Should fix the following [1]:
2024-10-07 21:13:38 02784_connection_string: [ OK ] 9.69 sec.
2024-10-07 21:13:38 Process Process-5:
2024-10-07 21:13:38 Traceback (most recent call last):
2024-10-07 21:13:38 File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
2024-10-07 21:13:38 self.run()
2024-10-07 21:13:38 File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
2024-10-07 21:13:38 self._target(*self._args, **self._kwargs)
2024-10-07 21:13:38 File "/usr/bin/clickhouse-test", line 2609, in run_tests_process
2024-10-07 21:13:38 return run_tests_array(*args, **kwargs)
2024-10-07 21:13:38 File "/usr/bin/clickhouse-test", line 2327, in run_tests_array
2024-10-07 21:13:38 stop_tests()
2024-10-07 21:13:38 File "/usr/bin/clickhouse-test", line 445, in stop_tests
2024-10-07 21:13:38 cleanup_child_processes(os.getpid())
2024-10-07 21:13:38 File "/usr/bin/clickhouse-test", line 433, in cleanup_child_processes
2024-10-07 21:13:38 child_pgid = os.getpgid(child)
2024-10-07 21:13:38 ProcessLookupError: [Errno 3] No such process
[1]: https://s3.amazonaws.com/clickhouse-test-reports/70448/cd826389e90065466ddfef140fc344b30e8c6de0/stateless_tests__aarch64_.html
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Though now there are oddities with multiprocessing_manager.list():
Having 1 errors! 0 tests passed. 0 tests skipped. 2.20 s elapsed (MainProcess).
Won't run stateful tests because test data wasn't loaded.
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/managers.py", line 813, in _callmethod
conn = self._tls.connection
^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 3707, in <module>
main(args)
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 3126, in main
if len(restarted_tests) > 0:
^^^^^^^^^^^^^^^^^^^^
File "<string>", line 2, in __len__
File "/usr/lib/python3.12/multiprocessing/managers.py", line 817, in _callmethod
self._connect()
File "/usr/lib/python3.12/multiprocessing/managers.py", line 804, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 519, in Client
c = SocketClient(address)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 647, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Here [1] the hung query failed:
2024.10.07 21:13:29.044675 [ 16750 ] {484a1200-d576-4c03-a82b-2d389b8e773f} <Debug> executeQuery: (from [::1]:43374) SELECT 1 /*hung check*/
(stage: Complete)
2024.10.07 21:13:29.047252 [ 16750 ] {484a1200-d576-4c03-a82b-2d389b8e773f} <Error> executeQuery: Code: 210. DB::Exception: I/O error: Broken pipe, while writing to socket ([::1]:8123 -> [::1]:43374): While executing TabSeparatedRowOutputFormat. (NETWORK_ERROR) (version 24.10.1.1368) (from [::1]:43374) (in query: SELECT 1 /*hung check*/
[1]: https://s3.amazonaws.com/clickhouse-test-reports/70448/cd826389e90065466ddfef140fc344b30e8c6de0/stateless_tests__aarch64_.html
But I don't see any possible reasons for this, only if the client closes
the connection, but I bet that the query had been sent long time ago,
but due to VM stall (#70473) it was not accepted.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This test creates too much threads (1500), and sometimes there are some
stalls on arm64 [1]:
2024.10.07 21:06:38.768974 [ 4479 ] {3041e49a-89c3-4991-a59f-6ae221a5afd0} <Debug> executeQuery: (from [::1]:52210) (comment: 00913_many_threads.sql) SELECT count() FROM t; (stage: Complete)
...
2024.10.07 21:07:37.028725 [ 15271 ] {3041e49a-89c3-4991-a59f-6ae221a5afd0} <Trace> AggregatingTransform: Aggregated. 1 to 1 rows (from 0.00 B) in 58.247881431 sec. (0.017 rows/sec., 0.00 B/sec.)
2024.10.07 21:11:20.662232 [ 15270 ] {3041e49a-89c3-4991-a59f-6ae221a5afd0} <Trace> AggregatingTransform: Aggregating
...
As you can see there are zero logs for almost 4 minutes, and trace_log
contains zero records for the interval event_time between '2024.10.07
21:07:37' and '2024.10.07 21:11:20'.
[1]: https://s3.amazonaws.com/clickhouse-test-reports/70448/cd826389e90065466ddfef140fc344b30e8c6de0/stateless_tests__aarch64_.html
At first I thought about adding some new metrics like
GlobalThreadPoolThreadSlowCreationMicroseconds, but it will unlikely
help with understading the root cause.
Then I thought about exposing CPU steal metric, but the problem is that
AWS uses Nitro instead of Xen for this instances (r6g.2xlarge), and
Nitro does not expose this metric.
So let's just disable parallel execution of this test to stabilize the
CI.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>