Should fix the following [1]:
2024-10-07 21:13:38 02784_connection_string: [ OK ] 9.69 sec.
2024-10-07 21:13:38 Process Process-5:
2024-10-07 21:13:38 Traceback (most recent call last):
2024-10-07 21:13:38 File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
2024-10-07 21:13:38 self.run()
2024-10-07 21:13:38 File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
2024-10-07 21:13:38 self._target(*self._args, **self._kwargs)
2024-10-07 21:13:38 File "/usr/bin/clickhouse-test", line 2609, in run_tests_process
2024-10-07 21:13:38 return run_tests_array(*args, **kwargs)
2024-10-07 21:13:38 File "/usr/bin/clickhouse-test", line 2327, in run_tests_array
2024-10-07 21:13:38 stop_tests()
2024-10-07 21:13:38 File "/usr/bin/clickhouse-test", line 445, in stop_tests
2024-10-07 21:13:38 cleanup_child_processes(os.getpid())
2024-10-07 21:13:38 File "/usr/bin/clickhouse-test", line 433, in cleanup_child_processes
2024-10-07 21:13:38 child_pgid = os.getpgid(child)
2024-10-07 21:13:38 ProcessLookupError: [Errno 3] No such process
[1]: https://s3.amazonaws.com/clickhouse-test-reports/70448/cd826389e90065466ddfef140fc344b30e8c6de0/stateless_tests__aarch64_.html
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Though now there are oddities with multiprocessing_manager.list():
Having 1 errors! 0 tests passed. 0 tests skipped. 2.20 s elapsed (MainProcess).
Won't run stateful tests because test data wasn't loaded.
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/managers.py", line 813, in _callmethod
conn = self._tls.connection
^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 3707, in <module>
main(args)
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 3126, in main
if len(restarted_tests) > 0:
^^^^^^^^^^^^^^^^^^^^
File "<string>", line 2, in __len__
File "/usr/lib/python3.12/multiprocessing/managers.py", line 817, in _callmethod
self._connect()
File "/usr/lib/python3.12/multiprocessing/managers.py", line 804, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 519, in Client
c = SocketClient(address)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 647, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Here [1] the hung query failed:
2024.10.07 21:13:29.044675 [ 16750 ] {484a1200-d576-4c03-a82b-2d389b8e773f} <Debug> executeQuery: (from [::1]:43374) SELECT 1 /*hung check*/
(stage: Complete)
2024.10.07 21:13:29.047252 [ 16750 ] {484a1200-d576-4c03-a82b-2d389b8e773f} <Error> executeQuery: Code: 210. DB::Exception: I/O error: Broken pipe, while writing to socket ([::1]:8123 -> [::1]:43374): While executing TabSeparatedRowOutputFormat. (NETWORK_ERROR) (version 24.10.1.1368) (from [::1]:43374) (in query: SELECT 1 /*hung check*/
[1]: https://s3.amazonaws.com/clickhouse-test-reports/70448/cd826389e90065466ddfef140fc344b30e8c6de0/stateless_tests__aarch64_.html
But I don't see any possible reasons for this, only if the client closes
the connection, but I bet that the query had been sent long time ago,
but due to VM stall (#70473) it was not accepted.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
After https://github.com/ClickHouse/ClickHouse/pull/67737 the output
will be broken, since in case of timeout it will print to stdout.
Let's just capture it and add it to stderr.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Previously it was possible to have original pgid from the spawned
threads, that could lead to killing the caller script and in case of CI
it could be init process [1].
[1]: https://s3.amazonaws.com/clickhouse-test-reports/67737/e68c9c8d16f37f6c25739076c9b071ed97952269/stress_test__asan_/stress_test_run_21.txt
Repro:
$ echo "SELECT '1" > tests/queries/0_stateless/00001_select_1.sql # break the test
$ cat /tmp/test.sh
./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings
Before this change:
$ /tmp/test.sh
Using queries from '/src/ch/worktrees/clickhouse-upstream/tests/queries' directory
Connecting to ClickHouse server... OK
Connected to server 24.8.1.1 @ bef896ce143ea4e0464c9829de6277ba06cc1a53 mt/rename-without-lock-v2
Running 3 stateless tests (MainProcess).
00001_select_1: [ FAIL ]
Reason: return code: 62
Code: 62. DB::Exception: Syntax error: failed at position 8 (''1;
'): '1;
. Single quoted string is not closed: ''1;
'. (SYNTAX_ERROR)
, result:
stdout:
Database: test_hz2zwymr
Child processes of 13041:
13042 python3 ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings
Killing process group 13040
Processes in process group 13040:
13040 -bash
13042 python3 ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings
[2]+ Stopped /tmp/test.sh
[1]$ Process group 13040 should be killed
Max failures chain
[2]+ Killed /tmp/test.sh
After:
$ /tmp/test.sh
Using queries from '/src/ch/worktrees/clickhouse-upstream/tests/queries' directory
Connecting to ClickHouse server... OK
Connected to server 24.8.1.1 @ bef896ce143ea4e0464c9829de6277ba06cc1a53 mt/rename-without-lock-v2
Running 3 stateless tests (MainProcess).
00001_select_1: [ FAIL ]
Reason: return code: 62
Code: 62. DB::Exception: Syntax error: failed at position 8 (''1;
'): '1;
. Single quoted string is not closed: ''1;
'. (SYNTAX_ERROR)
, result:
stdout:
Database: test_urz6rk5z
Child processes of 9782:
9785 python3 ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings
Max failures chain
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
It was simply wrong before, but now, with capturing stacktrace that can
take sometime it is a must.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Previously processes cleanup on i.e. SIGINT simply did not work, because
the launcher kills only processes in process group, while tests are
launched with start_new_session=True for Popen(), which creates own
process group.
This is needed for killing process group in case of test timeout.
So instead, look at the parent pid, and kill the child process groups.
Also add some logging to make it more explicit which processes will be
killed.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>