Set --server-check-retries to 90 (and this is ~45 seconds), since right
now sometimes it is not enough [1].
[1]: https://clickhouse-test-reports.s3.yandex.net/30191/0e34a9d550cfe6924fe575871f36c44dd44acdaa/functional_stateless_tests_(memory).html#fail1
And the reason I guess is clickhouse-test had been rewritten to
http.client in #30065, and since now it does not need to execute
clickhouse-client binary, which in debug/sanitizers builds can take also
sometime.
That said that with clickhouse-client for hung check it was not 15
seconds, but more (each clickhouse-client requires 0.6sec with
sanitizers for simple SELECT 1, while w/o 0.1second, also too much
should be optimized)
Cons of clickhouse-driver:
- it is one more extra dependency
- it does not have correct timeouts (only for socket operations, and
this is not the same, so we need to set timeout by ourself)
- it is one more thing which can break (@alesapin)
Pros:
- Using native protocol over executing binaries is always better
- `clickhouse-client` in debug build takes almost a second to execute simple `SELECT 1`
and `clickhouse-test` requires ~5 queries at start (determine some
flags, zk, alive, create database)
Notes:
- `FORMAT Vertical` had been replaced with printing of `pandas.DataFrame`
And after this patch tiny tests work with the speed of the test, and
does not requires +-5 seconds of bootstrapping.
Right now it is possible to get the following error:
Having 20 errors! 0 tests passed. 0 tests skipped. 57.37 s elapsed (MainProcess).
Won't run stateful tests because test data wasn't loaded.
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 1462, in <module>
main(args)
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 1261, in main
if len(restarted_tests) > 0:
File "<string>", line 2, in __len__
File "/usr/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
self._connect()
File "/usr/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python3.9/multiprocessing/connection.py", line 507, in Client
c = SocketClient(address)
File "/usr/lib/python3.9/multiprocessing/connection.py", line 635, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
The reason behind this is that manager's thread got terminated:
ipdb> p restarted_tests._manager._process
<ForkProcess name='SyncManager-1' pid=25125 parent=24939 stopped exitcode=-SIGTERM>
Refs: #29259 (cc: @vdimir)
Follow-up for: #29197 (cc: @tavplubix)
Variables aren't shared when using multiprocessing, use shared memory
instead
https://docs.python.org/3/library/multiprocessing.html#shared-ctypes-objects.
There appears to be a deadlock when multiple threads try to send
sigterm signal at the same time. Avoid it by making sure sigterm is sent
only once for the process group.
Replacing stderr by default does not makes a lot of sense, since it does
not compared with .reference anyway. From the other hand it may confuse
the reader (seeing "default" instead of real database name in the test
logs).
P.S. I guess some tests may relay on this, let's see.
This is another try of not leaving child processes in clickhouse-test,
first one was in [1] by @akuzm:
"I tried to do this earlier with a separate process group + atexit callbacks:
573983d407 (diff-3a359de18cacf146f406a7ae332fb47196aa5e0aa430eb4b157a202a3cb8e6e3R578)
But that commit was later reverted because it also tried to switch to
multithreading instead of multiprocessing, and that didn't go good.
SIG_IGN and SIG_DFL were broken then
(https://bugs.python.org/issue23395), now they are fixed but not quite
but maybe it's not relevant for us."
I looked (only briefly) through that bug report in python, but I don't
see any issues with killing child processes during testing this patch.
Plus to me it is better to get some unknown python error (and fix it
somehow) instead of leaving child processes.
v2: correctly catch INT/TERM/HUP too