Recently (#32094) test database had been overlapped, and random prefix
for database had been increased from 6 to 8.
But actually 6 bytes for random prefix should be enough (with existing
alphabet (0-9a-z) it is 36**6=2'176'782'336), and the real reason of
this overlap is that random generator by default uses shared state [1]:
The functions supplied by this module are actually bound methods of
a hidden instance of the random.Random class. You can instantiate your
own instances of Random to get generators that don’t share state.
[1]: https://docs.python.org/3/library/random.html
I've played a little bit with random in python, and using default random
generator it generates non-unique strings pretty fast, just in a few
runs, but using SystemRandom (that uses /dev/urandom) it takes ~1 minute.
Test:
```sh
$ while /tmp/test.py | LANG=c sort -S5G | LANG=c uniq -d | tee /dev/stderr | wc -l | fgrep -q -x -c 0; do :; done
```
```python
#!/usr/bin/env python3
import multiprocessing
import string
import random
def random_str(length=6):
alphabet = string.ascii_lowercase + string.digits
return ''.join(random.SystemRandom().choice(alphabet) for _ in range(length))
def worker(_):
print(random_str())
with multiprocessing.Pool(processes=2) as pool:
pool.map(worker, range(0, int(10e3)))
```
So let's switch to SystemRandom and use 6-byte prefix.
Set --server-check-retries to 90 (and this is ~45 seconds), since right
now sometimes it is not enough [1].
[1]: https://clickhouse-test-reports.s3.yandex.net/30191/0e34a9d550cfe6924fe575871f36c44dd44acdaa/functional_stateless_tests_(memory).html#fail1
And the reason I guess is clickhouse-test had been rewritten to
http.client in #30065, and since now it does not need to execute
clickhouse-client binary, which in debug/sanitizers builds can take also
sometime.
That said that with clickhouse-client for hung check it was not 15
seconds, but more (each clickhouse-client requires 0.6sec with
sanitizers for simple SELECT 1, while w/o 0.1second, also too much
should be optimized)
Cons of clickhouse-driver:
- it is one more extra dependency
- it does not have correct timeouts (only for socket operations, and
this is not the same, so we need to set timeout by ourself)
- it is one more thing which can break (@alesapin)
Pros:
- Using native protocol over executing binaries is always better
- `clickhouse-client` in debug build takes almost a second to execute simple `SELECT 1`
and `clickhouse-test` requires ~5 queries at start (determine some
flags, zk, alive, create database)
Notes:
- `FORMAT Vertical` had been replaced with printing of `pandas.DataFrame`
And after this patch tiny tests work with the speed of the test, and
does not requires +-5 seconds of bootstrapping.
Right now it is possible to get the following error:
Having 20 errors! 0 tests passed. 0 tests skipped. 57.37 s elapsed (MainProcess).
Won't run stateful tests because test data wasn't loaded.
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 1462, in <module>
main(args)
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 1261, in main
if len(restarted_tests) > 0:
File "<string>", line 2, in __len__
File "/usr/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
self._connect()
File "/usr/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python3.9/multiprocessing/connection.py", line 507, in Client
c = SocketClient(address)
File "/usr/lib/python3.9/multiprocessing/connection.py", line 635, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
The reason behind this is that manager's thread got terminated:
ipdb> p restarted_tests._manager._process
<ForkProcess name='SyncManager-1' pid=25125 parent=24939 stopped exitcode=-SIGTERM>
Refs: #29259 (cc: @vdimir)
Follow-up for: #29197 (cc: @tavplubix)
Variables aren't shared when using multiprocessing, use shared memory
instead
https://docs.python.org/3/library/multiprocessing.html#shared-ctypes-objects.
There appears to be a deadlock when multiple threads try to send
sigterm signal at the same time. Avoid it by making sure sigterm is sent
only once for the process group.
Replacing stderr by default does not makes a lot of sense, since it does
not compared with .reference anyway. From the other hand it may confuse
the reader (seeing "default" instead of real database name in the test
logs).
P.S. I guess some tests may relay on this, let's see.
This is another try of not leaving child processes in clickhouse-test,
first one was in [1] by @akuzm:
"I tried to do this earlier with a separate process group + atexit callbacks:
573983d407 (diff-3a359de18cacf146f406a7ae332fb47196aa5e0aa430eb4b157a202a3cb8e6e3R578)
But that commit was later reverted because it also tried to switch to
multithreading instead of multiprocessing, and that didn't go good.
SIG_IGN and SIG_DFL were broken then
(https://bugs.python.org/issue23395), now they are fixed but not quite
but maybe it's not relevant for us."
I looked (only briefly) through that bug report in python, but I don't
see any issues with killing child processes during testing this patch.
Plus to me it is better to get some unknown python error (and fix it
somehow) instead of leaving child processes.
v2: correctly catch INT/TERM/HUP too