Right now it is possible to have "false-positive" for this hardening,
because there is a tiny delay (that can be quite significant on CI, when
it is under pressure) between when the server sends EndOfStream and the
server removes an entry from system.processes.
But now system.processes has is_all_data_sent column, that means that
the EndOfStream was sent, and we can use it to avoid false positive.
Here is an example of such report [1]:
2022-04-25 03:47:18 00806_alter_update: [ FAIL ] 0.95 sec. - Queries left in background after the test finished:
2022-04-25 03:47:18 "elapsed": 0.100084746,
2022-04-25 03:47:18 "is_cancelled": 0,
2022-04-25 03:47:18 "query": "DROP TABLE alter_update_00806;",
2022-04-25 03:47:18 "thread_ids": [
2022-04-25 03:47:18 "8950"
2022-04-25 03:47:18 ],
2022.04.25 03:47:17.887095 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> executeQuery: (from [::1]:52012) (comment: 00806_alter_update.sql) DROP TABLE alter_update_00806;
2022.04.25 03:47:17.887493 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Trace> ContextAccess (default): Access granted: DROP TABLE ON test_7ntsjn.alter_update_00806
2022.04.25 03:47:17.887765 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Trace> test_7ntsjn.alter_update_00806 (1bc92bca-10a7-444e-be5e-7f61f4650169): Found 2 old parts to remove.
2022.04.25 03:47:17.887947 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> test_7ntsjn.alter_update_00806 (1bc92bca-10a7-444e-be5e-7f61f4650169): Removing part from filesystem 20180101_20180101_1_1_0
2022.04.25 03:47:17.888960 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> test_7ntsjn.alter_update_00806 (1bc92bca-10a7-444e-be5e-7f61f4650169): Removing part from filesystem 20180102_20180102_2_2_0
2022.04.25 03:47:17.890620 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> DatabaseCatalog: Waiting for table 1bc92bca-10a7-444e-be5e-7f61f4650169 to be finally dropped
2022.04.25 03:47:17.895046 [ 8950 ] {7c062004-4c22-486c-934a-f405846e2c81} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
...
2022.04.25 03:47:17.938328 [ 4422 ] {aa01985a-78f5-4c0e-b646-8d04a4a1dc77} <Debug> executeQuery: (from [::1]:59416) (comment: 00806_alter_update.sql) DROP DATABASE test_7ntsjn
2022.04.25 03:47:17.938667 [ 4422 ] {aa01985a-78f5-4c0e-b646-8d04a4a1dc77} <Trace> ContextAccess (default): Access granted: DROP DATABASE ON test_7ntsjn.*
...
2022.04.25 03:47:18.154847 [ 8950 ] {} <Debug> TCPHandler: Processed in 0.269358257 sec.
2022.04.25 03:47:18.154991 [ 8950 ] {} <Debug> TCPHandler: Done processing connection.
2022.04.25 03:47:18.155181 [ 8950 ] {} <Debug> TCP-Session: e1d8176a-ee62-4e0a-9855-fe9eb52e06dc Destroying unnamed session of user 94309d50-4f52-5250-31bd-74fecac179db
[1]: https://s3.amazonaws.com/clickhouse-test-reports/36319/a646cf76b6d4699f06aea1e8d777edb1ad6fd2c5/stateless_tests__debug__actions__[1/3]/runlog.log
So as you can see here DROP TABLE was captured when elapsed was 0.1,
while TCPHandler processes it for 0.26 seconds.
Also from the same report you are seeing that DROP DATABASE was executed
before TCPHandler stopoped processing DROP TABLE.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Before:
00170_s3_cache:
Found queries left in processlist after running 00170_s3_cache.sql (database=test_8zpbah):
[
...
]
2022-03-27 00:11:27 [ FAIL ] 600.06 sec. - Timeout!
After:
02263_test: [ FAIL ] - Queries left in background after the test finished:
[
...
]
Settings used in the test: --max_insert_threads=11 --group_by_two_level_threshold=100000 --group_by_two_level_threshold_bytes=50000000 --distributed_aggregation_memory_efficient=0 --fsync_metadata=0 --priority=2 --output_format_parallel_formatting=1 --input_format_parallel_parsing=0 --min_chunk_bytes_for_parallel_parsing=12372842 --max_read_buffer_size=855730 --prefer_localhost_replica=1 --max_block_size=74323 --max_threads=56
Database: default
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Look at log_comment/current_database for this check under --no-left-queries-check
v2: add no-left-queries-check tag support
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Recently (#32094) test database had been overlapped, and random prefix
for database had been increased from 6 to 8.
But actually 6 bytes for random prefix should be enough (with existing
alphabet (0-9a-z) it is 36**6=2'176'782'336), and the real reason of
this overlap is that random generator by default uses shared state [1]:
The functions supplied by this module are actually bound methods of
a hidden instance of the random.Random class. You can instantiate your
own instances of Random to get generators that don’t share state.
[1]: https://docs.python.org/3/library/random.html
I've played a little bit with random in python, and using default random
generator it generates non-unique strings pretty fast, just in a few
runs, but using SystemRandom (that uses /dev/urandom) it takes ~1 minute.
Test:
```sh
$ while /tmp/test.py | LANG=c sort -S5G | LANG=c uniq -d | tee /dev/stderr | wc -l | fgrep -q -x -c 0; do :; done
```
```python
#!/usr/bin/env python3
import multiprocessing
import string
import random
def random_str(length=6):
alphabet = string.ascii_lowercase + string.digits
return ''.join(random.SystemRandom().choice(alphabet) for _ in range(length))
def worker(_):
print(random_str())
with multiprocessing.Pool(processes=2) as pool:
pool.map(worker, range(0, int(10e3)))
```
So let's switch to SystemRandom and use 6-byte prefix.
Set --server-check-retries to 90 (and this is ~45 seconds), since right
now sometimes it is not enough [1].
[1]: https://clickhouse-test-reports.s3.yandex.net/30191/0e34a9d550cfe6924fe575871f36c44dd44acdaa/functional_stateless_tests_(memory).html#fail1
And the reason I guess is clickhouse-test had been rewritten to
http.client in #30065, and since now it does not need to execute
clickhouse-client binary, which in debug/sanitizers builds can take also
sometime.
That said that with clickhouse-client for hung check it was not 15
seconds, but more (each clickhouse-client requires 0.6sec with
sanitizers for simple SELECT 1, while w/o 0.1second, also too much
should be optimized)
Cons of clickhouse-driver:
- it is one more extra dependency
- it does not have correct timeouts (only for socket operations, and
this is not the same, so we need to set timeout by ourself)
- it is one more thing which can break (@alesapin)
Pros:
- Using native protocol over executing binaries is always better
- `clickhouse-client` in debug build takes almost a second to execute simple `SELECT 1`
and `clickhouse-test` requires ~5 queries at start (determine some
flags, zk, alive, create database)
Notes:
- `FORMAT Vertical` had been replaced with printing of `pandas.DataFrame`
And after this patch tiny tests work with the speed of the test, and
does not requires +-5 seconds of bootstrapping.
Right now it is possible to get the following error:
Having 20 errors! 0 tests passed. 0 tests skipped. 57.37 s elapsed (MainProcess).
Won't run stateful tests because test data wasn't loaded.
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 1462, in <module>
main(args)
File "/src/ch/clickhouse/.cmake/../tests/clickhouse-test", line 1261, in main
if len(restarted_tests) > 0:
File "<string>", line 2, in __len__
File "/usr/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
self._connect()
File "/usr/lib/python3.9/multiprocessing/managers.py", line 793, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python3.9/multiprocessing/connection.py", line 507, in Client
c = SocketClient(address)
File "/usr/lib/python3.9/multiprocessing/connection.py", line 635, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
The reason behind this is that manager's thread got terminated:
ipdb> p restarted_tests._manager._process
<ForkProcess name='SyncManager-1' pid=25125 parent=24939 stopped exitcode=-SIGTERM>
Refs: #29259 (cc: @vdimir)
Follow-up for: #29197 (cc: @tavplubix)
Variables aren't shared when using multiprocessing, use shared memory
instead
https://docs.python.org/3/library/multiprocessing.html#shared-ctypes-objects.
There appears to be a deadlock when multiple threads try to send
sigterm signal at the same time. Avoid it by making sure sigterm is sent
only once for the process group.
Replacing stderr by default does not makes a lot of sense, since it does
not compared with .reference anyway. From the other hand it may confuse
the reader (seeing "default" instead of real database name in the test
logs).
P.S. I guess some tests may relay on this, let's see.
This is another try of not leaving child processes in clickhouse-test,
first one was in [1] by @akuzm:
"I tried to do this earlier with a separate process group + atexit callbacks:
573983d407 (diff-3a359de18cacf146f406a7ae332fb47196aa5e0aa430eb4b157a202a3cb8e6e3R578)
But that commit was later reverted because it also tried to switch to
multithreading instead of multiprocessing, and that didn't go good.
SIG_IGN and SIG_DFL were broken then
(https://bugs.python.org/issue23395), now they are fixed but not quite
but maybe it's not relevant for us."
I looked (only briefly) through that bug report in python, but I don't
see any issues with killing child processes during testing this patch.
Plus to me it is better to get some unknown python error (and fix it
somehow) instead of leaving child processes.
v2: correctly catch INT/TERM/HUP too