Found with fuzzer [1] for 00992_system_parts_race_condition_zookeeper:
2021.03.13 11:12:30.385188 [ 42042 ] {2d3a8e17-26be-47c1-974f-bd2c9fc7c3af} <Debug> executeQuery: (from [::1]:58192, using production parser) (comment: '/usr/share/clickhouse-test/queries/1_stateful/00153_aggregate_arena_race.sql') CREATE TABLE alter_tabl
e (a UInt8, b Int16, c Float32, d String, e Array(UInt8), f Nullable(UUID), g Tuple(UInt8, UInt16)) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test_3.alter_table', 'r1') ORDER BY a PARTITION BY b % 10 SETTINGS old_parts_lifetime = 1, cleanup_delay_p
eriod = 1, cleanup_delay_period_random_add = 0;
...
2021.03.13 11:12:30.678387 [ 42042 ] {528cafc5-a02b-4df8-a531-a9a98e37b478} <Debug> executeQuery: (from [::1]:58192, using production parser) (comment: '/usr/share/clickhouse-test/queries/1_stateful/00153_aggregate_arena_race.sql') CREATE TABLE alter_table2 (a UInt8, b Int16, c Float32, d String, e Array(UInt8), f Nullable(UUID), g Tuple(UInt8, UInt16)) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test_3.alter_table', 'r2') ORDER BY a PARTITION BY b % 10 SETTINGS old_parts_lifetime = 1, cleanup_delay_period = 1, cleanup_delay_period_random_add = 0;
...
2021.03.13 11:12:40.671994 [ 4193 ] {d96ee93c-69b0-4e89-b411-16c382ae27a8} <Debug> executeQuery: (from [::1]:59714, using production parser) (comment: '/usr/share/clickhouse-test/queries/1_stateful/00153_aggregate_arena_race.sql') OPTIMIZE TABLE alter_table FINAL
...
2021.03.13 11:12:40.990174 [ 2298 ] {a80f9306-3a73-4778-a921-db53249247e3} <Debug> executeQuery: (from [::1]:59768, using production parser) (comment: '/usr/share/clickhouse-test/queries/1_stateful/00153_aggregate_arena_race.sql') DROP TABLE alter_table;
...
2021.03.13 11:12:41.333054 [ 2298 ] {a80f9306-3a73-4778-a921-db53249247e3} <Debug> test_3.alter_table (d4fedaca-e0f6-4c22-9a4f-9f4d11b6b705): Removing part from filesystem 7_0_0_0
...
2021.03.13 11:12:41.335380 [ 2298 ] {a80f9306-3a73-4778-a921-db53249247e3} <Debug> DatabaseCatalog: Waiting for table d4fedaca-e0f6-4c22-9a4f-9f4d11b6b705 to be finally dropped
...
2021.03.13 11:12:41.781032 [ 4193 ] {d96ee93c-69b0-4e89-b411-16c382ae27a8} <Debug> test_3.alter_table (d4fedaca-e0f6-4c22-9a4f-9f4d11b6b705): Waiting for queue-0000000085 to disappear from r2 queue
...
2021.03.13 11:12:41.900039 [ 371 ] {} <Trace> test_3.alter_table2 (ReplicatedMergeTreeQueue): Not executing log entry queue-0000000085 of type MERGE_PARTS for part 7_0_0_1 because part 7_0_0_0 is not ready yet (log entry for that part is being processed).
2021.03.13 11:12:41.900213 [ 365 ] {} <Trace> test_3.alter_table2 (ReplicatedMergeTreeQueue): Cannot execute alter metadata queue-0000000056 with version 22 because another alter 21 must be executed before
2021.03.13 11:12:41.900231 [ 13762 ] {} <Trace> test_3.alter_table2 (ae877c49-0d30-416d-9afe-27fd457d8fc4): Executing log entry to merge parts -7_0_0_0 to -7_0_0_1
2021.03.13 11:12:41.900330 [ 13762 ] {} <Debug> test_3.alter_table2 (ae877c49-0d30-416d-9afe-27fd457d8fc4): Don't have all parts for merge -7_0_0_1; will try to fetch it instead
...
[1]: https://clickhouse-test-reports.s3.yandex.net/21691/eb3710c164b991b8d4f86b1435a65f9eceb8f1f5/stress_test_(address).html#fail1
Before this patch for distributed queries, that requires cancellation
(simple select from multiple shards with limit, i.e. `select * from
remote('127.{2,3}', system.numbers) limit 100`) it is very easy to
trigger the situation when remote shard is in the middle of sending Data
block while the initiator already send Cancel and expecting some new
packet, but it will receive not new packet, but part of the Data block
that was in the middle of sending before cancellation, and this will
lead to some various errors, like:
- Unknown packet X from server Y
- Unexpected packet from server Y
- and a lot more...
Fix this, by correctly waiting for the pending packet before
cancellation.
It is not very easy to write a test, since localhost is too fast.
Also note, that it is not possible to get these errors with hedged
requests (use_hedged_requests=1) since handle fibers correctly.
But it had been disabled by default for 21.3 in #21534, while
async_socket_for_remote is enabled by default.