Fix possible "No more packets are available" for distributed queries

CI founds the following case:

<details>

    2022.05.25 22:36:06.778808 [ 3037 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Fatal> : Logical error: 'No more packets are available.'.
    2022.05.25 22:42:24.960075 [ 17397 ] {} <Fatal> BaseDaemon: ########################################
    2022.05.25 22:42:24.971173 [ 17397 ] {} <Fatal> BaseDaemon: (version 22.6.1.1, build id: 9A1F9489854CED36) (from thread 3037) (query_id: 77743723-1fcd-4b3d-babc-d0615e3ff40e) (query: SELECT * FROM
    2022.05.25 22:42:25.046871 [ 17397 ] {} <Fatal> BaseDaemon: 5. ./build_docker/../src/Common/Exception.cpp:47: DB::abortOnFailedAssertion()
    2022.05.25 22:42:25.181449 [ 17397 ] {} <Fatal> BaseDaemon: 6. ./build_docker/../src/Common/Exception.cpp:70: DB::Exception::Exception()
    2022.05.25 22:42:25.367710 [ 17397 ] {} <Fatal> BaseDaemon: 7. ./build_docker/../src/Client/MultiplexedConnections.cpp:0: DB::MultiplexedConnections::receivePacketUnlocked()
    2022.05.25 22:42:25.414201 [ 17397 ] {} <Fatal> BaseDaemon: 8. ./build_docker/../src/Client/MultiplexedConnections.cpp:0: DB::MultiplexedConnections::receivePacket()
    2022.05.25 22:42:25.493066 [ 17397 ] {} <Fatal> BaseDaemon: 9. ./build_docker/../src/QueryPipeline/RemoteQueryExecutor.cpp:279: DB::RemoteQueryExecutor::read()
    2022.05.25 22:42:25.612679 [ 17397 ] {} <Fatal> BaseDaemon: 10. ./build_docker/../src/Processors/Sources/RemoteSource.cpp:0: DB::RemoteSource::tryGenerate()

Here are additional logs for this query:
    $ pigz -cd clickhouse-server.stress.log.gz | fgrep -a 77743723-1fcd-4b3d-babc-d0615e3ff40e | fgrep -e Connection -e Distributed -e Fatal
    2022.05.25 22:36:04.698671 [ 6613 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Trace> Connection (127.0.0.2:9000): Connecting. Database: (not specified). User: default
    2022.05.25 22:36:04.722568 [ 3419 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Trace> Connection (127.0.0.2:9000): Connecting. Database: (not specified). User: default
    2022.05.25 22:36:05.014432 [ 6613 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Trace> Connection (127.0.0.2:9000): Connected to ClickHouse server version 22.6.1.
    2022.05.25 22:36:05.091397 [ 6613 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Debug> Connection (127.0.0.2:9000): Sent data for 2 scalars, total 2 rows in 0.000125814 sec., 15602 rows/sec., 68.00 B (517.81 KiB/sec.), compressed 0.4594594594594595 times to 148.00 B (1.10 MiB/sec.)
    2022.05.25 22:36:05.301301 [ 3419 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Trace> Connection (127.0.0.2:9000): Connected to ClickHouse server version 22.6.1.
    2022.05.25 22:36:05.343140 [ 3419 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Debug> Connection (127.0.0.2:9000): Sent data for 2 scalars, total 2 rows in 0.000116304 sec., 16889 rows/sec., 68.00 B (559.80 KiB/sec.), compressed 0.4594594594594595 times to 148.00 B (1.19 MiB/sec.)
    2022.05.25 22:36:06.682535 [ 6613 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Trace> StorageDistributed (remote): (127.0.0.2:9000) Cancelling query because enough data has been read
    2022.05.25 22:36:06.778808 [ 3037 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Fatal> : Logical error: 'No more packets are available.'.
    2022.05.25 22:36:06.789505 [ 3419 ] {77743723-1fcd-4b3d-babc-d0615e3ff40e} <Trace> StorageDistributed (remote): (127.0.0.2:9000) Cancelling query because enough data has been read
    2022.05.25 22:42:24.971173 [ 17397 ] {} <Fatal> BaseDaemon: (version 22.6.1.1, build id: 9A1F9489854CED36) (from thread 3037) (query_id: 77743723-1fcd-4b3d-babc-d0615e3ff40e) (query: SELECT * FROM

</details>

So between cancelling different sources the LOGICAL_ERROR occured, I
believe that this is because of the race:

    T1:                                        T2:
    RemoteQueryExecutor::read()
    checks was_cancelled
                                               RemoteQueryExecutor::tryCancel()
                                               connections->cancel()
    calls connections->receivePacket()

Note, for this problem async_socket_for_remote/use_hedged_requests
should be disabled, and original settings was:
- --max_parallel_replicas=3
- --use_hedged_requests=false
- --allow_experimental_parallel_reading_from_replicas=3

CI: https://s3.amazonaws.com/clickhouse-test-reports/37469/41cb029ed23e77f3a108e07e6b1b1bcb03dc7fcf/stress_test__undefined__actions_/fatal_messages.txt
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This commit is contained in:
Azat Khuzhin 2022-06-03 14:49:22 +03:00
parent 5faaf1b553
commit 078678237e

View File

@ -275,6 +275,7 @@ Block RemoteQueryExecutor::read()
while (true)
{
std::lock_guard lock(was_cancelled_mutex);
if (was_cancelled)
return Block();
@ -602,7 +603,8 @@ void RemoteQueryExecutor::sendExternalTables()
void RemoteQueryExecutor::tryCancel(const char * reason, std::unique_ptr<ReadContext> * read_context)
{
/// Flag was_cancelled is atomic because it is checked in read().
/// Flag was_cancelled is atomic because it is checked in read(),
/// in case of packet had been read by fiber (async_socket_for_remote).
std::lock_guard guard(was_cancelled_mutex);
if (was_cancelled)