For async s3 writes final part flushing was defered until all the INSERT
block was processed, however in case of too many partitions/columns you
may exceed max_memory_usage limit (since each stream has overhead).
Introduce max_insert_delayed_streams_for_parallel_writes (with default
to 1000 for S3, 0 otherwise), to avoid this.
This should "Memory limit exceeded" errors in performance tests.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
In #33291 final part commit had been defered, and now it can take
significantly more time, that may lead to "Part directory doesn't exist"
error during INSERT:
2022.02.21 18:18:06.979881 [ 11329 ] {insert} <Debug> executeQuery: (from 127.1:24572, user: default) INSERT INTO db.table (...) VALUES
2022.02.21 20:58:03.933593 [ 11329 ] {insert} <Trace> db.table: Renaming temporary part tmp_insert_20220214_18044_18044_0 to 20220214_270654_270654_0.
2022.02.21 21:16:50.961917 [ 11329 ] {insert} <Trace> db.table: Renaming temporary part tmp_insert_20220214_18197_18197_0 to 20220214_270689_270689_0.
...
2022.02.22 21:16:57.632221 [ 64878 ] {} <Warning> db.table: Removing temporary directory /clickhouse/data/db/table/tmp_insert_20220214_18232_18232_0/
...
2022.02.23 12:23:56.277480 [ 11329 ] {insert} <Trace> db.table: Renaming temporary part tmp_insert_20220214_18232_18232_0 to 20220214_273459_273459_0.
2022.02.23 12:23:56.299218 [ 11329 ] {insert} <Error> executeQuery: Code: 107. DB::Exception: Part directory /clickhouse/data/db/table/tmp_insert_20220214_18232_18232_0/ doesn't exist. Most likely it is a logical error. (FILE_DOESNT_EXIST) (version 22.2.1.1) (from 127.1:24572) (in query: INSERT INTO db.table (...) VALUES), Stack trace (when copying this message, always include the lines below):
Follow-up for: #28760
Refs: #33291
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Before it was possible to return stale counter from
StorageMergeTree::totalRows() (that is used for
optimize_trivial_count_query) since the accounting is done in two steps:
- subtract old number of rows
<-- here the number can be zero, even though there are rows -->
- add new number of rows
This was found by CI [1] in 01615_random_one_shard_insertion test:
Here you can see that INSERT went to both tables:
<details>
2022.01.16 09:07:34.288252 [ 154369 ] {a1905be0-93da-460c-8c6f-9b5adace72a0} <Debug> DistributedBlockOutputStream: It took 0.035197041 sec. to insert 100
blocks, 2841.1479249065287 rows per second. Insertion status:
Wrote 54 blocks and 54 rows on shard 0 replica 0, localhost:9000 (average 0 ms per block, the slowest block 1 ms)
Wrote 46 blocks and 46 rows on shard 1 replica 0, localhost:9000 (average 0 ms per block, the slowest block 1 ms)
</details>
But the test fails, since select from shard1.tbl returns 0, and the
problem was concurrent merge:
<details>
2022.01.16 09:07:34.289470 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Debug> executeQuery: (from [::1]:42082) (comment: 01615_random_one_shard_insertion.sql) select count() != 0 from shard_0.tbl;
2022.01.16 09:07:34.289564 [ 375 ] {c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab::all_1_54_9} <Debug> MergeTask::MergeProjectionsStage: Merge sorted 54 rows, containing 1 columns (1 merged, 0 gathered) in 0.00171193 sec., 31543.345814373253 rows/sec., 246.43 KiB>
2022.01.16 09:07:34.289810 [ 375 ] {c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab::all_1_54_9} <Trace> shard_0.tbl (c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab): Renaming temporary part tmp_merge_all_1_54_9 to all_1_54_9.
2022.01.16 09:07:34.289858 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> ContextAccess (default): Access granted: SELECT(number) ON shard_0.tbl
2022.01.16 09:07:34.289897 [ 375 ] {c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab::all_1_54_9} <Trace> shard_0.tbl (c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab) (MergerMutator): Merged 6 parts: from all_1_49_8 to all_54_54_0
2022.01.16 09:07:34.289920 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> InterpreterSelectQuery: WithMergeableState -> Complete
2022.01.16 09:07:34.289987 [ 375 ] {} <Debug> MemoryTracker: Peak memory usage Mutate/Merge: 3.12 MiB.
2022.01.16 09:07:34.290305 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> MergingAggregatedTransform: Reading blocks of partially aggregated data.
2022.01.16 09:07:34.290332 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Debug> MergingAggregatedTransform: Read 1 blocks of partially aggregated data, total 1 rows.
2022.01.16 09:07:34.290343 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> Aggregator: Merging partially aggregated single-level data.
2022.01.16 09:07:34.290358 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> Aggregator: Merged partially aggregated single-level data.
2022.01.16 09:07:34.290366 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> Aggregator: Converting aggregated data to blocks
2022.01.16 09:07:34.290391 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Debug> Aggregator: Converted aggregated data to blocks. 1 rows, 8.00 B in 1.0939e-05 sec. (91416.034 rows/sec., 714.19 KiB/sec.)
2022.01.16 09:07:34.290709 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Information> executeQuery: Read 1 rows, 4.01 KiB in 0.001187722 sec., 841 rows/sec., 3.30 MiB/sec.
2022.01.16 09:07:34.290774 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
</details>
[1]: https://s3.amazonaws.com/clickhouse-test-reports/33675/7848ea7d609e4c720e8e4494eb6207c0751f5aea/stateless_tests__ubsan__actions_.html
This also fixes a race between DROP TABLE check and INSERT/merges.
v0: use Active parts instead.
v2: fix total counters accounting instead.
v2: rebase against MergeTask
v3: rebase due to conflicts in src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
v4:
- rebase due to conflicts in src/Storages/MergeTree/MergeTask.cpp
- drop common/scope_guard_safe.h (not used)
* initial commit: add setting and stub
* typo
* added test stub
* fix
* wip merging new integration test and code proto
* adding steps interpreters
* adding firstly proposed solution (moving parts etc)
* added checking zookeeper path existence
* fixing the include
* fixing and sorting includes
* fixing outdated struct
* fix the name
* added ast ptr as level of indirection
* fix ref
* updating the changes
* working on test stub
* fix iterator -> reference
* revert rocksdb submodule update
* fixed show privileges test
* updated the test stub
* replaced rand() with thread_local_rng(), updated the tests
updated the test
fixed test config path
test fix
removed error messages
fixed the test
updated the test
fixed string literal
fixed literal
typo: =
* fixed the empty replica error message
* updated the test and the code with logs
* updated the possible test cases, updated
* added the code/test milestone comments
* updated the test (added more testcases)
* replaced native assert with CH one
* individual replicas recursive delete fix
* updated the AS db.name AST
* two small logging fixes
* manually generated AST fixes
* Updated the test, added the possible algo change
* Some thoughts about optimizing the solution:
ALTER MOVE PARTITION .. TO TABLE -> move to detached/ + ALTER ... ATTACH
* fix
* Removed the replica sync in test as it's invalid
* Some test tweaks
* tmp
* Rewrote the algo by using the executeQuery instead of
hand-crafting the ASTPtr.
Two questions still active.
* tr: logging active parts
* Extracted the parts moving algo into a separate helper function
* Fixed the test data and the queries slightly
* Replaced query to system.parts to direct invocation,
started building the test that breaks on various parts.
* Added the case for tables when at least one replica is alive
* Updated the test to test replicas restoration by detaching/attaching
* Altered the test to check restoration without replica restart
* Added the tables swap in the start if the server failed last time
* Hotfix when only /replicas/replica... path was deleted
* Restore ZK paths while creating a replicated MergeTree table
* Updated the docs, fixed the algo for individual replicas restoration case
* Initial parts table storage fix, tests sync fix
* Reverted individual replica restoration to general algo
* Slightly optimised getDataParts
* Trying another solution with parts detaching
* Rewrote algo without any steps, added ON CLUSTER support
* Attaching parts from other replica on restoration
* Getting part checksums from ZK
* Removed ON CLUSTER, finished working solution
* Multiple small changes after review
* Fixing parallel test
* Supporting rewritten form on cluster
* Test fix
* Moar logging
* Using source replica as checksum provider
* improve test, remove some code from parser
* Trying solution with move to detached + forget
* Moving all parts (not only Committed) to detached
* Edited docs for RESTORE REPLICA
* Re-merging
* minor fixes
Co-authored-by: Alexander Tokmakov <avtokmakov@yandex-team.ru>