Before it was possible to return stale counter from
StorageMergeTree::totalRows() (that is used for
optimize_trivial_count_query) since the accounting is done in two steps:
- subtract old number of rows
<-- here the number can be zero, even though there are rows -->
- add new number of rows
This was found by CI [1] in 01615_random_one_shard_insertion test:
Here you can see that INSERT went to both tables:
<details>
2022.01.16 09:07:34.288252 [ 154369 ] {a1905be0-93da-460c-8c6f-9b5adace72a0} <Debug> DistributedBlockOutputStream: It took 0.035197041 sec. to insert 100
blocks, 2841.1479249065287 rows per second. Insertion status:
Wrote 54 blocks and 54 rows on shard 0 replica 0, localhost:9000 (average 0 ms per block, the slowest block 1 ms)
Wrote 46 blocks and 46 rows on shard 1 replica 0, localhost:9000 (average 0 ms per block, the slowest block 1 ms)
</details>
But the test fails, since select from shard1.tbl returns 0, and the
problem was concurrent merge:
<details>
2022.01.16 09:07:34.289470 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Debug> executeQuery: (from [::1]:42082) (comment: 01615_random_one_shard_insertion.sql) select count() != 0 from shard_0.tbl;
2022.01.16 09:07:34.289564 [ 375 ] {c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab::all_1_54_9} <Debug> MergeTask::MergeProjectionsStage: Merge sorted 54 rows, containing 1 columns (1 merged, 0 gathered) in 0.00171193 sec., 31543.345814373253 rows/sec., 246.43 KiB>
2022.01.16 09:07:34.289810 [ 375 ] {c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab::all_1_54_9} <Trace> shard_0.tbl (c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab): Renaming temporary part tmp_merge_all_1_54_9 to all_1_54_9.
2022.01.16 09:07:34.289858 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> ContextAccess (default): Access granted: SELECT(number) ON shard_0.tbl
2022.01.16 09:07:34.289897 [ 375 ] {c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab::all_1_54_9} <Trace> shard_0.tbl (c7a885fa-4ef4-4dcf-a4de-1650d44fa0ab) (MergerMutator): Merged 6 parts: from all_1_49_8 to all_54_54_0
2022.01.16 09:07:34.289920 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> InterpreterSelectQuery: WithMergeableState -> Complete
2022.01.16 09:07:34.289987 [ 375 ] {} <Debug> MemoryTracker: Peak memory usage Mutate/Merge: 3.12 MiB.
2022.01.16 09:07:34.290305 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> MergingAggregatedTransform: Reading blocks of partially aggregated data.
2022.01.16 09:07:34.290332 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Debug> MergingAggregatedTransform: Read 1 blocks of partially aggregated data, total 1 rows.
2022.01.16 09:07:34.290343 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> Aggregator: Merging partially aggregated single-level data.
2022.01.16 09:07:34.290358 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> Aggregator: Merged partially aggregated single-level data.
2022.01.16 09:07:34.290366 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Trace> Aggregator: Converting aggregated data to blocks
2022.01.16 09:07:34.290391 [ 154344 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Debug> Aggregator: Converted aggregated data to blocks. 1 rows, 8.00 B in 1.0939e-05 sec. (91416.034 rows/sec., 714.19 KiB/sec.)
2022.01.16 09:07:34.290709 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Information> executeQuery: Read 1 rows, 4.01 KiB in 0.001187722 sec., 841 rows/sec., 3.30 MiB/sec.
2022.01.16 09:07:34.290774 [ 146495 ] {cd9d4cf2-7131-4179-b0b2-3aeec4045755} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
</details>
[1]: https://s3.amazonaws.com/clickhouse-test-reports/33675/7848ea7d609e4c720e8e4494eb6207c0751f5aea/stateless_tests__ubsan__actions_.html
This also fixes a race between DROP TABLE check and INSERT/merges.
v0: use Active parts instead.
v2: fix total counters accounting instead.
v2: rebase against MergeTask
v3: rebase due to conflicts in src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
v4:
- rebase due to conflicts in src/Storages/MergeTree/MergeTask.cpp
- drop common/scope_guard_safe.h (not used)
* initial commit: add setting and stub
* typo
* added test stub
* fix
* wip merging new integration test and code proto
* adding steps interpreters
* adding firstly proposed solution (moving parts etc)
* added checking zookeeper path existence
* fixing the include
* fixing and sorting includes
* fixing outdated struct
* fix the name
* added ast ptr as level of indirection
* fix ref
* updating the changes
* working on test stub
* fix iterator -> reference
* revert rocksdb submodule update
* fixed show privileges test
* updated the test stub
* replaced rand() with thread_local_rng(), updated the tests
updated the test
fixed test config path
test fix
removed error messages
fixed the test
updated the test
fixed string literal
fixed literal
typo: =
* fixed the empty replica error message
* updated the test and the code with logs
* updated the possible test cases, updated
* added the code/test milestone comments
* updated the test (added more testcases)
* replaced native assert with CH one
* individual replicas recursive delete fix
* updated the AS db.name AST
* two small logging fixes
* manually generated AST fixes
* Updated the test, added the possible algo change
* Some thoughts about optimizing the solution:
ALTER MOVE PARTITION .. TO TABLE -> move to detached/ + ALTER ... ATTACH
* fix
* Removed the replica sync in test as it's invalid
* Some test tweaks
* tmp
* Rewrote the algo by using the executeQuery instead of
hand-crafting the ASTPtr.
Two questions still active.
* tr: logging active parts
* Extracted the parts moving algo into a separate helper function
* Fixed the test data and the queries slightly
* Replaced query to system.parts to direct invocation,
started building the test that breaks on various parts.
* Added the case for tables when at least one replica is alive
* Updated the test to test replicas restoration by detaching/attaching
* Altered the test to check restoration without replica restart
* Added the tables swap in the start if the server failed last time
* Hotfix when only /replicas/replica... path was deleted
* Restore ZK paths while creating a replicated MergeTree table
* Updated the docs, fixed the algo for individual replicas restoration case
* Initial parts table storage fix, tests sync fix
* Reverted individual replica restoration to general algo
* Slightly optimised getDataParts
* Trying another solution with parts detaching
* Rewrote algo without any steps, added ON CLUSTER support
* Attaching parts from other replica on restoration
* Getting part checksums from ZK
* Removed ON CLUSTER, finished working solution
* Multiple small changes after review
* Fixing parallel test
* Supporting rewritten form on cluster
* Test fix
* Moar logging
* Using source replica as checksum provider
* improve test, remove some code from parser
* Trying solution with move to detached + forget
* Moving all parts (not only Committed) to detached
* Edited docs for RESTORE REPLICA
* Re-merging
* minor fixes
Co-authored-by: Alexander Tokmakov <avtokmakov@yandex-team.ru>
TODO (suggested by Nikolai)
1. Build query plan fro current query (inside storage::read) up to WithMergableState
2. Check, that plan is simple enough: Aggregating - Expression - Filter - ReadFromStorage (or simplier)
3. Check, that filter is the same as filter in projection, and also expression calculates the same aggregation keys as in projection
4. Return WithMergableState if projection applies
3 will be easier to do with ActionsDAG, cause it sees all functions, and dependencies are direct (but it is possible with ExpressionActions also)
Also need to figure out how prewhere works for projections, and
row_filter_policies.
wip