This was introduced in https://github.com/ClickHouse/ClickHouse/pull/8602.
The idea was to avoid data re-appearing in ClickHouse after DROP/DETACH
PARTITION. This problem was only present in MergeTree engine and I don't
understand why we need to do the same in ReplicatedMergeTree.
For ReplicatedMergeTree the state of truth is stored in ZK, deleting
things from filesystem just introduces inconsistencies and this is the
main source for errors like "No active replica has part X or covering
part".
The resulting problem is fixed by
https://github.com/ClickHouse/ClickHouse/pull/25820, but in my opinion
we would better avoid introducing the ZK/FS inconsistency in the first
place.
When does this inconsistency appear? Often the sequence is like this:
0. Write 2 parts to ZK [all_0_0_0, all_1_1_0]
1. A merge gets scheduled
2. New part replaces old parts [new: all_0_1_1, old: all_0_0_0, all_1_1_0]
3. Replica gets shutdown and old parts are removed from filesystem
4. Replica comes back online, metadata about all parts is still stored in ZK for this new replica.
5. Other replica after cleanup thread runs will have only [all_0_1_1] in
ZK
5. User triggers a DROP_RANGE after a while (drop range is for all_0_1_9999*)
6. Each replica deletes from ZK only [all_0_1_1]. The replica that got
restarted uses its in-memory state to choose nodes to delete from ZK.
7. Restart the replica again. It will now think that there are 2 parts
that it lost and needs to fetch them [all_0_0_0, all_1_1_0].
`clearOldPartsAndRemoveFromZK` which is triggered from cleanup thread
runs cleanup sequence correctly, it first removes things from ZK and
then from filesystem. I don't see much benefit of triggering it on
shutdown and would rather have it called only from a single place.
---
This is a very, very edge case situation but it proves that the current
"fix" (https://github.com/ClickHouse/ClickHouse/pull/25820) isn't
complete.
```
create table test(
v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'one')
order by v
settings old_parts_lifetime = 30;
create table test2(
v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'two')
order by v
settings old_parts_lifetime = 30;
create table test3(
v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'three')
order by v
settings old_parts_lifetime = 30;
insert into table test values (1), (2), (3);
insert into table test values (4);
optimize table test final;
detach table test;
detach table test2;
alter table test3 drop partition tuple();
attach table test;
attach table test2;
```
```
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/one/parts
all_0_0_0
all_1_1_0
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/two/parts
all_0_0_0
all_1_1_0
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/three/parts
```
```
detach table test;
attach table test;
```
`test` will now figure out that parts exist only in ZK and will issue `GET_PART`
after first removing parts from ZK.
`test2` will receive fetch for unknown parts and will trigger part checks itself.
Because `test` doesn't have the parts anymore in ZK `test2` will mark them as LostForever.
It will also not insert empty parts, because the partition is empty.
`test` is left with `GET_PART` in the queue and stuck.
```
SELECT
table,
type,
replica_name,
new_part_name,
last_exception
FROM system.replication_queue
Query id: 74c5aa00-048d-4bc1-a2ea-6f69501c11a0
Row 1:
──────
table: test
type: GET_PART
replica_name: one
new_part_name: all_0_0_0
last_exception: Code: 234. DB::Exception: No active replica has part all_0_0_0 or covering part. (NO_REPLICA_HAS_PART) (version 21.9.1.1)
Row 2:
──────
table: test
type: GET_PART
replica_name: one
new_part_name: all_1_1_0
last_exception: Code: 234. DB::Exception: No active replica has part all_1_1_0 or covering part. (NO_REPLICA_HAS_PART) (version 21.9.1.1)
```
* initial commit: add setting and stub
* typo
* added test stub
* fix
* wip merging new integration test and code proto
* adding steps interpreters
* adding firstly proposed solution (moving parts etc)
* added checking zookeeper path existence
* fixing the include
* fixing and sorting includes
* fixing outdated struct
* fix the name
* added ast ptr as level of indirection
* fix ref
* updating the changes
* working on test stub
* fix iterator -> reference
* revert rocksdb submodule update
* fixed show privileges test
* updated the test stub
* replaced rand() with thread_local_rng(), updated the tests
updated the test
fixed test config path
test fix
removed error messages
fixed the test
updated the test
fixed string literal
fixed literal
typo: =
* fixed the empty replica error message
* updated the test and the code with logs
* updated the possible test cases, updated
* added the code/test milestone comments
* updated the test (added more testcases)
* replaced native assert with CH one
* individual replicas recursive delete fix
* updated the AS db.name AST
* two small logging fixes
* manually generated AST fixes
* Updated the test, added the possible algo change
* Some thoughts about optimizing the solution:
ALTER MOVE PARTITION .. TO TABLE -> move to detached/ + ALTER ... ATTACH
* fix
* Removed the replica sync in test as it's invalid
* Some test tweaks
* tmp
* Rewrote the algo by using the executeQuery instead of
hand-crafting the ASTPtr.
Two questions still active.
* tr: logging active parts
* Extracted the parts moving algo into a separate helper function
* Fixed the test data and the queries slightly
* Replaced query to system.parts to direct invocation,
started building the test that breaks on various parts.
* Added the case for tables when at least one replica is alive
* Updated the test to test replicas restoration by detaching/attaching
* Altered the test to check restoration without replica restart
* Added the tables swap in the start if the server failed last time
* Hotfix when only /replicas/replica... path was deleted
* Restore ZK paths while creating a replicated MergeTree table
* Updated the docs, fixed the algo for individual replicas restoration case
* Initial parts table storage fix, tests sync fix
* Reverted individual replica restoration to general algo
* Slightly optimised getDataParts
* Trying another solution with parts detaching
* Rewrote algo without any steps, added ON CLUSTER support
* Attaching parts from other replica on restoration
* Getting part checksums from ZK
* Removed ON CLUSTER, finished working solution
* Multiple small changes after review
* Fixing parallel test
* Supporting rewritten form on cluster
* Test fix
* Moar logging
* Using source replica as checksum provider
* improve test, remove some code from parser
* Trying solution with move to detached + forget
* Moving all parts (not only Committed) to detached
* Edited docs for RESTORE REPLICA
* Re-merging
* minor fixes
Co-authored-by: Alexander Tokmakov <avtokmakov@yandex-team.ru>
Before this patch KILL MUTATION marks mutation as canceled just by name
(and part numbers) so if you have multiple tables with the same part
name, then killing mutation for one table, will mark it as killed for
another too.
Fix this by comparing StorageID too (it is better to use StorageID over
database/table to avoid ambiguity by using UUIDs for comparing).
Here is a failure of the 01414_freeze_does_not_prevent_alters on CI [1].
[1]: https://clickhouse-test-reports.s3.yandex.net/24069/9fb69dcf98c71a939d200cad3c8491bf43a44622/functional_stateless_tests_(ubsan).html#fail1
TODO (suggested by Nikolai)
1. Build query plan fro current query (inside storage::read) up to WithMergableState
2. Check, that plan is simple enough: Aggregating - Expression - Filter - ReadFromStorage (or simplier)
3. Check, that filter is the same as filter in projection, and also expression calculates the same aggregation keys as in projection
4. Return WithMergableState if projection applies
3 will be easier to do with ActionsDAG, cause it sees all functions, and dependencies are direct (but it is possible with ExpressionActions also)
Also need to figure out how prewhere works for projections, and
row_filter_policies.
wip
with the explicit data loading and verification.
If the function fails, the exception will re-throw upper,
cancelling the fetch in the handling of the replicated log entry,
but this event will also wake the PartCheckThread, which will
issue a re-fetch.
When the part data (e.g. data.bin) is corrupted, but the checksums.txt
is present -- explicitly deleting the checksums.txt.
Removed the extra logging, changes some exceptions message.
Now, instead of iterating through the directories, we iterate though
directories of on of the table disks (which doesn't give us a
substantial boost but is a bit neater to read).
- Updated the system.replication_queue command types.
- Fixed the part ptr being empty (added the checksum loading and
initialization).
- Removed extra logging.
- Updated the docs to make everything clear.
- Multiple small logger fixes.
- Changed the attach_part command -- now it's after check for the
covering parts -- motivation is to do less work with the checksums
fetching.
- Better logging in the integration test.
Found with fuzzer [1] for 00992_system_parts_race_condition_zookeeper:
2021.03.13 11:12:30.385188 [ 42042 ] {2d3a8e17-26be-47c1-974f-bd2c9fc7c3af} <Debug> executeQuery: (from [::1]:58192, using production parser) (comment: '/usr/share/clickhouse-test/queries/1_stateful/00153_aggregate_arena_race.sql') CREATE TABLE alter_tabl
e (a UInt8, b Int16, c Float32, d String, e Array(UInt8), f Nullable(UUID), g Tuple(UInt8, UInt16)) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test_3.alter_table', 'r1') ORDER BY a PARTITION BY b % 10 SETTINGS old_parts_lifetime = 1, cleanup_delay_p
eriod = 1, cleanup_delay_period_random_add = 0;
...
2021.03.13 11:12:30.678387 [ 42042 ] {528cafc5-a02b-4df8-a531-a9a98e37b478} <Debug> executeQuery: (from [::1]:58192, using production parser) (comment: '/usr/share/clickhouse-test/queries/1_stateful/00153_aggregate_arena_race.sql') CREATE TABLE alter_table2 (a UInt8, b Int16, c Float32, d String, e Array(UInt8), f Nullable(UUID), g Tuple(UInt8, UInt16)) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test_3.alter_table', 'r2') ORDER BY a PARTITION BY b % 10 SETTINGS old_parts_lifetime = 1, cleanup_delay_period = 1, cleanup_delay_period_random_add = 0;
...
2021.03.13 11:12:40.671994 [ 4193 ] {d96ee93c-69b0-4e89-b411-16c382ae27a8} <Debug> executeQuery: (from [::1]:59714, using production parser) (comment: '/usr/share/clickhouse-test/queries/1_stateful/00153_aggregate_arena_race.sql') OPTIMIZE TABLE alter_table FINAL
...
2021.03.13 11:12:40.990174 [ 2298 ] {a80f9306-3a73-4778-a921-db53249247e3} <Debug> executeQuery: (from [::1]:59768, using production parser) (comment: '/usr/share/clickhouse-test/queries/1_stateful/00153_aggregate_arena_race.sql') DROP TABLE alter_table;
...
2021.03.13 11:12:41.333054 [ 2298 ] {a80f9306-3a73-4778-a921-db53249247e3} <Debug> test_3.alter_table (d4fedaca-e0f6-4c22-9a4f-9f4d11b6b705): Removing part from filesystem 7_0_0_0
...
2021.03.13 11:12:41.335380 [ 2298 ] {a80f9306-3a73-4778-a921-db53249247e3} <Debug> DatabaseCatalog: Waiting for table d4fedaca-e0f6-4c22-9a4f-9f4d11b6b705 to be finally dropped
...
2021.03.13 11:12:41.781032 [ 4193 ] {d96ee93c-69b0-4e89-b411-16c382ae27a8} <Debug> test_3.alter_table (d4fedaca-e0f6-4c22-9a4f-9f4d11b6b705): Waiting for queue-0000000085 to disappear from r2 queue
...
2021.03.13 11:12:41.900039 [ 371 ] {} <Trace> test_3.alter_table2 (ReplicatedMergeTreeQueue): Not executing log entry queue-0000000085 of type MERGE_PARTS for part 7_0_0_1 because part 7_0_0_0 is not ready yet (log entry for that part is being processed).
2021.03.13 11:12:41.900213 [ 365 ] {} <Trace> test_3.alter_table2 (ReplicatedMergeTreeQueue): Cannot execute alter metadata queue-0000000056 with version 22 because another alter 21 must be executed before
2021.03.13 11:12:41.900231 [ 13762 ] {} <Trace> test_3.alter_table2 (ae877c49-0d30-416d-9afe-27fd457d8fc4): Executing log entry to merge parts -7_0_0_0 to -7_0_0_1
2021.03.13 11:12:41.900330 [ 13762 ] {} <Debug> test_3.alter_table2 (ae877c49-0d30-416d-9afe-27fd457d8fc4): Don't have all parts for merge -7_0_0_1; will try to fetch it instead
...
[1]: https://clickhouse-test-reports.s3.yandex.net/21691/eb3710c164b991b8d4f86b1435a65f9eceb8f1f5/stress_test_(address).html#fail1
This is used for removing part metadata from ZooKeeper when executing
queue events like `DROP_RANGE` triggered when a user tries to drop a
part or a partition. There are other uses but I'll focus only on this
one.
Before this change the method was giving up silently if it was unable to
remove parts from ZooKeeper and this behaviour seems to be problematic.
It could lead to operation being reported as successful at first but
data reappearing later (very rarely) or "stuck" events in replication
queue.
Here is one particular scenario which I think we've hit:
* Execute a DETACH PARTITION
* DROP_RANGE event put in the queue
* Replicas try to execute dropRange but some of them get disconnected
from ZK and 5 retries aren't enough (ZK is miss-behaving), return code
(false) is ignored and log pointer advances.
* One of the replica where dropRange failed is restarted.
* checkParts is executed and it finds parts that weren't removed from
ZK, logs `Removing locally missing part from ZooKeeper and queueing a
fetch` and puts GET_PART on the queue.
* Few things can happen from here:
* There is a lagging replica that din't execute DROP_RANGE yet: part will be
fetched. The other replica will execute DROP_RANGE later and we'll
get diverging set of parts on replicas.
* Another replica also silently failed to remove parts from ZK: both
of them are left with GET_PART in the queue and none of them can
make progress, logging: `No active replica has part ... or covering
part`.
The Year 1925 is a starting point because most of the timezones
switched to saner (mostly 15-minutes based) offsets somewhere
during 1924 or before. And that significantly simplifies implementation.
2238 is to simplify arithmetics for sanitizing LUT index access;
there are less than 0x1ffff days from 1925.
* Extended DateLUTImpl internal LUT to 0x1ffff items, some of which
represent negative (pre-1970) time values.
As a collateral benefit, Date now correctly supports dates up to 2149
(instead of 2106).
* Added a new strong typedef ExtendedDayNum, which represents dates
pre-1970 and post 2149.
* Functions that used to return DayNum now return ExtendedDayNum.
* Refactored DateLUTImpl to untie DayNum from the dual role of being
a value and an index (due to negative time). Index is now a different
type LUTIndex with explicit conversion functions from DatNum, time_t,
and ExtendedDayNum.
* Updated DateLUTImpl to properly support values close to epoch start
(1970-01-01 00:00), including negative ones.
* Reduced resolution of DateLUTImpl::Values::time_at_offset_change
to multiple of 15-minutes to allow storing 64-bits of time_t in
DateLUTImpl::Value while keeping same size.
* Minor performance updates to DateLUTImpl when building month LUT
by skipping non-start-of-month days.
* Fixed extractTimeZoneFromFunctionArguments to work correctly
with DateTime64.
* New unit-tests and stateless integration tests for both DateTime
and DateTime64.
ATTACH_PART into the replicated log.
The LogEntry now also has the pre-calculated part checksum for this
entry type, which is later used while searching in the detached/ folder
The original ticket idea was to search for the possibly available data
into the /detached folders for the GET_PART command, but
@tavplubix pointed out this would be quite expensive for an every
fetch.
So a new command is going to be introduced, ATTACH_PART, which will
cover ALTER TABLE ATTACH PART and only for which the search will start.
The following headers are pretty generic, so use forward declaration as
much as possible:
- Context.h
- Settings.h
- ConnectionTimeouts.h
(Also this shows that some missing some includes -- this has been fixed)
And split ConnectionTimeouts.h into ConnectionTimeoutsContext.h (since
module part cannot be added for it, due to recursive build dependencies
that will be introduced)
Also remove Settings from the RemoteBlockInputStream/RemoteQueryExecutor
and just pass the context, since settings was passed only in speicifc
places, that can allow making a copy of Context (i.e. Copier).
Approx results (How much units will be recompiled after changing file X?):
- ConnectionTimeouts.h
- mainline: 100
- Context.h:
- mainline: ~800
- patched: 415
- Settings.h:
- mainline: 900-1K
- patched: 440 (most of them because of the Context.h)
Extended OPTIMIZE ... DEDUPLICATE syntax to allow explicit (or implicit with asterisk/column transformers) list of columns to check for duplicates on.
Following syntax variants are now supported:
OPTIMIZE TABLE table DEDUPLICATE; -- the old one
OPTIMIZE TABLE table DEDUPLICATE BY *;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY);
OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex');
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY);
Note that * behaves just like in SELECT: MATERIALIZED, and ALIAS columns are not used for expansion.
Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column.
Column transformers other than EXCEPT are not supported.