If during removing replica_path from zookeeper, some error occurred
(zookeeper goes away), then it may not remove everything from zookeeper.
And on DETACH/ATTACH (or server restart, like stress tests does in the
analysis from this comment [1]), it will trigger an error:
The local set of parts of table test_1.alter_table_4 doesn't look like the set of parts in ZooKeeper:
[1]: https://github.com/ClickHouse/ClickHouse/pull/28296#issuecomment-915829943
Fix this, by removing "metadata" at first, and only after this
everything else, this will avoid this error, since on ATTACH such table
will be marked as read-only.
v2: forget to remove remote_replica_path itself
v3: fix test_drop_replica by adding a check for remote_replica_path existence
v2: rebase against MergeTask
v3: rebase due to conflicts in src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
v4:
- rebase due to conflicts in src/Storages/MergeTree/MergeTask.cpp
- drop common/scope_guard_safe.h (not used)
Sometimes we want to push a log entry once and only once. Because it is
not possible to create a sequential node in ZooKeeper and store its name
to a well known location in the same transaction we'll do it in the
other order. First somehow generate a unique identifier, then submit a
log entry with that identifier. Later, we can search through log entries
using the identifier we provided to find the node.
Required for part movement between shards.
This allows to use the same functions with very short timeouts while
ensuring that the actual state is checked at least once instead of
timing out before even looking at at ZK at least once.
The use case is to alert when queue contains broken entries. Especially
important when ClickHouse breaks backwards compatibility between
versions and log entries written by newer versions aren't parseable by
old versions.
```
Code: 27, e.displayText() = DB::Exception: Cannot parse input: expected 'quorum: ' before: 'merge_type: 2\n'
```
This was introduced in https://github.com/ClickHouse/ClickHouse/pull/8602.
The idea was to avoid data re-appearing in ClickHouse after DROP/DETACH
PARTITION. This problem was only present in MergeTree engine and I don't
understand why we need to do the same in ReplicatedMergeTree.
For ReplicatedMergeTree the state of truth is stored in ZK, deleting
things from filesystem just introduces inconsistencies and this is the
main source for errors like "No active replica has part X or covering
part".
The resulting problem is fixed by
https://github.com/ClickHouse/ClickHouse/pull/25820, but in my opinion
we would better avoid introducing the ZK/FS inconsistency in the first
place.
When does this inconsistency appear? Often the sequence is like this:
0. Write 2 parts to ZK [all_0_0_0, all_1_1_0]
1. A merge gets scheduled
2. New part replaces old parts [new: all_0_1_1, old: all_0_0_0, all_1_1_0]
3. Replica gets shutdown and old parts are removed from filesystem
4. Replica comes back online, metadata about all parts is still stored in ZK for this new replica.
5. Other replica after cleanup thread runs will have only [all_0_1_1] in
ZK
5. User triggers a DROP_RANGE after a while (drop range is for all_0_1_9999*)
6. Each replica deletes from ZK only [all_0_1_1]. The replica that got
restarted uses its in-memory state to choose nodes to delete from ZK.
7. Restart the replica again. It will now think that there are 2 parts
that it lost and needs to fetch them [all_0_0_0, all_1_1_0].
`clearOldPartsAndRemoveFromZK` which is triggered from cleanup thread
runs cleanup sequence correctly, it first removes things from ZK and
then from filesystem. I don't see much benefit of triggering it on
shutdown and would rather have it called only from a single place.
---
This is a very, very edge case situation but it proves that the current
"fix" (https://github.com/ClickHouse/ClickHouse/pull/25820) isn't
complete.
```
create table test(
v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'one')
order by v
settings old_parts_lifetime = 30;
create table test2(
v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'two')
order by v
settings old_parts_lifetime = 30;
create table test3(
v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'three')
order by v
settings old_parts_lifetime = 30;
insert into table test values (1), (2), (3);
insert into table test values (4);
optimize table test final;
detach table test;
detach table test2;
alter table test3 drop partition tuple();
attach table test;
attach table test2;
```
```
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/one/parts
all_0_0_0
all_1_1_0
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/two/parts
all_0_0_0
all_1_1_0
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/three/parts
```
```
detach table test;
attach table test;
```
`test` will now figure out that parts exist only in ZK and will issue `GET_PART`
after first removing parts from ZK.
`test2` will receive fetch for unknown parts and will trigger part checks itself.
Because `test` doesn't have the parts anymore in ZK `test2` will mark them as LostForever.
It will also not insert empty parts, because the partition is empty.
`test` is left with `GET_PART` in the queue and stuck.
```
SELECT
table,
type,
replica_name,
new_part_name,
last_exception
FROM system.replication_queue
Query id: 74c5aa00-048d-4bc1-a2ea-6f69501c11a0
Row 1:
──────
table: test
type: GET_PART
replica_name: one
new_part_name: all_0_0_0
last_exception: Code: 234. DB::Exception: No active replica has part all_0_0_0 or covering part. (NO_REPLICA_HAS_PART) (version 21.9.1.1)
Row 2:
──────
table: test
type: GET_PART
replica_name: one
new_part_name: all_1_1_0
last_exception: Code: 234. DB::Exception: No active replica has part all_1_1_0 or covering part. (NO_REPLICA_HAS_PART) (version 21.9.1.1)
```
* initial commit: add setting and stub
* typo
* added test stub
* fix
* wip merging new integration test and code proto
* adding steps interpreters
* adding firstly proposed solution (moving parts etc)
* added checking zookeeper path existence
* fixing the include
* fixing and sorting includes
* fixing outdated struct
* fix the name
* added ast ptr as level of indirection
* fix ref
* updating the changes
* working on test stub
* fix iterator -> reference
* revert rocksdb submodule update
* fixed show privileges test
* updated the test stub
* replaced rand() with thread_local_rng(), updated the tests
updated the test
fixed test config path
test fix
removed error messages
fixed the test
updated the test
fixed string literal
fixed literal
typo: =
* fixed the empty replica error message
* updated the test and the code with logs
* updated the possible test cases, updated
* added the code/test milestone comments
* updated the test (added more testcases)
* replaced native assert with CH one
* individual replicas recursive delete fix
* updated the AS db.name AST
* two small logging fixes
* manually generated AST fixes
* Updated the test, added the possible algo change
* Some thoughts about optimizing the solution:
ALTER MOVE PARTITION .. TO TABLE -> move to detached/ + ALTER ... ATTACH
* fix
* Removed the replica sync in test as it's invalid
* Some test tweaks
* tmp
* Rewrote the algo by using the executeQuery instead of
hand-crafting the ASTPtr.
Two questions still active.
* tr: logging active parts
* Extracted the parts moving algo into a separate helper function
* Fixed the test data and the queries slightly
* Replaced query to system.parts to direct invocation,
started building the test that breaks on various parts.
* Added the case for tables when at least one replica is alive
* Updated the test to test replicas restoration by detaching/attaching
* Altered the test to check restoration without replica restart
* Added the tables swap in the start if the server failed last time
* Hotfix when only /replicas/replica... path was deleted
* Restore ZK paths while creating a replicated MergeTree table
* Updated the docs, fixed the algo for individual replicas restoration case
* Initial parts table storage fix, tests sync fix
* Reverted individual replica restoration to general algo
* Slightly optimised getDataParts
* Trying another solution with parts detaching
* Rewrote algo without any steps, added ON CLUSTER support
* Attaching parts from other replica on restoration
* Getting part checksums from ZK
* Removed ON CLUSTER, finished working solution
* Multiple small changes after review
* Fixing parallel test
* Supporting rewritten form on cluster
* Test fix
* Moar logging
* Using source replica as checksum provider
* improve test, remove some code from parser
* Trying solution with move to detached + forget
* Moving all parts (not only Committed) to detached
* Edited docs for RESTORE REPLICA
* Re-merging
* minor fixes
Co-authored-by: Alexander Tokmakov <avtokmakov@yandex-team.ru>
Before this patch KILL MUTATION marks mutation as canceled just by name
(and part numbers) so if you have multiple tables with the same part
name, then killing mutation for one table, will mark it as killed for
another too.
Fix this by comparing StorageID too (it is better to use StorageID over
database/table to avoid ambiguity by using UUIDs for comparing).
Here is a failure of the 01414_freeze_does_not_prevent_alters on CI [1].
[1]: https://clickhouse-test-reports.s3.yandex.net/24069/9fb69dcf98c71a939d200cad3c8491bf43a44622/functional_stateless_tests_(ubsan).html#fail1