[1] breaks on disk format (and the relevant change in the:
[1]: https://github.com/ClickHouse/ClickHouse/pull/12455#discussion_r682830812
Too bad that I checked this patchset only for compatibility after
reverting this patch [2] (use case: I've applied it manually, then
revert it, and data skipping indexes over Nullable column had been
broken)
[2]: https://github.com/ClickHouse/ClickHouse/pull/12455#issuecomment-823423772
But this patchset actually breaks compatibility with older versions of
clickhouse for Nullable data skipping indexes after simple upgrade:
Here is a simple reproducer:
--
-- run this with 21.6 or similar (i.e. w/o this patch)
--
CREATE TABLE data
(
`key` Int,
`value` Nullable(Int),
INDEX value_index value TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY key;
INSERT INTO data SELECT
number,
number
FROM numbers(10000);
SELECT * FROM data WHERE value = 20000 SETTINGS force_data_skipping_indices = 'value_index' SETTINGS force_data_skipping_indices = 'value_index', max_rows_to_read=1;
Now upgrade and run the query again:
SELECT * FROM data WHERE value = 20000 SETTINGS force_data_skipping_indices = 'value_index' SETTINGS force_data_skipping_indices = 'value_index', max_rows_to_read=1;
And it will fail because of on disk format changes:
$ ll --time-style=+ data/*/data/all_1_1_0/skp*.idx
-rw-r----- 1 azat azat 36 data/with_nullable_patch/data/all_1_1_0/skp_idx_value_index.idx
-rw-r----- 1 azat azat 37 data/without_nullable_patch/data/all_1_1_0/skp_idx_value_index.idx
$ md5sum data/*/data/all_1_1_0/skp*.idx
a19c95c4a14506c65665a1e30ab404bf data/with_nullable_patch/data/all_1_1_0/skp_idx_value_index.idx
e50e2fcfa873b232196623d56ab26105 data/without_nullable_patch/data/all_1_1_0/skp_idx_value_index.idx
Note, that there is no stable release with this patch included yet, so
no need to backport.
Also note that you may create data skipping indexes over Nullable
column even before [3].
[3]: https://github.com/ClickHouse/ClickHouse/pull/12455
v2: break cases when granulas has Null in values due to backward
compatibility
In case if one Distributed has multiple shards, and underlying
Distributed has only one, there can be the case when the query will be
tried to process from Complete to WithMergeableStateAfterAggregation,
which is obviously wrong.
Before the following queries was running LimitBy/Distinct step on the
initator:
select distinct sharding_key from dist order by k
While this can be omitted.
Before this patch it wasn't possible to optimize simple SELECT * FROM
dist ORDER BY (w/o GROUP BY and DISTINCT) to more optimal stage
(QueryProcessingStage::WithMergeableStateAfterAggregationAndLimit),
since that code was under
allow_nondeterministic_optimize_skip_unused_shards, rework it and make
it possible.
Also now distributed_push_down_limit is respected for
optimize_distributed_group_by_sharding_key.
Next step will be to enable distributed_push_down_limit by default.
v2: fix detection of aggregates
After #25884 parts that was dropped (DROP_RANGE) will not be removed
from old mutations and this will stuck the mutation.
Interesting, that this mutation may continue after some merge.
The use case is to alert when queue contains broken entries. Especially
important when ClickHouse breaks backwards compatibility between
versions and log entries written by newer versions aren't parseable by
old versions.
```
Code: 27, e.displayText() = DB::Exception: Cannot parse input: expected 'quorum: ' before: 'merge_type: 2\n'
```
This was introduced in https://github.com/ClickHouse/ClickHouse/pull/8602.
The idea was to avoid data re-appearing in ClickHouse after DROP/DETACH
PARTITION. This problem was only present in MergeTree engine and I don't
understand why we need to do the same in ReplicatedMergeTree.
For ReplicatedMergeTree the state of truth is stored in ZK, deleting
things from filesystem just introduces inconsistencies and this is the
main source for errors like "No active replica has part X or covering
part".
The resulting problem is fixed by
https://github.com/ClickHouse/ClickHouse/pull/25820, but in my opinion
we would better avoid introducing the ZK/FS inconsistency in the first
place.
When does this inconsistency appear? Often the sequence is like this:
0. Write 2 parts to ZK [all_0_0_0, all_1_1_0]
1. A merge gets scheduled
2. New part replaces old parts [new: all_0_1_1, old: all_0_0_0, all_1_1_0]
3. Replica gets shutdown and old parts are removed from filesystem
4. Replica comes back online, metadata about all parts is still stored in ZK for this new replica.
5. Other replica after cleanup thread runs will have only [all_0_1_1] in
ZK
5. User triggers a DROP_RANGE after a while (drop range is for all_0_1_9999*)
6. Each replica deletes from ZK only [all_0_1_1]. The replica that got
restarted uses its in-memory state to choose nodes to delete from ZK.
7. Restart the replica again. It will now think that there are 2 parts
that it lost and needs to fetch them [all_0_0_0, all_1_1_0].
`clearOldPartsAndRemoveFromZK` which is triggered from cleanup thread
runs cleanup sequence correctly, it first removes things from ZK and
then from filesystem. I don't see much benefit of triggering it on
shutdown and would rather have it called only from a single place.
---
This is a very, very edge case situation but it proves that the current
"fix" (https://github.com/ClickHouse/ClickHouse/pull/25820) isn't
complete.
```
create table test(
v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'one')
order by v
settings old_parts_lifetime = 30;
create table test2(
v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'two')
order by v
settings old_parts_lifetime = 30;
create table test3(
v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'three')
order by v
settings old_parts_lifetime = 30;
insert into table test values (1), (2), (3);
insert into table test values (4);
optimize table test final;
detach table test;
detach table test2;
alter table test3 drop partition tuple();
attach table test;
attach table test2;
```
```
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/one/parts
all_0_0_0
all_1_1_0
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/two/parts
all_0_0_0
all_1_1_0
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/three/parts
```
```
detach table test;
attach table test;
```
`test` will now figure out that parts exist only in ZK and will issue `GET_PART`
after first removing parts from ZK.
`test2` will receive fetch for unknown parts and will trigger part checks itself.
Because `test` doesn't have the parts anymore in ZK `test2` will mark them as LostForever.
It will also not insert empty parts, because the partition is empty.
`test` is left with `GET_PART` in the queue and stuck.
```
SELECT
table,
type,
replica_name,
new_part_name,
last_exception
FROM system.replication_queue
Query id: 74c5aa00-048d-4bc1-a2ea-6f69501c11a0
Row 1:
──────
table: test
type: GET_PART
replica_name: one
new_part_name: all_0_0_0
last_exception: Code: 234. DB::Exception: No active replica has part all_0_0_0 or covering part. (NO_REPLICA_HAS_PART) (version 21.9.1.1)
Row 2:
──────
table: test
type: GET_PART
replica_name: one
new_part_name: all_1_1_0
last_exception: Code: 234. DB::Exception: No active replica has part all_1_1_0 or covering part. (NO_REPLICA_HAS_PART) (version 21.9.1.1)
```
UBsan reports:
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../src/Storages/Distributed/DirectoryMonitor.cpp:435:53 in
../src/Storages/Distributed/DirectoryMonitor.cpp:435: runtime error: 1.15292e+19 is outside the range of representable values of type 'long'
0 0x1df0c286 in DB::StorageDistributedDirectoryMonitor::run() obj-x86_64-linux-gnu/../src/Storages/Distributed/DirectoryMonitor.cpp:435:53
It is pretty easy to reproduce by limiting max_server_memory_usage
before staring the test.
Before we incorrectly check that columns from GROUP BY was a subset of
columns from sharding key, while this is not right, consider the
following example:
select k1, any(k2), sum(v) from remote('127.{1,2}', view(select 1 k1, 2 k2, 3 v), cityHash64(k1, k2)) group by k1
Here the columns from GROUP BY is a subset of columns from sharding key,
but the optimization cannot be applied, since there is no guarantee that
particular shard contains distinct values of k1.
So instead we should check that GROUP BY contains all columns that is
required for calculating sharding key expression, i.e.:
select k1, k2, sum(v) from remote('127.{1,2}', view(select 1 k1, 2 k2, 3 v), cityHash64(k1, k2)) group by k1, k2