[1] breaks on disk format (and the relevant change in the:
[1]: https://github.com/ClickHouse/ClickHouse/pull/12455#discussion_r682830812
Too bad that I checked this patchset only for compatibility after
reverting this patch [2] (use case: I've applied it manually, then
revert it, and data skipping indexes over Nullable column had been
broken)
[2]: https://github.com/ClickHouse/ClickHouse/pull/12455#issuecomment-823423772
But this patchset actually breaks compatibility with older versions of
clickhouse for Nullable data skipping indexes after simple upgrade:
Here is a simple reproducer:
--
-- run this with 21.6 or similar (i.e. w/o this patch)
--
CREATE TABLE data
(
`key` Int,
`value` Nullable(Int),
INDEX value_index value TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY key;
INSERT INTO data SELECT
number,
number
FROM numbers(10000);
SELECT * FROM data WHERE value = 20000 SETTINGS force_data_skipping_indices = 'value_index' SETTINGS force_data_skipping_indices = 'value_index', max_rows_to_read=1;
Now upgrade and run the query again:
SELECT * FROM data WHERE value = 20000 SETTINGS force_data_skipping_indices = 'value_index' SETTINGS force_data_skipping_indices = 'value_index', max_rows_to_read=1;
And it will fail because of on disk format changes:
$ ll --time-style=+ data/*/data/all_1_1_0/skp*.idx
-rw-r----- 1 azat azat 36 data/with_nullable_patch/data/all_1_1_0/skp_idx_value_index.idx
-rw-r----- 1 azat azat 37 data/without_nullable_patch/data/all_1_1_0/skp_idx_value_index.idx
$ md5sum data/*/data/all_1_1_0/skp*.idx
a19c95c4a14506c65665a1e30ab404bf data/with_nullable_patch/data/all_1_1_0/skp_idx_value_index.idx
e50e2fcfa873b232196623d56ab26105 data/without_nullable_patch/data/all_1_1_0/skp_idx_value_index.idx
Note, that there is no stable release with this patch included yet, so
no need to backport.
Also note that you may create data skipping indexes over Nullable
column even before [3].
[3]: https://github.com/ClickHouse/ClickHouse/pull/12455
v2: break cases when granulas has Null in values due to backward
compatibility
In case if one Distributed has multiple shards, and underlying
Distributed has only one, there can be the case when the query will be
tried to process from Complete to WithMergeableStateAfterAggregation,
which is obviously wrong.
Before the following queries was running LimitBy/Distinct step on the
initator:
select distinct sharding_key from dist order by k
While this can be omitted.
Before this patch it wasn't possible to optimize simple SELECT * FROM
dist ORDER BY (w/o GROUP BY and DISTINCT) to more optimal stage
(QueryProcessingStage::WithMergeableStateAfterAggregationAndLimit),
since that code was under
allow_nondeterministic_optimize_skip_unused_shards, rework it and make
it possible.
Also now distributed_push_down_limit is respected for
optimize_distributed_group_by_sharding_key.
Next step will be to enable distributed_push_down_limit by default.
v2: fix detection of aggregates