Merge branch 'master' into in_memory_raft

2024-09-20 08:40:50 +00:00 · 2021-02-08 10:30:31 +03:00 · 2021-02-08 10:30:31 +03:00 · d5ee7d33e9
commit d5ee7d33e9
parent ab98040003 be27578294
151 changed files with 4084 additions and 587 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,148 @@
+## ClickHouse release 21.2
+
+### ClickHouse release v21.2.2.8-stable, 2021-02-07
+
+#### Backward Incompatible Change
+
+* Bitwise functions (`bitAnd`, `bitOr`, etc) are forbidden for floating point arguments. Now you have to do explicit cast to integer. [#19853](https://github.com/ClickHouse/ClickHouse/pull/19853) ([Azat Khuzhin](https://github.com/azat)).
+* Forbid `lcm`/`gcd` for floats. [#19532](https://github.com/ClickHouse/ClickHouse/pull/19532) ([Azat Khuzhin](https://github.com/azat)).
+* Fix memory tracking for `OPTIMIZE TABLE`/merges; account query memory limits and sampling for `OPTIMIZE TABLE`/merges. [#18772](https://github.com/ClickHouse/ClickHouse/pull/18772) ([Azat Khuzhin](https://github.com/azat)).
+* Disallow floating point column as partition key, see [#18421](https://github.com/ClickHouse/ClickHouse/issues/18421#event-4147046255). [#18464](https://github.com/ClickHouse/ClickHouse/pull/18464) ([hexiaoting](https://github.com/hexiaoting)).
+* Excessive parenthesis in type definitions no longer supported, example: `Array((UInt8))`.
+
+#### New Feature
+
+* Added `PostgreSQL` table engine (both select/insert, with support for multidimensional arrays), also as table function. Added `PostgreSQL` dictionary source. Added `PostgreSQL` database engine. [#18554](https://github.com/ClickHouse/ClickHouse/pull/18554) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Data type `Nested` now supports arbitrary levels of nesting. Introduced subcolumns of complex types, such as `size0` in `Array`, `null` in `Nullable`, names of `Tuple` elements, which can be read without reading of whole column. [#17310](https://github.com/ClickHouse/ClickHouse/pull/17310) ([Anton Popov](https://github.com/CurtizJ)).
+* Added `Nullable` support for `FlatDictionary`, `HashedDictionary`, `ComplexKeyHashedDictionary`, `DirectDictionary`, `ComplexKeyDirectDictionary`, `RangeHashedDictionary`. [#18236](https://github.com/ClickHouse/ClickHouse/pull/18236) ([Maksim Kita](https://github.com/kitaisreal)).
+* Adds a new table called `system.distributed_ddl_queue` that displays the queries in the DDL worker queue. [#17656](https://github.com/ClickHouse/ClickHouse/pull/17656) ([Bharat Nallan](https://github.com/bharatnc)).
+* Added support of mapping LDAP group names, and attribute values in general, to local roles for users from ldap user directories. [#17211](https://github.com/ClickHouse/ClickHouse/pull/17211) ([Denis Glazachev](https://github.com/traceon)).
+* Support insert into table function `cluster`, and for both table functions `remote` and `cluster`, support distributing data across nodes by specify sharding key. Close [#16752](https://github.com/ClickHouse/ClickHouse/issues/16752). [#18264](https://github.com/ClickHouse/ClickHouse/pull/18264) ([flynn](https://github.com/ucasFL)).
+* Add function `decodeXMLComponent` to decode characters for XML. Example: `SELECT decodeXMLComponent('Hello,&quot;world&quot;!')` [#17659](https://github.com/ClickHouse/ClickHouse/issues/17659). [#18542](https://github.com/ClickHouse/ClickHouse/pull/18542) ([nauta](https://github.com/nautaa)).
+* Added functions `parseDateTimeBestEffortUSOrZero`, `parseDateTimeBestEffortUSOrNull`. [#19712](https://github.com/ClickHouse/ClickHouse/pull/19712) ([Maksim Kita](https://github.com/kitaisreal)).
+* Add `sign` math function. [#19527](https://github.com/ClickHouse/ClickHouse/pull/19527) ([flynn](https://github.com/ucasFL)).
+* Add information about used features (functions, table engines, etc) into system.query_log. [#18495](https://github.com/ClickHouse/ClickHouse/issues/18495). [#19371](https://github.com/ClickHouse/ClickHouse/pull/19371) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Function `formatDateTime` support the `%Q` modification to format date to quarter. [#19224](https://github.com/ClickHouse/ClickHouse/pull/19224) ([Jianmei Zhang](https://github.com/zhangjmruc)).
+* Support MetaKey+Enter hotkey binding in play UI. [#19012](https://github.com/ClickHouse/ClickHouse/pull/19012) ([sundyli](https://github.com/sundy-li)).
+* Add three functions for map data type: 1. `mapContains(map, key)` to check weather map.keys include the second parameter key. 2. `mapKeys(map)` return all the keys in Array format 3. `mapValues(map)` return all the values in Array format. [#18788](https://github.com/ClickHouse/ClickHouse/pull/18788) ([hexiaoting](https://github.com/hexiaoting)).
+* Add `log_comment` setting related to [#18494](https://github.com/ClickHouse/ClickHouse/issues/18494). [#18549](https://github.com/ClickHouse/ClickHouse/pull/18549) ([Zijie Lu](https://github.com/TszKitLo40)).
+* Add support of tuple argument to `argMin` and `argMax` functions. [#17359](https://github.com/ClickHouse/ClickHouse/pull/17359) ([Ildus Kurbangaliev](https://github.com/ildus)).
+* Support `EXISTS VIEW` syntax. [#18552](https://github.com/ClickHouse/ClickHouse/pull/18552) ([Du Chuan](https://github.com/spongedu)).
+* Add `SELECT ALL` syntax. closes [#18706](https://github.com/ClickHouse/ClickHouse/issues/18706). [#18723](https://github.com/ClickHouse/ClickHouse/pull/18723) ([flynn](https://github.com/ucasFL)).
+
+#### Performance Improvement
+
+* Faster parts removal by lowering the number of `stat` syscalls. This returns the optimization that existed while ago. More safe interface of `IDisk`. This closes [#19065](https://github.com/ClickHouse/ClickHouse/issues/19065). [#19086](https://github.com/ClickHouse/ClickHouse/pull/19086) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Aliases declared in `WITH` statement are properly used in index analysis. Queries like `WITH column AS alias SELECT ... WHERE alias = ...` may use index now. [#18896](https://github.com/ClickHouse/ClickHouse/pull/18896) ([Amos Bird](https://github.com/amosbird)).
+* Add `optimize_alias_column_prediction` (on by default), that will: - Respect aliased columns in WHERE during partition pruning and skipping data using secondary indexes; - Respect aliased columns in WHERE for trivial count queries for optimize_trivial_count; - Respect aliased columns in GROUP BY/ORDER BY for optimize_aggregation_in_order/optimize_read_in_order. [#16995](https://github.com/ClickHouse/ClickHouse/pull/16995) ([sundyli](https://github.com/sundy-li)).
+* Speed up aggregate function `sum`. Improvement only visible on synthetic benchmarks and not very practical. [#19216](https://github.com/ClickHouse/ClickHouse/pull/19216) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Update libc++ and use another ABI to provide better performance. [#18914](https://github.com/ClickHouse/ClickHouse/pull/18914) ([Danila Kutenin](https://github.com/danlark1)).
+* Rewrite `sumIf()` and `sum(if())` function to `countIf()` function when logically equivalent. [#17041](https://github.com/ClickHouse/ClickHouse/pull/17041) ([flynn](https://github.com/ucasFL)).
+* Use a connection pool for S3 connections, controlled by the `s3_max_connections` settings. [#13405](https://github.com/ClickHouse/ClickHouse/pull/13405) ([Vladimir Chebotarev](https://github.com/excitoon)).
+* Add support for zstd long option for better compression of string columns to save space. [#17184](https://github.com/ClickHouse/ClickHouse/pull/17184) ([ygrek](https://github.com/ygrek)).
+* Slightly improve server latency by removing access to configuration on every connection. [#19863](https://github.com/ClickHouse/ClickHouse/pull/19863) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Reduce lock contention for multiple layers of the `Buffer` engine. [#19379](https://github.com/ClickHouse/ClickHouse/pull/19379) ([Azat Khuzhin](https://github.com/azat)).
+* Support splitting `Filter` step of query plan into `Expression + Filter` pair. Together with `Expression + Expression` merging optimization ([#17458](https://github.com/ClickHouse/ClickHouse/issues/17458)) it may delay execution for some expressions after `Filter` step. [#19253](https://github.com/ClickHouse/ClickHouse/pull/19253) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+
+#### Improvement
+
+* `SELECT count() FROM table` now can be executed if only one any column can be selected from the `table`. This PR fixes [#10639](https://github.com/ClickHouse/ClickHouse/issues/10639). [#18233](https://github.com/ClickHouse/ClickHouse/pull/18233) ([Vitaly Baranov](https://github.com/vitlibar)).
+* Set charset to `utf8mb4` when interacting with remote MySQL servers. Fixes [#19795](https://github.com/ClickHouse/ClickHouse/issues/19795). [#19800](https://github.com/ClickHouse/ClickHouse/pull/19800) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* `S3` table function now supports `auto` compression mode (autodetect). This closes [#18754](https://github.com/ClickHouse/ClickHouse/issues/18754). [#19793](https://github.com/ClickHouse/ClickHouse/pull/19793) ([Vladimir Chebotarev](https://github.com/excitoon)).
+* Correctly output infinite arguments for `formatReadableTimeDelta` function. In previous versions, there was implicit conversion to implementation specific integer value. [#19791](https://github.com/ClickHouse/ClickHouse/pull/19791) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Table function `S3` will use global region if the region can't be determined exactly. This closes [#10998](https://github.com/ClickHouse/ClickHouse/issues/10998). [#19750](https://github.com/ClickHouse/ClickHouse/pull/19750) ([Vladimir Chebotarev](https://github.com/excitoon)).
+* In distributed queries if the setting `async_socket_for_remote` is enabled, it was possible to get stack overflow at least in debug build configuration if very deeply nested data type is used in table (e.g. `Array(Array(Array(...more...)))`). This fixes [#19108](https://github.com/ClickHouse/ClickHouse/issues/19108). This change introduces minor backward incompatibility: excessive parenthesis in type definitions no longer supported, example: `Array((UInt8))`. [#19736](https://github.com/ClickHouse/ClickHouse/pull/19736) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Add separate pool for message brokers (RabbitMQ and Kafka). [#19722](https://github.com/ClickHouse/ClickHouse/pull/19722) ([Azat Khuzhin](https://github.com/azat)).
+* Fix rare `max_number_of_merges_with_ttl_in_pool` limit overrun (more merges with TTL can be assigned) for non-replicated MergeTree. [#19708](https://github.com/ClickHouse/ClickHouse/pull/19708) ([alesapin](https://github.com/alesapin)).
+* Dictionary: better error message during attribute parsing. [#19678](https://github.com/ClickHouse/ClickHouse/pull/19678) ([Maksim Kita](https://github.com/kitaisreal)).
+* Add an option to disable validation of checksums on reading. Should never be used in production. Please do not expect any benefits in disabling it. It may only be used for experiments and benchmarks. The setting only applicable for tables of MergeTree family. Checksums are always validated for other table engines and when receiving data over network. In my observations there is no performance difference or it is less than 0.5%. [#19588](https://github.com/ClickHouse/ClickHouse/pull/19588) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Support constant result in function `multiIf`. [#19533](https://github.com/ClickHouse/ClickHouse/pull/19533) ([Maksim Kita](https://github.com/kitaisreal)).
+* Enable function length/empty/notEmpty for datatype Map, which returns keys number in Map. [#19530](https://github.com/ClickHouse/ClickHouse/pull/19530) ([taiyang-li](https://github.com/taiyang-li)).
+* Add `--reconnect` option to `clickhouse-benchmark`. When this option is specified, it will reconnect before every request. This is needed for testing. [#19872](https://github.com/ClickHouse/ClickHouse/pull/19872) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Support using the new location of `.debug` file. This fixes [#19348](https://github.com/ClickHouse/ClickHouse/issues/19348). [#19520](https://github.com/ClickHouse/ClickHouse/pull/19520) ([Amos Bird](https://github.com/amosbird)).
+* `toIPv6` function parses `IPv4` addresses. [#19518](https://github.com/ClickHouse/ClickHouse/pull/19518) ([Bharat Nallan](https://github.com/bharatnc)).
+* Add `http_referer` field to `system.query_log`, `system.processes`, etc. This closes [#19389](https://github.com/ClickHouse/ClickHouse/issues/19389). [#19390](https://github.com/ClickHouse/ClickHouse/pull/19390) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Improve MySQL compatibility by making more functions case insensitive and adding aliases. [#19387](https://github.com/ClickHouse/ClickHouse/pull/19387) ([Daniil Kondratyev](https://github.com/dankondr)).
+* Add metrics for MergeTree parts (Wide/Compact/InMemory) types. [#19381](https://github.com/ClickHouse/ClickHouse/pull/19381) ([Azat Khuzhin](https://github.com/azat)).
+* Allow docker to be executed with arbitrary uid. [#19374](https://github.com/ClickHouse/ClickHouse/pull/19374) ([filimonov](https://github.com/filimonov)).
+* Fix wrong alignment of values of `IPv4` data type in Pretty formats. They were aligned to the right, not to the left. This closes [#19184](https://github.com/ClickHouse/ClickHouse/issues/19184). [#19339](https://github.com/ClickHouse/ClickHouse/pull/19339) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Allow change `max_server_memory_usage` without restart. This closes [#18154](https://github.com/ClickHouse/ClickHouse/issues/18154). [#19186](https://github.com/ClickHouse/ClickHouse/pull/19186) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* The exception when function `bar` is called with certain NaN argument may be slightly misleading in previous versions. This fixes [#19088](https://github.com/ClickHouse/ClickHouse/issues/19088). [#19107](https://github.com/ClickHouse/ClickHouse/pull/19107) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Explicitly set uid / gid of clickhouse user & group to the fixed values (101) in clickhouse-server images. [#19096](https://github.com/ClickHouse/ClickHouse/pull/19096) ([filimonov](https://github.com/filimonov)).
+* Fixed `PeekableReadBuffer: Memory limit exceed` error when inserting data with huge strings. Fixes [#18690](https://github.com/ClickHouse/ClickHouse/issues/18690). [#18979](https://github.com/ClickHouse/ClickHouse/pull/18979) ([tavplubix](https://github.com/tavplubix)).
+* Docker image: several improvements for clickhouse-server entrypoint. [#18954](https://github.com/ClickHouse/ClickHouse/pull/18954) ([filimonov](https://github.com/filimonov)).
+* Add `normalizeQueryKeepNames` and `normalizedQueryHashKeepNames` to normalize queries without masking long names with `?`. This helps better analyze complex query logs. [#18910](https://github.com/ClickHouse/ClickHouse/pull/18910) ([Amos Bird](https://github.com/amosbird)).
+* Check per-block checksum of the distributed batch on the sender before sending (without reading the file twice, the checksums will be verified while reading), this will avoid stuck of the INSERT on the receiver (on truncated .bin file on the sender). Avoid reading .bin files twice for batched INSERT (it was required to calculate rows/bytes to take squashing into account, now this information included into the header, backward compatible is preserved). [#18853](https://github.com/ClickHouse/ClickHouse/pull/18853) ([Azat Khuzhin](https://github.com/azat)).
+* Fix issues with RIGHT and FULL JOIN of tables with aggregate function states. In previous versions exception about `cloneResized` method was thrown. [#18818](https://github.com/ClickHouse/ClickHouse/pull/18818) ([templarzq](https://github.com/templarzq)).
+* Added prefix-based S3 endpoint settings. [#18812](https://github.com/ClickHouse/ClickHouse/pull/18812) ([Vladimir Chebotarev](https://github.com/excitoon)).
+* Add [UInt8, UInt16, UInt32, UInt64] arguments types support for bitmapTransform, bitmapSubsetInRange, bitmapSubsetLimit, bitmapContains functions. This closes [#18713](https://github.com/ClickHouse/ClickHouse/issues/18713). [#18791](https://github.com/ClickHouse/ClickHouse/pull/18791) ([sundyli](https://github.com/sundy-li)).
+* Allow CTE (Common Table Expressions) to be further aliased. Propagate CSE (Common Subexpressions Elimination) to subqueries in the same level when `enable_global_with_statement = 1`. This fixes [#17378](https://github.com/ClickHouse/ClickHouse/issues/17378) . This fixes https://github.com/ClickHouse/ClickHouse/pull/16575#issuecomment-753416235 . [#18684](https://github.com/ClickHouse/ClickHouse/pull/18684) ([Amos Bird](https://github.com/amosbird)).
+* Update librdkafka to v1.6.0-RC2. Fixes [#18668](https://github.com/ClickHouse/ClickHouse/issues/18668). [#18671](https://github.com/ClickHouse/ClickHouse/pull/18671) ([filimonov](https://github.com/filimonov)).
+* In case of unexpected exceptions automatically restart background thread which is responsible for execution of distributed DDL queries. Fixes [#17991](https://github.com/ClickHouse/ClickHouse/issues/17991). [#18285](https://github.com/ClickHouse/ClickHouse/pull/18285) ([徐炘](https://github.com/weeds085490)).
+* Updated AWS C++ SDK in order to utilize global regions in S3. [#17870](https://github.com/ClickHouse/ClickHouse/pull/17870) ([Vladimir Chebotarev](https://github.com/excitoon)).
+* Added support for `WITH ... [AND] [PERIODIC] REFRESH [interval_in_sec]` clause when creating `LIVE VIEW` tables. [#14822](https://github.com/ClickHouse/ClickHouse/pull/14822) ([vzakaznikov](https://github.com/vzakaznikov)).
+* Restrict `MODIFY TTL` queries for `MergeTree` tables created in old syntax. Previously the query succeeded, but actually it had no effect. [#19064](https://github.com/ClickHouse/ClickHouse/pull/19064) ([Anton Popov](https://github.com/CurtizJ)).
+
+#### Bug Fix
+
+* Fix index analysis of binary functions with constant argument which leads to wrong query results. This fixes [#18364](https://github.com/ClickHouse/ClickHouse/issues/18364). [#18373](https://github.com/ClickHouse/ClickHouse/pull/18373) ([Amos Bird](https://github.com/amosbird)).
+* Fix starting the server with tables having default expressions containing dictGet(). Allow getting return type of dictGet() without loading dictionary. [#19805](https://github.com/ClickHouse/ClickHouse/pull/19805) ([Vitaly Baranov](https://github.com/vitlibar)).
+* Fix server crash after query with `if` function with `Tuple` type of then/else branches result. `Tuple` type must contain `Array` or another complex type. Fixes [#18356](https://github.com/ClickHouse/ClickHouse/issues/18356). [#20133](https://github.com/ClickHouse/ClickHouse/pull/20133) ([alesapin](https://github.com/alesapin)).
+* `MaterializeMySQL` (experimental feature): Fix replication for statements that update several tables. [#20066](https://github.com/ClickHouse/ClickHouse/pull/20066) ([Håvard Kvålen](https://github.com/havardk)).
+* Prevent "Connection refused" in docker during initialization script execution. [#20012](https://github.com/ClickHouse/ClickHouse/pull/20012) ([filimonov](https://github.com/filimonov)).
+* `EmbeddedRocksDB` is an experimental storage. Fix the issue with lack of proper type checking. Simplified code. This closes [#19967](https://github.com/ClickHouse/ClickHouse/issues/19967). [#19972](https://github.com/ClickHouse/ClickHouse/pull/19972) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Fix a segfault in function `fromModifiedJulianDay` when the argument type is `Nullable(T)` for any integral types other than Int32. [#19959](https://github.com/ClickHouse/ClickHouse/pull/19959) ([PHO](https://github.com/depressed-pho)).
+* The function `greatCircleAngle` returned inaccurate results in previous versions. This closes [#19769](https://github.com/ClickHouse/ClickHouse/issues/19769). [#19789](https://github.com/ClickHouse/ClickHouse/pull/19789) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Fix rare bug when some replicated operations (like mutation) cannot process some parts after data corruption. Fixes [#19593](https://github.com/ClickHouse/ClickHouse/issues/19593). [#19702](https://github.com/ClickHouse/ClickHouse/pull/19702) ([alesapin](https://github.com/alesapin)).
+* Background thread which executes `ON CLUSTER` queries might hang waiting for dropped replicated table to do something. It's fixed. [#19684](https://github.com/ClickHouse/ClickHouse/pull/19684) ([yiguolei](https://github.com/yiguolei)).
+* Fix wrong deserialization of columns description. It makes INSERT into a table with a column named `\` impossible. [#19479](https://github.com/ClickHouse/ClickHouse/pull/19479) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Mark distributed batch as broken in case of empty data block in one of files. [#19449](https://github.com/ClickHouse/ClickHouse/pull/19449) ([Azat Khuzhin](https://github.com/azat)).
+* Fixed very rare bug that might cause mutation to hang after `DROP/DETACH/REPLACE/MOVE PARTITION`. It was partially fixed by [#15537](https://github.com/ClickHouse/ClickHouse/issues/15537) for the most cases. [#19443](https://github.com/ClickHouse/ClickHouse/pull/19443) ([tavplubix](https://github.com/tavplubix)).
+* Fix possible error `Extremes transform was already added to pipeline`. Fixes [#14100](https://github.com/ClickHouse/ClickHouse/issues/14100). [#19430](https://github.com/ClickHouse/ClickHouse/pull/19430) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Fix default value in join types with non-zero default (e.g. some Enums). Closes [#18197](https://github.com/ClickHouse/ClickHouse/issues/18197). [#19360](https://github.com/ClickHouse/ClickHouse/pull/19360) ([vdimir](https://github.com/vdimir)).
+* Do not mark file for distributed send as broken on EOF. [#19290](https://github.com/ClickHouse/ClickHouse/pull/19290) ([Azat Khuzhin](https://github.com/azat)).
+* Fix leaking of pipe fd for `async_socket_for_remote`. [#19153](https://github.com/ClickHouse/ClickHouse/pull/19153) ([Azat Khuzhin](https://github.com/azat)).
+* Fix infinite reading from file in `ORC` format (was introduced in [#10580](https://github.com/ClickHouse/ClickHouse/issues/10580)). Fixes [#19095](https://github.com/ClickHouse/ClickHouse/issues/19095). [#19134](https://github.com/ClickHouse/ClickHouse/pull/19134) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Fix issue in merge tree data writer which can lead to marks with bigger size than fixed granularity size. Fixes [#18913](https://github.com/ClickHouse/ClickHouse/issues/18913). [#19123](https://github.com/ClickHouse/ClickHouse/pull/19123) ([alesapin](https://github.com/alesapin)).
+* Fix startup bug when clickhouse was not able to read compression codec from `LowCardinality(Nullable(...))` and throws exception `Attempt to read after EOF`. Fixes [#18340](https://github.com/ClickHouse/ClickHouse/issues/18340). [#19101](https://github.com/ClickHouse/ClickHouse/pull/19101) ([alesapin](https://github.com/alesapin)).
+* Simplify the implementation of `tupleHammingDistance`. Support for tuples of any equal length. Fixes [#19029](https://github.com/ClickHouse/ClickHouse/issues/19029). [#19084](https://github.com/ClickHouse/ClickHouse/pull/19084) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Make sure `groupUniqArray` returns correct type for argument of Enum type. This closes [#17875](https://github.com/ClickHouse/ClickHouse/issues/17875). [#19019](https://github.com/ClickHouse/ClickHouse/pull/19019) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Fix possible error `Expected single dictionary argument for function` if use function `ignore` with `LowCardinality` argument. Fixes [#14275](https://github.com/ClickHouse/ClickHouse/issues/14275). [#19016](https://github.com/ClickHouse/ClickHouse/pull/19016) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Fix inserting of `LowCardinality` column to table with `TinyLog` engine. Fixes [#18629](https://github.com/ClickHouse/ClickHouse/issues/18629). [#19010](https://github.com/ClickHouse/ClickHouse/pull/19010) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Fix minor issue in JOIN: Join tries to materialize const columns, but our code waits for them in other places. [#18982](https://github.com/ClickHouse/ClickHouse/pull/18982) ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)).
+* Disable `optimize_move_functions_out_of_any` because optimization is not always correct. This closes [#18051](https://github.com/ClickHouse/ClickHouse/issues/18051). This closes [#18973](https://github.com/ClickHouse/ClickHouse/issues/18973). [#18981](https://github.com/ClickHouse/ClickHouse/pull/18981) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Fix possible exception `QueryPipeline stream: different number of columns` caused by merging of query plan's `Expression` steps. Fixes [#18190](https://github.com/ClickHouse/ClickHouse/issues/18190). [#18980](https://github.com/ClickHouse/ClickHouse/pull/18980) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Fixed very rare deadlock at shutdown. [#18977](https://github.com/ClickHouse/ClickHouse/pull/18977) ([tavplubix](https://github.com/tavplubix)).
+* Fixed rare crashes when server run out of memory. [#18976](https://github.com/ClickHouse/ClickHouse/pull/18976) ([tavplubix](https://github.com/tavplubix)).
+* Fix incorrect behavior when `ALTER TABLE ... DROP PART 'part_name'` query removes all deduplication blocks for the whole partition. Fixes [#18874](https://github.com/ClickHouse/ClickHouse/issues/18874). [#18969](https://github.com/ClickHouse/ClickHouse/pull/18969) ([alesapin](https://github.com/alesapin)).
+* Fixed issue [#18894](https://github.com/ClickHouse/ClickHouse/issues/18894) Add a check to avoid exception when long column alias('table.column' style, usually auto-generated by BI tools like Looker) equals to long table name. [#18968](https://github.com/ClickHouse/ClickHouse/pull/18968) ([Daniel Qin](https://github.com/mathfool)).
+* Fix error `Task was not found in task queue` (possible only for remote queries, with `async_socket_for_remote = 1`). [#18964](https://github.com/ClickHouse/ClickHouse/pull/18964) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Fix bug when mutation with some escaped text (like `ALTER ... UPDATE e = CAST('foo', 'Enum8(\'foo\' = 1')` serialized incorrectly. Fixes [#18878](https://github.com/ClickHouse/ClickHouse/issues/18878). [#18944](https://github.com/ClickHouse/ClickHouse/pull/18944) ([alesapin](https://github.com/alesapin)).
+* ATTACH PARTITION will reset mutations. [#18804](https://github.com/ClickHouse/ClickHouse/issues/18804). [#18935](https://github.com/ClickHouse/ClickHouse/pull/18935) ([fastio](https://github.com/fastio)).
+* Fix issue with `bitmapOrCardinality` that may lead to nullptr dereference. This closes [#18911](https://github.com/ClickHouse/ClickHouse/issues/18911). [#18912](https://github.com/ClickHouse/ClickHouse/pull/18912) ([sundyli](https://github.com/sundy-li)).
+* Fixed `Attempt to read after eof` error when trying to `CAST` `NULL` from `Nullable(String)` to `Nullable(Decimal(P, S))`. Now function `CAST` returns `NULL` when it cannot parse decimal from nullable string. Fixes [#7690](https://github.com/ClickHouse/ClickHouse/issues/7690). [#18718](https://github.com/ClickHouse/ClickHouse/pull/18718) ([Winter Zhang](https://github.com/zhang2014)).
+* Fix data type convert issue for MySQL engine. [#18124](https://github.com/ClickHouse/ClickHouse/pull/18124) ([bo zeng](https://github.com/mis98zb)).
+* Fix clickhouse-client abort exception while executing only `select`. [#19790](https://github.com/ClickHouse/ClickHouse/pull/19790) ([taiyang-li](https://github.com/taiyang-li)).
+
+
+#### Build/Testing/Packaging Improvement
+
+* Run [SQLancer](https://twitter.com/RiggerManuel/status/1352345625480884228) (logical SQL fuzzer) in CI. [#19006](https://github.com/ClickHouse/ClickHouse/pull/19006) ([Ilya Yatsishin](https://github.com/qoega)).
+* Query Fuzzer will fuzz newly added tests more extensively. This closes [#18916](https://github.com/ClickHouse/ClickHouse/issues/18916). [#19185](https://github.com/ClickHouse/ClickHouse/pull/19185) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Integrate with [Big List of Naughty Strings](https://github.com/minimaxir/big-list-of-naughty-strings/) for better fuzzing. [#19480](https://github.com/ClickHouse/ClickHouse/pull/19480) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Add integration tests run with MSan. [#18974](https://github.com/ClickHouse/ClickHouse/pull/18974) ([alesapin](https://github.com/alesapin)).
+* Fixed MemorySanitizer errors in cyrus-sasl and musl. [#19821](https://github.com/ClickHouse/ClickHouse/pull/19821) ([Ilya Yatsishin](https://github.com/qoega)).
+* Insuffiient arguments check in `positionCaseInsensitiveUTF8` function triggered address sanitizer. [#19720](https://github.com/ClickHouse/ClickHouse/pull/19720) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Remove --project-directory for docker-compose in integration test. Fix logs formatting from docker container. [#19706](https://github.com/ClickHouse/ClickHouse/pull/19706) ([Ilya Yatsishin](https://github.com/qoega)).
+* Made generation of macros.xml easier for integration tests. No more excessive logging from dicttoxml. dicttoxml project is not active for 5+ years. [#19697](https://github.com/ClickHouse/ClickHouse/pull/19697) ([Ilya Yatsishin](https://github.com/qoega)).
+* Allow to explicitly enable or disable watchdog via environment variable `CLICKHOUSE_WATCHDOG_ENABLE`. By default it is enabled if server is not attached to terminal. [#19522](https://github.com/ClickHouse/ClickHouse/pull/19522) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Allow building ClickHouse with Kafka support on arm64. [#19369](https://github.com/ClickHouse/ClickHouse/pull/19369) ([filimonov](https://github.com/filimonov)).
+* Allow building librdkafka without ssl. [#19337](https://github.com/ClickHouse/ClickHouse/pull/19337) ([filimonov](https://github.com/filimonov)).
+* Restore Kafka input in FreeBSD builds. [#18924](https://github.com/ClickHouse/ClickHouse/pull/18924) ([Alexandre Snarskii](https://github.com/snar)).
+* Fix potential nullptr dereference in table function `VALUES`. [#19357](https://github.com/ClickHouse/ClickHouse/pull/19357) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+* Avoid UBSan reports in `arrayElement` function, `substring` and `arraySum`. Fixes [#19305](https://github.com/ClickHouse/ClickHouse/issues/19305). Fixes [#19287](https://github.com/ClickHouse/ClickHouse/issues/19287). This closes [#19336](https://github.com/ClickHouse/ClickHouse/issues/19336). [#19347](https://github.com/ClickHouse/ClickHouse/pull/19347) ([alexey-milovidov](https://github.com/alexey-milovidov)).
+
+
 ## ClickHouse release 21.1

 ### ClickHouse release v21.1.3.32-stable, 2021-02-03
--- a/README.md
+++ b/README.md
@ -8,7 +8,7 @@ ClickHouse® is an open-source column-oriented database management system that a
 * [Tutorial](https://clickhouse.tech/docs/en/getting_started/tutorial/) shows how to set up and query small ClickHouse cluster.
 * [Documentation](https://clickhouse.tech/docs/en/) provides more in-depth information.
 * [YouTube channel](https://www.youtube.com/c/ClickHouseDB) has a lot of content about ClickHouse in video format.
-* [Slack](https://join.slack.com/t/clickhousedb/shared_invite/zt-d2zxkf9e-XyxDa_ucfPxzuH4SJIm~Ng) and [Telegram](https://telegram.me/clickhouse_en) allow to chat with ClickHouse users in real-time.
+* [Slack](https://join.slack.com/t/clickhousedb/shared_invite/zt-ly9m4w1x-6j7x5Ts_pQZqrctAbRZ3cg) and [Telegram](https://telegram.me/clickhouse_en) allow to chat with ClickHouse users in real-time.
 * [Blog](https://clickhouse.yandex/blog/en/) contains various ClickHouse-related articles, as well as announcements and reports about events.
 * [Code Browser](https://clickhouse.tech/codebrowser/html_report/ClickHouse/index.html) with syntax highlight and navigation.
 * [Yandex.Messenger channel](https://yandex.ru/chat/#/join/20e380d9-c7be-4123-ab06-e95fb946975e) shares announcements and useful links in Russian.
--- a/contrib/base64-cmake/CMakeLists.txt
+++ b/contrib/base64-cmake/CMakeLists.txt
@ -11,7 +11,7 @@ endif ()
 target_compile_options(base64_scalar PRIVATE -falign-loops)

 if (ARCH_AMD64)
-    target_compile_options(base64_ssse3 PRIVATE -mssse3 -falign-loops)
+    target_compile_options(base64_ssse3 PRIVATE -mno-avx -mno-avx2 -mssse3 -falign-loops)
    target_compile_options(base64_avx PRIVATE -falign-loops -mavx)
    target_compile_options(base64_avx2 PRIVATE -falign-loops -mavx2)
 else ()
--- a/contrib/hyperscan-cmake/CMakeLists.txt
+++ b/contrib/hyperscan-cmake/CMakeLists.txt
@ -252,6 +252,7 @@ if (NOT EXTERNAL_HYPERSCAN_LIBRARY_FOUND)
    target_compile_definitions (hyperscan PUBLIC USE_HYPERSCAN=1)
    target_compile_options (hyperscan
        PRIVATE -g0 # Library has too much debug information
+        -mno-avx -mno-avx2 # The library is using dynamic dispatch and is confused if AVX is enabled globally
        -march=corei7 -O2 -fno-strict-aliasing -fno-omit-frame-pointer -fvisibility=hidden # The options from original build system
        -fno-sanitize=undefined # Assume the library takes care of itself
    )
--- a/contrib/poco
+++ b/contrib/poco
@ -1 +1 @@
-Subproject commit e11f3c971570cf6a31006cd21cadf41a259c360a
+Subproject commit fbaaba4a02e29987b8c584747a496c79528f125f
--- a/docker/test/fasttest/run.sh
+++ b/docker/test/fasttest/run.sh
@ -321,6 +321,7 @@ function run_tests

         # In fasttest, ENABLE_LIBRARIES=0, so rocksdb engine is not enabled by default
        01504_rocksdb
+        01686_rocksdb

        # Look at DistributedFilesToInsert, so cannot run in parallel.
        01460_DistributedFilesToInsert
--- a/docker/test/integration/runner/Dockerfile
+++ b/docker/test/integration/runner/Dockerfile
@ -61,7 +61,7 @@ RUN python3 -m pip install \
    aerospike \
    avro \
    cassandra-driver \
-    confluent-kafka \
+    confluent-kafka==1.5.0 \
    dict2xml \
    dicttoxml \
    docker \
--- a/docs/en/introduction/adopters.md
+++ b/docs/en/introduction/adopters.md
@ -74,6 +74,7 @@ toc_title: Adopters
 | <a href="https://getnoc.com/" class="favicon">NOC Project</a> | Network Monitoring | Analytics | Main Product | — | [Official Website](https://getnoc.com/features/big-data/) |
 | <a href="https://www.nuna.com/" class="favicon">Nuna Inc.</a> | Health Data Analytics | — | — | — | [Talk in English, July 2020](https://youtu.be/GMiXCMFDMow?t=170) |
 | <a href="https://www.oneapm.com/" class="favicon">OneAPM</a> | Monitorings and Data Analysis | Main product | — | — | [Slides in Chinese, October 2018](https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup19/8.%20clickhouse在OneAPM的应用%20杜龙.pdf) |
+| <a href="https://panelbear.com/" class="favicon">Panelbear | Analytics | Monitoring and Analytics | — | — | [Tech Stack, November 2020](https://panelbear.com/blog/tech-stack/) |
 | <a href="https://www.percent.cn/" class="favicon">Percent 百分点</a> | Analytics | Main Product | — | — | [Slides in Chinese, June 2019](https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup24/4.%20ClickHouse万亿数据双中心的设计与实践%20.pdf) |
 | <a href="https://www.percona.com/" class="favicon">Percona</a> | Performance analysis | Percona Monitoring and Management | — | — | [Official website, Mar 2020](https://www.percona.com/blog/2020/03/30/advanced-query-analysis-in-percona-monitoring-and-management-with-direct-clickhouse-access/) |
 | <a href="https://plausible.io/" class="favicon">Plausible</a> | Analytics | Main Product | — | — | [Blog post, June 2020](https://twitter.com/PlausibleHQ/status/1273889629087969280) |
--- a/docs/en/sql-reference/data-types/array.md
+++ b/docs/en/sql-reference/data-types/array.md
@ -45,6 +45,8 @@ SELECT [1, 2] AS x, toTypeName(x)

 ## Working with Data Types {#working-with-data-types}

+The maximum size of an array is limited to one million elements. 
+
 When creating an array on the fly, ClickHouse automatically defines the argument type as the narrowest data type that can store all the listed arguments. If there are any [Nullable](../../sql-reference/data-types/nullable.md#data_type-nullable) or literal [NULL](../../sql-reference/syntax.md#null-literal) values, the type of an array element also becomes [Nullable](../../sql-reference/data-types/nullable.md).

 If ClickHouse couldn’t determine the data type, it generates an exception. For instance, this happens when trying to create an array with strings and numbers simultaneously (`SELECT array(1, 'a')`).
--- a/docs/en/sql-reference/functions/array-functions.md
+++ b/docs/en/sql-reference/functions/array-functions.md
@ -1288,73 +1288,226 @@ Returns the index of the first element in the `arr1` array for which `func` retu

 Note that the `arrayFirstIndex` is a [higher-order function](../../sql-reference/functions/index.md#higher-order-functions). You must pass a lambda function to it as the first argument, and it can’t be omitted.

-## arrayMin(\[func,\] arr1, …) {#array-min}
+## arrayMin {#array-min}

-Returns the min of the `func` values. If the function is omitted, it just returns the min of the array elements.
+Returns the minimum of elements in the source array. 
+
+If the `func` function is specified, returns the mininum of elements converted by this function.

 Note that the `arrayMin` is a [higher-order function](../../sql-reference/functions/index.md#higher-order-functions). You can pass a lambda function to it as the first argument.

-Examples:
+**Syntax**
+
 ```sql
-SELECT arrayMin([1, 2, 4]) AS res
+arrayMin([func,] arr)
+```
+
+**Parameters**
+
+-   `func` — Function. [Expression](../../sql-reference/data-types/special-data-types/expression.md).
+-   `arr` — Array. [Array](../../sql-reference/data-types/array.md).
+
+**Returned value**
+
+-   The minimum of function values (or the array minimum). 
+
+Type: if `func` is specified, matches `func` return value type, else matches the array elements type. 
+
+**Examples**
+
+Query:
+
+```sql
+SELECT arrayMin([1, 2, 4]) AS res;
+```
+
+Result:
+
+```text
 ┌─res─┐
 │   1 │
 └─────┘
+```

+Query:

-SELECT arrayMin(x -> (-x), [1, 2, 4]) AS res
+```sql
+SELECT arrayMin(x -> (-x), [1, 2, 4]) AS res;
+```
+
+Result:
+
+```text
 ┌─res─┐
 │  -4 │
 └─────┘
 ```

-## arrayMax(\[func,\] arr1, …) {#array-max}
+## arrayMax {#array-max}

-Returns the max of the `func` values. If the function is omitted, it just returns the max of the array elements.
+Returns the maximum of elements in the source array. 
+
+If the `func` function is specified, returns the maximum of elements converted by this function.

 Note that the `arrayMax` is a [higher-order function](../../sql-reference/functions/index.md#higher-order-functions). You can pass a lambda function to it as the first argument.

-Examples:
+**Syntax**
+
 ```sql
-SELECT arrayMax([1, 2, 4]) AS res
+arrayMax([func,] arr)
+```
+
+**Parameters**
+
+-   `func` — Function. [Expression](../../sql-reference/data-types/special-data-types/expression.md).
+-   `arr` — Array. [Array](../../sql-reference/data-types/array.md).
+
+**Returned value**
+
+-   The maximum of function values (or the array maximum). 
+
+Type: if `func` is specified, matches `func` return value type, else matches the array elements type. 
+
+**Examples**
+
+Query:
+
+```sql
+SELECT arrayMax([1, 2, 4]) AS res;
+```
+
+Result:
+
+```text
 ┌─res─┐
 │   4 │
 └─────┘
+```

+Query:

-SELECT arrayMax(x -> (-x), [1, 2, 4]) AS res
+```sql
+SELECT arrayMax(x -> (-x), [1, 2, 4]) AS res;
+```
+
+Result:
+
+```text
 ┌─res─┐
 │  -1 │
 └─────┘
 ```

-## arraySum(\[func,\] arr1, …) {#array-sum}
+## arraySum {#array-sum}

-Returns the sum of the `func` values. If the function is omitted, it just returns the sum of the array elements.
+Returns the sum of elements in the source array. 
+
+If the `func` function is specified, returns the sum of elements converted by this function.

 Note that the `arraySum` is a [higher-order function](../../sql-reference/functions/index.md#higher-order-functions). You can pass a lambda function to it as the first argument.

-Examples:
+**Syntax**
+
 ```sql
-SELECT arraySum([2,3]) AS res
+arraySum([func,] arr)
+```
+
+**Parameters**
+
+-   `func` — Function. [Expression](../../sql-reference/data-types/special-data-types/expression.md).
+-   `arr` — Array. [Array](../../sql-reference/data-types/array.md).   
+
+**Returned value**
+
+-   The sum of the function values (or the array sum).
+
+Type: for decimal numbers in source array (or for converted values, if `func` is specified) — [Decimal128](../../sql-reference/data-types/decimal.md), for floating point numbers — [Float64](../../sql-reference/data-types/float.md), for numeric unsigned — [UInt64](../../sql-reference/data-types/int-uint.md), and for numeric signed — [Int64](../../sql-reference/data-types/int-uint.md).
+
+**Examples**
+
+Query:
+
+```sql
+SELECT arraySum([2, 3]) AS res;
+```
+
+Result:
+
+```text
 ┌─res─┐
 │   5 │
 └─────┘
+```

+Query:

-SELECT arraySum(x -> x*x, [2, 3]) AS res
+```sql
+SELECT arraySum(x -> x*x, [2, 3]) AS res;
+```
+
+Result:
+
+```text
 ┌─res─┐
 │  13 │
 └─────┘
 ```

+## arrayAvg {#array-avg}

-## arrayAvg(\[func,\] arr1, …) {#array-avg}
+Returns the average of elements in the source array. 

-Returns the average of the `func` values. If the function is omitted, it just returns the average of the array elements.
+If the `func` function is specified, returns the average of elements converted by this function.

 Note that the `arrayAvg` is a [higher-order function](../../sql-reference/functions/index.md#higher-order-functions). You can pass a lambda function to it as the first argument.

+**Syntax**
+
+```sql
+arrayAvg([func,] arr)
+```
+
+**Parameters**
+
+-   `func` — Function. [Expression](../../sql-reference/data-types/special-data-types/expression.md).
+-   `arr` — Array. [Array](../../sql-reference/data-types/array.md).   
+
+**Returned value**
+
+-   The average of function values (or the array average).
+
+Type: [Float64](../../sql-reference/data-types/float.md).
+
+**Examples**
+
+Query:
+
+```sql
+SELECT arrayAvg([1, 2, 4]) AS res;
+```
+
+Result:
+
+```text
+┌────────────────res─┐
+│ 2.3333333333333335 │
+└────────────────────┘
+```
+
+Query:
+
+```sql
+SELECT arrayAvg(x -> (x * x), [2, 4]) AS res;
+```
+
+Result:
+
+```text
+┌─res─┐
+│  10 │
+└─────┘
+```
+
 ## arrayCumSum(\[func,\] arr1, …) {#arraycumsumfunc-arr1}

 Returns an array of partial sums of elements in the source array (a running sum). If the `func` function is specified, then the values of the array elements are converted by this function before summing.
--- a/docs/en/sql-reference/functions/date-time-functions.md
+++ b/docs/en/sql-reference/functions/date-time-functions.md
@ -380,7 +380,7 @@ Alias: `dateTrunc`.

 **Parameters**

-   `unit` — Part of date. [String](../syntax.md#syntax-string-literal).
+-   `unit` — The type of interval to truncate the result. [String Literal](../syntax.md#syntax-string-literal).
    Possible values:

    - `second`
@ -435,6 +435,201 @@ Result:

 -   [toStartOfInterval](#tostartofintervaltime-or-data-interval-x-unit-time-zone)

+## date\_add {#date_add}
+
+Adds specified date/time interval to the provided date.
+
+**Syntax** 
+
+``` sql
+date_add(unit, value, date)
+```
+
+Aliases: `dateAdd`, `DATE_ADD`. 
+
+**Parameters**
+
+-   `unit` — The type of interval to add. [String](../../sql-reference/data-types/string.md).
+
+        Supported values: second, minute, hour, day, week, month, quarter, year.
+-   `value` - Value in specified unit - [Int](../../sql-reference/data-types/int-uint.md)    
+-   `date` — [Date](../../sql-reference/data-types/date.md) or [DateTime](../../sql-reference/data-types/datetime.md).
+
+
+**Returned value**
+
+Returns Date or DateTime with `value` expressed in `unit` added to `date`. 
+
+**Example**
+
+```sql
+select date_add(YEAR, 3, toDate('2018-01-01'));
+```
+
+```text
+┌─plus(toDate('2018-01-01'), toIntervalYear(3))─┐
+│                                    2021-01-01 │
+└───────────────────────────────────────────────┘
+```
+
+## date\_diff {#date_diff}
+
+Returns the difference between two Date or DateTime values.
+
+**Syntax**
+
+``` sql
+date_diff('unit', startdate, enddate, [timezone])
+```
+
+Aliases: `dateDiff`, `DATE_DIFF`. 
+
+**Parameters**
+
+-   `unit` — The type of interval for result [String](../../sql-reference/data-types/string.md).
+
+        Supported values: second, minute, hour, day, week, month, quarter, year.
+
+-   `startdate` — The first time value to subtract (the subtrahend). [Date](../../sql-reference/data-types/date.md) or [DateTime](../../sql-reference/data-types/datetime.md).
+
+-   `enddate` — The second time value to subtract from (the minuend). [Date](../../sql-reference/data-types/date.md) or [DateTime](../../sql-reference/data-types/datetime.md).
+
+-   `timezone` — Optional parameter. If specified, it is applied to both `startdate` and `enddate`. If not specified, timezones of `startdate` and `enddate` are used. If they are not the same, the result is unspecified.
+
+**Returned value**
+
+Difference between `enddate` and `startdate` expressed in `unit`.
+
+Type: `int`.
+
+**Example**
+
+Query:
+
+``` sql
+SELECT dateDiff('hour', toDateTime('2018-01-01 22:00:00'), toDateTime('2018-01-02 23:00:00'));
+```
+
+Result:
+
+``` text
+┌─dateDiff('hour', toDateTime('2018-01-01 22:00:00'), toDateTime('2018-01-02 23:00:00'))─┐
+│                                                                                     25 │
+└────────────────────────────────────────────────────────────────────────────────────────┘
+```
+
+## date\_sub {#date_sub}
+
+Subtracts a time/date interval from the provided date.
+
+**Syntax**
+
+``` sql
+date_sub(unit, value, date)
+```
+
+Aliases: `dateSub`, `DATE_SUB`. 
+
+**Parameters**
+
+-   `unit` — The type of interval to subtract. [String](../../sql-reference/data-types/string.md).
+
+        Supported values: second, minute, hour, day, week, month, quarter, year.
+-   `value` - Value in specified unit - [Int](../../sql-reference/data-types/int-uint.md)    
+-   `date` — [Date](../../sql-reference/data-types/date.md) or [DateTime](../../sql-reference/data-types/datetime.md) to subtract value from.
+
+**Returned value**
+
+Returns Date or DateTime with `value` expressed in `unit` subtracted from `date`. 
+
+**Example**
+
+Query:
+
+``` sql
+SELECT date_sub(YEAR, 3, toDate('2018-01-01'));
+```
+
+Result:
+
+``` text
+┌─minus(toDate('2018-01-01'), toIntervalYear(3))─┐
+│                                     2015-01-01 │
+└────────────────────────────────────────────────┘
+```
+
+## timestamp\_add {#timestamp_add}
+
+Adds the specified time value with the provided date or date time value.
+
+**Syntax** 
+
+``` sql
+timestamp_add(date, INTERVAL value unit)
+```
+
+Aliases: `timeStampAdd`, `TIMESTAMP_ADD`. 
+
+**Parameters**
+    
+-   `date` — Date or Date with time - [Date](../../sql-reference/data-types/date.md) or [DateTime](../../sql-reference/data-types/datetime.md).
+-   `value` - Value in specified unit - [Int](../../sql-reference/data-types/int-uint.md)
+-   `unit` — The type of interval to add. [String](../../sql-reference/data-types/string.md).
+
+        Supported values: second, minute, hour, day, week, month, quarter, year.
+
+**Returned value**
+
+Returns Date or DateTime with the specified `value`  expressed in `unit` added to `date`. 
+    
+**Example**
+
+```sql
+select timestamp_add(toDate('2018-01-01'), INTERVAL 3 MONTH);
+```
+
+```text
+┌─plus(toDate('2018-01-01'), toIntervalMonth(3))─┐
+│                                     2018-04-01 │
+└────────────────────────────────────────────────┘
+```
+
+## timestamp\_sub {#timestamp_sub}
+
+Returns the difference between two dates in the specified unit.
+
+**Syntax** 
+
+``` sql
+timestamp_sub(unit, value, date)
+```
+
+Aliases: `timeStampSub`, `TIMESTAMP_SUB`. 
+
+**Parameters**
+
+-   `unit` — The type of interval to add. [String](../../sql-reference/data-types/string.md).
+
+        Supported values: second, minute, hour, day, week, month, quarter, year.
+-   `value` - Value in specified unit - [Int](../../sql-reference/data-types/int-uint.md).   
+-   `date`- [Date](../../sql-reference/data-types/date.md) or [DateTime](../../sql-reference/data-types/datetime.md).
+
+**Returned value**
+
+Difference between `date` and the specified `value` expressed in `unit`.
+
+**Example**
+
+```sql
+select timestamp_sub(MONTH, 5, toDateTime('2018-12-18 01:02:03'));
+```
+
+```text
+┌─minus(toDateTime('2018-12-18 01:02:03'), toIntervalMonth(5))─┐
+│                                          2018-07-18 01:02:03 │
+└──────────────────────────────────────────────────────────────┘
+```
+    
 ## now {#now}

 Returns the current date and time. 
@ -550,50 +745,6 @@ SELECT
 └──────────────────────────┴───────────────────────────────┘
 ```

-## dateDiff {#datediff}
-
-Returns the difference between two Date or DateTime values.
-
-**Syntax**
-
-``` sql
-dateDiff('unit', startdate, enddate, [timezone])
-```
-
-**Parameters**
-
-   `unit` — Time unit, in which the returned value is expressed. [String](../../sql-reference/syntax.md#syntax-string-literal).
-
-        Supported values: second, minute, hour, day, week, month, quarter, year.
-
-   `startdate` — The first time value to compare. [Date](../../sql-reference/data-types/date.md) or [DateTime](../../sql-reference/data-types/datetime.md).
-
-   `enddate` — The second time value to compare. [Date](../../sql-reference/data-types/date.md) or [DateTime](../../sql-reference/data-types/datetime.md).
-
-   `timezone` — Optional parameter. If specified, it is applied to both `startdate` and `enddate`. If not specified, timezones of `startdate` and `enddate` are used. If they are not the same, the result is unspecified.
-
-**Returned value**
-
-Difference between `startdate` and `enddate` expressed in `unit`.
-
-Type: `int`.
-
-**Example**
-
-Query:
-
-``` sql
-SELECT dateDiff('hour', toDateTime('2018-01-01 22:00:00'), toDateTime('2018-01-02 23:00:00'));
-```
-
-Result:
-
-``` text
-┌─dateDiff('hour', toDateTime('2018-01-01 22:00:00'), toDateTime('2018-01-02 23:00:00'))─┐
-│                                                                                     25 │
-└────────────────────────────────────────────────────────────────────────────────────────┘
-```
-
 ## timeSlots(StartTime, Duration,\[, Size\]) {#timeslotsstarttime-duration-size}

 For a time interval starting at ‘StartTime’ and continuing for ‘Duration’ seconds, it returns an array of moments in time, consisting of points from this interval rounded down to the ‘Size’ in seconds. ‘Size’ is an optional parameter: a constant UInt32, set to 1800 by default.
--- a/docs/en/sql-reference/window-functions/index.md
+++ b/docs/en/sql-reference/window-functions/index.md
@ -1,9 +1,14 @@
-# [development] Window Functions
+---
+toc_priority: 62
+toc_title: Window Functions
+---
+
+# [experimental] Window Functions

 !!! warning "Warning"
 This is an experimental feature that is currently in development and is not ready
 for general use. It will change in unpredictable backwards-incompatible ways in
-the future releases.
+the future releases. Set `allow_experimental_window_functions = 1` to enable it.

 ClickHouse currently supports calculation of aggregate functions over a window.
 Pure window functions such as `rank`, `lag`, `lead` and so on are not yet supported.
@ -11,9 +16,7 @@ Pure window functions such as `rank`, `lag`, `lead` and so on are not yet suppor
 The window can be specified either with an `OVER` clause or with a separate
 `WINDOW` clause.

-Only two variants of frame are supported, `ROWS` and `RANGE`. The only supported
-frame boundaries are `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`.
-
+Only two variants of frame are supported, `ROWS` and `RANGE`. Offsets for the `RANGE` frame are not yet supported.

 ## References

@ -28,6 +31,7 @@ https://github.com/ClickHouse/ClickHouse/blob/master/tests/performance/window_fu
 https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/01591_window_functions.sql

 ### Postgres Docs
+https://www.postgresql.org/docs/current/sql-select.html#SQL-WINDOW
 https://www.postgresql.org/docs/devel/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
 https://www.postgresql.org/docs/devel/functions-window.html
 https://www.postgresql.org/docs/devel/tutorial-window.html
--- a/docs/ru/development/style.md
+++ b/docs/ru/development/style.md
@ -714,6 +714,7 @@ auto s = std::string{"Hello"};
 ### Пользовательская ошибка {#error-messages-user-error}

 Такая ошибка вызвана действиями пользователя (неверный синтаксис запроса) или конфигурацией внешних систем (кончилось место на диске). Предполагается, что пользователь может устранить её самостоятельно. Для этого в сообщении об ошибке должна содержаться следующая информация:
+
 * что произошло. Это должно объясняться в пользовательских терминах (`Function pow() is not supported for data type UInt128`), а не загадочными конструкциями из кода (`runtime overload resolution failed in DB::BinaryOperationBuilder<FunctionAdaptor<pow>::Impl, UInt128, Int8>::kaboongleFastPath()`).
 * почему/где/когда -- любой контекст, который помогает отладить проблему. Представьте, как бы её отлаживали вы (программировать и пользоваться отладчиком нельзя).
 * что можно предпринять для устранения ошибки. Здесь можно перечислить типичные причины проблемы, настройки, влияющие на это поведение, и так далее.
--- a/docs/ru/sql-reference/data-types/array.md
+++ b/docs/ru/sql-reference/data-types/array.md
@ -47,6 +47,8 @@ SELECT [1, 2] AS x, toTypeName(x)

 ## Особенности работы с типами данных {#osobennosti-raboty-s-tipami-dannykh}

+Максимальный размер массива ограничен одним миллионом элементов.
+
 При создании массива «на лету» ClickHouse автоматически определяет тип аргументов как наиболее узкий тип данных, в котором можно хранить все перечисленные аргументы. Если среди аргументов есть [NULL](../../sql-reference/data-types/array.md#null-literal) или аргумент типа [Nullable](nullable.md#data_type-nullable), то тип элементов массива — [Nullable](nullable.md).

 Если ClickHouse не смог подобрать тип данных, то он сгенерирует исключение. Это произойдёт, например, при попытке создать массив одновременно со строками и числами `SELECT array(1, 'a')`.
--- a/docs/ru/sql-reference/functions/array-functions.md
+++ b/docs/ru/sql-reference/functions/array-functions.md
@ -1135,11 +1135,225 @@ SELECT

 Функция `arrayFirstIndex` является [функцией высшего порядка](../../sql-reference/functions/index.md#higher-order-functions) — в качестве первого аргумента ей нужно передать лямбда-функцию, и этот аргумент не может быть опущен.

-## arraySum(\[func,\] arr1, …) {#array-sum}
+## arrayMin {#array-min}

-Возвращает сумму значений функции `func`. Если функция не указана - просто возвращает сумму элементов массива.
+Возвращает значение минимального элемента в исходном массиве. 

-Функция `arraySum` является [функцией высшего порядка](../../sql-reference/functions/index.md#higher-order-functions) - в качестве первого аргумента ей можно передать лямбда-функцию.
+Если передана функция `func`, возвращается минимум из элементов массива, преобразованных этой функцией.
+
+Функция `arrayMin` является [функцией высшего порядка](../../sql-reference/functions/index.md#higher-order-functions) — в качестве первого аргумента ей можно передать лямбда-функцию.
+
+**Синтаксис**
+
+```sql
+arrayMin([func,] arr)
+```
+
+**Параметры**
+
+-   `func` — функция. [Expression](../../sql-reference/data-types/special-data-types/expression.md).
+-   `arr` — массив. [Array](../../sql-reference/data-types/array.md).
+
+**Возвращаемое значение**
+
+-   Минимальное значение функции (или минимальный элемент массива).
+
+Тип: если передана `func`, соответствует типу ее возвращаемого значения, иначе соответствует типу элементов массива.
+
+**Примеры**
+
+Запрос:
+
+```sql
+SELECT arrayMin([1, 2, 4]) AS res;
+```
+
+Результат:
+
+```text
+┌─res─┐
+│   1 │
+└─────┘
+```
+
+Запрос:
+
+```sql
+SELECT arrayMin(x -> (-x), [1, 2, 4]) AS res;
+```
+
+Результат:
+
+```text
+┌─res─┐
+│  -4 │
+└─────┘
+```
+
+## arrayMax {#array-max}
+
+Возвращает значение максимального элемента в исходном массиве. 
+
+Если передана функция `func`, возвращается максимум из элементов массива, преобразованных этой функцией.
+
+Функция `arrayMax` является [функцией высшего порядка](../../sql-reference/functions/index.md#higher-order-functions) — в качестве первого аргумента ей можно передать лямбда-функцию.
+
+**Синтаксис**
+
+```sql
+arrayMax([func,] arr)
+```
+
+**Параметры**
+
+-   `func` — функция. [Expression](../../sql-reference/data-types/special-data-types/expression.md).
+-   `arr` — массив. [Array](../../sql-reference/data-types/array.md).
+
+**Возвращаемое значение**
+
+-   Максимальное значение функции (или максимальный элемент массива).
+
+Тип: если передана `func`, соответствует типу ее возвращаемого значения, иначе соответствует типу элементов массива.
+
+**Примеры**
+
+Запрос:
+
+```sql
+SELECT arrayMax([1, 2, 4]) AS res;
+```
+
+Результат:
+
+```text
+┌─res─┐
+│   4 │
+└─────┘
+```
+
+Запрос:
+
+```sql
+SELECT arrayMax(x -> (-x), [1, 2, 4]) AS res;
+```
+
+Результат:
+
+```text
+┌─res─┐
+│  -1 │
+└─────┘
+```
+
+## arraySum {#array-sum}
+
+Возвращает сумму элементов в исходном массиве. 
+
+Если передана функция `func`, возвращается сумма элементов массива, преобразованных этой функцией.
+
+Функция `arraySum` является [функцией высшего порядка](../../sql-reference/functions/index.md#higher-order-functions) — в качестве первого аргумента ей можно передать лямбда-функцию.
+
+**Синтаксис**
+
+```sql
+arraySum([func,] arr)
+```
+
+**Параметры**
+
+-   `func` — функция. [Expression](../../sql-reference/data-types/special-data-types/expression.md).
+-   `arr` — массив. [Array](../../sql-reference/data-types/array.md).   
+
+**Возвращаемое значение**
+
+-   Сумма значений функции (или сумма элементов массива).
+
+Тип: для Decimal чисел в исходном массиве (если функция `func` была передана, то для чисел, преобразованных ею) — [Decimal128](../../sql-reference/data-types/decimal.md), для чисел с плавающей точкой — [Float64](../../sql-reference/data-types/float.md), для беззнаковых целых чисел — [UInt64](../../sql-reference/data-types/int-uint.md), для целых чисел со знаком — [Int64](../../sql-reference/data-types/int-uint.md).
+
+**Примеры**
+
+Запрос:
+
+```sql
+SELECT arraySum([2, 3]) AS res;
+```
+
+Результат:
+
+```text
+┌─res─┐
+│   5 │
+└─────┘
+```
+
+Запрос:
+
+```sql
+SELECT arraySum(x -> x*x, [2, 3]) AS res;
+```
+
+Результат:
+
+```text
+┌─res─┐
+│  13 │
+└─────┘
+```
+
+## arrayAvg {#array-avg}
+
+Возвращает среднее значение элементов в исходном массиве. 
+
+Если передана функция `func`, возвращается среднее значение элементов массива, преобразованных этой функцией.
+
+Функция `arrayAvg` является [функцией высшего порядка](../../sql-reference/functions/index.md#higher-order-functions) — в качестве первого аргумента ей можно передать лямбда-функцию.
+
+**Синтаксис**
+
+```sql
+arrayAvg([func,] arr)
+```
+
+**Параметры**
+
+-   `func` — функция. [Expression](../../sql-reference/data-types/special-data-types/expression.md).
+-   `arr` — массив. [Array](../../sql-reference/data-types/array.md).   
+
+**Возвращаемое значение**
+
+-   Среднее значение функции (или среднее значение элементов массива).
+
+Тип: [Float64](../../sql-reference/data-types/float.md).
+
+**Примеры**
+
+Запрос:
+
+```sql
+SELECT arrayAvg([1, 2, 4]) AS res;
+```
+
+Результат:
+
+```text
+┌────────────────res─┐
+│ 2.3333333333333335 │
+└────────────────────┘
+```
+
+Запрос:
+
+```sql
+SELECT arrayAvg(x -> (x * x), [2, 4]) AS res;
+```
+
+Результат:
+
+```text
+┌─res─┐
+│  10 │
+└─────┘
+```

 ## arrayCumSum(\[func,\] arr1, …) {#arraycumsumfunc-arr1}

--- a/programs/client/Client.cpp
+++ b/programs/client/Client.cpp
@ -1719,7 +1719,7 @@ private:
            }
            // Remember where the data ended. We use this info later to determine
            // where the next query begins.
-            parsed_insert_query->end = data_in.buffer().begin() + data_in.count();
+            parsed_insert_query->end = parsed_insert_query->data + data_in.count();
        }
        else if (!is_interactive)
        {
@ -1900,6 +1900,9 @@ private:

        switch (packet.type)
        {
+            case Protocol::Server::PartUUIDs:
+                return true;
+
            case Protocol::Server::Data:
                if (!cancelled)
                    onData(packet.block);
--- a/programs/client/QueryFuzzer.cpp
+++ b/programs/client/QueryFuzzer.cpp
@ -325,6 +325,51 @@ void QueryFuzzer::fuzzColumnLikeExpressionList(IAST * ast)
    // the generic recursion into IAST.children.
 }

+void QueryFuzzer::fuzzWindowFrame(WindowFrame & frame)
+{
+    switch (fuzz_rand() % 40)
+    {
+        case 0:
+        {
+            const auto r = fuzz_rand() % 3;
+            frame.type = r == 0 ? WindowFrame::FrameType::Rows
+                : r == 1 ? WindowFrame::FrameType::Range
+                    : WindowFrame::FrameType::Groups;
+            break;
+        }
+        case 1:
+        {
+            const auto r = fuzz_rand() % 3;
+            frame.begin_type = r == 0 ? WindowFrame::BoundaryType::Unbounded
+                : r == 1 ? WindowFrame::BoundaryType::Current
+                    : WindowFrame::BoundaryType::Offset;
+            break;
+        }
+        case 2:
+        {
+            const auto r = fuzz_rand() % 3;
+            frame.end_type = r == 0 ? WindowFrame::BoundaryType::Unbounded
+                : r == 1 ? WindowFrame::BoundaryType::Current
+                    : WindowFrame::BoundaryType::Offset;
+            break;
+        }
+        case 3:
+        {
+            frame.begin_offset = getRandomField(0).get<Int64>();
+            break;
+        }
+        case 4:
+        {
+            frame.end_offset = getRandomField(0).get<Int64>();
+            break;
+        }
+        default:
+            break;
+    }
+
+    frame.is_default = (frame == WindowFrame{});
+}
+
 void QueryFuzzer::fuzz(ASTs & asts)
 {
    for (auto & ast : asts)
@ -409,6 +454,7 @@ void QueryFuzzer::fuzz(ASTPtr & ast)
            auto & def = fn->window_definition->as<ASTWindowDefinition &>();
            fuzzColumnLikeExpressionList(def.partition_by.get());
            fuzzOrderByList(def.order_by.get());
+            fuzzWindowFrame(def.frame);
        }

        fuzz(fn->children);
@ -421,6 +467,23 @@ void QueryFuzzer::fuzz(ASTPtr & ast)

        fuzz(select->children);
    }
+    /*
+     * The time to fuzz the settings has not yet come.
+     * Apparently we don't have any infractructure to validate the values of
+     * the settings, and the first query with max_block_size = -1 breaks
+     * because of overflows here and there.
+     *//*
+     * else if (auto * set = typeid_cast<ASTSetQuery *>(ast.get()))
+     * {
+     *      for (auto & c : set->changes)
+     *      {
+     *          if (fuzz_rand() % 50 == 0)
+     *          {
+     *              c.value = fuzzField(c.value);
+     *          }
+     *      }
+     * }
+     */
    else if (auto * literal = typeid_cast<ASTLiteral *>(ast.get()))
    {
        // There is a caveat with fuzzing the children: many ASTs also keep the
--- a/programs/client/QueryFuzzer.h
+++ b/programs/client/QueryFuzzer.h
@ -14,6 +14,7 @@ namespace DB

 class ASTExpressionList;
 class ASTOrderByElement;
+struct WindowFrame;

 /*
 * This is an AST-based query fuzzer that makes random modifications to query
@ -65,6 +66,7 @@ struct QueryFuzzer
    void fuzzOrderByElement(ASTOrderByElement * elem);
    void fuzzOrderByList(IAST * ast);
    void fuzzColumnLikeExpressionList(IAST * ast);
+    void fuzzWindowFrame(WindowFrame & frame);
    void fuzz(ASTs & asts);
    void fuzz(ASTPtr & ast);
    void collectFuzzInfoMain(const ASTPtr ast);
--- a/src/Client/Connection.cpp
+++ b/src/Client/Connection.cpp
@ -542,6 +542,12 @@ void Connection::sendData(const Block & block, const String & name, bool scalar)
        throttler->add(out->count() - prev_bytes);
 }

+void Connection::sendIgnoredPartUUIDs(const std::vector<UUID> & uuids)
+{
+    writeVarUInt(Protocol::Client::IgnoredPartUUIDs, *out);
+    writeVectorBinary(uuids, *out);
+    out->next();
+}

 void Connection::sendPreparedData(ReadBuffer & input, size_t size, const String & name)
 {
@ -798,6 +804,10 @@ Packet Connection::receivePacket(std::function<void(Poco::Net::Socket &)> async_
            case Protocol::Server::EndOfStream:
                return res;

+            case Protocol::Server::PartUUIDs:
+                readVectorBinary(res.part_uuids, *in);
+                return res;
+
            default:
                /// In unknown state, disconnect - to not leave unsynchronised connection.
                disconnect();
--- a/src/Client/Connection.h
+++ b/src/Client/Connection.h
@ -66,6 +66,7 @@ struct Packet
    std::vector<String> multistring_message;
    Progress progress;
    BlockStreamProfileInfo profile_info;
+    std::vector<UUID> part_uuids;

    Packet() : type(Protocol::Server::Hello) {}
 };
@ -157,6 +158,8 @@ public:
    void sendScalarsData(Scalars & data);
    /// Send all contents of external (temporary) tables.
    void sendExternalTablesData(ExternalTablesData & data);
+    /// Send parts' uuids to excluded them from query processing
+    void sendIgnoredPartUUIDs(const std::vector<UUID> & uuids);

    /// Send prepared block of data (serialized and, if need, compressed), that will be read from 'input'.
    /// You could pass size of serialized/compressed block.
--- a/src/Client/MultiplexedConnections.cpp
+++ b/src/Client/MultiplexedConnections.cpp
@ -140,6 +140,21 @@ void MultiplexedConnections::sendQuery(
    sent_query = true;
 }

+void MultiplexedConnections::sendIgnoredPartUUIDs(const std::vector<UUID> & uuids)
+{
+    std::lock_guard lock(cancel_mutex);
+
+    if (sent_query)
+        throw Exception("Cannot send uuids after query is sent.", ErrorCodes::LOGICAL_ERROR);
+
+    for (ReplicaState & state : replica_states)
+    {
+        Connection * connection = state.connection;
+        if (connection != nullptr)
+            connection->sendIgnoredPartUUIDs(uuids);
+    }
+}
+
 Packet MultiplexedConnections::receivePacket()
 {
    std::lock_guard lock(cancel_mutex);
@ -195,6 +210,7 @@ Packet MultiplexedConnections::drain()

        switch (packet.type)
        {
+            case Protocol::Server::PartUUIDs:
            case Protocol::Server::Data:
            case Protocol::Server::Progress:
            case Protocol::Server::ProfileInfo:
@ -253,6 +269,7 @@ Packet MultiplexedConnections::receivePacketUnlocked(std::function<void(Poco::Ne

    switch (packet.type)
    {
+        case Protocol::Server::PartUUIDs:
        case Protocol::Server::Data:
        case Protocol::Server::Progress:
        case Protocol::Server::ProfileInfo:
--- a/src/Client/MultiplexedConnections.h
+++ b/src/Client/MultiplexedConnections.h
@ -50,6 +50,9 @@ public:
    /// Send a request to the replica to cancel the request
    void sendCancel();

+    /// Send parts' uuids to replicas to exclude them from query processing
+    void sendIgnoredPartUUIDs(const std::vector<UUID> & uuids);
+
    /** On each replica, read and skip all packets to EndOfStream or Exception.
      * Returns EndOfStream if no exception has been received. Otherwise
      * returns the last received packet of type Exception.
--- a/src/Columns/ColumnsNumber.h
+++ b/src/Columns/ColumnsNumber.h
@ -26,4 +26,6 @@ using ColumnInt256 = ColumnVector<Int256>;
 using ColumnFloat32 = ColumnVector<Float32>;
 using ColumnFloat64 = ColumnVector<Float64>;

+using ColumnUUID = ColumnVector<UInt128>;
+
 }
--- a/src/Common/ErrorCodes.cpp
+++ b/src/Common/ErrorCodes.cpp
@ -533,7 +533,8 @@
    M(564, INTERSERVER_SCHEME_DOESNT_MATCH) \
    M(565, TOO_MANY_PARTITIONS) \
    M(566, CANNOT_RMDIR) \
-    M(567, RAFT_ERROR) \
+    M(567, DUPLICATED_PART_UUIDS) \
+    M(568, RAFT_ERROR) \
    \
    M(999, KEEPER_EXCEPTION) \
    M(1000, POCO_EXCEPTION) \
--- a/src/Common/StackTrace.h
+++ b/src/Common/StackTrace.h
@ -34,7 +34,15 @@ public:
        std::optional<std::string> file;
        std::optional<UInt64> line;
    };
-    static constexpr size_t capacity = 32;
+
+    static constexpr size_t capacity =
+#ifndef NDEBUG
+        /* The stacks are normally larger in debug version due to less inlining. */
+        64
+#else
+        32
+#endif
+        ;
    using FramePointers = std::array<void *, capacity>;
    using Frames = std::array<Frame, capacity>;

--- a/src/Common/ThreadProfileEvents.cpp
+++ b/src/Common/ThreadProfileEvents.cpp
@ -68,7 +68,7 @@ TasksStatsCounters::TasksStatsCounters(const UInt64 tid, const MetricsProvider p
    case MetricsProvider::Netlink:
        stats_getter = [metrics_provider = std::make_shared<TaskStatsInfoGetter>(), tid]()
                {
-                    ::taskstats result;
+                    ::taskstats result{};
                    metrics_provider->getStat(result, tid);
                    return result;
                };
@ -76,7 +76,7 @@ TasksStatsCounters::TasksStatsCounters(const UInt64 tid, const MetricsProvider p
    case MetricsProvider::Procfs:
        stats_getter = [metrics_provider = std::make_shared<ProcfsMetricsProvider>(tid)]()
                {
-                    ::taskstats result;
+                    ::taskstats result{};
                    metrics_provider->getTaskStats(result);
                    return result;
                };
--- a/src/Core/MySQL/MySQLReplication.cpp
+++ b/src/Core/MySQL/MySQLReplication.cpp
@ -136,6 +136,7 @@ namespace MySQLReplication
        out << "XID: " << this->xid << '\n';
    }

+    /// https://dev.mysql.com/doc/internals/en/table-map-event.html
    void TableMapEvent::parseImpl(ReadBuffer & payload)
    {
        payload.readStrict(reinterpret_cast<char *>(&table_id), 6);
@ -257,15 +258,19 @@ namespace MySQLReplication
        out << "Null Bitmap: " << bitmap_str << '\n';
    }

-    void RowsEvent::parseImpl(ReadBuffer & payload)
+    void RowsEventHeader::parse(ReadBuffer & payload)
    {
        payload.readStrict(reinterpret_cast<char *>(&table_id), 6);
        payload.readStrict(reinterpret_cast<char *>(&flags), 2);

+        UInt16 extra_data_len;
        /// This extra_data_len contains the 2 bytes length.
        payload.readStrict(reinterpret_cast<char *>(&extra_data_len), 2);
        payload.ignore(extra_data_len - 2);
+    }

+    void RowsEvent::parseImpl(ReadBuffer & payload)
+    {
        number_columns = readLengthEncodedNumber(payload);
        size_t columns_bitmap_size = (number_columns + 7) / 8;
        switch (header.type)
@ -795,37 +800,50 @@ namespace MySQLReplication
            {
                event = std::make_shared<TableMapEvent>(std::move(event_header));
                event->parseEvent(event_payload);
-                table_map = std::static_pointer_cast<TableMapEvent>(event);
+                auto table_map = std::static_pointer_cast<TableMapEvent>(event);
+                table_maps[table_map->table_id] = table_map;
                break;
            }
            case WRITE_ROWS_EVENT_V1:
            case WRITE_ROWS_EVENT_V2: {
-                if (doReplicate())
-                    event = std::make_shared<WriteRowsEvent>(table_map, std::move(event_header));
+                RowsEventHeader rows_header(event_header.type);
+                rows_header.parse(event_payload);
+                if (doReplicate(rows_header.table_id))
+                    event = std::make_shared<WriteRowsEvent>(table_maps.at(rows_header.table_id), std::move(event_header), rows_header);
                else
                    event = std::make_shared<DryRunEvent>(std::move(event_header));

                event->parseEvent(event_payload);
+                if (rows_header.flags & ROWS_END_OF_STATEMENT)
+                    table_maps.clear();
                break;
            }
            case DELETE_ROWS_EVENT_V1:
            case DELETE_ROWS_EVENT_V2: {
-                if (doReplicate())
-                    event = std::make_shared<DeleteRowsEvent>(table_map, std::move(event_header));
+                RowsEventHeader rows_header(event_header.type);
+                rows_header.parse(event_payload);
+                if (doReplicate(rows_header.table_id))
+                    event = std::make_shared<DeleteRowsEvent>(table_maps.at(rows_header.table_id), std::move(event_header), rows_header);
                else
                    event = std::make_shared<DryRunEvent>(std::move(event_header));

                event->parseEvent(event_payload);
+                if (rows_header.flags & ROWS_END_OF_STATEMENT)
+                    table_maps.clear();
                break;
            }
            case UPDATE_ROWS_EVENT_V1:
            case UPDATE_ROWS_EVENT_V2: {
-                if (doReplicate())
-                    event = std::make_shared<UpdateRowsEvent>(table_map, std::move(event_header));
+                RowsEventHeader rows_header(event_header.type);
+                rows_header.parse(event_payload);
+                if (doReplicate(rows_header.table_id))
+                    event = std::make_shared<UpdateRowsEvent>(table_maps.at(rows_header.table_id), std::move(event_header), rows_header);
                else
                    event = std::make_shared<DryRunEvent>(std::move(event_header));

                event->parseEvent(event_payload);
+                if (rows_header.flags & ROWS_END_OF_STATEMENT)
+                    table_maps.clear();
                break;
            }
            case GTID_EVENT:
@ -843,6 +861,19 @@ namespace MySQLReplication
            }
        }
    }
+
+    bool MySQLFlavor::doReplicate(UInt64 table_id)
+    {
+        if (replicate_do_db.empty())
+            return false;
+        if (table_id == 0x00ffffff)
+        {
+            // Special "dummy event"
+            return false;
+        }
+        auto table_map = table_maps.at(table_id);
+        return table_map->schema == replicate_do_db;
+    }
 }

 }
--- a/src/Core/MySQL/MySQLReplication.h
+++ b/src/Core/MySQL/MySQLReplication.h
@ -430,6 +430,22 @@ namespace MySQLReplication
        void parseMeta(String meta);
    };

+    enum RowsEventFlags
+    {
+        ROWS_END_OF_STATEMENT = 1
+    };
+
+    class RowsEventHeader
+    {
+    public:
+        EventType type;
+        UInt64 table_id;
+        UInt16 flags;
+
+        RowsEventHeader(EventType type_) : type(type_), table_id(0), flags(0) {}
+        void parse(ReadBuffer & payload);
+    };
+
    class RowsEvent : public EventBase
    {
    public:
@ -438,9 +454,11 @@ namespace MySQLReplication
        String table;
        std::vector<Field> rows;

-        RowsEvent(std::shared_ptr<TableMapEvent> table_map_, EventHeader && header_)
-            : EventBase(std::move(header_)), number_columns(0), table_id(0), flags(0), extra_data_len(0), table_map(table_map_)
+        RowsEvent(std::shared_ptr<TableMapEvent> table_map_, EventHeader && header_, const RowsEventHeader & rows_header)
+            : EventBase(std::move(header_)), number_columns(0), table_map(table_map_)
        {
+            table_id = rows_header.table_id;
+            flags = rows_header.flags;
            schema = table_map->schema;
            table = table_map->table;
        }
@ -450,7 +468,6 @@ namespace MySQLReplication
    protected:
        UInt64 table_id;
        UInt16 flags;
-        UInt16 extra_data_len;
        Bitmap columns_present_bitmap1;
        Bitmap columns_present_bitmap2;

@ -464,21 +481,24 @@ namespace MySQLReplication
    class WriteRowsEvent : public RowsEvent
    {
    public:
-        WriteRowsEvent(std::shared_ptr<TableMapEvent> table_map_, EventHeader && header_) : RowsEvent(table_map_, std::move(header_)) {}
+        WriteRowsEvent(std::shared_ptr<TableMapEvent> table_map_, EventHeader && header_, const RowsEventHeader & rows_header)
+            : RowsEvent(table_map_, std::move(header_), rows_header) {}
        MySQLEventType type() const override { return MYSQL_WRITE_ROWS_EVENT; }
    };

    class DeleteRowsEvent : public RowsEvent
    {
    public:
-        DeleteRowsEvent(std::shared_ptr<TableMapEvent> table_map_, EventHeader && header_) : RowsEvent(table_map_, std::move(header_)) {}
+        DeleteRowsEvent(std::shared_ptr<TableMapEvent> table_map_, EventHeader && header_, const RowsEventHeader & rows_header)
+            : RowsEvent(table_map_, std::move(header_), rows_header) {}
        MySQLEventType type() const override { return MYSQL_DELETE_ROWS_EVENT; }
    };

    class UpdateRowsEvent : public RowsEvent
    {
    public:
-        UpdateRowsEvent(std::shared_ptr<TableMapEvent> table_map_, EventHeader && header_) : RowsEvent(table_map_, std::move(header_)) {}
+        UpdateRowsEvent(std::shared_ptr<TableMapEvent> table_map_, EventHeader && header_, const RowsEventHeader & rows_header)
+            : RowsEvent(table_map_, std::move(header_), rows_header) {}
        MySQLEventType type() const override { return MYSQL_UPDATE_ROWS_EVENT; }
    };

@ -546,10 +566,10 @@ namespace MySQLReplication
        Position position;
        BinlogEventPtr event;
        String replicate_do_db;
-        std::shared_ptr<TableMapEvent> table_map;
+        std::map<UInt64, std::shared_ptr<TableMapEvent> > table_maps;
        size_t checksum_signature_length = 4;

-        inline bool doReplicate() { return (replicate_do_db.empty() || table_map->schema == replicate_do_db); }
+        bool doReplicate(UInt64 table_id);
    };
 }

--- a/src/Core/Protocol.h
+++ b/src/Core/Protocol.h
@ -75,8 +75,9 @@ namespace Protocol
            TablesStatusResponse = 9, /// A response to TablesStatus request.
            Log = 10,                 /// System logs of the query execution
            TableColumns = 11,        /// Columns' description for default values calculation
+            PartUUIDs = 12,           /// List of unique parts ids.

-            MAX = TableColumns,
+            MAX = PartUUIDs,
        };

        /// NOTE: If the type of packet argument would be Enum, the comparison packet >= 0 && packet < 10
@ -98,6 +99,7 @@ namespace Protocol
                "TablesStatusResponse",
                "Log",
                "TableColumns",
+                "PartUUIDs",
            };
            return packet <= MAX
                ? data[packet]
@ -132,8 +134,9 @@ namespace Protocol
            TablesStatusRequest = 5, /// Check status of tables on the server.
            KeepAlive = 6,           /// Keep the connection alive
            Scalar = 7,              /// A block of data (compressed or not).
+            IgnoredPartUUIDs = 8,    /// List of unique parts ids to exclude from query processing

-            MAX = Scalar,
+            MAX = IgnoredPartUUIDs,
        };

        inline const char * toString(UInt64 packet)
@ -147,6 +150,7 @@ namespace Protocol
                "TablesStatusRequest",
                "KeepAlive",
                "Scalar",
+                "IgnoredPartUUIDs",
            };
            return packet <= MAX
                ? data[packet]
--- a/src/Core/Settings.h
+++ b/src/Core/Settings.h
@ -420,6 +420,8 @@ class IColumn;
    M(Bool, async_socket_for_remote, true, "Asynchronously read from socket executing remote query", 0) \
    \
    M(Bool, optimize_rewrite_sum_if_to_count_if, true, "Rewrite sumIf() and sum(if()) function countIf() function when logically equivalent", 0) \
+    M(Bool, allow_experimental_query_deduplication, false, "Allow sending parts' UUIDs for a query in order to deduplicate data parts if any", 0) \
+    \
    /** Obsolete settings that do nothing but left for compatibility reasons. Remove each one after half a year of obsolescence. */ \
    \
    M(UInt64, max_memory_usage_for_all_queries, 0, "Obsolete. Will be removed after 2020-10-20", 0) \
--- a/src/DataStreams/RemoteQueryExecutor.cpp
+++ b/src/DataStreams/RemoteQueryExecutor.cpp
@ -13,6 +13,7 @@
 #include <Interpreters/InternalTextLogsQueue.h>
 #include <IO/ConnectionTimeoutsContext.h>
 #include <Common/FiberStack.h>
+#include <Storages/MergeTree/MergeTreeDataPartUUID.h>

 namespace DB
 {
@ -20,6 +21,7 @@ namespace DB
 namespace ErrorCodes
 {
    extern const int UNKNOWN_PACKET_FROM_SERVER;
+    extern const int DUPLICATED_PART_UUIDS;
 }

 RemoteQueryExecutor::RemoteQueryExecutor(
@ -158,6 +160,7 @@ void RemoteQueryExecutor::sendQuery()
    std::lock_guard guard(was_cancelled_mutex);

    established = true;
+    was_cancelled = false;

    auto timeouts = ConnectionTimeouts::getTCPTimeoutsWithFailover(settings);
    ClientInfo modified_client_info = context.getClientInfo();
@ -167,6 +170,12 @@ void RemoteQueryExecutor::sendQuery()
        modified_client_info.client_trace_context = CurrentThread::get().thread_trace_context;
    }

+    {
+        std::lock_guard lock(duplicated_part_uuids_mutex);
+        if (!duplicated_part_uuids.empty())
+            multiplexed_connections->sendIgnoredPartUUIDs(duplicated_part_uuids);
+    }
+
    multiplexed_connections->sendQuery(timeouts, query, query_id, stage, modified_client_info, true);

    established = false;
@ -196,6 +205,8 @@ Block RemoteQueryExecutor::read()

        if (auto block = processPacket(std::move(packet)))
            return *block;
+        else if (got_duplicated_part_uuids)
+            return std::get<Block>(restartQueryWithoutDuplicatedUUIDs());
    }
 }

@ -211,7 +222,7 @@ std::variant<Block, int> RemoteQueryExecutor::read(std::unique_ptr<ReadContext>
            return Block();
    }

-    if (!read_context)
+    if (!read_context || resent_query)
    {
        std::lock_guard lock(was_cancelled_mutex);
        if (was_cancelled)
@ -234,6 +245,8 @@ std::variant<Block, int> RemoteQueryExecutor::read(std::unique_ptr<ReadContext>
        {
            if (auto data = processPacket(std::move(read_context->packet)))
                return std::move(*data);
+            else if (got_duplicated_part_uuids)
+                return restartQueryWithoutDuplicatedUUIDs(&read_context);
        }
    }
    while (true);
@ -242,10 +255,39 @@ std::variant<Block, int> RemoteQueryExecutor::read(std::unique_ptr<ReadContext>
 #endif
 }

+
+std::variant<Block, int> RemoteQueryExecutor::restartQueryWithoutDuplicatedUUIDs(std::unique_ptr<ReadContext> * read_context)
+{
+    /// Cancel previous query and disconnect before retry.
+    cancel(read_context);
+    multiplexed_connections->disconnect();
+
+    /// Only resend once, otherwise throw an exception
+    if (!resent_query)
+    {
+        if (log)
+            LOG_DEBUG(log, "Found duplicate UUIDs, will retry query without those parts");
+
+        resent_query = true;
+        sent_query = false;
+        got_duplicated_part_uuids = false;
+        /// Consecutive read will implicitly send query first.
+        if (!read_context)
+            return read();
+        else
+            return read(*read_context);
+    }
+    throw Exception("Found duplicate uuids while processing query.", ErrorCodes::DUPLICATED_PART_UUIDS);
+}
+
 std::optional<Block> RemoteQueryExecutor::processPacket(Packet packet)
 {
    switch (packet.type)
    {
+        case Protocol::Server::PartUUIDs:
+            if (!setPartUUIDs(packet.part_uuids))
+                got_duplicated_part_uuids = true;
+            break;
        case Protocol::Server::Data:
            /// If the block is not empty and is not a header block
            if (packet.block && (packet.block.rows() > 0))
@ -306,6 +348,20 @@ std::optional<Block> RemoteQueryExecutor::processPacket(Packet packet)
    return {};
 }

+bool RemoteQueryExecutor::setPartUUIDs(const std::vector<UUID> & uuids)
+{
+    Context & query_context = const_cast<Context &>(context).getQueryContext();
+    auto duplicates = query_context.getPartUUIDs()->add(uuids);
+
+    if (!duplicates.empty())
+    {
+        std::lock_guard lock(duplicated_part_uuids_mutex);
+        duplicated_part_uuids.insert(duplicated_part_uuids.begin(), duplicates.begin(), duplicates.end());
+        return false;
+    }
+    return true;
+}
+
 void RemoteQueryExecutor::finish(std::unique_ptr<ReadContext> * read_context)
 {
    /** If one of:
@ -383,6 +439,7 @@ void RemoteQueryExecutor::sendExternalTables()
    {
        std::lock_guard lock(external_tables_mutex);

+        external_tables_data.clear();
        external_tables_data.reserve(count);

        for (size_t i = 0; i < count; ++i)
--- a/src/DataStreams/RemoteQueryExecutor.h
+++ b/src/DataStreams/RemoteQueryExecutor.h
@ -57,6 +57,9 @@ public:
    /// Create connection and send query, external tables and scalars.
    void sendQuery();

+    /// Query is resent to a replica, the query itself can be modified.
+    std::atomic<bool> resent_query { false };
+
    /// Read next block of data. Returns empty block if query is finished.
    Block read();

@ -152,6 +155,14 @@ private:
      */
    std::atomic<bool> got_unknown_packet_from_replica { false };

+    /** Got duplicated uuids from replica
+      */
+    std::atomic<bool> got_duplicated_part_uuids{ false };
+
+    /// Parts uuids, collected from remote replicas
+    std::mutex duplicated_part_uuids_mutex;
+    std::vector<UUID> duplicated_part_uuids;
+
    PoolMode pool_mode = PoolMode::GET_MANY;
    StorageID main_table = StorageID::createEmpty();

@ -163,6 +174,14 @@ private:
    /// Send all temporary tables to remote servers
    void sendExternalTables();

+    /// Set part uuids to a query context, collected from remote replicas.
+    /// Return true if duplicates found.
+    bool setPartUUIDs(const std::vector<UUID> & uuids);
+
+    /// Cancell query and restart it with info about duplicated UUIDs
+    /// only for `allow_experimental_query_deduplication`.
+    std::variant<Block, int> restartQueryWithoutDuplicatedUUIDs(std::unique_ptr<ReadContext> * read_context = nullptr);
+
    /// If wasn't sent yet, send request to cancel all connections to replicas
    void tryCancel(const char * reason, std::unique_ptr<ReadContext> * read_context);

@ -174,6 +193,10 @@ private:

    /// Process packet for read and return data block if possible.
    std::optional<Block> processPacket(Packet packet);
+
+    /// Reads packet by packet
+    Block readPackets();
+
 };

 }
--- a/src/DataTypes/NumberTraits.h
+++ b/src/DataTypes/NumberTraits.h
@ -104,11 +104,16 @@ template <typename A, typename B> struct ResultOfIntegerDivision
        sizeof(A)>::Type;
 };

-/** Division with remainder you get a number with the same number of bits as in divisor.
-    */
+/** Division with remainder you get a number with the same number of bits as in divisor,
+  * or larger in case of signed type.
+  */
 template <typename A, typename B> struct ResultOfModulo
 {
-    using Type0 = typename Construct<is_signed_v<A> || is_signed_v<B>, false, sizeof(B)>::Type;
+    static constexpr bool result_is_signed = is_signed_v<A>;
+    /// If modulo of division can yield negative number, we need larger type to accommodate it.
+    /// Example: toInt32(-199) % toUInt8(200) will return -199 that does not fit in Int8, only in Int16.
+    static constexpr size_t size_of_result = result_is_signed ? nextSize(sizeof(B)) : sizeof(B);
+    using Type0 = typename Construct<result_is_signed, false, size_of_result>::Type;
    using Type = std::conditional_t<std::is_floating_point_v<A> || std::is_floating_point_v<B>, Float64, Type0>;
 };

--- a/src/Disks/DiskCacheWrapper.cpp
+++ b/src/Disks/DiskCacheWrapper.cpp
@ -108,7 +108,7 @@ DiskCacheWrapper::readFile(const String & path, size_t buf_size, size_t estimate
    if (!cache_file_predicate(path))
        return DiskDecorator::readFile(path, buf_size, estimated_size, aio_threshold, mmap_threshold);

-    LOG_DEBUG(&Poco::Logger::get("DiskS3"), "Read file {} from cache", backQuote(path));
+    LOG_DEBUG(&Poco::Logger::get("DiskCache"), "Read file {} from cache", backQuote(path));

    if (cache_disk->exists(path))
        return cache_disk->readFile(path, buf_size, estimated_size, aio_threshold, mmap_threshold);
@ -122,11 +122,11 @@ DiskCacheWrapper::readFile(const String & path, size_t buf_size, size_t estimate
        {
            /// This thread will responsible for file downloading to cache.
            metadata->status = DOWNLOADING;
-            LOG_DEBUG(&Poco::Logger::get("DiskS3"), "File {} doesn't exist in cache. Will download it", backQuote(path));
+            LOG_DEBUG(&Poco::Logger::get("DiskCache"), "File {} doesn't exist in cache. Will download it", backQuote(path));
        }
        else if (metadata->status == DOWNLOADING)
        {
-            LOG_DEBUG(&Poco::Logger::get("DiskS3"), "Waiting for file {} download to cache", backQuote(path));
+            LOG_DEBUG(&Poco::Logger::get("DiskCache"), "Waiting for file {} download to cache", backQuote(path));
            metadata->condition.wait(lock, [metadata] { return metadata->status == DOWNLOADED || metadata->status == ERROR; });
        }
    }
@ -139,7 +139,7 @@ DiskCacheWrapper::readFile(const String & path, size_t buf_size, size_t estimate
        {
            try
            {
-                auto dir_path = getDirectoryPath(path);
+                auto dir_path = directoryPath(path);
                if (!cache_disk->exists(dir_path))
                    cache_disk->createDirectories(dir_path);

@ -151,11 +151,11 @@ DiskCacheWrapper::readFile(const String & path, size_t buf_size, size_t estimate
                }
                cache_disk->moveFile(tmp_path, path);

-                LOG_DEBUG(&Poco::Logger::get("DiskS3"), "File {} downloaded to cache", backQuote(path));
+                LOG_DEBUG(&Poco::Logger::get("DiskCache"), "File {} downloaded to cache", backQuote(path));
            }
            catch (...)
            {
-                tryLogCurrentException("DiskS3", "Failed to download file + " + backQuote(path) + " to cache");
+                tryLogCurrentException("DiskCache", "Failed to download file + " + backQuote(path) + " to cache");
                result_status = ERROR;
            }
        }
@ -180,9 +180,9 @@ DiskCacheWrapper::writeFile(const String & path, size_t buf_size, WriteMode mode
    if (!cache_file_predicate(path))
        return DiskDecorator::writeFile(path, buf_size, mode);

-    LOG_DEBUG(&Poco::Logger::get("DiskS3"), "Write file {} to cache", backQuote(path));
+    LOG_DEBUG(&Poco::Logger::get("DiskCache"), "Write file {} to cache", backQuote(path));

-    auto dir_path = getDirectoryPath(path);
+    auto dir_path = directoryPath(path);
    if (!cache_disk->exists(dir_path))
        cache_disk->createDirectories(dir_path);

@ -217,7 +217,7 @@ void DiskCacheWrapper::moveFile(const String & from_path, const String & to_path
 {
    if (cache_disk->exists(from_path))
    {
-        auto dir_path = getDirectoryPath(to_path);
+        auto dir_path = directoryPath(to_path);
        if (!cache_disk->exists(dir_path))
            cache_disk->createDirectories(dir_path);

@ -230,7 +230,7 @@ void DiskCacheWrapper::replaceFile(const String & from_path, const String & to_p
 {
    if (cache_disk->exists(from_path))
    {
-        auto dir_path = getDirectoryPath(to_path);
+        auto dir_path = directoryPath(to_path);
        if (!cache_disk->exists(dir_path))
            cache_disk->createDirectories(dir_path);

@ -239,19 +239,6 @@ void DiskCacheWrapper::replaceFile(const String & from_path, const String & to_p
    DiskDecorator::replaceFile(from_path, to_path);
 }

-void DiskCacheWrapper::copyFile(const String & from_path, const String & to_path)
-{
-    if (cache_disk->exists(from_path))
-    {
-        auto dir_path = getDirectoryPath(to_path);
-        if (!cache_disk->exists(dir_path))
-            cache_disk->createDirectories(dir_path);
-
-        cache_disk->copyFile(from_path, to_path);
-    }
-    DiskDecorator::copyFile(from_path, to_path);
-}
-
 void DiskCacheWrapper::removeFile(const String & path)
 {
    cache_disk->removeFileIfExists(path);
@ -280,9 +267,10 @@ void DiskCacheWrapper::removeRecursive(const String & path)

 void DiskCacheWrapper::createHardLink(const String & src_path, const String & dst_path)
 {
-    if (cache_disk->exists(src_path))
+    /// Don't create hardlinks for cache files to shadow directory as it just waste cache disk space.
+    if (cache_disk->exists(src_path) && !dst_path.starts_with("shadow/"))
    {
-        auto dir_path = getDirectoryPath(dst_path);
+        auto dir_path = directoryPath(dst_path);
        if (!cache_disk->exists(dir_path))
            cache_disk->createDirectories(dir_path);

@ -303,11 +291,6 @@ void DiskCacheWrapper::createDirectories(const String & path)
    DiskDecorator::createDirectories(path);
 }

-inline String DiskCacheWrapper::getDirectoryPath(const String & path)
-{
-    return Poco::Path{path}.setFileName("").toString();
-}
-
 /// TODO: Current reservation mechanism leaks IDisk abstraction details.
 /// This hack is needed to return proper disk pointer (wrapper instead of implementation) from reservation object.
 class ReservationDelegate : public IReservation
--- a/src/Disks/DiskCacheWrapper.h
+++ b/src/Disks/DiskCacheWrapper.h
@ -32,7 +32,6 @@ public:
    void moveDirectory(const String & from_path, const String & to_path) override;
    void moveFile(const String & from_path, const String & to_path) override;
    void replaceFile(const String & from_path, const String & to_path) override;
-    void copyFile(const String & from_path, const String & to_path) override;
    std::unique_ptr<ReadBufferFromFileBase>
    readFile(const String & path, size_t buf_size, size_t estimated_size, size_t aio_threshold, size_t mmap_threshold) const override;
    std::unique_ptr<WriteBufferFromFileBase>
@ -46,7 +45,6 @@ public:

 private:
    std::shared_ptr<FileDownloadMetadata> acquireDownloadMetadata(const String & path) const;
-    static String getDirectoryPath(const String & path);

    /// Disk to cache files.
    std::shared_ptr<DiskLocal> cache_disk;
--- a/src/Disks/DiskDecorator.cpp
+++ b/src/Disks/DiskDecorator.cpp
@ -103,11 +103,6 @@ void DiskDecorator::replaceFile(const String & from_path, const String & to_path
    delegate->replaceFile(from_path, to_path);
 }

-void DiskDecorator::copyFile(const String & from_path, const String & to_path)
-{
-    delegate->copyFile(from_path, to_path);
-}
-
 void DiskDecorator::copy(const String & from_path, const std::shared_ptr<IDisk> & to_disk, const String & to_path)
 {
    delegate->copy(from_path, to_disk, to_path);
@ -185,4 +180,9 @@ SyncGuardPtr DiskDecorator::getDirectorySyncGuard(const String & path) const
    return delegate->getDirectorySyncGuard(path);
 }

+void DiskDecorator::onFreeze(const String & path)
+{
+    delegate->onFreeze(path);
+}
+
 }
--- a/src/Disks/DiskDecorator.h
+++ b/src/Disks/DiskDecorator.h
@ -32,7 +32,6 @@ public:
    void createFile(const String & path) override;
    void moveFile(const String & from_path, const String & to_path) override;
    void replaceFile(const String & from_path, const String & to_path) override;
-    void copyFile(const String & from_path, const String & to_path) override;
    void copy(const String & from_path, const std::shared_ptr<IDisk> & to_disk, const String & to_path) override;
    void listFiles(const String & path, std::vector<String> & file_names) override;
    std::unique_ptr<ReadBufferFromFileBase>
@ -48,8 +47,9 @@ public:
    void setReadOnly(const String & path) override;
    void createHardLink(const String & src_path, const String & dst_path) override;
    void truncateFile(const String & path, size_t size) override;
-    const String getType() const override { return delegate->getType(); }
+    DiskType::Type getType() const override { return delegate->getType(); }
    Executor & getExecutor() override;
+    void onFreeze(const String & path) override;
    SyncGuardPtr getDirectorySyncGuard(const String & path) const override;

 protected:
--- a/src/Disks/DiskLocal.cpp
+++ b/src/Disks/DiskLocal.cpp
@ -218,11 +218,6 @@ void DiskLocal::replaceFile(const String & from_path, const String & to_path)
        from_file.renameTo(to_file.path());
 }

-void DiskLocal::copyFile(const String & from_path, const String & to_path)
-{
-    Poco::File(disk_path + from_path).copyTo(disk_path + to_path);
-}
-
 std::unique_ptr<ReadBufferFromFileBase>
 DiskLocal::readFile(const String & path, size_t buf_size, size_t estimated_size, size_t aio_threshold, size_t mmap_threshold) const
 {
--- a/src/Disks/DiskLocal.h
+++ b/src/Disks/DiskLocal.h
@ -67,8 +67,6 @@ public:

    void replaceFile(const String & from_path, const String & to_path) override;

-    void copyFile(const String & from_path, const String & to_path) override;
-
    void copy(const String & from_path, const std::shared_ptr<IDisk> & to_disk, const String & to_path) override;

    void listFiles(const String & path, std::vector<String> & file_names) override;
@ -100,7 +98,7 @@ public:

    void truncateFile(const String & path, size_t size) override;

-    const String getType() const override { return "local"; }
+    DiskType::Type getType() const override { return DiskType::Type::Local; }

    SyncGuardPtr getDirectorySyncGuard(const String & path) const override;

--- a/src/Disks/DiskMemory.cpp
+++ b/src/Disks/DiskMemory.cpp
@ -314,11 +314,6 @@ void DiskMemory::replaceFileImpl(const String & from_path, const String & to_pat
    files.insert(std::move(node));
 }

-void DiskMemory::copyFile(const String & /*from_path*/, const String & /*to_path*/)
-{
-    throw Exception("Method copyFile is not implemented for memory disks", ErrorCodes::NOT_IMPLEMENTED);
-}
-
 std::unique_ptr<ReadBufferFromFileBase> DiskMemory::readFile(const String & path, size_t /*buf_size*/, size_t, size_t, size_t) const
 {
    std::lock_guard lock(mutex);
--- a/src/Disks/DiskMemory.h
+++ b/src/Disks/DiskMemory.h
@ -60,8 +60,6 @@ public:

    void replaceFile(const String & from_path, const String & to_path) override;

-    void copyFile(const String & from_path, const String & to_path) override;
-
    void listFiles(const String & path, std::vector<String> & file_names) override;

    std::unique_ptr<ReadBufferFromFileBase> readFile(
@ -91,7 +89,7 @@ public:

    void truncateFile(const String & path, size_t size) override;

-    const String getType() const override { return "memory"; }
+    DiskType::Type getType() const override { return DiskType::Type::RAM; }

 private:
    void createDirectoriesImpl(const String & path);
--- a/src/Disks/IDisk.h
+++ b/src/Disks/IDisk.h
@ -57,6 +57,29 @@ public:

 using SpacePtr = std::shared_ptr<Space>;

+struct DiskType
+{
+    enum class Type
+    {
+        Local,
+        RAM,
+        S3
+    };
+    static String toString(Type disk_type)
+    {
+        switch (disk_type)
+        {
+            case Type::Local:
+                return "local";
+            case Type::RAM:
+                return "memory";
+            case Type::S3:
+                return "s3";
+        }
+        __builtin_unreachable();
+    }
+};
+
 /**
 * A guard, that should synchronize file's or directory's state
 * with storage device (e.g. fsync in POSIX) in its destructor.
@ -140,9 +163,6 @@ public:
    /// If a file with `to_path` path already exists, it will be replaced.
    virtual void replaceFile(const String & from_path, const String & to_path) = 0;

-    /// Copy the file from `from_path` to `to_path`.
-    virtual void copyFile(const String & from_path, const String & to_path) = 0;
-
    /// Recursively copy data containing at `from_path` to `to_path` located at `to_disk`.
    virtual void copy(const String & from_path, const std::shared_ptr<IDisk> & to_disk, const String & to_path);

@ -191,7 +211,7 @@ public:
    virtual void truncateFile(const String & path, size_t size);

    /// Return disk type - "local", "s3", etc.
-    virtual const String getType() const = 0;
+    virtual DiskType::Type getType() const = 0;

    /// Invoked when Global Context is shutdown.
    virtual void shutdown() { }
@ -199,6 +219,9 @@ public:
    /// Returns executor to perform asynchronous operations.
    virtual Executor & getExecutor() { return *executor; }

+    /// Invoked on partitions freeze query.
+    virtual void onFreeze(const String &) { }
+
    /// Returns guard, that insures synchronization of directory metadata with storage device.
    virtual SyncGuardPtr getDirectorySyncGuard(const String & path) const;

@ -269,4 +292,11 @@ inline String fileName(const String & path)
 {
    return Poco::Path(path).getFileName();
 }
+
+/// Return directory path for the specified path.
+inline String directoryPath(const String & path)
+{
+    return Poco::Path(path).setFileName("").toString();
+}
+
 }
--- a/src/Disks/S3/DiskS3.cpp
+++ b/src/Disks/S3/DiskS3.cpp
@ -23,6 +23,8 @@
 #include <aws/s3/model/CopyObjectRequest.h>
 #include <aws/s3/model/DeleteObjectsRequest.h>
 #include <aws/s3/model/GetObjectRequest.h>
+#include <aws/s3/model/ListObjectsV2Request.h>
+#include <aws/s3/model/HeadObjectRequest.h>

 #include <boost/algorithm/string.hpp>

@ -32,12 +34,15 @@ namespace DB

 namespace ErrorCodes
 {
+    extern const int S3_ERROR;
    extern const int FILE_ALREADY_EXISTS;
    extern const int CANNOT_SEEK_THROUGH_FILE;
    extern const int UNKNOWN_FORMAT;
    extern const int INCORRECT_DISK_INDEX;
+    extern const int BAD_ARGUMENTS;
    extern const int PATH_ACCESS_DENIED;
    extern const int CANNOT_DELETE_DIRECTORY;
+    extern const int LOGICAL_ERROR;
 }


@ -76,12 +81,12 @@ String getRandomName()
 }

 template <typename Result, typename Error>
-void throwIfError(Aws::Utils::Outcome<Result, Error> && response)
+void throwIfError(Aws::Utils::Outcome<Result, Error> & response)
 {
    if (!response.IsSuccess())
    {
        const auto & err = response.GetError();
-        throw Exception(err.GetMessage(), static_cast<int>(err.GetErrorType()));
+        throw Exception(std::to_string(static_cast<int>(err.GetErrorType())) + ": " + err.GetMessage(), ErrorCodes::S3_ERROR);
    }
 }

@ -244,7 +249,7 @@ public:
        if (whence == SEEK_CUR)
        {
            /// If position within current working buffer - shift pos.
-            if (working_buffer.size() && size_t(getPosition() + offset_) < absolute_position)
+            if (!working_buffer.empty() && size_t(getPosition() + offset_) < absolute_position)
            {
                pos += offset_;
                return getPosition();
@ -257,7 +262,7 @@ public:
        else if (whence == SEEK_SET)
        {
            /// If position within current working buffer - shift pos.
-            if (working_buffer.size() && size_t(offset_) >= absolute_position - working_buffer.size()
+            if (!working_buffer.empty() && size_t(offset_) >= absolute_position - working_buffer.size()
                && size_t(offset_) < absolute_position)
            {
                pos = working_buffer.end() - (absolute_position - offset_);
@ -500,17 +505,17 @@ private:
    CurrentMetrics::Increment metric_increment;
 };

-/// Runs tasks asynchronously using global thread pool.
+/// Runs tasks asynchronously using thread pool.
 class AsyncExecutor : public Executor
 {
 public:
-    explicit AsyncExecutor() = default;
+    explicit AsyncExecutor(int thread_pool_size) : pool(ThreadPool(thread_pool_size)) { }

    std::future<void> execute(std::function<void()> task) override
    {
        auto promise = std::make_shared<std::promise<void>>();

-        GlobalThreadPool::instance().scheduleOrThrowOnError(
+        pool.scheduleOrThrowOnError(
            [promise, task]()
            {
                try
@ -531,6 +536,9 @@ public:

        return promise->get_future();
    }
+
+private:
+    ThreadPool pool;
 };


@ -544,8 +552,10 @@ DiskS3::DiskS3(
    size_t min_upload_part_size_,
    size_t max_single_part_upload_size_,
    size_t min_bytes_for_seek_,
-    bool send_metadata_)
-    : IDisk(std::make_unique<AsyncExecutor>())
+    bool send_metadata_,
+    int thread_pool_size_,
+    int list_object_keys_size_)
+    : IDisk(std::make_unique<AsyncExecutor>(thread_pool_size_))
    , name(std::move(name_))
    , client(std::move(client_))
    , proxy_configuration(std::move(proxy_configuration_))
@ -556,6 +566,8 @@ DiskS3::DiskS3(
    , max_single_part_upload_size(max_single_part_upload_size_)
    , min_bytes_for_seek(min_bytes_for_seek_)
    , send_metadata(send_metadata_)
+    , revision_counter(0)
+    , list_object_keys_size(list_object_keys_size_)
 {
 }

@ -613,45 +625,31 @@ void DiskS3::moveFile(const String & from_path, const String & to_path)
 {
    if (exists(to_path))
        throw Exception("File already exists: " + to_path, ErrorCodes::FILE_ALREADY_EXISTS);
+
+    if (send_metadata)
+    {
+        auto revision = ++revision_counter;
+        const DiskS3::ObjectMetadata object_metadata {
+            {"from_path", from_path},
+            {"to_path", to_path}
+        };
+        createFileOperationObject("rename", revision, object_metadata);
+    }
+
    Poco::File(metadata_path + from_path).renameTo(metadata_path + to_path);
 }

 void DiskS3::replaceFile(const String & from_path, const String & to_path)
 {
-    Poco::File from_file(metadata_path + from_path);
-    Poco::File to_file(metadata_path + to_path);
-    if (to_file.exists())
+    if (exists(to_path))
    {
-        Poco::File tmp_file(metadata_path + to_path + ".old");
-        to_file.renameTo(tmp_file.path());
-        from_file.renameTo(metadata_path + to_path);
-        removeFile(to_path + ".old");
+        const String tmp_path = to_path + ".old";
+        moveFile(to_path, tmp_path);
+        moveFile(from_path, to_path);
+        removeFile(tmp_path);
    }
    else
-        from_file.renameTo(to_file.path());
-}
-
-void DiskS3::copyFile(const String & from_path, const String & to_path)
-{
-    if (exists(to_path))
-        removeFile(to_path);
-
-    auto from = readMeta(from_path);
-    auto to = createMeta(to_path);
-
-    for (const auto & [path, size] : from.s3_objects)
-    {
-        auto new_path = getRandomName();
-        Aws::S3::Model::CopyObjectRequest req;
-        req.SetCopySource(bucket + "/" + s3_root_path + path);
-        req.SetBucket(bucket);
-        req.SetKey(s3_root_path + new_path);
-        throwIfError(client->CopyObject(req));
-
-        to.addObject(new_path, size);
-    }
-
-    to.save();
+        moveFile(from_path, to_path);
 }

 std::unique_ptr<ReadBufferFromFileBase> DiskS3::readFile(const String & path, size_t buf_size, size_t, size_t, size_t) const
@ -673,7 +671,17 @@ std::unique_ptr<WriteBufferFromFileBase> DiskS3::writeFile(const String & path,

    /// Path to store new S3 object.
    auto s3_path = getRandomName();
-    auto object_metadata = createObjectMetadata(path);
+
+    std::optional<ObjectMetadata> object_metadata;
+    if (send_metadata)
+    {
+        auto revision = ++revision_counter;
+        object_metadata = {
+            {"path", path}
+        };
+        s3_path = "r" + revisionToString(revision) + "-file-" + s3_path;
+    }
+
    if (!exist || mode == WriteMode::Rewrite)
    {
        /// If metadata file exists - remove and create new.
@ -777,7 +785,8 @@ void DiskS3::removeAws(const AwsS3KeyKeeper & keys)
            Aws::S3::Model::DeleteObjectsRequest request;
            request.SetBucket(bucket);
            request.SetDelete(delkeys);
-            throwIfError(client->DeleteObjects(request));
+            auto outcome = client->DeleteObjects(request);
+            throwIfError(outcome);
        }
    }
 }
@ -852,6 +861,17 @@ Poco::Timestamp DiskS3::getLastModified(const String & path)

 void DiskS3::createHardLink(const String & src_path, const String & dst_path)
 {
+    /// We don't need to record hardlinks created to shadow folder.
+    if (send_metadata && !dst_path.starts_with("shadow/"))
+    {
+        auto revision = ++revision_counter;
+        const ObjectMetadata object_metadata {
+            {"src_path", src_path},
+            {"dst_path", dst_path}
+        };
+        createFileOperationObject("hardlink", revision, object_metadata);
+    }
+
    /// Increment number of references.
    auto src = readMeta(src_path);
    ++src.ref_count;
@ -886,12 +906,368 @@ void DiskS3::shutdown()
    client->DisableRequestProcessing();
 }

-std::optional<DiskS3::ObjectMetadata> DiskS3::createObjectMetadata(const String & path) const
+void DiskS3::createFileOperationObject(const String & operation_name, UInt64 revision, const DiskS3::ObjectMetadata & metadata)
 {
-    if (send_metadata)
-        return (DiskS3::ObjectMetadata){{"path", path}};
+    const String key = "operations/r" + revisionToString(revision) + "-" + operation_name;
+    WriteBufferFromS3 buffer(client, bucket, s3_root_path + key, min_upload_part_size, max_single_part_upload_size, metadata);
+    buffer.write('0');
+    buffer.finalize();
+}

-    return {};
+void DiskS3::startup()
+{
+    if (!send_metadata)
+        return;
+
+    LOG_INFO(&Poco::Logger::get("DiskS3"), "Starting up disk {}", name);
+
+    /// Find last revision.
+    UInt64 l = 0, r = LATEST_REVISION;
+    while (l < r)
+    {
+        LOG_DEBUG(&Poco::Logger::get("DiskS3"), "Check revision in bounds {}-{}", l, r);
+
+        auto revision = l + (r - l + 1) / 2;
+        auto revision_str = revisionToString(revision);
+
+        LOG_DEBUG(&Poco::Logger::get("DiskS3"), "Check object with revision {}", revision);
+
+        /// Check file or operation with such revision exists.
+        if (checkObjectExists(s3_root_path + "r" + revision_str)
+            || checkObjectExists(s3_root_path + "operations/r" + revision_str))
+            l = revision;
+        else
+            r = revision - 1;
+    }
+    revision_counter = l;
+    LOG_INFO(&Poco::Logger::get("DiskS3"), "Found last revision number {} for disk {}", revision_counter, name);
+}
+
+bool DiskS3::checkObjectExists(const String & prefix)
+{
+    Aws::S3::Model::ListObjectsV2Request request;
+    request.SetBucket(bucket);
+    request.SetPrefix(prefix);
+    request.SetMaxKeys(1);
+
+    auto outcome = client->ListObjectsV2(request);
+    throwIfError(outcome);
+
+    return !outcome.GetResult().GetContents().empty();
+}
+
+Aws::S3::Model::HeadObjectResult DiskS3::headObject(const String & source_bucket, const String & key)
+{
+    Aws::S3::Model::HeadObjectRequest request;
+    request.SetBucket(source_bucket);
+    request.SetKey(key);
+
+    auto outcome = client->HeadObject(request);
+    throwIfError(outcome);
+
+    return outcome.GetResultWithOwnership();
+}
+
+void DiskS3::listObjects(const String & source_bucket, const String & source_path, std::function<bool(const Aws::S3::Model::ListObjectsV2Result &)> callback)
+{
+    Aws::S3::Model::ListObjectsV2Request request;
+    request.SetBucket(source_bucket);
+    request.SetPrefix(source_path);
+    request.SetMaxKeys(list_object_keys_size);
+
+    Aws::S3::Model::ListObjectsV2Outcome outcome;
+    do
+    {
+        outcome = client->ListObjectsV2(request);
+        throwIfError(outcome);
+
+        bool should_continue = callback(outcome.GetResult());
+
+        if (!should_continue)
+            break;
+
+        request.SetContinuationToken(outcome.GetResult().GetNextContinuationToken());
+    } while (outcome.GetResult().GetIsTruncated());
+}
+
+void DiskS3::copyObject(const String & src_bucket, const String & src_key, const String & dst_bucket, const String & dst_key)
+{
+    Aws::S3::Model::CopyObjectRequest request;
+    request.SetCopySource(src_bucket + "/" + src_key);
+    request.SetBucket(dst_bucket);
+    request.SetKey(dst_key);
+
+    auto outcome = client->CopyObject(request);
+    throwIfError(outcome);
+}
+
+struct DiskS3::RestoreInformation
+{
+    UInt64 revision = LATEST_REVISION;
+    String source_bucket;
+    String source_path;
+};
+
+void DiskS3::readRestoreInformation(DiskS3::RestoreInformation & restore_information)
+{
+    ReadBufferFromFile buffer(metadata_path + restore_file_name, 512);
+    buffer.next();
+
+    /// Empty file - just restore all metadata.
+    if (!buffer.hasPendingData())
+        return;
+
+    try
+    {
+        readIntText(restore_information.revision, buffer);
+        assertChar('\n', buffer);
+
+        if (!buffer.hasPendingData())
+            return;
+
+        readText(restore_information.source_bucket, buffer);
+        assertChar('\n', buffer);
+
+        if (!buffer.hasPendingData())
+            return;
+
+        readText(restore_information.source_path, buffer);
+        assertChar('\n', buffer);
+
+        if (buffer.hasPendingData())
+            throw Exception("Extra information at the end of restore file", ErrorCodes::UNKNOWN_FORMAT);
+    }
+    catch (const Exception & e)
+    {
+        throw Exception("Failed to read restore information", e, ErrorCodes::UNKNOWN_FORMAT);
+    }
+}
+
+void DiskS3::restore()
+{
+    if (!exists(restore_file_name))
+        return;
+
+    try
+    {
+        RestoreInformation information;
+        information.source_bucket = bucket;
+        information.source_path = s3_root_path;
+
+        readRestoreInformation(information);
+        if (information.revision == 0)
+            information.revision = LATEST_REVISION;
+        if (!information.source_path.ends_with('/'))
+            information.source_path += '/';
+
+        if (information.source_bucket == bucket)
+        {
+            /// In this case we need to additionally cleanup S3 from objects with later revision.
+            /// Will be simply just restore to different path.
+            if (information.source_path == s3_root_path && information.revision != LATEST_REVISION)
+                throw Exception("Restoring to the same bucket and path is allowed if revision is latest (0)", ErrorCodes::BAD_ARGUMENTS);
+
+            /// This case complicates S3 cleanup in case of unsuccessful restore.
+            if (information.source_path != s3_root_path && s3_root_path.starts_with(information.source_path))
+                throw Exception("Restoring to the same bucket is allowed only if source path is not a sub-path of configured path in S3 disk", ErrorCodes::BAD_ARGUMENTS);
+        }
+
+        ///TODO: Cleanup FS and bucket if previous restore was failed.
+
+        LOG_INFO(&Poco::Logger::get("DiskS3"), "Starting to restore disk {}. Revision: {}, Source bucket: {}, Source path: {}",
+                 name, information.revision, information.source_bucket, information.source_path);
+
+        restoreFiles(information.source_bucket, information.source_path, information.revision);
+        restoreFileOperations(information.source_bucket, information.source_path, information.revision);
+
+        Poco::File restore_file(metadata_path + restore_file_name);
+        restore_file.remove();
+
+        LOG_INFO(&Poco::Logger::get("DiskS3"), "Restore disk {} finished", name);
+    }
+    catch (const Exception & e)
+    {
+        LOG_ERROR(&Poco::Logger::get("DiskS3"), "Failed to restore disk. Code: {}, e.displayText() = {}, Stack trace:\n\n{}", e.code(), e.displayText(), e.getStackTraceString());
+
+        throw;
+    }
+}
+
+void DiskS3::restoreFiles(const String & source_bucket, const String & source_path, UInt64 target_revision)
+{
+    LOG_INFO(&Poco::Logger::get("DiskS3"), "Starting restore files for disk {}", name);
+
+    std::vector<std::future<void>> results;
+    listObjects(source_bucket, source_path, [this, &source_bucket, &source_path, &target_revision, &results](auto list_result)
+    {
+        std::vector<String> keys;
+        for (const auto & row : list_result.GetContents())
+        {
+            const String & key = row.GetKey();
+
+            /// Skip file operations objects. They will be processed separately.
+            if (key.find("/operations/") != String::npos)
+                continue;
+
+            const auto [revision, _] = extractRevisionAndOperationFromKey(key);
+            /// Filter early if it's possible to get revision from key.
+            if (revision > target_revision)
+                continue;
+
+            keys.push_back(key);
+        }
+
+        if (!keys.empty())
+        {
+            auto result = getExecutor().execute([this, &source_bucket, &source_path, keys]()
+            {
+                processRestoreFiles(source_bucket, source_path, keys);
+            });
+
+            results.push_back(std::move(result));
+        }
+
+        return true;
+    });
+
+    for (auto & result : results)
+        result.wait();
+    for (auto & result : results)
+        result.get();
+
+    LOG_INFO(&Poco::Logger::get("DiskS3"), "Files are restored for disk {}", name);
+}
+
+void DiskS3::processRestoreFiles(const String & source_bucket, const String & source_path, Strings keys)
+{
+    for (const auto & key : keys)
+    {
+        auto head_result = headObject(source_bucket, key);
+        auto object_metadata = head_result.GetMetadata();
+
+        /// Restore file if object has 'path' in metadata.
+        auto path_entry = object_metadata.find("path");
+        if (path_entry == object_metadata.end())
+            throw Exception("Failed to restore key " + key + " because it doesn't have 'path' in metadata", ErrorCodes::S3_ERROR);
+
+        const auto & path = path_entry->second;
+
+        createDirectories(directoryPath(path));
+        auto metadata = createMeta(path);
+        auto relative_key = shrinkKey(source_path, key);
+
+        /// Copy object if we restore to different bucket / path.
+        if (bucket != source_bucket || s3_root_path != source_path)
+            copyObject(source_bucket, key, bucket, s3_root_path + relative_key);
+
+        metadata.addObject(relative_key, head_result.GetContentLength());
+        metadata.save();
+
+        LOG_DEBUG(&Poco::Logger::get("DiskS3"), "Restored file {}", path);
+    }
+}
+
+void DiskS3::restoreFileOperations(const String & source_bucket, const String & source_path, UInt64 target_revision)
+{
+    LOG_INFO(&Poco::Logger::get("DiskS3"), "Starting restore file operations for disk {}", name);
+
+    /// Enable recording file operations if we restore to different bucket / path.
+    send_metadata = bucket != source_bucket || s3_root_path != source_path;
+
+    listObjects(source_bucket, source_path + "operations/", [this, &source_bucket, &target_revision](auto list_result)
+    {
+        const String rename = "rename";
+        const String hardlink = "hardlink";
+
+        for (const auto & row : list_result.GetContents())
+        {
+            const String & key = row.GetKey();
+
+            const auto [revision, operation] = extractRevisionAndOperationFromKey(key);
+            if (revision == UNKNOWN_REVISION)
+            {
+                LOG_WARNING(&Poco::Logger::get("DiskS3"), "Skip key {} with unknown revision", key);
+                continue;
+            }
+
+            /// S3 ensures that keys will be listed in ascending UTF-8 bytes order (revision order).
+            /// We can stop processing if revision of the object is already more than required.
+            if (revision > target_revision)
+                return false;
+
+            /// Keep original revision if restore to different bucket / path.
+            if (send_metadata)
+                revision_counter = revision - 1;
+
+            auto object_metadata = headObject(source_bucket, key).GetMetadata();
+            if (operation == rename)
+            {
+                auto from_path = object_metadata["from_path"];
+                auto to_path = object_metadata["to_path"];
+                if (exists(from_path))
+                {
+                    moveFile(from_path, to_path);
+                    LOG_DEBUG(&Poco::Logger::get("DiskS3"), "Revision {}. Restored rename {} -> {}", revision, from_path, to_path);
+                }
+            }
+            else if (operation == hardlink)
+            {
+                auto src_path = object_metadata["src_path"];
+                auto dst_path = object_metadata["dst_path"];
+                if (exists(src_path))
+                {
+                    createDirectories(directoryPath(dst_path));
+                    createHardLink(src_path, dst_path);
+                    LOG_DEBUG(&Poco::Logger::get("DiskS3"), "Revision {}. Restored hardlink {} -> {}", revision, src_path, dst_path);
+                }
+            }
+        }
+
+        return true;
+    });
+
+    send_metadata = true;
+
+    LOG_INFO(&Poco::Logger::get("DiskS3"), "File operations restored for disk {}", name);
+}
+
+std::tuple<UInt64, String> DiskS3::extractRevisionAndOperationFromKey(const String & key)
+{
+    UInt64 revision = UNKNOWN_REVISION;
+    String operation;
+
+    re2::RE2::FullMatch(key, key_regexp, &revision, &operation);
+
+    return {revision, operation};
+}
+
+String DiskS3::shrinkKey(const String & path, const String & key)
+{
+    if (!key.starts_with(path))
+        throw Exception("The key " + key + " prefix mismatch with given " + path, ErrorCodes::LOGICAL_ERROR);
+
+    return key.substr(path.length());
+}
+
+String DiskS3::revisionToString(UInt64 revision)
+{
+    static constexpr size_t max_digits = 19; /// UInt64 max digits in decimal representation.
+
+    /// Align revision number with leading zeroes to have strict lexicographical order of them.
+    auto revision_str = std::to_string(revision);
+    auto digits_to_align = max_digits - revision_str.length();
+    for (size_t i = 0; i < digits_to_align; ++i)
+        revision_str = "0" + revision_str;
+
+    return revision_str;
+}
+
+void DiskS3::onFreeze(const String & path)
+{
+    createDirectories(path);
+    WriteBufferFromFile revision_file_buf(metadata_path + path + "revision.txt", 32);
+    writeIntText(revision_counter.load(), revision_file_buf);
+    revision_file_buf.finalize();
 }

 }
--- a/src/Disks/S3/DiskS3.h
+++ b/src/Disks/S3/DiskS3.h
@ -1,11 +1,16 @@
 #pragma once

+#include <atomic>
 #include "Disks/DiskFactory.h"
 #include "Disks/Executor.h"
 #include "ProxyConfiguration.h"

 #include <aws/s3/S3Client.h>
+#include <aws/s3/model/HeadObjectResult.h>
+#include <aws/s3/model/ListObjectsV2Result.h>
+
 #include <Poco/DirectoryIterator.h>
+#include <re2/re2.h>


 namespace DB
@ -25,6 +30,7 @@ public:

    class AwsS3KeyKeeper;
    struct Metadata;
+    struct RestoreInformation;

    DiskS3(
        String name_,
@ -36,7 +42,9 @@ public:
        size_t min_upload_part_size_,
        size_t max_single_part_upload_size_,
        size_t min_bytes_for_seek_,
-        bool send_metadata_);
+        bool send_metadata_,
+        int thread_pool_size_,
+        int list_object_keys_size_);

    const String & getName() const override { return name; }

@ -74,8 +82,6 @@ public:

    void replaceFile(const String & from_path, const String & to_path) override;

-    void copyFile(const String & from_path, const String & to_path) override;
-
    void listFiles(const String & path, std::vector<String> & file_names) override;

    std::unique_ptr<ReadBufferFromFileBase> readFile(
@ -105,22 +111,47 @@ public:

    void setReadOnly(const String & path) override;

-    const String getType() const override { return "s3"; }
+    DiskType::Type getType() const override { return DiskType::Type::S3; }

    void shutdown() override;

+    /// Actions performed after disk creation.
+    void startup();
+
+    /// Restore S3 metadata files on file system.
+    void restore();
+
+    /// Dumps current revision counter into file 'revision.txt' at given path.
+    void onFreeze(const String & path) override;
+
 private:
    bool tryReserve(UInt64 bytes);

    void removeMeta(const String & path, AwsS3KeyKeeper & keys);
    void removeMetaRecursive(const String & path, AwsS3KeyKeeper & keys);
    void removeAws(const AwsS3KeyKeeper & keys);
-    std::optional<ObjectMetadata> createObjectMetadata(const String & path) const;

    Metadata readMeta(const String & path) const;
    Metadata createMeta(const String & path) const;

-private:
+    void createFileOperationObject(const String & operation_name, UInt64 revision, const ObjectMetadata & metadata);
+    static String revisionToString(UInt64 revision);
+
+    bool checkObjectExists(const String & prefix);
+    Aws::S3::Model::HeadObjectResult headObject(const String & source_bucket, const String & key);
+    void listObjects(const String & source_bucket, const String & source_path, std::function<bool(const Aws::S3::Model::ListObjectsV2Result &)> callback);
+    void copyObject(const String & src_bucket, const String & src_key, const String & dst_bucket, const String & dst_key);
+
+    void readRestoreInformation(RestoreInformation & restore_information);
+    void restoreFiles(const String & source_bucket, const String & source_path, UInt64 target_revision);
+    void processRestoreFiles(const String & source_bucket, const String & source_path, std::vector<String> keys);
+    void restoreFileOperations(const String & source_bucket, const String & source_path, UInt64 target_revision);
+
+    /// Remove 'path' prefix from 'key' to get relative key.
+    /// It's needed to store keys to metadata files in RELATIVE_PATHS version.
+    static String shrinkKey(const String & path, const String & key);
+    std::tuple<UInt64, String> extractRevisionAndOperationFromKey(const String & key);
+
    const String name;
    std::shared_ptr<Aws::S3::S3Client> client;
    std::shared_ptr<S3::ProxyConfiguration> proxy_configuration;
@ -135,6 +166,18 @@ private:
    UInt64 reserved_bytes = 0;
    UInt64 reservation_count = 0;
    std::mutex reservation_mutex;
+
+    std::atomic<UInt64> revision_counter;
+    static constexpr UInt64 LATEST_REVISION = (static_cast<UInt64>(1)) << 63;
+    static constexpr UInt64 UNKNOWN_REVISION = 0;
+
+    /// File at path {metadata_path}/restore contains metadata restore information
+    const String restore_file_name = "restore";
+    /// The number of keys listed in one request (1000 is max value)
+    int list_object_keys_size;
+
+    /// Key has format: ../../r{revision}-{operation}
+    const re2::RE2 key_regexp {".*/r(\\d+)-(\\w+).*"};
 };

 }
--- a/src/Disks/S3/registerDiskS3.cpp
+++ b/src/Disks/S3/registerDiskS3.cpp
@ -152,7 +152,9 @@ void registerDiskS3(DiskFactory & factory)
            context.getSettingsRef().s3_min_upload_part_size,
            context.getSettingsRef().s3_max_single_part_upload_size,
            config.getUInt64(config_prefix + ".min_bytes_for_seek", 1024 * 1024),
-            config.getBool(config_prefix + ".send_object_metadata", false));
+            config.getBool(config_prefix + ".send_metadata", false),
+            config.getInt(config_prefix + ".thread_pool_size", 16),
+            config.getInt(config_prefix + ".list_object_keys_size", 1000));

        /// This code is used only to check access to the corresponding disk.
        if (!config.getBool(config_prefix + ".skip_access_check", false))
@ -162,6 +164,9 @@ void registerDiskS3(DiskFactory & factory)
            checkRemoveAccess(*s3disk);
        }

+        s3disk->restore();
+        s3disk->startup();
+
        bool cache_enabled = config.getBool(config_prefix + ".cache_enabled", true);

        if (cache_enabled)
--- a/src/Functions/CMakeLists.txt
+++ b/src/Functions/CMakeLists.txt
@ -117,3 +117,6 @@ target_link_libraries(clickhouse_functions PRIVATE clickhouse_functions_array)
 if (USE_STATS)
    target_link_libraries(clickhouse_functions PRIVATE stats)
 endif()
+
+# Signed integer overflow on user-provided data inside boost::geometry - ignore.
+set_source_files_properties("pointInPolygon.cpp" PROPERTIES COMPILE_FLAGS -fno-sanitize=signed-integer-overflow)
--- a/src/Functions/DateTimeTransforms.h
+++ b/src/Functions/DateTimeTransforms.h
@ -704,7 +704,11 @@ struct DateTimeTransformImpl
    {
        using Op = Transformer<typename FromDataType::FieldType, typename ToDataType::FieldType, Transform>;

-        const DateLUTImpl & time_zone = extractTimeZoneFromFunctionArguments(arguments, 1, 0);
+        size_t time_zone_argument_position = 1;
+        if constexpr (std::is_same_v<ToDataType, DataTypeDateTime64>)
+            time_zone_argument_position = 2;
+
+        const DateLUTImpl & time_zone = extractTimeZoneFromFunctionArguments(arguments, time_zone_argument_position, 0);

        const ColumnPtr source_col = arguments[0].column;
        if (const auto * sources = checkAndGetColumn<typename FromDataType::ColumnType>(source_col.get()))
--- a/src/Functions/FunctionsConversion.h
+++ b/src/Functions/FunctionsConversion.h
@ -477,6 +477,61 @@ template <typename Name> struct ConvertImpl<DataTypeDate, DataTypeDateTime64, Na
 template <typename Name> struct ConvertImpl<DataTypeDateTime, DataTypeDateTime64, Name, ConvertDefaultBehaviorTag>
    : DateTimeTransformImpl<DataTypeDateTime, DataTypeDateTime64, ToDateTime64Transform> {};

+/** Conversion of numeric to DateTime64
+  */
+
+template <typename FromType>
+struct ToDateTime64TransformUnsigned
+{
+    static constexpr auto name = "toDateTime64";
+
+    const DateTime64::NativeType scale_multiplier = 1;
+
+    ToDateTime64TransformUnsigned(UInt32 scale = 0)
+        : scale_multiplier(DecimalUtils::scaleMultiplier<DateTime64::NativeType>(scale))
+    {}
+
+    inline NO_SANITIZE_UNDEFINED DateTime64::NativeType execute(FromType from, const DateLUTImpl &) const
+    {
+        from = std::min(time_t(from), time_t(0xFFFFFFFF));
+        return DecimalUtils::decimalFromComponentsWithMultiplier<DateTime64>(from, 0, scale_multiplier);
+    }
+};
+template <typename FromType>
+struct ToDateTime64TransformSigned
+{
+    static constexpr auto name = "toDateTime64";
+
+    const DateTime64::NativeType scale_multiplier = 1;
+
+    ToDateTime64TransformSigned(UInt32 scale = 0)
+        : scale_multiplier(DecimalUtils::scaleMultiplier<DateTime64::NativeType>(scale))
+    {}
+
+    inline NO_SANITIZE_UNDEFINED DateTime64::NativeType execute(FromType from, const DateLUTImpl &) const
+    {
+        if (from < 0)
+            return 0;
+        from = std::min(time_t(from), time_t(0xFFFFFFFF));
+        return DecimalUtils::decimalFromComponentsWithMultiplier<DateTime64>(from, 0, scale_multiplier);
+    }
+};
+
+template <typename Name> struct ConvertImpl<DataTypeInt8, DataTypeDateTime64, Name>
+    : DateTimeTransformImpl<DataTypeInt8, DataTypeDateTime64, ToDateTime64TransformSigned<Int8>> {};
+template <typename Name> struct ConvertImpl<DataTypeInt16, DataTypeDateTime64, Name>
+    : DateTimeTransformImpl<DataTypeInt16, DataTypeDateTime64, ToDateTime64TransformSigned<Int16>> {};
+template <typename Name> struct ConvertImpl<DataTypeInt32, DataTypeDateTime64, Name>
+    : DateTimeTransformImpl<DataTypeInt32, DataTypeDateTime64, ToDateTime64TransformSigned<Int32>> {};
+template <typename Name> struct ConvertImpl<DataTypeInt64, DataTypeDateTime64, Name>
+    : DateTimeTransformImpl<DataTypeInt64, DataTypeDateTime64, ToDateTime64TransformSigned<Int64>> {};
+template <typename Name> struct ConvertImpl<DataTypeUInt64, DataTypeDateTime64, Name>
+    : DateTimeTransformImpl<DataTypeUInt64, DataTypeDateTime64, ToDateTime64TransformUnsigned<UInt64>> {};
+template <typename Name> struct ConvertImpl<DataTypeFloat32, DataTypeDateTime64, Name>
+    : DateTimeTransformImpl<DataTypeFloat32, DataTypeDateTime64, ToDateTime64TransformSigned<Float32>> {};
+template <typename Name> struct ConvertImpl<DataTypeFloat64, DataTypeDateTime64, Name>
+    : DateTimeTransformImpl<DataTypeFloat64, DataTypeDateTime64, ToDateTime64TransformSigned<Float64>> {};
+
 /** Conversion of DateTime64 to Date or DateTime: discards fractional part.
 */
 template <typename Transform>
@ -1294,7 +1349,12 @@ public:
    bool useDefaultImplementationForNulls() const override { return checked_return_type; }

    bool useDefaultImplementationForConstants() const override { return true; }
-    ColumnNumbers getArgumentsThatAreAlwaysConstant() const override { return {1}; }
+    ColumnNumbers getArgumentsThatAreAlwaysConstant() const override
+    {
+        if constexpr (std::is_same_v<ToDataType, DataTypeDateTime64>)
+            return {2};
+        return {1};
+    }
    bool canBeExecutedOnDefaultArguments() const override { return false; }

    ColumnPtr executeImpl(const ColumnsWithTypeAndName & arguments, const DataTypePtr & result_type, size_t input_rows_count) const override
@ -2313,7 +2373,7 @@ private:
                using LeftDataType = typename Types::LeftType;
                using RightDataType = typename Types::RightType;

-                if constexpr (IsDataTypeDecimalOrNumber<LeftDataType> && IsDataTypeDecimalOrNumber<RightDataType>)
+                if constexpr (IsDataTypeDecimalOrNumber<LeftDataType> && IsDataTypeDecimalOrNumber<RightDataType> && !std::is_same_v<DataTypeDateTime64, RightDataType>)
                {
                    if (wrapper_cast_type == CastType::accurate)
                    {
--- a/src/Functions/array/arrayCumSum.cpp
+++ b/src/Functions/array/arrayCumSum.cpp
@ -45,6 +45,41 @@ struct ArrayCumSumImpl
    }


+    template <typename Src, typename Dst>
+    static void NO_SANITIZE_UNDEFINED implConst(
+        size_t size, const IColumn::Offset * __restrict offsets, Dst * __restrict res_values, Src src_value)
+    {
+        size_t pos = 0;
+        for (const auto * end = offsets + size; offsets < end; ++offsets)
+        {
+            auto offset = *offsets;
+            Dst accumulated{};
+            for (; pos < offset; ++pos)
+            {
+                accumulated += src_value;
+                res_values[pos] = accumulated;
+            }
+        }
+    }
+
+    template <typename Src, typename Dst>
+    static void NO_SANITIZE_UNDEFINED implVector(
+        size_t size, const IColumn::Offset * __restrict offsets, Dst * __restrict res_values, const Src * __restrict src_values)
+    {
+        size_t pos = 0;
+        for (const auto * end = offsets + size; offsets < end; ++offsets)
+        {
+            auto offset = *offsets;
+            Dst accumulated{};
+            for (; pos < offset; ++pos)
+            {
+                accumulated += src_values[pos];
+                res_values[pos] = accumulated;
+            }
+        }
+    }
+
+
    template <typename Element, typename Result>
    static bool executeType(const ColumnPtr & mapped, const ColumnArray & array, ColumnPtr & res_ptr)
    {
@ -75,19 +110,7 @@ struct ArrayCumSumImpl

            typename ColVecResult::Container & res_values = res_nested->getData();
            res_values.resize(column_const->size());
-
-            size_t pos = 0;
-            for (auto offset : offsets)
-            {
-                // skip empty arrays
-                if (pos < offset)
-                {
-                    res_values[pos++] = x; // NOLINT
-                    for (; pos < offset; ++pos)
-                        res_values[pos] = res_values[pos - 1] + x;
-                }
-            }
-
+            implConst(offsets.size(), offsets.data(), res_values.data(), x);
            res_ptr = ColumnArray::create(std::move(res_nested), array.getOffsetsPtr());
            return true;
        }
@ -103,18 +126,7 @@ struct ArrayCumSumImpl

        typename ColVecResult::Container & res_values = res_nested->getData();
        res_values.resize(data.size());
-
-        size_t pos = 0;
-        for (auto offset : offsets)
-        {
-            // skip empty arrays
-            if (pos < offset)
-            {
-                res_values[pos] = data[pos]; // NOLINT
-                for (++pos; pos < offset; ++pos)
-                    res_values[pos] = res_values[pos - 1] + data[pos];
-            }
-        }
+        implVector(offsets.size(), offsets.data(), res_values.data(), data.data());
        res_ptr = ColumnArray::create(std::move(res_nested), array.getOffsetsPtr());
        return true;

--- a/src/Functions/array/arrayCumSumNonNegative.cpp
+++ b/src/Functions/array/arrayCumSumNonNegative.cpp
@ -48,6 +48,26 @@ struct ArrayCumSumNonNegativeImpl
    }


+    template <typename Src, typename Dst>
+    static void NO_SANITIZE_UNDEFINED implVector(
+        size_t size, const IColumn::Offset * __restrict offsets, Dst * __restrict res_values, const Src * __restrict src_values)
+    {
+        size_t pos = 0;
+        for (const auto * end = offsets + size; offsets < end; ++offsets)
+        {
+            auto offset = *offsets;
+            Dst accumulated{};
+            for (; pos < offset; ++pos)
+            {
+                accumulated += src_values[pos];
+                if (accumulated < 0)
+                    accumulated = 0;
+                res_values[pos] = accumulated;
+            }
+        }
+    }
+
+
    template <typename Element, typename Result>
    static bool executeType(const ColumnPtr & mapped, const ColumnArray & array, ColumnPtr & res_ptr)
    {
@ -70,26 +90,7 @@ struct ArrayCumSumNonNegativeImpl

        typename ColVecResult::Container & res_values = res_nested->getData();
        res_values.resize(data.size());
-
-        size_t pos = 0;
-        Result accum_sum = 0;
-        for (auto offset : offsets)
-        {
-            // skip empty arrays
-            if (pos < offset)
-            {
-                accum_sum = data[pos] > 0 ? data[pos] : Element(0); // NOLINT
-                res_values[pos] = accum_sum;
-                for (++pos; pos < offset; ++pos)
-                {
-                    accum_sum = accum_sum + data[pos];
-                    if (accum_sum < 0)
-                        accum_sum = 0;
-
-                    res_values[pos] = accum_sum;
-                }
-            }
-        }
+        implVector(offsets.size(), offsets.data(), res_values.data(), data.data());
        res_ptr = ColumnArray::create(std::move(res_nested), array.getOffsetsPtr());
        return true;

--- a/src/Functions/array/mapPopulateSeries.cpp
+++ b/src/Functions/array/mapPopulateSeries.cpp
@ -16,6 +16,7 @@ namespace ErrorCodes
    extern const int ILLEGAL_COLUMN;
    extern const int ILLEGAL_TYPE_OF_ARGUMENT;
    extern const int NUMBER_OF_ARGUMENTS_DOESNT_MATCH;
+    extern const int TOO_LARGE_ARRAY_SIZE;
 }

 class FunctionMapPopulateSeries : public IFunction
@ -188,9 +189,13 @@ private:
                }
            }

+            static constexpr size_t MAX_ARRAY_SIZE = 1ULL << 30;
+            if (static_cast<size_t>(max_key - min_key) > MAX_ARRAY_SIZE)
+                throw Exception(ErrorCodes::TOO_LARGE_ARRAY_SIZE, "Too large array size in the result of function {}", getName());
+
            /* fill the result arrays */
            KeyType key;
-            for (key = min_key; key <= max_key; ++key)
+            for (key = min_key;; ++key)
            {
                to_keys_data.insert(key);

@ -205,6 +210,8 @@ private:
                }

                ++offset;
+                if (key == max_key)
+                    break;
            }

            to_keys_offsets.push_back(offset);
--- a/src/Functions/if.cpp
+++ b/src/Functions/if.cpp
@ -532,7 +532,7 @@ private:
        return nullptr;
    }

-    ColumnPtr executeTuple(const ColumnsWithTypeAndName & arguments, size_t input_rows_count) const
+    ColumnPtr executeTuple(const ColumnsWithTypeAndName & arguments, const DataTypePtr & result_type, size_t input_rows_count) const
    {
        /// Calculate function for each corresponding elements of tuples.

@ -558,6 +558,7 @@ private:

        const DataTypeTuple & type1 = static_cast<const DataTypeTuple &>(*arg1.type);
        const DataTypeTuple & type2 = static_cast<const DataTypeTuple &>(*arg2.type);
+        const DataTypeTuple & tuple_result = static_cast<const DataTypeTuple &>(*result_type);

        ColumnsWithTypeAndName temporary_columns(3);
        temporary_columns[0] = arguments[0];
@ -570,7 +571,7 @@ private:
            temporary_columns[1] = {col1_contents[i], type1.getElements()[i], {}};
            temporary_columns[2] = {col2_contents[i], type2.getElements()[i], {}};

-            tuple_columns[i] = executeImpl(temporary_columns, std::make_shared<DataTypeUInt8>(), input_rows_count);
+            tuple_columns[i] = executeImpl(temporary_columns, tuple_result.getElements()[i], input_rows_count);
        }

        return ColumnTuple::create(tuple_columns);
@ -988,7 +989,7 @@ public:
            || (res = executeTyped<UInt128, UInt128>(cond_col, arguments, result_type, input_rows_count))
            || (res = executeString(cond_col, arguments, result_type))
            || (res = executeGenericArray(cond_col, arguments, result_type))
-            || (res = executeTuple(arguments, input_rows_count))))
+            || (res = executeTuple(arguments, result_type, input_rows_count))))
        {
            return executeGeneric(cond_col, arguments, input_rows_count);
        }
--- a/src/Functions/tests/gtest_number_traits.cpp
+++ b/src/Functions/tests/gtest_number_traits.cpp
@ -258,7 +258,7 @@ TEST(NumberTraits, Others)
    ASSERT_EQ(getTypeString(DB::NumberTraits::ResultOfFloatingPointDivision<DB::UInt16, DB::Int16>::Type()), "Float64");
    ASSERT_EQ(getTypeString(DB::NumberTraits::ResultOfFloatingPointDivision<DB::UInt32, DB::Int16>::Type()), "Float64");
    ASSERT_EQ(getTypeString(DB::NumberTraits::ResultOfIntegerDivision<DB::UInt8, DB::Int16>::Type()), "Int8");
-    ASSERT_EQ(getTypeString(DB::NumberTraits::ResultOfModulo<DB::UInt32, DB::Int8>::Type()), "Int8");
+    ASSERT_EQ(getTypeString(DB::NumberTraits::ResultOfModulo<DB::UInt32, DB::Int8>::Type()), "UInt8");
 }


--- a/src/IO/BrotliReadBuffer.cpp
+++ b/src/IO/BrotliReadBuffer.cpp
@ -77,7 +77,7 @@ bool BrotliReadBuffer::nextImpl()
        if (in->eof())
        {
            eof = true;
-            return working_buffer.size() != 0;
+            return !working_buffer.empty();
        }
        else
        {
--- a/src/IO/BufferBase.h
+++ b/src/IO/BufferBase.h
@ -40,6 +40,7 @@ public:
        inline Position end() const { return end_pos; }
        inline size_t size() const { return size_t(end_pos - begin_pos); }
        inline void resize(size_t size) { end_pos = begin_pos + size; }
+        inline bool empty() const { return size() == 0; }

        inline void swap(Buffer & other)
        {
--- a/src/IO/ConcatReadBuffer.h
+++ b/src/IO/ConcatReadBuffer.h
@ -25,11 +25,16 @@ protected:
            return false;

        /// First reading
-        if (working_buffer.size() == 0 && (*current)->hasPendingData())
+        if (working_buffer.empty())
        {
-            working_buffer = Buffer((*current)->position(), (*current)->buffer().end());
-            return true;
+            if ((*current)->hasPendingData())
+            {
+                working_buffer = Buffer((*current)->position(), (*current)->buffer().end());
+                return true;
+            }
        }
+        else
+            (*current)->position() = position();

        if (!(*current)->next())
        {
@ -51,14 +56,12 @@ protected:
    }

 public:
-    ConcatReadBuffer(const ReadBuffers & buffers_) : ReadBuffer(nullptr, 0), buffers(buffers_), current(buffers.begin()) {}
-
-    ConcatReadBuffer(ReadBuffer & buf1, ReadBuffer & buf2) : ReadBuffer(nullptr, 0)
+    explicit ConcatReadBuffer(const ReadBuffers & buffers_) : ReadBuffer(nullptr, 0), buffers(buffers_), current(buffers.begin())
    {
-        buffers.push_back(&buf1);
-        buffers.push_back(&buf2);
-        current = buffers.begin();
+        assert(!buffers.empty());
    }
+
+    ConcatReadBuffer(ReadBuffer & buf1, ReadBuffer & buf2) : ConcatReadBuffer({&buf1, &buf2}) {}
 };

 }
--- a/src/IO/HashingReadBuffer.h
+++ b/src/IO/HashingReadBuffer.h
@ -1,10 +1,11 @@
 #pragma once

-#include <IO/ReadBuffer.h>
 #include <IO/HashingWriteBuffer.h>
+#include <IO/ReadBuffer.h>

 namespace DB
 {
+
 /*
 * Calculates the hash from the read data. When reading, the data is read from the nested ReadBuffer.
 * Small pieces are copied into its own memory.
@ -12,14 +13,14 @@ namespace DB
 class HashingReadBuffer : public IHashingBuffer<ReadBuffer>
 {
 public:
-    HashingReadBuffer(ReadBuffer & in_, size_t block_size_ = DBMS_DEFAULT_HASHING_BLOCK_SIZE) :
-        IHashingBuffer<ReadBuffer>(block_size_), in(in_)
+    explicit HashingReadBuffer(ReadBuffer & in_, size_t block_size_ = DBMS_DEFAULT_HASHING_BLOCK_SIZE)
+        : IHashingBuffer<ReadBuffer>(block_size_), in(in_)
    {
        working_buffer = in.buffer();
        pos = in.position();

        /// calculate hash from the data already read
-        if (working_buffer.size())
+        if (!working_buffer.empty())
        {
            calculateHash(pos, working_buffer.end() - pos);
        }
@ -39,7 +40,7 @@ private:
        return res;
    }

-private:
    ReadBuffer & in;
 };
+
 }
--- a/src/IO/LZMAInflatingReadBuffer.cpp
+++ b/src/IO/LZMAInflatingReadBuffer.cpp
@ -66,7 +66,7 @@ bool LZMAInflatingReadBuffer::nextImpl()
        if (in->eof())
        {
            eof = true;
-            return working_buffer.size() != 0;
+            return !working_buffer.empty();
        }
        else
        {
--- a/src/IO/LimitReadBuffer.cpp
+++ b/src/IO/LimitReadBuffer.cpp
@ -1,4 +1,5 @@
 #include <IO/LimitReadBuffer.h>
+
 #include <Common/Exception.h>


@ -13,6 +14,8 @@ namespace ErrorCodes

 bool LimitReadBuffer::nextImpl()
 {
+    assert(position() >= in.position());
+
    /// Let underlying buffer calculate read bytes in `next()` call.
    in.position() = position();

@ -25,7 +28,10 @@ bool LimitReadBuffer::nextImpl()
    }

    if (!in.next())
+    {
+        working_buffer = in.buffer();
        return false;
+    }

    working_buffer = in.buffer();

@ -50,7 +56,7 @@ LimitReadBuffer::LimitReadBuffer(ReadBuffer & in_, UInt64 limit_, bool throw_exc
 LimitReadBuffer::~LimitReadBuffer()
 {
    /// Update underlying buffer's position in case when limit wasn't reached.
-    if (working_buffer.size() != 0)
+    if (!working_buffer.empty())
        in.position() = position();
 }

--- a/src/IO/MemoryReadWriteBuffer.cpp
+++ b/src/IO/MemoryReadWriteBuffer.cpp
@ -61,7 +61,7 @@ private:
            position() = nullptr;
        }

-        return buffer().size() != 0;
+        return !buffer().empty();
    }

    using Container = std::forward_list<BufferBase::Buffer>;
--- a/src/IO/ReadBuffer.h
+++ b/src/IO/ReadBuffer.h
@ -55,13 +55,19 @@ public:
      */
    bool next()
    {
+        assert(!hasPendingData());
+        assert(position() <= working_buffer.end());
+
        bytes += offset();
        bool res = nextImpl();
        if (!res)
-            working_buffer.resize(0);
-
-        pos = working_buffer.begin() + nextimpl_working_buffer_offset;
+            working_buffer = Buffer(pos, pos);
+        else
+            pos = working_buffer.begin() + nextimpl_working_buffer_offset;
        nextimpl_working_buffer_offset = 0;
+
+        assert(position() <= working_buffer.end());
+
        return res;
    }

@ -72,7 +78,7 @@ public:
            next();
    }

-    virtual ~ReadBuffer() {}
+    virtual ~ReadBuffer() = default;


    /** Unlike std::istream, it returns true if all data was read
@ -192,7 +198,7 @@ private:
      */
    virtual bool nextImpl() { return false; }

-    [[noreturn]] void throwReadAfterEOF()
+    [[noreturn]] static void throwReadAfterEOF()
    {
        throw Exception("Attempt to read after eof", ErrorCodes::ATTEMPT_TO_READ_AFTER_EOF);
    }
--- a/src/IO/ReadBufferFromFileDescriptor.cpp
+++ b/src/IO/ReadBufferFromFileDescriptor.cpp
@ -90,6 +90,7 @@ bool ReadBufferFromFileDescriptor::nextImpl()
    if (bytes_read)
    {
        ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorReadBytes, bytes_read);
+        working_buffer = internal_buffer;
        working_buffer.resize(bytes_read);
    }
    else
--- a/src/IO/ReadWriteBufferFromHTTP.h
+++ b/src/IO/ReadWriteBufferFromHTTP.h
@ -76,9 +76,7 @@ public:
        }
    }

-    virtual ~UpdatableSessionBase()
-    {
-    }
+    virtual ~UpdatableSessionBase() = default;
 };


@ -205,6 +203,8 @@ namespace detail
        {
            if (next_callback)
                next_callback(count());
+            if (!working_buffer.empty())
+                impl->position() = position();
            if (!impl->next())
                return false;
            internal_buffer = impl->buffer();
--- a/src/IO/SeekableReadBuffer.h
+++ b/src/IO/SeekableReadBuffer.h
@ -21,6 +21,12 @@ public:
     */
    virtual off_t seek(off_t off, int whence) = 0;

+    /**
+     * Keep in mind that seekable buffer may encounter eof() once and the working buffer
+     * may get into inconsistent state. Don't forget to reset it on the first nextImpl()
+     * after seek().
+     */
+
    /**
     * @return Offset from the begin of the underlying buffer / file corresponds to the buffer current position.
     */
--- a/src/IO/WriteBuffer.h
+++ b/src/IO/WriteBuffer.h
@ -61,7 +61,7 @@ public:
    /** it is desirable in the derived classes to place the next() call in the destructor,
      * so that the last data is written
      */
-    virtual ~WriteBuffer() {}
+    virtual ~WriteBuffer() = default;

    inline void nextIfAtEnd()
    {
@ -75,7 +75,7 @@ public:
        size_t bytes_copied = 0;

        /// Produces endless loop
-        assert(working_buffer.size() > 0);
+        assert(!working_buffer.empty());

        while (bytes_copied < n)
        {
--- a/src/IO/WriteHelpers.h
+++ b/src/IO/WriteHelpers.h
@ -910,6 +910,7 @@ inline void writeBinary(const StringRef & x, WriteBuffer & buf) { writeStringBin
 inline void writeBinary(const std::string_view & x, WriteBuffer & buf) { writeStringBinary(x, buf); }
 inline void writeBinary(const Int128 & x, WriteBuffer & buf) { writePODBinary(x, buf); }
 inline void writeBinary(const UInt128 & x, WriteBuffer & buf) { writePODBinary(x, buf); }
+inline void writeBinary(const UUID & x, WriteBuffer & buf) { writePODBinary(x, buf); }
 inline void writeBinary(const DummyUInt256 & x, WriteBuffer & buf) { writePODBinary(x, buf); }
 inline void writeBinary(const Decimal32 & x, WriteBuffer & buf) { writePODBinary(x, buf); }
 inline void writeBinary(const Decimal64 & x, WriteBuffer & buf) { writePODBinary(x, buf); }
--- a/src/IO/ZlibInflatingReadBuffer.cpp
+++ b/src/IO/ZlibInflatingReadBuffer.cpp
@ -70,7 +70,7 @@ bool ZlibInflatingReadBuffer::nextImpl()
        if (in->eof())
        {
            eof = true;
-            return working_buffer.size() != 0;
+            return !working_buffer.empty();
        }
        else
        {
--- a/src/IO/ZstdInflatingReadBuffer.cpp
+++ b/src/IO/ZstdInflatingReadBuffer.cpp
@ -54,7 +54,7 @@ bool ZstdInflatingReadBuffer::nextImpl()
    if (in->eof())
    {
        eof = true;
-        return working_buffer.size() != 0;
+        return !working_buffer.empty();
    }

    return true;
--- a/src/Interpreters/ActionsDAG.cpp
+++ b/src/Interpreters/ActionsDAG.cpp
@ -454,36 +454,42 @@ bool ActionsDAG::tryRestoreColumn(const std::string & column_name)
    return false;
 }

-void ActionsDAG::removeUnusedInput(const std::string & column_name)
+bool ActionsDAG::removeUnusedResult(const std::string & column_name)
 {
+    /// Find column in index and remove.
+    const Node * col;
+    {
+        auto it = index.begin();
+        for (; it != index.end(); ++it)
+            if ((*it)->result_name == column_name)
+                break;
+
+        if (it == index.end())
+            throw Exception(ErrorCodes::LOGICAL_ERROR, "Not found result {} in ActionsDAG\n{}", column_name, dumpDAG());
+
+        col = *it;
+        index.remove(it);
+    }
+
+    /// Check if column is in input.
    auto it = inputs.begin();
    for (; it != inputs.end(); ++it)
-        if ((*it)->result_name == column_name)
+        if (*it == col)
            break;

    if (it == inputs.end())
-        throw Exception(ErrorCodes::LOGICAL_ERROR, "Not found input {} in ActionsDAG\n{}", column_name, dumpDAG());
+        return false;

-    auto * input = *it;
+    /// Check column has no dependent.
    for (const auto & node : nodes)
        for (const auto * child : node.children)
-            if (input == child)
-                throw Exception(ErrorCodes::LOGICAL_ERROR,
-                                "Cannot remove input {} because it has dependent nodes in ActionsDAG\n{}",
-                                column_name, dumpDAG());
-
-    for (auto jt = index.begin(); jt != index.end(); ++jt)
-    {
-        if (*jt == input)
-        {
-            index.remove(jt);
-            break;
-        }
-    }
+            if (col == child)
+                return false;

+    /// Remove from nodes and inputs.
    for (auto jt = nodes.begin(); jt != nodes.end(); ++jt)
    {
-        if (&(*jt) == input)
+        if (&(*jt) == *it)
        {
            nodes.erase(jt);
            break;
@ -491,6 +497,7 @@ void ActionsDAG::removeUnusedInput(const std::string & column_name)
    }

    inputs.erase(it);
+    return true;
 }

 ActionsDAGPtr ActionsDAG::clone() const
@ -844,7 +851,7 @@ ActionsDAGPtr ActionsDAG::merge(ActionsDAG && first, ActionsDAG && second)
    return std::make_shared<ActionsDAG>(std::move(first));
 }

-std::pair<ActionsDAGPtr, ActionsDAGPtr> ActionsDAG::split(std::unordered_set<const Node *> split_nodes) const
+ActionsDAG::SplitResult ActionsDAG::split(std::unordered_set<const Node *> split_nodes) const
 {
    /// Split DAG into two parts.
    /// (first_nodes, first_index) is a part which will have split_list in result.
@ -1045,7 +1052,7 @@ std::pair<ActionsDAGPtr, ActionsDAGPtr> ActionsDAG::split(std::unordered_set<con
    return {std::move(first_actions), std::move(second_actions)};
 }

-std::pair<ActionsDAGPtr, ActionsDAGPtr>  ActionsDAG::splitActionsBeforeArrayJoin(const NameSet & array_joined_columns) const
+ActionsDAG::SplitResult ActionsDAG::splitActionsBeforeArrayJoin(const NameSet & array_joined_columns) const
 {

    struct Frame
@ -1113,7 +1120,7 @@ std::pair<ActionsDAGPtr, ActionsDAGPtr>  ActionsDAG::splitActionsBeforeArrayJoin
    return res;
 }

-std::pair<ActionsDAGPtr, ActionsDAGPtr> ActionsDAG::splitActionsForFilter(const std::string & column_name) const
+ActionsDAG::SplitResult ActionsDAG::splitActionsForFilter(const std::string & column_name) const
 {
    auto it = index.begin();
    for (; it != index.end(); ++it)
--- a/src/Interpreters/ActionsDAG.h
+++ b/src/Interpreters/ActionsDAG.h
@ -214,9 +214,10 @@ public:

    /// If column is not in index, try to find it in nodes and insert back into index.
    bool tryRestoreColumn(const std::string & column_name);
-    /// Find column in input. Remove it from input and index.
-    /// Checks that column in inputs and has not dependent nodes.
-    void removeUnusedInput(const std::string & column_name);
+    /// Find column in result. Remove it from index.
+    /// If columns is in inputs and has no dependent nodes, remove it from inputs too.
+    /// Return true if column was removed from inputs.
+    bool removeUnusedResult(const std::string & column_name);

    void projectInput() { settings.project_input = true; }
    void removeUnusedActions(const Names & required_names);
@ -255,18 +256,20 @@ public:
    /// Otherwise, any two actions may be combined.
    static ActionsDAGPtr merge(ActionsDAG && first, ActionsDAG && second);

+    using SplitResult = std::pair<ActionsDAGPtr, ActionsDAGPtr>;
+
    /// Split ActionsDAG into two DAGs, where first part contains all nodes from split_nodes and their children.
    /// Execution of first then second parts on block is equivalent to execution of initial DAG.
    /// First DAG and initial DAG have equal inputs, second DAG and initial DAG has equal index (outputs).
    /// Second DAG inputs may contain less inputs then first DAG (but also include other columns).
-    std::pair<ActionsDAGPtr, ActionsDAGPtr> split(std::unordered_set<const Node *> split_nodes) const;
+    SplitResult split(std::unordered_set<const Node *> split_nodes) const;

    /// Splits actions into two parts. Returned first half may be swapped with ARRAY JOIN.
-    std::pair<ActionsDAGPtr, ActionsDAGPtr> splitActionsBeforeArrayJoin(const NameSet & array_joined_columns) const;
+    SplitResult splitActionsBeforeArrayJoin(const NameSet & array_joined_columns) const;

    /// Splits actions into two parts. First part has minimal size sufficient for calculation of column_name.
    /// Index of initial actions must contain column_name.
-    std::pair<ActionsDAGPtr, ActionsDAGPtr> splitActionsForFilter(const std::string & column_name) const;
+    SplitResult splitActionsForFilter(const std::string & column_name) const;

 private:
    Node & addNode(Node node, bool can_replace = false);
--- a/src/Interpreters/Context.cpp
+++ b/src/Interpreters/Context.cpp
@ -64,6 +64,7 @@
 #include <Common/RemoteHostFilter.h>
 #include <Interpreters/DatabaseCatalog.h>
 #include <Storages/MergeTree/BackgroundJobsExecutor.h>
+#include <Storages/MergeTree/MergeTreeDataPartUUID.h>


 namespace ProfileEvents
@ -2544,4 +2545,22 @@ StorageID Context::resolveStorageIDImpl(StorageID storage_id, StorageNamespace w
    return StorageID::createEmpty();
 }

+PartUUIDsPtr Context::getPartUUIDs()
+{
+    auto lock = getLock();
+    if (!part_uuids)
+        part_uuids = std::make_shared<PartUUIDs>();
+
+    return part_uuids;
+}
+
+PartUUIDsPtr Context::getIgnoredPartUUIDs()
+{
+    auto lock = getLock();
+    if (!ignored_part_uuids)
+        ignored_part_uuids = std::make_shared<PartUUIDs>();
+
+    return ignored_part_uuids;
+}
+
 }
--- a/src/Interpreters/Context.h
+++ b/src/Interpreters/Context.h
@ -106,6 +106,8 @@ using StoragePolicyPtr = std::shared_ptr<const IStoragePolicy>;
 using StoragePoliciesMap = std::map<String, StoragePolicyPtr>;
 class StoragePolicySelector;
 using StoragePolicySelectorPtr = std::shared_ptr<const StoragePolicySelector>;
+struct PartUUIDs;
+using PartUUIDsPtr = std::shared_ptr<PartUUIDs>;
 class NuKeeperStorageDispatcher;

 class IOutputFormat;
@ -264,6 +266,9 @@ private:
    using SampleBlockCache = std::unordered_map<std::string, Block>;
    mutable SampleBlockCache sample_block_cache;

+    PartUUIDsPtr part_uuids; /// set of parts' uuids, is used for query parts deduplication
+    PartUUIDsPtr ignored_part_uuids; /// set of parts' uuids are meant to be excluded from query processing
+
    NameToNameMap query_parameters;   /// Dictionary with query parameters for prepared statements.
                                                     /// (key=name, value)

@ -737,6 +742,9 @@ public:
    };

    MySQLWireContext mysql;
+
+    PartUUIDsPtr getPartUUIDs();
+    PartUUIDsPtr getIgnoredPartUUIDs();
 private:
    std::unique_lock<std::recursive_mutex> getLock() const;

--- a/src/Interpreters/InterpreterSelectQuery.cpp
+++ b/src/Interpreters/InterpreterSelectQuery.cpp
@ -69,7 +69,6 @@
 #include <Processors/Transforms/FilterTransform.h>
 #include <Processors/Transforms/JoiningTransform.h>

-#include <Storages/MergeTree/MergeTreeData.h>
 #include <Storages/MergeTree/MergeTreeWhereOptimizer.h>
 #include <Storages/IStorage.h>
 #include <Storages/StorageView.h>
@ -390,13 +389,18 @@ InterpreterSelectQuery::InterpreterSelectQuery(
        if (try_move_to_prewhere && storage && !row_policy_filter && query.where() && !query.prewhere() && !query.final())
        {
            /// PREWHERE optimization: transfer some condition from WHERE to PREWHERE if enabled and viable
-            if (const auto * merge_tree = dynamic_cast<const MergeTreeData *>(storage.get()))
+            if (const auto & column_sizes = storage->getColumnSizes(); !column_sizes.empty())
            {
+                /// Extract column compressed sizes.
+                std::unordered_map<std::string, UInt64> column_compressed_sizes;
+                for (const auto & [name, sizes] : column_sizes)
+                    column_compressed_sizes[name] = sizes.data_compressed;
+
                SelectQueryInfo current_info;
                current_info.query = query_ptr;
                current_info.syntax_analyzer_result = syntax_analyzer_result;

-                MergeTreeWhereOptimizer{current_info, *context, *merge_tree, metadata_snapshot, syntax_analyzer_result->requiredSourceColumns(), log};
+                MergeTreeWhereOptimizer{current_info, *context, std::move(column_compressed_sizes), metadata_snapshot, syntax_analyzer_result->requiredSourceColumns(), log};
            }
        }

--- a/src/Interpreters/MarkTableIdentifiersVisitor.cpp
+++ b/src/Interpreters/MarkTableIdentifiersVisitor.cpp
@ -47,7 +47,7 @@ void MarkTableIdentifiersMatcher::visit(const ASTFunction & func, ASTPtr &, Data
    // First argument of dictGet can be a dictionary name, perhaps with a database.
    if (functionIsJoinGet(func.name) || functionIsDictGet(func.name))
    {
-        if (func.arguments->children.empty())
+        if (!func.arguments || func.arguments->children.empty())
        {
            return;
        }
--- a/src/Interpreters/RewriteSumIfFunctionVisitor.cpp
+++ b/src/Interpreters/RewriteSumIfFunctionVisitor.cpp
@ -13,18 +13,6 @@ void RewriteSumIfFunctionMatcher::visit(ASTPtr & ast, Data & data)
        visit(*func, ast, data);
 }

-static ASTPtr createNewFunctionWithOneArgument(const String & func_name, const ASTPtr & argument)
-{
-    auto new_func = std::make_shared<ASTFunction>();
-    new_func->name = func_name;
-
-    auto new_arguments = std::make_shared<ASTExpressionList>();
-    new_arguments->children.push_back(argument);
-    new_func->arguments = new_arguments;
-    new_func->children.push_back(new_arguments);
-    return new_func;
-}
-
 void RewriteSumIfFunctionMatcher::visit(const ASTFunction & func, ASTPtr & ast, Data &)
 {
    if (!func.arguments || func.arguments->children.empty())
@ -46,7 +34,7 @@ void RewriteSumIfFunctionMatcher::visit(const ASTFunction & func, ASTPtr & ast,

        if (func_arguments.size() == 2 && literal->value.get<UInt64>() == 1)
        {
-            auto new_func = createNewFunctionWithOneArgument("countIf", func_arguments[1]);
+            auto new_func = makeASTFunction("countIf", func_arguments[1]);
            new_func->setAlias(func.alias);
            ast = std::move(new_func);
            return;
@ -74,7 +62,7 @@ void RewriteSumIfFunctionMatcher::visit(const ASTFunction & func, ASTPtr & ast,
            /// sum(if(cond, 1, 0)) -> countIf(cond)
            if (first_value == 1 && second_value == 0)
            {
-                auto new_func = createNewFunctionWithOneArgument("countIf", if_arguments[0]);
+                auto new_func = makeASTFunction("countIf", if_arguments[0]);
                new_func->setAlias(func.alias);
                ast = std::move(new_func);
                return;
@ -82,8 +70,8 @@ void RewriteSumIfFunctionMatcher::visit(const ASTFunction & func, ASTPtr & ast,
            /// sum(if(cond, 0, 1)) -> countIf(not(cond))
            if (first_value == 0 && second_value == 1)
            {
-                auto not_func = createNewFunctionWithOneArgument("not", if_arguments[0]);
-                auto new_func = createNewFunctionWithOneArgument("countIf", not_func);
+                auto not_func = makeASTFunction("not", if_arguments[0]);
+                auto new_func = makeASTFunction("countIf", not_func);
                new_func->setAlias(func.alias);
                ast = std::move(new_func);
                return;
--- a/src/Interpreters/WindowDescription.cpp
+++ b/src/Interpreters/WindowDescription.cpp
@ -6,6 +6,11 @@
 namespace DB
 {

+namespace ErrorCodes
+{
+    extern const int BAD_ARGUMENTS;
+}
+
 std::string WindowFunctionDescription::dump() const
 {
    WriteBufferFromOwnString ss;
@ -33,4 +38,95 @@ std::string WindowDescription::dump() const
    return ss.str();
 }

+std::string WindowFrame::toString() const
+{
+    WriteBufferFromOwnString buf;
+    toString(buf);
+    return buf.str();
+}
+
+void WindowFrame::toString(WriteBuffer & buf) const
+{
+    buf << toString(type) << " BETWEEN ";
+    if (begin_type == BoundaryType::Current)
+    {
+        buf << "CURRENT ROW";
+    }
+    else if (begin_type == BoundaryType::Unbounded)
+    {
+        buf << "UNBOUNDED PRECEDING";
+    }
+    else
+    {
+        buf << abs(begin_offset);
+        buf << " "
+            << (begin_offset > 0 ? "FOLLOWING" : "PRECEDING");
+    }
+    buf << " AND ";
+    if (end_type == BoundaryType::Current)
+    {
+        buf << "CURRENT ROW";
+    }
+    else if (end_type == BoundaryType::Unbounded)
+    {
+        buf << "UNBOUNDED PRECEDING";
+    }
+    else
+    {
+        buf << abs(end_offset);
+        buf << " "
+            << (end_offset > 0 ? "FOLLOWING" : "PRECEDING");
+    }
+}
+
+void WindowFrame::checkValid() const
+{
+    if (begin_type == BoundaryType::Unbounded
+        || end_type == BoundaryType::Unbounded)
+    {
+        return;
+    }
+
+    if (begin_type == BoundaryType::Current
+        && end_type == BoundaryType::Offset
+        && end_offset > 0)
+    {
+        return;
+    }
+
+    if (end_type == BoundaryType::Current
+        && begin_type == BoundaryType::Offset
+        && begin_offset < 0)
+    {
+        return;
+    }
+
+    if (end_type == BoundaryType::Current
+        && begin_type == BoundaryType::Current)
+    {
+        // BETWEEN CURRENT ROW AND CURRENT ROW makes some sense for RANGE or
+        // GROUP frames, and is technically valid for ROWS frame.
+        return;
+    }
+
+    if (end_type == BoundaryType::Offset
+        && begin_type == BoundaryType::Offset)
+    {
+        if (type == FrameType::Rows)
+        {
+            if (end_offset >= begin_offset)
+            {
+                return;
+            }
+        }
+
+        // For RANGE and GROUPS, we must check that end follows begin if sorted
+        // according to ORDER BY (we don't support them yet).
+    }
+
+    throw Exception(ErrorCodes::BAD_ARGUMENTS,
+        "Window frame '{}' is invalid",
+        toString());
+}
+
 }
--- a/src/Interpreters/WindowDescription.h
+++ b/src/Interpreters/WindowDescription.h
@ -53,6 +53,13 @@ struct WindowFrame
    int64_t end_offset = 0;


+    // Throws BAD_ARGUMENTS exception if the frame definition is incorrect, e.g.
+    // the frame start comes later than the frame end.
+    void checkValid() const;
+
+    std::string toString() const;
+    void toString(WriteBuffer & buf) const;
+
    bool operator == (const WindowFrame & other) const
    {
        // We don't compare is_default because it's not a real property of the
--- a/src/Interpreters/ya.make
+++ b/src/Interpreters/ya.make
@ -145,6 +145,7 @@ SRCS(
    TranslateQualifiedNamesVisitor.cpp
    TreeOptimizer.cpp
    TreeRewriter.cpp
+    WindowDescription.cpp
    addMissingDefaults.cpp
    addTypeConversionToAST.cpp
    castColumn.cpp
--- a/src/Parsers/ExpressionElementParsers.cpp
+++ b/src/Parsers/ExpressionElementParsers.cpp
@ -46,6 +46,7 @@ namespace DB

 namespace ErrorCodes
 {
+    extern const int BAD_ARGUMENTS;
    extern const int SYNTAX_ERROR;
    extern const int LOGICAL_ERROR;
    extern const int NOT_IMPLEMENTED;
@ -558,7 +559,24 @@ static bool tryParseFrameDefinition(ASTWindowDefinition * node, IParser::Pos & p
        }
        else if (parser_literal.parse(pos, ast_literal, expected))
        {
-            node->frame.begin_offset = ast_literal->as<ASTLiteral &>().value.safeGet<Int64>();
+            const Field & value = ast_literal->as<ASTLiteral &>().value;
+            if (!isInt64FieldType(value.getType()))
+            {
+                throw Exception(ErrorCodes::BAD_ARGUMENTS,
+                    "Only integer frame offsets are supported, '{}' is not supported.",
+                    Field::Types::toString(value.getType()));
+            }
+            node->frame.begin_offset = value.get<Int64>();
+            node->frame.begin_type = WindowFrame::BoundaryType::Offset;
+            // We can easily get a UINT64_MAX here, which doesn't even fit into
+            // int64_t. Not sure what checks we are going to need here after we
+            // support floats and dates.
+            if (node->frame.begin_offset > INT_MAX || node->frame.begin_offset < INT_MIN)
+            {
+                throw Exception(ErrorCodes::BAD_ARGUMENTS,
+                    "Frame offset must be between {} and {}, but {} is given",
+                    INT_MAX, INT_MIN, node->frame.begin_offset);
+            }
        }
        else
        {
@ -567,7 +585,7 @@ static bool tryParseFrameDefinition(ASTWindowDefinition * node, IParser::Pos & p

        if (keyword_preceding.ignore(pos, expected))
        {
-            node->frame.begin_offset = - node->frame.begin_offset;
+            node->frame.begin_offset = -node->frame.begin_offset;
        }
        else if (keyword_following.ignore(pos, expected))
        {
@ -604,7 +622,22 @@ static bool tryParseFrameDefinition(ASTWindowDefinition * node, IParser::Pos & p
            }
            else if (parser_literal.parse(pos, ast_literal, expected))
            {
-                node->frame.end_offset = ast_literal->as<ASTLiteral &>().value.safeGet<Int64>();
+                const Field & value = ast_literal->as<ASTLiteral &>().value;
+                if (!isInt64FieldType(value.getType()))
+                {
+                    throw Exception(ErrorCodes::BAD_ARGUMENTS,
+                        "Only integer frame offsets are supported, '{}' is not supported.",
+                        Field::Types::toString(value.getType()));
+                }
+                node->frame.end_offset = value.get<Int64>();
+                node->frame.end_type = WindowFrame::BoundaryType::Offset;
+
+                if (node->frame.end_offset > INT_MAX || node->frame.end_offset < INT_MIN)
+                {
+                    throw Exception(ErrorCodes::BAD_ARGUMENTS,
+                        "Frame offset must be between {} and {}, but {} is given",
+                        INT_MAX, INT_MIN, node->frame.end_offset);
+                }
            }
            else
            {
@ -623,6 +656,7 @@ static bool tryParseFrameDefinition(ASTWindowDefinition * node, IParser::Pos & p
            }
            else if (keyword_following.ignore(pos, expected))
            {
+                // Positive offset or UNBOUNDED FOLLOWING.
            }
            else
            {
--- a/src/Parsers/New/CMakeLists.txt
+++ b/src/Parsers/New/CMakeLists.txt
@ -65,8 +65,6 @@ target_compile_options (clickhouse_parsers_new
        -Wno-documentation-deprecated-sync
        -Wno-shadow-field
        -Wno-unused-parameter
-
-    PUBLIC
        -Wno-extra-semi
        -Wno-inconsistent-missing-destructor-override
 )
--- a/src/Parsers/ParserDataType.cpp
+++ b/src/Parsers/ParserDataType.cpp
@ -32,8 +32,10 @@ private:
        const char * operators[] = {"=", "equals", nullptr};
        ParserLeftAssociativeBinaryOperatorList enum_parser(operators, std::make_unique<ParserLiteral>());

-        return nested_parser.parse(pos, node, expected)
-            || enum_parser.parse(pos, node, expected)
+        if (pos->type == TokenType::BareWord && std::string_view(pos->begin, pos->size()) == "Nested")
+            return nested_parser.parse(pos, node, expected);
+
+        return enum_parser.parse(pos, node, expected)
            || literal_parser.parse(pos, node, expected)
            || data_type_parser.parse(pos, node, expected);
    }
--- a/src/Parsers/ParserInsertQuery.h
+++ b/src/Parsers/ParserInsertQuery.h
@ -30,7 +30,7 @@ private:
    const char * getName() const override { return "INSERT query"; }
    bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
 public:
-    ParserInsertQuery(const char * end_) : end(end_) {}
+    explicit ParserInsertQuery(const char * end_) : end(end_) {}
 };

 /** Insert accepts an identifier and an asterisk with variants.
--- a/src/Processors/QueryPlan/Optimizations/splitFilter.cpp
+++ b/src/Processors/QueryPlan/Optimizations/splitFilter.cpp
@ -24,8 +24,9 @@ size_t trySplitFilter(QueryPlan::Node * node, QueryPlan::Nodes & nodes)
    if (split.second->trivial())
        return 0;

+    bool remove_filter = false;
    if (filter_step->removesFilterColumn())
-        split.second->removeUnusedInput(filter_step->getFilterColumnName());
+        remove_filter = split.second->removeUnusedResult(filter_step->getFilterColumnName());

    auto description = filter_step->getStepDescription();

@ -37,7 +38,7 @@ size_t trySplitFilter(QueryPlan::Node * node, QueryPlan::Nodes & nodes)
            filter_node.children.at(0)->step->getOutputStream(),
            std::move(split.first),
            filter_step->getFilterColumnName(),
-            filter_step->removesFilterColumn());
+            remove_filter);

    node->step = std::make_unique<ExpressionStep>(filter_node.step->getOutputStream(), std::move(split.second));

--- a/src/Processors/QueryPlan/WindowStep.cpp
+++ b/src/Processors/QueryPlan/WindowStep.cpp
@ -57,6 +57,7 @@ WindowStep::WindowStep(const DataStream & input_stream_,
 {
    // We don't remove any columns, only add, so probably we don't have to update
    // the output DataStream::distinct_columns.
+    window_description.frame.checkValid();
 }

 void WindowStep::transformPipeline(QueryPipeline & pipeline)
--- a/src/Processors/Transforms/WindowTransform.cpp
+++ b/src/Processors/Transforms/WindowTransform.cpp
@ -165,16 +165,197 @@ void WindowTransform::advancePartitionEnd()
    assert(!partition_ended && partition_end == blocksEnd());
 }

-void WindowTransform::advanceFrameStart() const
+auto WindowTransform::moveRowNumberNoCheck(const RowNumber & _x, int offset) const
 {
-    // Frame start is always UNBOUNDED PRECEDING for now, so we don't have to
-    // move it. It is initialized when the new partition starts.
-    if (window_description.frame.begin_type
-        != WindowFrame::BoundaryType::Unbounded)
+    RowNumber x = _x;
+
+    if (offset > 0)
    {
-        throw Exception(ErrorCodes::NOT_IMPLEMENTED,
-            "Frame start type '{}' is not implemented",
-            WindowFrame::toString(window_description.frame.begin_type));
+        for (;;)
+        {
+            assertValid(x);
+            assert(offset >= 0);
+
+            const auto block_rows = blockRowsNumber(x);
+            x.row += offset;
+            if (x.row >= block_rows)
+            {
+                offset = x.row - block_rows;
+                x.row = 0;
+                x.block++;
+
+                if (x == blocksEnd())
+                {
+                    break;
+                }
+            }
+            else
+            {
+                offset = 0;
+                break;
+            }
+        }
+    }
+    else if (offset < 0)
+    {
+        for (;;)
+        {
+            assertValid(x);
+            assert(offset <= 0);
+
+            if (x.row >= static_cast<uint64_t>(-offset))
+            {
+                x.row -= -offset;
+                offset = 0;
+                break;
+            }
+
+            // Move to the first row in current block. Note that the offset is
+            // negative.
+            offset += x.row;
+            x.row = 0;
+
+            // Move to the last row of the previous block, if we are not at the
+            // first one. Offset also is incremented by one, because we pass over
+            // the first row of this block.
+            if (x.block == first_block_number)
+            {
+                break;
+            }
+
+            --x.block;
+            offset += 1;
+            x.row = blockRowsNumber(x) - 1;
+        }
+    }
+
+    return std::tuple{x, offset};
+}
+
+auto WindowTransform::moveRowNumber(const RowNumber & _x, int offset) const
+{
+    auto [x, o] = moveRowNumberNoCheck(_x, offset);
+
+#ifndef NDEBUG
+    // Check that it was reversible.
+    auto [xx, oo] = moveRowNumberNoCheck(x, -(offset - o));
+
+//    fmt::print(stderr, "{} -> {}, result {}, {}, new offset {}, twice {}, {}\n",
+//        _x, offset, x, o, -(offset - o), xx, oo);
+    assert(xx == _x);
+    assert(oo == 0);
+#endif
+
+    return std::tuple{x, o};
+}
+
+
+void WindowTransform::advanceFrameStartRowsOffset()
+{
+    // Just recalculate it each time by walking blocks.
+    const auto [moved_row, offset_left] = moveRowNumber(current_row,
+        window_description.frame.begin_offset);
+
+    frame_start = moved_row;
+
+    assertValid(frame_start);
+
+//    fmt::print(stderr, "frame start {} left {} partition start {}\n",
+//        frame_start, offset_left, partition_start);
+
+    if (frame_start <= partition_start)
+    {
+        // Got to the beginning of partition and can't go further back.
+        frame_start = partition_start;
+        frame_started = true;
+        return;
+    }
+
+    if (partition_end <= frame_start)
+    {
+        // A FOLLOWING frame start ran into the end of partition.
+        frame_start = partition_end;
+        frame_started = partition_ended;
+        return;
+    }
+
+    // Handled the equality case above. Now the frame start is inside the
+    // partition, if we walked all the offset, it's final.
+    assert(partition_start < frame_start);
+    frame_started = offset_left == 0;
+
+    // If we ran into the start of data (offset left is negative), we won't be
+    // able to make progress. Should have handled this case above.
+    assert(offset_left >= 0);
+}
+
+void WindowTransform::advanceFrameStartChoose()
+{
+    switch (window_description.frame.begin_type)
+    {
+        case WindowFrame::BoundaryType::Unbounded:
+            // UNBOUNDED PRECEDING, just mark it valid. It is initialized when
+            // the new partition starts.
+            frame_started = true;
+            return;
+        case WindowFrame::BoundaryType::Current:
+            // CURRENT ROW differs between frame types only in how the peer
+            // groups are accounted.
+            assert(partition_start <= peer_group_start);
+            assert(peer_group_start < partition_end);
+            assert(peer_group_start <= current_row);
+            frame_start = peer_group_start;
+            frame_started = true;
+            return;
+        case WindowFrame::BoundaryType::Offset:
+            switch (window_description.frame.type)
+            {
+                case WindowFrame::FrameType::Rows:
+                    advanceFrameStartRowsOffset();
+                    return;
+                default:
+                    // Fallthrough to the "not implemented" error.
+                    break;
+            }
+            break;
+    }
+
+    throw Exception(ErrorCodes::NOT_IMPLEMENTED,
+        "Frame start type '{}' for frame '{}' is not implemented",
+        WindowFrame::toString(window_description.frame.begin_type),
+        WindowFrame::toString(window_description.frame.type));
+}
+
+void WindowTransform::advanceFrameStart()
+{
+    if (frame_started)
+    {
+        return;
+    }
+
+    const auto frame_start_before = frame_start;
+    advanceFrameStartChoose();
+    assert(frame_start_before <= frame_start);
+    if (frame_start == frame_start_before)
+    {
+        // If the frame start didn't move, this means we validated that the frame
+        // starts at the point we reached earlier but were unable to validate.
+        // This probably only happens in degenerate cases where the frame start
+        // is further than the end of partition, and the partition ends at the
+        // last row of the block, but we can only tell for sure after a new
+        // block arrives. We still have to update the state of aggregate
+        // functions when the frame start becomes valid, so we continue.
+        assert(frame_started);
+    }
+
+    assert(partition_start <= frame_start);
+    assert(frame_start <= partition_end);
+    if (partition_ended && frame_start == partition_end)
+    {
+        // Check that if the start of frame (e.g. FOLLOWING) runs into the end
+        // of partition, it is marked as valid -- we can't advance it any
+        // further.
+        assert(frame_started);
    }
 }

@ -257,18 +438,15 @@ void WindowTransform::advanceFrameEndCurrentRow()

 //    fmt::print(stderr, "first row {} last {}\n", frame_end.row, rows_end);

-    // We could retreat the frame_end here, but for some reason I am reluctant
-    // to do this... It would have better data locality.
-    auto reference = current_row;
+    // Advance frame_end while it is still peers with the current row.
    for (; frame_end.row < rows_end; ++frame_end.row)
    {
-        if (!arePeers(reference, frame_end))
+        if (!arePeers(current_row, frame_end))
        {
 //            fmt::print(stderr, "{} and {} don't match\n", reference, frame_end);
            frame_ended = true;
            return;
        }
-        reference = frame_end;
    }

    // Might have gotten to the end of the current block, have to properly
@ -291,6 +469,39 @@ void WindowTransform::advanceFrameEndUnbounded()
    frame_ended = partition_ended;
 }

+void WindowTransform::advanceFrameEndRowsOffset()
+{
+    // Walk the specified offset from the current row. The "+1" is needed
+    // because the frame_end is a past-the-end pointer.
+    const auto [moved_row, offset_left] = moveRowNumber(current_row,
+        window_description.frame.end_offset + 1);
+
+    if (partition_end <= moved_row)
+    {
+        // Clamp to the end of partition. It might not have ended yet, in which
+        // case wait for more data.
+        frame_end = partition_end;
+        frame_ended = partition_ended;
+        return;
+    }
+
+    if (moved_row <= partition_start)
+    {
+        // Clamp to the start of partition.
+        frame_end = partition_start;
+        frame_ended = true;
+        return;
+    }
+
+    // Frame end inside partition, if we walked all the offset, it's final.
+    frame_end = moved_row;
+    frame_ended = offset_left == 0;
+
+    // If we ran into the start of data (offset left is negative), we won't be
+    // able to make progress. Should have handled this case above.
+    assert(offset_left >= 0);
+}
+
 void WindowTransform::advanceFrameEnd()
 {
    // No reason for this function to be called again after it succeeded.
@ -301,16 +512,23 @@ void WindowTransform::advanceFrameEnd()
    switch (window_description.frame.end_type)
    {
        case WindowFrame::BoundaryType::Current:
-            // The only frame end we have for now is CURRENT ROW.
            advanceFrameEndCurrentRow();
            break;
        case WindowFrame::BoundaryType::Unbounded:
            advanceFrameEndUnbounded();
            break;
        case WindowFrame::BoundaryType::Offset:
-            throw Exception(ErrorCodes::NOT_IMPLEMENTED,
-                "The frame end type '{}' is not implemented",
-                WindowFrame::toString(window_description.frame.end_type));
+            switch (window_description.frame.type)
+            {
+                case WindowFrame::FrameType::Rows:
+                    advanceFrameEndRowsOffset();
+                    break;
+                default:
+                    throw Exception(ErrorCodes::NOT_IMPLEMENTED,
+                        "The frame end type '{}' is not implemented",
+                        WindowFrame::toString(window_description.frame.end_type));
+            }
+            break;
    }

 //    fmt::print(stderr, "frame_end {} -> {}\n", frame_end_before, frame_end);
@ -321,44 +539,81 @@ void WindowTransform::advanceFrameEnd()
    {
        return;
    }
+}

-    // Add the rows over which we advanced the frame to the aggregate function
-    // states. We could have advanced over at most the entire last block.
-    uint64_t rows_end = frame_end.row;
-    if (frame_end.row == 0)
+// Update the aggregation states after the frame has changed.
+void WindowTransform::updateAggregationState()
+{
+//    fmt::print(stderr, "update agg states [{}, {}) -> [{}, {})\n",
+//        prev_frame_start, prev_frame_end, frame_start, frame_end);
+
+    // Assert that the frame boundaries are known, have proper order wrt each
+    // other, and have not gone back wrt the previous frame.
+    assert(frame_started);
+    assert(frame_ended);
+    assert(frame_start <= frame_end);
+    assert(prev_frame_start <= prev_frame_end);
+    assert(prev_frame_start <= frame_start);
+    assert(prev_frame_end <= frame_end);
+
+    // We might have to reset aggregation state and/or add some rows to it.
+    // Figure out what to do.
+    bool reset_aggregation = false;
+    RowNumber rows_to_add_start;
+    RowNumber rows_to_add_end;
+    if (frame_start == prev_frame_start)
    {
-        assert(frame_end == blocksEnd());
-        rows_end = blockRowsNumber(frame_end_before);
+        // The frame start didn't change, add the tail rows.
+        reset_aggregation = false;
+        rows_to_add_start = prev_frame_end;
+        rows_to_add_end = frame_end;
    }
    else
    {
-        assert(frame_end_before.block == frame_end.block);
+        // The frame start changed, reset the state and aggregate over the
+        // entire frame. This can be made per-function after we learn to
+        // subtract rows from some types of aggregation states, but for now we
+        // always have to reset when the frame start changes.
+        reset_aggregation = true;
+        rows_to_add_start = frame_start;
+        rows_to_add_end = frame_end;
    }
-    // Equality would mean "no data to process", for which we checked above.
-    assert(frame_end_before.row < rows_end);

    for (auto & ws : workspaces)
    {
-        if (frame_end_before.block != ws.cached_block_number)
-        {
-            const auto & block
-                = blocks[frame_end_before.block - first_block_number];
-            ws.argument_columns.clear();
-            for (const auto i : ws.argument_column_indices)
-            {
-                ws.argument_columns.push_back(block.input_columns[i].get());
-            }
-            ws.cached_block_number = frame_end_before.block;
-        }
-
        const auto * a = ws.window_function.aggregate_function.get();
        auto * buf = ws.aggregate_function_state.data();
-        auto * columns = ws.argument_columns.data();
-        for (auto row = frame_end_before.row; row < rows_end; ++row)
+
+        if (reset_aggregation)
        {
-            a->add(buf, columns, row, arena.get());
+//            fmt::print(stderr, "(2) reset aggregation\n");
+            a->destroy(buf);
+            a->create(buf);
+        }
+
+        for (auto row = rows_to_add_start; row < rows_to_add_end;
+            advanceRowNumber(row))
+        {
+            if (row.block != ws.cached_block_number)
+            {
+                const auto & block
+                    = blocks[row.block - first_block_number];
+                ws.argument_columns.clear();
+                for (const auto i : ws.argument_column_indices)
+                {
+                    ws.argument_columns.push_back(block.input_columns[i].get());
+                }
+                ws.cached_block_number = row.block;
+            }
+
+//            fmt::print(stderr, "(2) add row {}\n", row);
+            auto * columns = ws.argument_columns.data();
+            a->add(buf, columns, row.row, arena.get());
        }
    }
+
+    prev_frame_start = frame_start;
+    prev_frame_end = frame_end;
 }

 void WindowTransform::writeOutCurrentRow()
@ -414,8 +669,8 @@ void WindowTransform::appendChunk(Chunk & chunk)
    for (;;)
    {
        advancePartitionEnd();
-//        fmt::print(stderr, "partition [?, {}), {}\n",
-//            partition_end, partition_ended);
+//        fmt::print(stderr, "partition [{}, {}), {}\n",
+//            partition_start, partition_end, partition_ended);

        // Either we ran out of data or we found the end of partition (maybe
        // both, but this only happens at the total end of data).
@ -430,15 +685,38 @@ void WindowTransform::appendChunk(Chunk & chunk)
        // which is precisely the definition of `partition_end`.
        while (current_row < partition_end)
        {
-            // Advance the frame start, updating the state of the aggregate
-            // functions.
-            advanceFrameStart();
-            // Advance the frame end, updating the state of the aggregate
-            // functions.
-            advanceFrameEnd();
+//            fmt::print(stderr, "(1) row {} frame [{}, {}) {}, {}\n",
+//                current_row, frame_start, frame_end,
+//                frame_started, frame_ended);

-//            fmt::print(stderr, "row {} frame [{}, {}) {}\n",
-//                current_row, frame_start, frame_end, frame_ended);
+            // We now know that the current row is valid, so we can update the
+            // peer group start.
+            if (!arePeers(peer_group_start, current_row))
+            {
+                peer_group_start = current_row;
+            }
+
+            // Advance the frame start.
+            advanceFrameStart();
+
+            if (!frame_started)
+            {
+                // Wait for more input data to find the start of frame.
+                assert(!input_is_finished);
+                assert(!partition_ended);
+                return;
+            }
+
+            // frame_end must be greater or equal than frame_start, so if the
+            // frame_start is already past the current frame_end, we can start
+            // from it to save us some work.
+            if (frame_end < frame_start)
+            {
+                frame_end = frame_start;
+            }
+
+            // Advance the frame end.
+            advanceFrameEnd();

            if (!frame_ended)
            {
@ -448,16 +726,34 @@ void WindowTransform::appendChunk(Chunk & chunk)
                return;
            }

-            // The frame shouldn't be empty (probably?).
-            assert(frame_start < frame_end);
+//            fmt::print(stderr, "(2) row {} frame [{}, {}) {}, {}\n",
+//                current_row, frame_start, frame_end,
+//                frame_started, frame_ended);
+
+            // The frame can be empty sometimes, e.g. the boundaries coincide
+            // or the start is after the partition end. But hopefully start is
+            // not after end.
+            assert(frame_started);
+            assert(frame_ended);
+            assert(frame_start <= frame_end);
+
+            // Now that we know the new frame boundaries, update the aggregation
+            // states. Theoretically we could do this simultaneously with moving
+            // the frame boundaries, but it would require some care not to
+            // perform unnecessary work while we are still looking for the frame
+            // start, so do it the simple way for now.
+            updateAggregationState();

            // Write out the aggregation results.
            writeOutCurrentRow();

            // Move to the next row. The frame will have to be recalculated.
+            // The peer group start is updated at the beginning of the loop,
+            // because current_row might now be past-the-end.
            advanceRowNumber(current_row);
            first_not_ready_row = current_row;
            frame_ended = false;
+            frame_started = false;
        }

        if (input_is_finished)
@ -478,15 +774,18 @@ void WindowTransform::appendChunk(Chunk & chunk)
        }

        // Start the next partition.
-        const auto new_partition_start = partition_end;
+        partition_start = partition_end;
        advanceRowNumber(partition_end);
        partition_ended = false;
        // We have to reset the frame when the new partition starts. This is not a
        // generally correct way to do so, but we don't really support moving frame
        // for now.
-        frame_start = new_partition_start;
-        frame_end = new_partition_start;
-        assert(current_row == new_partition_start);
+        frame_start = partition_start;
+        frame_end = partition_start;
+        prev_frame_start = partition_start;
+        prev_frame_end = partition_start;
+        assert(current_row == partition_start);
+        peer_group_start = partition_start;

 //        fmt::print(stderr, "reinitialize agg data at start of {}\n",
 //            new_partition_start);
@ -534,6 +833,15 @@ IProcessor::Status WindowTransform::prepare()
        return Status::Finished;
    }

+    if (output_data.exception)
+    {
+        // An exception occurred during processing.
+        output.pushData(std::move(output_data));
+        output.finish();
+        input.close();
+        return Status::Finished;
+    }
+
    assert(first_not_ready_row.block >= first_block_number);
    // The first_not_ready_row might be past-the-end if we have already
    // calculated the window functions for all input rows. That's why the
@ -665,6 +973,7 @@ void WindowTransform::work()
        assert(next_output_block_number >= first_block_number);
        assert(frame_start.block >= first_block_number);
        assert(current_row.block >= first_block_number);
+        assert(peer_group_start.block >= first_block_number);
    }
 }

--- a/src/Processors/Transforms/WindowTransform.h
+++ b/src/Processors/Transforms/WindowTransform.h
@ -53,6 +53,11 @@ struct RowNumber
    {
        return block == other.block && row == other.row;
    }
+
+    bool operator <= (const RowNumber & other) const
+    {
+        return *this < other || *this == other;
+    }
 };

 /*
@ -101,11 +106,15 @@ public:

 private:
    void advancePartitionEnd();
-    void advanceFrameStart() const;
-    void advanceFrameEnd();
+    void advanceFrameStart();
+    void advanceFrameStartChoose();
+    void advanceFrameStartRowsOffset();
    void advanceFrameEndCurrentRow();
    void advanceFrameEndUnbounded();
+    void advanceFrameEndRowsOffset();
+    void advanceFrameEnd();
    bool arePeers(const RowNumber & x, const RowNumber & y) const;
+    void updateAggregationState();
    void writeOutCurrentRow();

    Columns & inputAt(const RowNumber & x)
@ -169,9 +178,28 @@ private:
 #endif
    }

+    auto moveRowNumber(const RowNumber & _x, int offset) const;
+    auto moveRowNumberNoCheck(const RowNumber & _x, int offset) const;
+
+    void assertValid(const RowNumber & x) const
+    {
+        assert(x.block >= first_block_number);
+        if (x.block == first_block_number + blocks.size())
+        {
+            assert(x.row == 0);
+        }
+        else
+        {
+            assert(x.row < blockRowsNumber(x));
+        }
+    }
+
    RowNumber blocksEnd() const
    { return RowNumber{first_block_number + blocks.size(), 0}; }

+    RowNumber blocksBegin() const
+    { return RowNumber{first_block_number, 0}; }
+
 public:
    /*
     * Data (formerly) inherited from ISimpleTransform, needed for the
@ -217,18 +245,26 @@ public:
    // Used to determine which resulting blocks we can pass to the consumer.
    RowNumber first_not_ready_row;

-    // We don't keep the pointer to start of partition, because we don't really
-    // need it, and we want to be able to drop the starting blocks to save memory.
-    // The `partition_end` is past-the-end, as usual. When partition_ended = false,
-    // it still haven't ended, and partition_end is the next row to check.
+    // Boundaries of the current partition.
+    // partition_start doesn't point to a valid block, because we want to drop
+    // the blocks early to save memory. We still have to track it so that we can
+    // cut off a PRECEDING frame at the partition start.
+    // The `partition_end` is past-the-end, as usual. When
+    // partition_ended = false, it still haven't ended, and partition_end is the
+    // next row to check.
+    RowNumber partition_start;
    RowNumber partition_end;
    bool partition_ended = false;

-    // This is the row for which we are computing the window functions now.
+    // The row for which we are now computing the window functions.
    RowNumber current_row;
+    // The start of current peer group, needed for CURRENT ROW frame start.
+    // For ROWS frame, always equal to the current row, and for RANGE and GROUP
+    // frames may be earlier.
+    RowNumber peer_group_start;

-    // The frame is [frame_start, frame_end) if frame_ended, and unknown
-    // otherwise. Note that when we move to the next row, both the
+    // The frame is [frame_start, frame_end) if frame_ended && frame_started,
+    // and unknown otherwise. Note that when we move to the next row, both the
    // frame_start and the frame_end may jump forward by an unknown amount of
    // blocks, e.g. if we use a RANGE frame. This means that sometimes we don't
    // know neither frame_end nor frame_start.
@ -239,6 +275,13 @@ public:
    RowNumber frame_start;
    RowNumber frame_end;
    bool frame_ended = false;
+    bool frame_started = false;
+
+    // The previous frame boundaries that correspond to the current state of the
+    // aggregate function. We use them to determine how to update the aggregation
+    // state after we find the new frame.
+    RowNumber prev_frame_start;
+    RowNumber prev_frame_end;
 };

 }
--- a/src/Server/HTTPHandler.cpp
+++ b/src/Server/HTTPHandler.cpp
@ -219,8 +219,11 @@ void HTTPHandler::pushDelayedResults(Output & used_output)
        }
    }

-    ConcatReadBuffer concat_read_buffer(read_buffers_raw_ptr);
-    copyData(concat_read_buffer, *used_output.out_maybe_compressed);
+    if (!read_buffers_raw_ptr.empty())
+    {
+        ConcatReadBuffer concat_read_buffer(read_buffers_raw_ptr);
+        copyData(concat_read_buffer, *used_output.out_maybe_compressed);
+    }
 }


--- a/src/Server/TCPHandler.cpp
+++ b/src/Server/TCPHandler.cpp
@ -24,6 +24,7 @@
 #include <Interpreters/OpenTelemetrySpanLog.h>
 #include <Storages/StorageMemory.h>
 #include <Storages/StorageReplicatedMergeTree.h>
+#include <Storages/MergeTree/MergeTreeDataPartUUID.h>
 #include <Core/ExternalTable.h>
 #include <Storages/ColumnDefault.h>
 #include <DataTypes/DataTypeLowCardinality.h>
@ -180,10 +181,16 @@ void TCPHandler::runImpl()

            /** If Query - process it. If Ping or Cancel - go back to the beginning.
             *  There may come settings for a separate query that modify `query_context`.
+             *  It's possible to receive part uuids packet before the query, so then receivePacket has to be called twice.
             */
            if (!receivePacket())
                continue;

+            /** If part_uuids got received in previous packet, trying to read again.
+              */
+            if (state.empty() && state.part_uuids && !receivePacket())
+                continue;
+
            query_scope.emplace(*query_context);

            send_exception_with_stack_trace = query_context->getSettingsRef().calculate_text_stack_trace;
@ -528,6 +535,10 @@ void TCPHandler::processOrdinaryQuery()
    /// Pull query execution result, if exists, and send it to network.
    if (state.io.in)
    {
+
+        if (query_context->getSettingsRef().allow_experimental_query_deduplication)
+            sendPartUUIDs();
+
        /// This allows the client to prepare output format
        if (Block header = state.io.in->getHeader())
            sendData(header);
@ -592,6 +603,9 @@ void TCPHandler::processOrdinaryQueryWithProcessors()
 {
    auto & pipeline = state.io.pipeline;

+    if (query_context->getSettingsRef().allow_experimental_query_deduplication)
+        sendPartUUIDs();
+
    /// Send header-block, to allow client to prepare output format for data to send.
    {
        const auto & header = pipeline.getHeader();
@ -693,6 +707,20 @@ void TCPHandler::receiveUnexpectedTablesStatusRequest()
    throw NetException("Unexpected packet TablesStatusRequest received from client", ErrorCodes::UNEXPECTED_PACKET_FROM_CLIENT);
 }

+void TCPHandler::sendPartUUIDs()
+{
+    auto uuids = query_context->getPartUUIDs()->get();
+    if (!uuids.empty())
+    {
+        for (const auto & uuid : uuids)
+            LOG_TRACE(log, "Sending UUID: {}", toString(uuid));
+
+        writeVarUInt(Protocol::Server::PartUUIDs, *out);
+        writeVectorBinary(uuids, *out);
+        out->next();
+    }
+}
+
 void TCPHandler::sendProfileInfo(const BlockStreamProfileInfo & info)
 {
    writeVarUInt(Protocol::Server::ProfileInfo, *out);
@ -905,6 +933,10 @@ bool TCPHandler::receivePacket()

    switch (packet_type)
    {
+        case Protocol::Client::IgnoredPartUUIDs:
+            /// Part uuids packet if any comes before query.
+            receiveIgnoredPartUUIDs();
+            return true;
        case Protocol::Client::Query:
            if (!state.empty())
                receiveUnexpectedQuery();
@ -940,6 +972,16 @@ bool TCPHandler::receivePacket()
    }
 }

+void TCPHandler::receiveIgnoredPartUUIDs()
+{
+    state.part_uuids = true;
+    std::vector<UUID> uuids;
+    readVectorBinary(uuids, *in);
+
+    if (!uuids.empty())
+        query_context->getIgnoredPartUUIDs()->add(uuids);
+}
+
 void TCPHandler::receiveClusterNameAndSalt()
 {
    readStringBinary(cluster, *in);
--- a/src/Server/TCPHandler.h
+++ b/src/Server/TCPHandler.h
@ -67,6 +67,9 @@ struct QueryState
    /// Temporary tables read
    bool temporary_tables_read = false;

+    /// A state got uuids to exclude from a query
+    bool part_uuids = false;
+
    /// Request requires data from client for function input()
    bool need_receive_data_for_input = false;
    /// temporary place for incoming data block for input()
@ -173,6 +176,7 @@ private:
    void receiveHello();
    bool receivePacket();
    void receiveQuery();
+    void receiveIgnoredPartUUIDs();
    bool receiveData(bool scalar);
    bool readDataNext(const size_t & poll_interval, const int & receive_timeout);
    void readData(const Settings & connection_settings);
@ -201,6 +205,7 @@ private:
    void sendProgress();
    void sendLogs();
    void sendEndOfStream();
+    void sendPartUUIDs();
    void sendProfileInfo(const BlockStreamProfileInfo & info);
    void sendTotals(const Block & totals);
    void sendExtremes(const Block & extremes);
--- a/src/Storages/MergeTree/MergeTreeBaseSelectProcessor.cpp
+++ b/src/Storages/MergeTree/MergeTreeBaseSelectProcessor.cpp
@ -7,6 +7,7 @@
 #include <Common/typeid_cast.h>
 #include <DataTypes/DataTypeNothing.h>
 #include <DataTypes/DataTypeNullable.h>
+#include <DataTypes/DataTypeUUID.h>


 namespace DB
@ -205,6 +206,7 @@ namespace

        virtual void insertStringColumn(const ColumnPtr & column, const String & name) = 0;
        virtual void insertUInt64Column(const ColumnPtr & column, const String & name) = 0;
+        virtual void insertUUIDColumn(const ColumnPtr & column, const String & name) = 0;
    };
 }

@ -241,6 +243,16 @@ static void injectVirtualColumnsImpl(size_t rows, VirtualColumnsInserter & inser

                inserter.insertUInt64Column(column, virtual_column_name);
            }
+            else if (virtual_column_name == "_part_uuid")
+            {
+                ColumnPtr column;
+                if (rows)
+                    column = DataTypeUUID().createColumnConst(rows, task->data_part->uuid)->convertToFullColumnIfConst();
+                else
+                    column = DataTypeUUID().createColumn();
+
+                inserter.insertUUIDColumn(column, virtual_column_name);
+            }
            else if (virtual_column_name == "_partition_id")
            {
                ColumnPtr column;
@ -271,6 +283,11 @@ namespace
            block.insert({column, std::make_shared<DataTypeUInt64>(), name});
        }

+        void insertUUIDColumn(const ColumnPtr & column, const String & name) final
+        {
+            block.insert({column, std::make_shared<DataTypeUUID>(), name});
+        }
+
        Block & block;
    };

@ -288,6 +305,10 @@ namespace
            columns.push_back(column);
        }

+        void insertUUIDColumn(const ColumnPtr & column, const String &) final
+        {
+            columns.push_back(column);
+        }
        Columns & columns;
    };
 }
--- a/src/Storages/MergeTree/MergeTreeData.cpp
+++ b/src/Storages/MergeTree/MergeTreeData.cpp
@ -4,6 +4,7 @@
 #include <DataTypes/DataTypeDate.h>
 #include <DataTypes/DataTypeDateTime.h>
 #include <DataTypes/DataTypeEnum.h>
+#include <DataTypes/DataTypeUUID.h>
 #include <DataTypes/DataTypeNullable.h>
 #include <DataTypes/DataTypeLowCardinality.h>
 #include <DataTypes/NestedUtils.h>
@ -3620,6 +3621,10 @@ PartitionCommandsResultInfo MergeTreeData::freezePartitionsByMatcher(MatcherFn m
    const auto data_parts = getDataParts();

    String backup_name = (!with_name.empty() ? escapeForFileName(with_name) : toString(increment));
+    String backup_path = shadow_path + backup_name + "/";
+
+    for (const auto & disk : getStoragePolicy()->getDisks())
+        disk->onFreeze(backup_path);

    PartitionCommandsResultInfo result;

@ -3629,12 +3634,10 @@ PartitionCommandsResultInfo MergeTreeData::freezePartitionsByMatcher(MatcherFn m
        if (!matcher(part))
            continue;

-        part->volume->getDisk()->createDirectories(shadow_path);
-
-        String backup_path = shadow_path + backup_name + "/";
-
        LOG_DEBUG(log, "Freezing part {} snapshot will be placed at {}", part->name, backup_path);

+        part->volume->getDisk()->createDirectories(backup_path);
+
        String backup_part_path = backup_path + relative_data_path + part->relative_path;
        if (auto part_in_memory = asInMemoryPart(part))
            part_in_memory->flushToDisk(backup_path + relative_data_path, part->relative_path, metadata_snapshot);
@ -3949,6 +3952,7 @@ NamesAndTypesList MergeTreeData::getVirtuals() const
    return NamesAndTypesList{
        NameAndTypePair("_part", std::make_shared<DataTypeString>()),
        NameAndTypePair("_part_index", std::make_shared<DataTypeUInt64>()),
+        NameAndTypePair("_part_uuid", std::make_shared<DataTypeUUID>()),
        NameAndTypePair("_partition_id", std::make_shared<DataTypeString>()),
        NameAndTypePair("_sample_factor", std::make_shared<DataTypeFloat64>()),
    };
--- a/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
+++ b/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
@ -1234,7 +1234,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mutatePartToTempor
            if (files_to_skip.count(it->name()))
                continue;

-            String destination = new_part_tmp_path + "/";
+            String destination = new_part_tmp_path;
            String file_name = it->name();
            auto rename_it = std::find_if(files_to_rename.begin(), files_to_rename.end(), [&file_name](const auto & rename_pair) { return rename_pair.first == file_name; });
            if (rename_it != files_to_rename.end())
--- a/src/Storages/MergeTree/MergeTreeDataPartUUID.cpp
+++ b/src/Storages/MergeTree/MergeTreeDataPartUUID.cpp
@ -0,0 +1,38 @@
+#include <Storages/MergeTree/MergeTreeDataPartUUID.h>
+
+namespace DB
+{
+
+std::vector<UUID> PartUUIDs::add(const std::vector<UUID> & new_uuids)
+{
+    std::lock_guard lock(mutex);
+    std::vector<UUID> intersection;
+
+    /// First check any presence of uuids in a uuids, return duplicates back if any
+    for (const auto & uuid : new_uuids)
+    {
+        if (uuids.find(uuid) != uuids.end())
+            intersection.emplace_back(uuid);
+    }
+
+    if (intersection.empty())
+    {
+        for (const auto & uuid : new_uuids)
+            uuids.emplace(uuid);
+    }
+    return intersection;
+}
+
+std::vector<UUID> PartUUIDs::get() const
+{
+    std::lock_guard lock(mutex);
+    return std::vector<UUID>(uuids.begin(), uuids.end());
+}
+
+bool PartUUIDs::has(const UUID & uuid) const
+{
+    std::lock_guard lock(mutex);
+    return uuids.find(uuid) != uuids.end();
+}
+
+}
--- a/src/Storages/MergeTree/MergeTreeDataPartUUID.h
+++ b/src/Storages/MergeTree/MergeTreeDataPartUUID.h
@ -0,0 +1,34 @@
+#pragma once
+
+#include <memory>
+#include <mutex>
+#include <unordered_set>
+#include <Core/UUID.h>
+
+namespace DB
+{
+
+/** PartUUIDs is a uuid set to control query deduplication.
+ * The object is used in query context in both direction:
+ *  Server->Client to send all parts' UUIDs that have been read during the query
+ *  Client->Server to ignored specified parts from being processed.
+ *
+ *  Current implementation assumes a user setting allow_experimental_query_deduplication=1 is set.
+ */
+struct PartUUIDs
+{
+public:
+    /// Add new UUIDs if not duplicates found otherwise return duplicated UUIDs
+    std::vector<UUID> add(const std::vector<UUID> & uuids);
+    /// Get accumulated UUIDs
+    std::vector<UUID> get() const;
+    bool has(const UUID & uuid) const;
+
+private:
+    mutable std::mutex mutex;
+    std::unordered_set<UUID> uuids;
+};
+
+using PartUUIDsPtr = std::shared_ptr<PartUUIDs>;
+
+}
--- a/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp
+++ b/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp
@ -6,7 +6,6 @@
 #include <Poco/File.h>

 #include <Common/FieldVisitors.h>
-#include <Storages/MergeTree/PartitionPruner.h>
 #include <Storages/MergeTree/MergeTreeDataSelectExecutor.h>
 #include <Storages/MergeTree/MergeTreeSelectProcessor.h>
 #include <Storages/MergeTree/MergeTreeReverseSelectProcessor.h>
@ -15,6 +14,7 @@
 #include <Storages/MergeTree/MergeTreeIndices.h>
 #include <Storages/MergeTree/MergeTreeIndexReader.h>
 #include <Storages/MergeTree/KeyCondition.h>
+#include <Storages/MergeTree/MergeTreeDataPartUUID.h>
 #include <Storages/ReadInOrderOptimizer.h>
 #include <Parsers/ASTIdentifier.h>
 #include <Parsers/ASTLiteral.h>
@ -35,8 +35,10 @@
 #include <Processors/QueryPlan/MergingFinal.h>
 #include <Processors/QueryPlan/ReadNothingStep.h>

+#include <Core/UUID.h>
 #include <DataTypes/DataTypeDate.h>
 #include <DataTypes/DataTypeEnum.h>
+#include <DataTypes/DataTypeUUID.h>
 #include <DataTypes/DataTypesNumber.h>
 #include <Storages/VirtualColumnUtils.h>

@ -61,6 +63,7 @@ namespace ErrorCodes
    extern const int TOO_MANY_ROWS;
    extern const int CANNOT_PARSE_TEXT;
    extern const int TOO_MANY_PARTITIONS;
+    extern const int DUPLICATED_PART_UUIDS;
 }


@ -71,14 +74,27 @@ MergeTreeDataSelectExecutor::MergeTreeDataSelectExecutor(const MergeTreeData & d


 /// Construct a block consisting only of possible values of virtual columns
-static Block getBlockWithPartColumn(const MergeTreeData::DataPartsVector & parts)
+static Block getBlockWithVirtualPartColumns(const MergeTreeData::DataPartsVector & parts, bool with_uuid)
 {
-    auto column = ColumnString::create();
+    auto part_column = ColumnString::create();
+    auto part_uuid_column = ColumnUUID::create();

    for (const auto & part : parts)
-        column->insert(part->name);
+    {
+        part_column->insert(part->name);
+        if (with_uuid)
+            part_uuid_column->insert(part->uuid);
+    }

-    return Block{ColumnWithTypeAndName(std::move(column), std::make_shared<DataTypeString>(), "_part")};
+    if (with_uuid)
+    {
+        return Block(std::initializer_list<ColumnWithTypeAndName>{
+            ColumnWithTypeAndName(std::move(part_column), std::make_shared<DataTypeString>(), "_part"),
+            ColumnWithTypeAndName(std::move(part_uuid_column), std::make_shared<DataTypeUUID>(), "_part_uuid"),
+        });
+    }
+
+    return Block{ColumnWithTypeAndName(std::move(part_column), std::make_shared<DataTypeString>(), "_part")};
 }


@ -162,6 +178,7 @@ QueryPlanPtr MergeTreeDataSelectExecutor::readFromParts(
    Names real_column_names;

    bool part_column_queried = false;
+    bool part_uuid_column_queried = false;

    bool sample_factor_column_queried = false;
    Float64 used_sample_factor = 1;
@ -181,6 +198,11 @@ QueryPlanPtr MergeTreeDataSelectExecutor::readFromParts(
        {
            virt_column_names.push_back(name);
        }
+        else if (name == "_part_uuid")
+        {
+            part_uuid_column_queried = true;
+            virt_column_names.push_back(name);
+        }
        else if (name == "_sample_factor")
        {
            sample_factor_column_queried = true;
@ -198,9 +220,9 @@ QueryPlanPtr MergeTreeDataSelectExecutor::readFromParts(
    if (real_column_names.empty())
        real_column_names.push_back(ExpressionActions::getSmallestColumn(available_real_columns));

-    /// If `_part` virtual column is requested, we try to use it as an index.
-    Block virtual_columns_block = getBlockWithPartColumn(parts);
-    if (part_column_queried)
+    /// If `_part` or `_part_uuid` virtual columns are requested, we try to filter out data by them.
+    Block virtual_columns_block = getBlockWithVirtualPartColumns(parts, part_uuid_column_queried);
+    if (part_column_queried || part_uuid_column_queried)
        VirtualColumnUtils::filterBlockWithQuery(query_info.query, virtual_columns_block, context);

    auto part_values = VirtualColumnUtils::extractSingleValueFromBlock<String>(virtual_columns_block, "_part");
@ -244,40 +266,13 @@ QueryPlanPtr MergeTreeDataSelectExecutor::readFromParts(
        }
    }

-    /// Select the parts in which there can be data that satisfy `minmax_idx_condition` and that match the condition on `_part`,
-    ///  as well as `max_block_number_to_read`.
-    {
-        auto prev_parts = parts;
-        parts.clear();
+    const Context & query_context = context.hasQueryContext() ? context.getQueryContext() : context;

-        for (const auto & part : prev_parts)
-        {
-            if (part_values.find(part->name) == part_values.end())
-                continue;
+    if (query_context.getSettingsRef().allow_experimental_query_deduplication)
+        selectPartsToReadWithUUIDFilter(parts, part_values, minmax_idx_condition, partition_pruner, max_block_numbers_to_read, query_context);
+    else
+        selectPartsToRead(parts, part_values, minmax_idx_condition, partition_pruner, max_block_numbers_to_read);

-            if (part->isEmpty())
-                continue;
-
-            if (minmax_idx_condition && !minmax_idx_condition->checkInHyperrectangle(
-                    part->minmax_idx.hyperrectangle, data.minmax_idx_column_types).can_be_true)
-                continue;
-
-            if (partition_pruner)
-            {
-                if (partition_pruner->canBePruned(part))
-                    continue;
-            }
-
-            if (max_block_numbers_to_read)
-            {
-                auto blocks_iterator = max_block_numbers_to_read->find(part->info.partition_id);
-                if (blocks_iterator == max_block_numbers_to_read->end() || part->info.max_block > blocks_iterator->second)
-                    continue;
-            }
-
-            parts.push_back(part);
-        }
-    }

    /// Sampling.
    Names column_names_to_read = real_column_names;
@ -1849,5 +1844,134 @@ MarkRanges MergeTreeDataSelectExecutor::filterMarksUsingIndex(
    return res;
 }

+void MergeTreeDataSelectExecutor::selectPartsToRead(
+    MergeTreeData::DataPartsVector & parts,
+    const std::unordered_set<String> & part_values,
+    const std::optional<KeyCondition> & minmax_idx_condition,
+    std::optional<PartitionPruner> & partition_pruner,
+    const PartitionIdToMaxBlock * max_block_numbers_to_read) const
+{
+    auto prev_parts = parts;
+    parts.clear();
+
+    for (const auto & part : prev_parts)
+    {
+        if (part_values.find(part->name) == part_values.end())
+            continue;
+
+        if (part->isEmpty())
+            continue;
+
+        if (minmax_idx_condition && !minmax_idx_condition->checkInHyperrectangle(
+                part->minmax_idx.hyperrectangle, data.minmax_idx_column_types).can_be_true)
+            continue;
+
+        if (partition_pruner)
+        {
+            if (partition_pruner->canBePruned(part))
+                continue;
+        }
+
+        if (max_block_numbers_to_read)
+        {
+            auto blocks_iterator = max_block_numbers_to_read->find(part->info.partition_id);
+            if (blocks_iterator == max_block_numbers_to_read->end() || part->info.max_block > blocks_iterator->second)
+                continue;
+        }
+
+        parts.push_back(part);
+    }
+}
+
+void MergeTreeDataSelectExecutor::selectPartsToReadWithUUIDFilter(
+    MergeTreeData::DataPartsVector & parts,
+    const std::unordered_set<String> & part_values,
+    const std::optional<KeyCondition> & minmax_idx_condition,
+    std::optional<PartitionPruner> & partition_pruner,
+    const PartitionIdToMaxBlock * max_block_numbers_to_read,
+    const Context & query_context) const
+{
+    /// const_cast to add UUIDs to context. Bad practice.
+    Context & non_const_context = const_cast<Context &>(query_context);
+
+    /// process_parts prepare parts that have to be read for the query,
+    /// returns false if duplicated parts' UUID have been met
+    auto select_parts = [&] (MergeTreeData::DataPartsVector & selected_parts) -> bool
+    {
+        auto ignored_part_uuids = non_const_context.getIgnoredPartUUIDs();
+        std::unordered_set<UUID> temp_part_uuids;
+
+        auto prev_parts = selected_parts;
+        selected_parts.clear();
+
+        for (const auto & part : prev_parts)
+        {
+            if (part_values.find(part->name) == part_values.end())
+                continue;
+
+            if (part->isEmpty())
+                continue;
+
+            if (minmax_idx_condition
+                && !minmax_idx_condition->checkInHyperrectangle(part->minmax_idx.hyperrectangle, data.minmax_idx_column_types)
+                        .can_be_true)
+                continue;
+
+            if (partition_pruner)
+            {
+                if (partition_pruner->canBePruned(part))
+                    continue;
+            }
+
+            if (max_block_numbers_to_read)
+            {
+                auto blocks_iterator = max_block_numbers_to_read->find(part->info.partition_id);
+                if (blocks_iterator == max_block_numbers_to_read->end() || part->info.max_block > blocks_iterator->second)
+                    continue;
+            }
+
+            /// populate UUIDs and exclude ignored parts if enabled
+            if (part->uuid != UUIDHelpers::Nil)
+            {
+                /// Skip the part if its uuid is meant to be excluded
+                if (ignored_part_uuids->has(part->uuid))
+                    continue;
+
+                auto result = temp_part_uuids.insert(part->uuid);
+                if (!result.second)
+                    throw Exception("Found a part with the same UUID on the same replica.", ErrorCodes::LOGICAL_ERROR);
+            }
+
+            selected_parts.push_back(part);
+        }
+
+        if (!temp_part_uuids.empty())
+        {
+            auto duplicates = non_const_context.getPartUUIDs()->add(std::vector<UUID>{temp_part_uuids.begin(), temp_part_uuids.end()});
+            if (!duplicates.empty())
+            {
+                /// on a local replica with prefer_localhost_replica=1 if any duplicates appeared during the first pass,
+                /// adding them to the exclusion, so they will be skipped on second pass
+                non_const_context.getIgnoredPartUUIDs()->add(duplicates);
+                return false;
+            }
+        }
+
+        return true;
+    };
+
+    /// Process parts that have to be read for a query.
+    auto needs_retry = !select_parts(parts);
+
+    /// If any duplicated part UUIDs met during the first step, try to ignore them in second pass
+    if (needs_retry)
+    {
+        LOG_DEBUG(log, "Found duplicate uuids locally, will retry part selection without them");
+
+        /// Second attempt didn't help, throw an exception
+        if (!select_parts(parts))
+            throw Exception("Found duplicate UUIDs while processing query.", ErrorCodes::DUPLICATED_PART_UUIDS);
+    }
+}

 }
--- a/src/Storages/MergeTree/MergeTreeDataSelectExecutor.h
+++ b/src/Storages/MergeTree/MergeTreeDataSelectExecutor.h
@ -4,6 +4,7 @@
 #include <Storages/SelectQueryInfo.h>
 #include <Storages/MergeTree/MergeTreeData.h>
 #include <Storages/MergeTree/RangesInDataPart.h>
+#include <Storages/MergeTree/PartitionPruner.h>


 namespace DB
@ -113,6 +114,24 @@ private:
        const Settings & settings,
        const MergeTreeReaderSettings & reader_settings,
        Poco::Logger * log);
+
+    /// Select the parts in which there can be data that satisfy `minmax_idx_condition` and that match the condition on `_part`,
+    ///  as well as `max_block_number_to_read`.
+    void selectPartsToRead(
+        MergeTreeData::DataPartsVector & parts,
+        const std::unordered_set<String> & part_values,
+        const std::optional<KeyCondition> & minmax_idx_condition,
+        std::optional<PartitionPruner> & partition_pruner,
+        const PartitionIdToMaxBlock * max_block_numbers_to_read) const;
+
+    /// Same as previous but also skip parts uuids if any to the query context, or skip parts which uuids marked as excluded.
+    void selectPartsToReadWithUUIDFilter(
+        MergeTreeData::DataPartsVector & parts,
+        const std::unordered_set<String> & part_values,
+        const std::optional<KeyCondition> & minmax_idx_condition,
+        std::optional<PartitionPruner> & partition_pruner,
+        const PartitionIdToMaxBlock * max_block_numbers_to_read,
+        const Context & query_context) const;
 };

 }
--- a/src/Storages/MergeTree/MergeTreeWhereOptimizer.cpp
+++ b/src/Storages/MergeTree/MergeTreeWhereOptimizer.cpp
@ -30,7 +30,7 @@ static constexpr auto threshold = 2;
 MergeTreeWhereOptimizer::MergeTreeWhereOptimizer(
    SelectQueryInfo & query_info,
    const Context & context,
-    const MergeTreeData & data,
+    std::unordered_map<std::string, UInt64> column_sizes_,
    const StorageMetadataPtr & metadata_snapshot,
    const Names & queried_columns_,
    Poco::Logger * log_)
@ -39,28 +39,20 @@ MergeTreeWhereOptimizer::MergeTreeWhereOptimizer(
    , queried_columns{queried_columns_}
    , block_with_constants{KeyCondition::getBlockWithConstants(query_info.query, query_info.syntax_analyzer_result, context)}
    , log{log_}
+    , column_sizes{std::move(column_sizes_)}
 {
    const auto & primary_key = metadata_snapshot->getPrimaryKey();
    if (!primary_key.column_names.empty())
        first_primary_key_column = primary_key.column_names[0];

-    calculateColumnSizes(data, queried_columns);
+    for (const auto & [_, size] : column_sizes)
+        total_size_of_queried_columns += size;
+
    determineArrayJoinedNames(query_info.query->as<ASTSelectQuery &>());
    optimize(query_info.query->as<ASTSelectQuery &>());
 }


-void MergeTreeWhereOptimizer::calculateColumnSizes(const MergeTreeData & data, const Names & column_names)
-{
-    for (const auto & column_name : column_names)
-    {
-        UInt64 size = data.getColumnCompressedSize(column_name);
-        column_sizes[column_name] = size;
-        total_size_of_queried_columns += size;
-    }
-}
-
-
 static void collectIdentifiersNoSubqueries(const ASTPtr & ast, NameSet & set)
 {
    if (auto opt_name = tryGetIdentifierName(ast))
--- a/src/Storages/MergeTree/MergeTreeWhereOptimizer.h
+++ b/src/Storages/MergeTree/MergeTreeWhereOptimizer.h
@ -33,7 +33,7 @@ public:
    MergeTreeWhereOptimizer(
        SelectQueryInfo & query_info,
        const Context & context,
-        const MergeTreeData & data,
+        std::unordered_map<std::string, UInt64> column_sizes_,
        const StorageMetadataPtr & metadata_snapshot,
        const Names & queried_columns_,
        Poco::Logger * log_);
@ -75,8 +75,6 @@ private:
    /// Transform Conditions list to WHERE or PREWHERE expression.
    static ASTPtr reconstruct(const Conditions & conditions);

-    void calculateColumnSizes(const MergeTreeData & data, const Names & column_names);
-
    void optimizeConjunction(ASTSelectQuery & select, ASTFunction * const fun) const;

    void optimizeArbitrary(ASTSelectQuery & select) const;
--- a/src/Storages/RocksDB/StorageEmbeddedRocksDB.cpp
+++ b/src/Storages/RocksDB/StorageEmbeddedRocksDB.cpp
@ -24,6 +24,7 @@
 #include <Interpreters/Set.h>
 #include <Interpreters/PreparedSets.h>
 #include <Interpreters/TreeRewriter.h>
+#include <Interpreters/convertFieldToType.h>

 #include <Poco/File.h>
 #include <Poco/Path.h>
@ -44,9 +45,12 @@ namespace ErrorCodes
    extern const int NUMBER_OF_ARGUMENTS_DOESNT_MATCH;
 }

+using FieldVectorPtr = std::shared_ptr<FieldVector>;
+

 // returns keys may be filter by condition
-static bool traverseASTFilter(const String & primary_key, const DataTypePtr & primary_key_type, const ASTPtr & elem, const PreparedSets & sets, FieldVector & res)
+static bool traverseASTFilter(
+    const String & primary_key, const DataTypePtr & primary_key_type, const ASTPtr & elem, const PreparedSets & sets, FieldVectorPtr & res)
 {
    const auto * function = elem->as<ASTFunction>();
    if (!function)
@ -63,13 +67,9 @@ static bool traverseASTFilter(const String & primary_key, const DataTypePtr & pr
    else if (function->name == "or")
    {
        // make sure every child has the key filter condition
-        FieldVector child_res;
        for (const auto & child : function->arguments->children)
-        {
-            if (!traverseASTFilter(primary_key, primary_key_type, child, sets, child_res))
+            if (!traverseASTFilter(primary_key, primary_key_type, child, sets, res))
                return false;
-        }
-        res.insert(res.end(), child_res.begin(), child_res.end());
        return true;
    }
    else if (function->name == "equals" || function->name == "in")
@ -108,9 +108,7 @@ static bool traverseASTFilter(const String & primary_key, const DataTypePtr & pr
            prepared_set->checkColumnsNumber(1);
            const auto & set_column = *prepared_set->getSetElements()[0];
            for (size_t row = 0; row < set_column.size(); ++row)
-            {
-                res.push_back(set_column[row]);
-            }
+                res->push_back(set_column[row]);
            return true;
        }
        else
@ -125,10 +123,12 @@ static bool traverseASTFilter(const String & primary_key, const DataTypePtr & pr
            if (ident->name() != primary_key)
                return false;

-            //function->name == "equals"
+            /// function->name == "equals"
            if (const auto * literal = value->as<ASTLiteral>())
            {
-                res.push_back(literal->value);
+                auto converted_field = convertFieldToType(literal->value, *primary_key_type);
+                if (!converted_field.isNull())
+                    res->push_back(converted_field);
                return true;
            }
        }
@ -140,14 +140,14 @@ static bool traverseASTFilter(const String & primary_key, const DataTypePtr & pr
 /** Retrieve from the query a condition of the form `key = 'key'`, `key in ('xxx_'), from conjunctions in the WHERE clause.
  * TODO support key like search
  */
-static std::pair<FieldVector, bool> getFilterKeys(const String & primary_key, const DataTypePtr & primary_key_type, const SelectQueryInfo & query_info)
+static std::pair<FieldVectorPtr, bool> getFilterKeys(
+    const String & primary_key, const DataTypePtr & primary_key_type, const SelectQueryInfo & query_info)
 {
    const auto & select = query_info.query->as<ASTSelectQuery &>();
    if (!select.where())
-    {
-        return std::make_pair(FieldVector{}, true);
-    }
-    FieldVector res;
+        return {{}, true};
+
+    FieldVectorPtr res = std::make_shared<FieldVector>();
    auto matched_keys = traverseASTFilter(primary_key, primary_key_type, select.where(), query_info.sets, res);
    return std::make_pair(res, !matched_keys);
 }
@ -159,23 +159,19 @@ public:
    EmbeddedRocksDBSource(
        const StorageEmbeddedRocksDB & storage_,
        const StorageMetadataPtr & metadata_snapshot_,
-        const FieldVector & keys_,
-        const size_t start_,
-        const size_t end_,
+        FieldVectorPtr keys_,
+        FieldVector::const_iterator begin_,
+        FieldVector::const_iterator end_,
        const size_t max_block_size_)
        : SourceWithProgress(metadata_snapshot_->getSampleBlock())
        , storage(storage_)
        , metadata_snapshot(metadata_snapshot_)
-        , start(start_)
+        , keys(std::move(keys_))
+        , begin(begin_)
        , end(end_)
+        , it(begin)
        , max_block_size(max_block_size_)
    {
-        // slice the keys
-        if (end > start)
-        {
-            keys.resize(end - start);
-            std::copy(keys_.begin() + start, keys_.begin() + end, keys.begin());
-        }
    }

    String getName() const override
@ -185,27 +181,34 @@ public:

    Chunk generate() override
    {
-        if (processed_keys >= keys.size() || (start == end))
+        if (it >= end)
            return {};

-        std::vector<rocksdb::Slice> slices_keys;
-        slices_keys.reserve(keys.size());
-        std::vector<String> values;
-        std::vector<WriteBufferFromOwnString> wbs(keys.size());
+        size_t num_keys = end - begin;
+
+        std::vector<std::string> serialized_keys(num_keys);
+        std::vector<rocksdb::Slice> slices_keys(num_keys);

        const auto & sample_block = metadata_snapshot->getSampleBlock();
        const auto & key_column = sample_block.getByName(storage.primary_key);
        auto columns = sample_block.cloneEmptyColumns();
        size_t primary_key_pos = sample_block.getPositionByName(storage.primary_key);

-        for (size_t i = processed_keys; i < std::min(keys.size(), processed_keys + max_block_size); ++i)
+        size_t rows_processed = 0;
+        while (it < end && rows_processed < max_block_size)
        {
-            key_column.type->serializeBinary(keys[i], wbs[i]);
-            auto str_ref = wbs[i].stringRef();
-            slices_keys.emplace_back(str_ref.data, str_ref.size);
+            WriteBufferFromString wb(serialized_keys[rows_processed]);
+            key_column.type->serializeBinary(*it, wb);
+            wb.finalize();
+            slices_keys[rows_processed] = std::move(serialized_keys[rows_processed]);
+
+            ++it;
+            ++rows_processed;
        }

+        std::vector<String> values;
        auto statuses = storage.rocksdb_ptr->MultiGet(rocksdb::ReadOptions(), slices_keys, &values);
+
        for (size_t i = 0; i < statuses.size(); ++i)
        {
            if (statuses[i].ok())
@ -221,7 +224,6 @@ public:
                }
            }
        }
-        processed_keys += max_block_size;

        UInt64 num_rows = columns.at(0)->size();
        return Chunk(std::move(columns), num_rows);
@ -231,12 +233,11 @@ private:
    const StorageEmbeddedRocksDB & storage;

    const StorageMetadataPtr metadata_snapshot;
-    const size_t start;
-    const size_t end;
+    FieldVectorPtr keys;
+    FieldVector::const_iterator begin;
+    FieldVector::const_iterator end;
+    FieldVector::const_iterator it;
    const size_t max_block_size;
-    FieldVector keys;
-
-    size_t processed_keys = 0;
 };


@ -289,7 +290,8 @@ Pipe StorageEmbeddedRocksDB::read(
        unsigned num_streams)
 {
    metadata_snapshot->check(column_names, getVirtuals(), getStorageID());
-    FieldVector keys;
+
+    FieldVectorPtr keys;
    bool all_scan = false;

    auto primary_key_data_type = metadata_snapshot->getSampleBlock().getByName(primary_key).type;
@ -302,37 +304,34 @@ Pipe StorageEmbeddedRocksDB::read(
    }
    else
    {
-        if (keys.empty())
+        if (keys->empty())
            return {};

-        std::sort(keys.begin(), keys.end());
-        auto unique_iter = std::unique(keys.begin(), keys.end());
-        if (unique_iter != keys.end())
-            keys.erase(unique_iter, keys.end());
+        std::sort(keys->begin(), keys->end());
+        keys->erase(std::unique(keys->begin(), keys->end()), keys->end());

        Pipes pipes;
-        size_t start = 0;
-        size_t end;

-        const size_t num_threads = std::min(size_t(num_streams), keys.size());
-        const size_t batch_per_size = ceil(keys.size() * 1.0 / num_threads);
+        size_t num_keys = keys->size();
+        size_t num_threads = std::min(size_t(num_streams), keys->size());

-        for (size_t t = 0; t < num_threads; ++t)
+        assert(num_keys <= std::numeric_limits<uint32_t>::max());
+        assert(num_threads <= std::numeric_limits<uint32_t>::max());
+
+        for (size_t thread_idx = 0; thread_idx < num_threads; ++thread_idx)
        {
-            if (start >= keys.size())
-                start = end = 0;
-            else
-                end = start + batch_per_size > keys.size() ? keys.size() : start + batch_per_size;
+            size_t begin = num_keys * thread_idx / num_threads;
+            size_t end = num_keys * (thread_idx + 1) / num_threads;

-            pipes.emplace_back(
-                std::make_shared<EmbeddedRocksDBSource>(*this, metadata_snapshot, keys, start, end, max_block_size));
-            start += batch_per_size;
+            pipes.emplace_back(std::make_shared<EmbeddedRocksDBSource>(
+                    *this, metadata_snapshot, keys, keys->begin() + begin, keys->begin() + end, max_block_size));
        }
        return Pipe::unitePipes(std::move(pipes));
    }
 }

-BlockOutputStreamPtr StorageEmbeddedRocksDB::write(const ASTPtr & /*query*/, const StorageMetadataPtr & metadata_snapshot, const Context & /*context*/)
+BlockOutputStreamPtr StorageEmbeddedRocksDB::write(
+    const ASTPtr & /*query*/, const StorageMetadataPtr & metadata_snapshot, const Context & /*context*/)
 {
    return std::make_shared<EmbeddedRocksDBBlockOutputStream>(*this, metadata_snapshot);
 }
--- a/src/Storages/StorageDistributed.cpp
+++ b/src/Storages/StorageDistributed.cpp
@ -4,6 +4,7 @@
 #include <Disks/IDisk.h>

 #include <DataTypes/DataTypeFactory.h>
+#include <DataTypes/DataTypeUUID.h>
 #include <DataTypes/DataTypesNumber.h>

 #include <Storages/Distributed/DistributedBlockOutputStream.h>
@ -345,6 +346,7 @@ NamesAndTypesList StorageDistributed::getVirtuals() const
            NameAndTypePair("_table", std::make_shared<DataTypeString>()),
            NameAndTypePair("_part", std::make_shared<DataTypeString>()),
            NameAndTypePair("_part_index", std::make_shared<DataTypeUInt64>()),
+            NameAndTypePair("_part_uuid", std::make_shared<DataTypeUUID>()),
            NameAndTypePair("_partition_id", std::make_shared<DataTypeString>()),
            NameAndTypePair("_sample_factor", std::make_shared<DataTypeFloat64>()),
            NameAndTypePair("_shard_num", std::make_shared<DataTypeUInt32>()),
--- a/Show More
+++ b/Show More