Merge branch 'uniq_array' of github.com:PerformanceVision/ClickHouse into uniq_array

2024-09-20 08:40:50 +00:00 · 2019-04-17 17:24:28 +07:00 · 2019-04-17 17:24:28 +07:00 · b8bc308685
commit b8bc308685
parent a96e3c470e 340e1380f4
615 changed files with 16788 additions and 8849 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,109 @@
+## ClickHouse release 19.5.2.6, 2019-04-15
+
+### New Features
+
+* [Hyperscan](https://github.com/intel/hyperscan) multiple regular expression matching was added (functions `multiMatchAny`, `multiMatchAnyIndex`, `multiFuzzyMatchAny`, `multiFuzzyMatchAnyIndex`). [#4780](https://github.com/yandex/ClickHouse/pull/4780), [#4841](https://github.com/yandex/ClickHouse/pull/4841) ([Danila Kutenin](https://github.com/danlark1))
+* `multiSearchFirstPosition` function was added. [#4780](https://github.com/yandex/ClickHouse/pull/4780) ([Danila Kutenin](https://github.com/danlark1))
+* Implement the predefined expression filter per row for tables. [#4792](https://github.com/yandex/ClickHouse/pull/4792) ([Ivan](https://github.com/abyss7))
+* A new type of data skipping indices based on bloom filters (can be used for `equal`, `in` and `like` functions). [#4499](https://github.com/yandex/ClickHouse/pull/4499) ([Nikita Vasilev](https://github.com/nikvas0))
+* Added `ASOF JOIN` which allows to run queries that join to the most recent value known. [#4774](https://github.com/yandex/ClickHouse/pull/4774) [#4867](https://github.com/yandex/ClickHouse/pull/4867) [#4863](https://github.com/yandex/ClickHouse/pull/4863) [#4875](https://github.com/yandex/ClickHouse/pull/4875) ([Martijn Bakker](https://github.com/Gladdy), [Artem Zuikov](https://github.com/4ertus2))
+* Rewrite multiple `COMMA JOIN` to `CROSS JOIN`. Then rewrite them to `INNER JOIN` if possible. [#4661](https://github.com/yandex/ClickHouse/pull/4661) ([Artem Zuikov](https://github.com/4ertus2))
+
+### Improvement
+
+* `topK` and `topKWeighted` now supports custom `loadFactor` (fixes issue [#4252](https://github.com/yandex/ClickHouse/issues/4252)). [#4634](https://github.com/yandex/ClickHouse/pull/4634) ([Kirill Danshin](https://github.com/kirillDanshin))
+* Allow to use `parallel_replicas_count > 1` even for tables without sampling (the setting is simply ignored for them). In previous versions it was lead to exception. [#4637](https://github.com/yandex/ClickHouse/pull/4637) ([Alexey Elymanov](https://github.com/digitalist))
+* Support for `CREATE OR REPLACE VIEW`. Allow to create a view or set a new definition in a single statement. [#4654](https://github.com/yandex/ClickHouse/pull/4654) ([Boris Granveaud](https://github.com/bgranvea))
+* `Buffer` table engine now supports `PREWHERE`. [#4671](https://github.com/yandex/ClickHouse/pull/4671) ([Yangkuan Liu](https://github.com/LiuYangkuan))
+* Add ability to start replicated table without metadata in zookeeper in `readonly` mode. [#4691](https://github.com/yandex/ClickHouse/pull/4691) ([alesapin](https://github.com/alesapin))
+* Fixed flicker of progress bar in clickhouse-client. The issue was most noticeable when using `FORMAT Null` with streaming queries. [#4811](https://github.com/yandex/ClickHouse/pull/4811) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Allow to disable functions with `hyperscan` library on per user basis to limit potentially excessive and uncontrolled resource usage. [#4816](https://github.com/yandex/ClickHouse/pull/4816) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Add version number logging in all errors. [#4824](https://github.com/yandex/ClickHouse/pull/4824) ([proller](https://github.com/proller))
+* Added restriction to the `multiMatch` functions which requires string size to fit into `unsigned int`. Also added the number of arguments limit to the `multiSearch` functions. [#4834](https://github.com/yandex/ClickHouse/pull/4834) ([Danila Kutenin](https://github.com/danlark1))
+* Improved usage of scratch space and error handling in Hyperscan. [#4866](https://github.com/yandex/ClickHouse/pull/4866) ([Danila Kutenin](https://github.com/danlark1))
+* Fill `system.graphite_detentions` from a table config of `*GraphiteMergeTree` engine tables. [#4584](https://github.com/yandex/ClickHouse/pull/4584) ([Mikhail f. Shiryaev](https://github.com/Felixoid))
+* Rename `trigramDistance` function to `ngramDistance` and add more functions with `CaseInsensitive` and `UTF`. [#4602](https://github.com/yandex/ClickHouse/pull/4602) ([Danila Kutenin](https://github.com/danlark1))
+* Improved data skipping indices calculation. [#4640](https://github.com/yandex/ClickHouse/pull/4640) ([Nikita Vasilev](https://github.com/nikvas0))
+
+### Bug Fix
+
+* Avoid `std::terminate` in case of memory allocation failure. Now `std::bad_alloc` exception is thrown as expected. [#4665](https://github.com/yandex/ClickHouse/pull/4665) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fixes capnproto reading from buffer. Sometimes files wasn't loaded successfully by HTTP. [#4674](https://github.com/yandex/ClickHouse/pull/4674) ([Vladislav](https://github.com/smirnov-vs))
+* Fix error `Unknown log entry type: 0` after `OPTIMIZE TABLE FINAL` query. [#4683](https://github.com/yandex/ClickHouse/pull/4683) ([Amos Bird](https://github.com/amosbird))
+* Wrong arguments to `hasAny` or `hasAll` functions may lead to segfault. [#4698](https://github.com/yandex/ClickHouse/pull/4698) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Deadlock may happen while executing `DROP DATABASE dictionary` query. [#4701](https://github.com/yandex/ClickHouse/pull/4701) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fix undefinied behavior in `median` and `quantile` functions. [#4702](https://github.com/yandex/ClickHouse/pull/4702) ([hcz](https://github.com/hczhcz))
+* Fix compression level detection when `network_compression_method` in lowercase. Broken in v19.1. [#4706](https://github.com/yandex/ClickHouse/pull/4706) ([proller](https://github.com/proller))
+* Keep ordinary, `DEFAULT`, `MATERIALIZED` and `ALIAS` columns in a single list (fixes issue [#2867](https://github.com/yandex/ClickHouse/issues/2867)). [#4707](https://github.com/yandex/ClickHouse/pull/4707) ([Alex Zatelepin](https://github.com/ztlpn))
+* Fixed ignorance of `<timezone>UTC</timezone>` setting (fixes issue [#4658](https://github.com/yandex/ClickHouse/issues/4658)). [#4718](https://github.com/yandex/ClickHouse/pull/4718) ([proller](https://github.com/proller))
+* Fix `histogram` function behaviour with `Distributed` tables. [#4741](https://github.com/yandex/ClickHouse/pull/4741) ([olegkv](https://github.com/olegkv))
+* Fixed tsan report `destroy of a locked mutex`. [#4742](https://github.com/yandex/ClickHouse/pull/4742) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fixed TSan report on shutdown due to race condition in system logs usage. Fixed potential use-after-free on shutdown when part_log is enabled. [#4758](https://github.com/yandex/ClickHouse/pull/4758) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fix recheck parts in `ReplicatedMergeTreeAlterThread` in case of error. [#4772](https://github.com/yandex/ClickHouse/pull/4772) ([Nikolai Kochetov](https://github.com/KochetovNicolai))
+* Arithmetic operations on intermediate aggregate function states were not working for constant arguments (such as subquery results). [#4776](https://github.com/yandex/ClickHouse/pull/4776) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Always backquote column names in metadata. Otherwise it's impossible to create a table with column named `index` (server won't restart due to malformed `ATTACH` query in metadata). [#4782](https://github.com/yandex/ClickHouse/pull/4782) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fix crash in `ALTER ... MODIFY ORDER BY` on `Distributed` table. [#4790](https://github.com/yandex/ClickHouse/pull/4790) ([TCeason](https://github.com/TCeason))
+* Fix segfault in `JOIN ON` with enabled `enable_optimize_predicate_expression`. [#4794](https://github.com/yandex/ClickHouse/pull/4794) ([Winter Zhang](https://github.com/zhang2014))
+* Fix bug with adding an extraneous row after consuming a protobuf message from Kafka. [#4808](https://github.com/yandex/ClickHouse/pull/4808) ([Vitaly Baranov](https://github.com/vitlibar))
+* Fix crash of `JOIN` on not-nullable vs nullable column. Fix `NULLs` in right keys in `ANY JOIN` + `join_use_nulls`. [#4815](https://github.com/yandex/ClickHouse/pull/4815) ([Artem Zuikov](https://github.com/4ertus2))
+* Fix segmentation fault in `clickhouse-copier`. [#4835](https://github.com/yandex/ClickHouse/pull/4835) ([proller](https://github.com/proller))
+* Fixed race condition in `SELECT` from `system.tables` if the table is renamed or altered concurrently. [#4836](https://github.com/yandex/ClickHouse/pull/4836) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fixed data race when fetching data part that is already obsolete. [#4839](https://github.com/yandex/ClickHouse/pull/4839) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fixed rare data race that can happen during `RENAME` table of MergeTree family. [#4844](https://github.com/yandex/ClickHouse/pull/4844) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fixed segmentation fault in function `arrayIntersect`. Segmentation fault could happen if function was called with mixed constant and ordinary arguments. [#4847](https://github.com/yandex/ClickHouse/pull/4847) ([Lixiang Qian](https://github.com/fancyqlx))
+* Fixed reading from `Array(LowCardinality)` column in rare case when column contained a long sequence of empty arrays. [#4850](https://github.com/yandex/ClickHouse/pull/4850) ([Nikolai Kochetov](https://github.com/KochetovNicolai))
+* Fix crash in `FULL/RIGHT JOIN` when we joining on nullable vs not nullable. [#4855](https://github.com/yandex/ClickHouse/pull/4855) ([Artem Zuikov](https://github.com/4ertus2))
+* Fix `No message received` exception while fetching parts between replicas. [#4856](https://github.com/yandex/ClickHouse/pull/4856) ([alesapin](https://github.com/alesapin))
+* Fixed `arrayIntersect` function wrong result in case of several repeated values in single array. [#4871](https://github.com/yandex/ClickHouse/pull/4871) ([Nikolai Kochetov](https://github.com/KochetovNicolai))
+* Fix a race condition during concurrent `ALTER COLUMN` queries that could lead to a server crash (fixes issue [#3421](https://github.com/yandex/ClickHouse/issues/3421)). [#4592](https://github.com/yandex/ClickHouse/pull/4592) ([Alex Zatelepin](https://github.com/ztlpn))
+* Fix incorrect result in `FULL/RIGHT JOIN` with const column. [#4723](https://github.com/yandex/ClickHouse/pull/4723) ([Artem Zuikov](https://github.com/4ertus2))
+* Fix duplicates in `GLOBAL JOIN` with asterisk. [#4705](https://github.com/yandex/ClickHouse/pull/4705) ([Artem Zuikov](https://github.com/4ertus2))
+* Fix parameter deduction in `ALTER MODIFY` of column `CODEC` when column type is not specified. [#4883](https://github.com/yandex/ClickHouse/pull/4883) ([alesapin](https://github.com/alesapin))
+* Functions `cutQueryStringAndFragment()` and `queryStringAndFragment()` now works correctly when `URL` contains a fragment and no query. [#4894](https://github.com/yandex/ClickHouse/pull/4894) ([Vitaly Baranov](https://github.com/vitlibar))
+* Fix rare bug when setting `min_bytes_to_use_direct_io` is greater than zero, which occures when thread have to seek backward in column file. [#4897](https://github.com/yandex/ClickHouse/pull/4897) ([alesapin](https://github.com/alesapin))
+* Fix wrong argument types for aggregate functions with `LowCardinality` arguments (fixes issue [#4919](https://github.com/yandex/ClickHouse/issues/4919)). [#4922](https://github.com/yandex/ClickHouse/pull/4922) ([Nikolai Kochetov](https://github.com/KochetovNicolai))
+* Fix wrong name qualification in `GLOBAL JOIN`. [#4969](https://github.com/yandex/ClickHouse/pull/4969) ([Artem Zuikov](https://github.com/4ertus2))
+* Function `toISOWeek` result for year 1970. [#4988](https://github.com/yandex/ClickHouse/pull/4988) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fix `DROP`, `TRUNCATE` and `OPTIMIZE` queries duplication, when executed on `ON CLUSTER` for `ReplicatedMergeTree*` tables family. [#4991](https://github.com/yandex/ClickHouse/pull/4991) ([alesapin](https://github.com/alesapin))
+
+### Backward Incompatible Change
+
+* Rename setting `insert_sample_with_metadata` to setting `input_format_defaults_for_omitted_fields`. [#4771](https://github.com/yandex/ClickHouse/pull/4771) ([Artem Zuikov](https://github.com/4ertus2))
+* Added setting `max_partitions_per_insert_block` (with value 100 by default). If inserted block contains larger number of partitions, an exception is thrown. Set it to 0 if you want to remove the limit (not recommended). [#4845](https://github.com/yandex/ClickHouse/pull/4845) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Multi-search functions were renamed (`multiPosition` to `multiSearchAllPositions`, `multiSearch` to `multiSearchAny`, `firstMatch` to `multiSearchFirstIndex`). [#4780](https://github.com/yandex/ClickHouse/pull/4780) ([Danila Kutenin](https://github.com/danlark1))
+
+### Performance Improvement
+
+* Optimize Volnitsky searcher by inlining, giving about 5-10% search improvement for queries with many needles or many similar bigrams. [#4862](https://github.com/yandex/ClickHouse/pull/4862) ([Danila Kutenin](https://github.com/danlark1))
+* Fix performance issue when setting `use_uncompressed_cache` is greater than zero, which appeared when all read data contained in cache. [#4913](https://github.com/yandex/ClickHouse/pull/4913) ([alesapin](https://github.com/alesapin))
+
+
+### Build/Testing/Packaging Improvement
+
+* Hardening debug build: more granular memory mappings and ASLR; add memory protection for mark cache and index. This allows to find more memory stomping bugs in case when ASan and MSan cannot do it. [#4632](https://github.com/yandex/ClickHouse/pull/4632) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Add support for cmake variables `ENABLE_PROTOBUF`, `ENABLE_PARQUET` and `ENABLE_BROTLI` which allows to enable/disable the above features (same as we can do for librdkafka, mysql, etc). [#4669](https://github.com/yandex/ClickHouse/pull/4669) ([Silviu Caragea](https://github.com/silviucpp))
+* Add ability to print process list and stacktraces of all threads if some queries are hung after test run. [#4675](https://github.com/yandex/ClickHouse/pull/4675) ([alesapin](https://github.com/alesapin))
+* Add retries on `Connection loss` error in `clickhouse-test`. [#4682](https://github.com/yandex/ClickHouse/pull/4682) ([alesapin](https://github.com/alesapin))
+* Add freebsd build with vagrant and build with thread sanitizer to packager script. [#4712](https://github.com/yandex/ClickHouse/pull/4712) [#4748](https://github.com/yandex/ClickHouse/pull/4748) ([alesapin](https://github.com/alesapin))
+* Now user asked for password for user `'default'` during installation. [#4725](https://github.com/yandex/ClickHouse/pull/4725) ([proller](https://github.com/proller))
+* Suppress warning in `rdkafka` library. [#4740](https://github.com/yandex/ClickHouse/pull/4740) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Allow ability to build without ssl. [#4750](https://github.com/yandex/ClickHouse/pull/4750) ([proller](https://github.com/proller))
+* Add a way to launch clickhouse-server image from a custom user. [#4753](https://github.com/yandex/ClickHouse/pull/4753) ([Mikhail f. Shiryaev](https://github.com/Felixoid))
+* Upgrade contrib boost to 1.69. [#4793](https://github.com/yandex/ClickHouse/pull/4793) ([proller](https://github.com/proller))
+* Disable usage of `mremap` when compiled with Thread Sanitizer. Surprisingly enough, TSan does not intercept `mremap` (though it does intercept `mmap`, `munmap`) that leads to false positives. Fixed TSan report in stateful tests. [#4859](https://github.com/yandex/ClickHouse/pull/4859) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Add test checking using format schema via HTTP interface. [#4864](https://github.com/yandex/ClickHouse/pull/4864) ([Vitaly Baranov](https://github.com/vitlibar))
+
+## ClickHouse release 19.4.3.11, 2019-04-02
+
+### Bug Fixes
+
+* Fix crash in `FULL/RIGHT JOIN` when we joining on nullable vs not nullable. [#4855](https://github.com/yandex/ClickHouse/pull/4855) ([Artem Zuikov](https://github.com/4ertus2))
+* Fix segmentation fault in `clickhouse-copier`. [#4835](https://github.com/yandex/ClickHouse/pull/4835) ([proller](https://github.com/proller))
+
+### Build/Testing/Packaging Improvement
+
+* Add a way to launch clickhouse-server image from a custom user. [#4753](https://github.com/yandex/ClickHouse/pull/4753) ([Mikhail f. Shiryaev](https://github.com/Felixoid))
+
 ## ClickHouse release 19.4.2.7, 2019-03-30

 ### Bug Fixes
@ -13,11 +119,11 @@
 ### New Features
 * Added full support for `Protobuf` format (input and output, nested data structures). [#4174](https://github.com/yandex/ClickHouse/pull/4174) [#4493](https://github.com/yandex/ClickHouse/pull/4493) ([Vitaly Baranov](https://github.com/vitlibar))
 * Added bitmap functions with Roaring Bitmaps. [#4207](https://github.com/yandex/ClickHouse/pull/4207) ([Andy Yang](https://github.com/andyyzh)) [#4568](https://github.com/yandex/ClickHouse/pull/4568) ([Vitaly Baranov](https://github.com/vitlibar))
-* Parquet format support [#4448](https://github.com/yandex/ClickHouse/pull/4448) ([proller](https://github.com/proller))
+* Parquet format support. [#4448](https://github.com/yandex/ClickHouse/pull/4448) ([proller](https://github.com/proller))
 * N-gram distance was added for fuzzy string comparison. It is similar to q-gram metrics in R language. [#4466](https://github.com/yandex/ClickHouse/pull/4466) ([Danila Kutenin](https://github.com/danlark1))
 * Combine rules for graphite rollup from dedicated aggregation and retention patterns. [#4426](https://github.com/yandex/ClickHouse/pull/4426) ([Mikhail f. Shiryaev](https://github.com/Felixoid))
 * Added `max_execution_speed` and `max_execution_speed_bytes` to limit resource usage. Added `min_execution_speed_bytes` setting to complement the `min_execution_speed`. [#4430](https://github.com/yandex/ClickHouse/pull/4430) ([Winter Zhang](https://github.com/zhang2014))
-* Implemented function `flatten` [#4555](https://github.com/yandex/ClickHouse/pull/4555) [#4409](https://github.com/yandex/ClickHouse/pull/4409) ([alexey-milovidov](https://github.com/alexey-milovidov), [kzon](https://github.com/kzon))
+* Implemented function `flatten`. [#4555](https://github.com/yandex/ClickHouse/pull/4555) [#4409](https://github.com/yandex/ClickHouse/pull/4409) ([alexey-milovidov](https://github.com/alexey-milovidov), [kzon](https://github.com/kzon))
 * Added functions `arrayEnumerateDenseRanked` and `arrayEnumerateUniqRanked` (it's like `arrayEnumerateUniq` but allows to fine tune array depth to look inside multidimensional arrays). [#4475](https://github.com/yandex/ClickHouse/pull/4475) ([proller](https://github.com/proller)) [#4601](https://github.com/yandex/ClickHouse/pull/4601) ([alexey-milovidov](https://github.com/alexey-milovidov))
 * Multiple JOINS with some restrictions: no asterisks, no complex aliases in ON/WHERE/GROUP BY/... [#4462](https://github.com/yandex/ClickHouse/pull/4462) ([Artem Zuikov](https://github.com/4ertus2))

@ -26,25 +132,25 @@
 * Fixed bug in data skipping indices: order of granules after INSERT was incorrect. [#4407](https://github.com/yandex/ClickHouse/pull/4407) ([Nikita Vasilev](https://github.com/nikvas0))
 * Fixed `set` index for `Nullable` and `LowCardinality` columns. Before it, `set` index with `Nullable` or `LowCardinality` column led to error `Data type must be deserialized with multiple streams` while selecting. [#4594](https://github.com/yandex/ClickHouse/pull/4594) ([Nikolai Kochetov](https://github.com/KochetovNicolai))
 * Correctly set update_time on full `executable` dictionary update. [#4551](https://github.com/yandex/ClickHouse/pull/4551) ([Tema Novikov](https://github.com/temoon))
-* Fix broken progress bar in 19.3 [#4627](https://github.com/yandex/ClickHouse/pull/4627) ([filimonov](https://github.com/filimonov))
+* Fix broken progress bar in 19.3. [#4627](https://github.com/yandex/ClickHouse/pull/4627) ([filimonov](https://github.com/filimonov))
 * Fixed inconsistent values of MemoryTracker when memory region was shrinked, in certain cases. [#4619](https://github.com/yandex/ClickHouse/pull/4619) ([alexey-milovidov](https://github.com/alexey-milovidov))
-* Fixed undefined behaviour in ThreadPool [#4612](https://github.com/yandex/ClickHouse/pull/4612) ([alexey-milovidov](https://github.com/alexey-milovidov))
+* Fixed undefined behaviour in ThreadPool. [#4612](https://github.com/yandex/ClickHouse/pull/4612) ([alexey-milovidov](https://github.com/alexey-milovidov))
 * Fixed a very rare crash with the message `mutex lock failed: Invalid argument` that could happen when a MergeTree table was dropped concurrently with a SELECT. [#4608](https://github.com/yandex/ClickHouse/pull/4608) ([Alex Zatelepin](https://github.com/ztlpn))
-* ODBC driver compatibility with `LowCardinality` data type [#4381](https://github.com/yandex/ClickHouse/pull/4381) ([proller](https://github.com/proller))
-* FreeBSD: Fixup for `AIOcontextPool: Found io_event with unknown id 0` error [#4438](https://github.com/yandex/ClickHouse/pull/4438) ([urgordeadbeef](https://github.com/urgordeadbeef))
+* ODBC driver compatibility with `LowCardinality` data type. [#4381](https://github.com/yandex/ClickHouse/pull/4381) ([proller](https://github.com/proller))
+* FreeBSD: Fixup for `AIOcontextPool: Found io_event with unknown id 0` error. [#4438](https://github.com/yandex/ClickHouse/pull/4438) ([urgordeadbeef](https://github.com/urgordeadbeef))
 * `system.part_log` table was created regardless to configuration. [#4483](https://github.com/yandex/ClickHouse/pull/4483) ([alexey-milovidov](https://github.com/alexey-milovidov))
 * Fix undefined behaviour in `dictIsIn` function for cache dictionaries. [#4515](https://github.com/yandex/ClickHouse/pull/4515) ([alesapin](https://github.com/alesapin))
 * Fixed a deadlock when a SELECT query locks the same table multiple times (e.g. from different threads or when executing multiple subqueries) and there is a concurrent DDL query. [#4535](https://github.com/yandex/ClickHouse/pull/4535) ([Alex Zatelepin](https://github.com/ztlpn))
 * Disable compile_expressions by default until we get own `llvm` contrib and can test it with `clang` and `asan`. [#4579](https://github.com/yandex/ClickHouse/pull/4579) ([alesapin](https://github.com/alesapin))
 * Prevent `std::terminate` when `invalidate_query` for `clickhouse` external dictionary source has returned wrong resultset (empty or more than one row or more than one column). Fixed issue when the `invalidate_query` was performed every five seconds regardless to the `lifetime`. [#4583](https://github.com/yandex/ClickHouse/pull/4583) ([alexey-milovidov](https://github.com/alexey-milovidov))
 * Avoid deadlock when the `invalidate_query` for a dictionary with `clickhouse` source was involving `system.dictionaries` table or `Dictionaries` database (rare case). [#4599](https://github.com/yandex/ClickHouse/pull/4599) ([alexey-milovidov](https://github.com/alexey-milovidov))
-* Fixes for CROSS JOIN with empty WHERE [#4598](https://github.com/yandex/ClickHouse/pull/4598) ([Artem Zuikov](https://github.com/4ertus2))
+* Fixes for CROSS JOIN with empty WHERE. [#4598](https://github.com/yandex/ClickHouse/pull/4598) ([Artem Zuikov](https://github.com/4ertus2))
 * Fixed segfault in function "replicate" when constant argument is passed. [#4603](https://github.com/yandex/ClickHouse/pull/4603) ([alexey-milovidov](https://github.com/alexey-milovidov))
 * Fix lambda function with predicate optimizer. [#4408](https://github.com/yandex/ClickHouse/pull/4408) ([Winter Zhang](https://github.com/zhang2014))
 * Multiple JOINs multiple fixes. [#4595](https://github.com/yandex/ClickHouse/pull/4595) ([Artem Zuikov](https://github.com/4ertus2))

 ### Improvements
-* Support aliases in JOIN ON section for right table columns [#4412](https://github.com/yandex/ClickHouse/pull/4412) ([Artem Zuikov](https://github.com/4ertus2))
+* Support aliases in JOIN ON section for right table columns. [#4412](https://github.com/yandex/ClickHouse/pull/4412) ([Artem Zuikov](https://github.com/4ertus2))
 * Result of multiple JOINs need correct result names to be used in subselects. Replace flat aliases with source names in result. [#4474](https://github.com/yandex/ClickHouse/pull/4474) ([Artem Zuikov](https://github.com/4ertus2))
 * Improve push-down logic for joined statements. [#4387](https://github.com/yandex/ClickHouse/pull/4387) ([Ivan](https://github.com/abyss7))

@ -67,6 +173,18 @@
 * Fix compilation on Mac. [#4371](https://github.com/yandex/ClickHouse/pull/4371) ([Vitaly Baranov](https://github.com/vitlibar))
 * Build fixes for FreeBSD and various unusual build configurations. [#4444](https://github.com/yandex/ClickHouse/pull/4444) ([proller](https://github.com/proller))

+## ClickHouse release 19.3.9.1, 2019-04-02
+
+### Bug Fixes
+
+* Fix crash in `FULL/RIGHT JOIN` when we joining on nullable vs not nullable. [#4855](https://github.com/yandex/ClickHouse/pull/4855) ([Artem Zuikov](https://github.com/4ertus2))
+* Fix segmentation fault in `clickhouse-copier`. [#4835](https://github.com/yandex/ClickHouse/pull/4835) ([proller](https://github.com/proller))
+* Fixed reading from `Array(LowCardinality)` column in rare case when column contained a long sequence of empty arrays. [#4850](https://github.com/yandex/ClickHouse/pull/4850) ([Nikolai Kochetov](https://github.com/KochetovNicolai))
+
+### Build/Testing/Packaging Improvement
+
+* Add a way to launch clickhouse-server image from a custom user [#4753](https://github.com/yandex/ClickHouse/pull/4753) ([Mikhail f. Shiryaev](https://github.com/Felixoid))
+

 ## ClickHouse release 19.3.7, 2019-03-12

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -317,6 +317,7 @@ include (cmake/find_hdfs3.cmake) # uses protobuf
 include (cmake/find_consistent-hashing.cmake)
 include (cmake/find_base64.cmake)
 include (cmake/find_hyperscan.cmake)
+include (cmake/find_lfalloc.cmake)
 find_contrib_lib(cityhash)
 find_contrib_lib(farmhash)
 find_contrib_lib(metrohash)
--- a/README.md
+++ b/README.md
@ -10,3 +10,10 @@ ClickHouse is an open-source column-oriented database management system that all
 * [Blog](https://clickhouse.yandex/blog/en/) contains various ClickHouse-related articles, as well as announces and reports about events.
 * [Contacts](https://clickhouse.yandex/#contacts) can help to get your questions answered if there are any.
 * You can also [fill this form](https://forms.yandex.com/surveys/meet-yandex-clickhouse-team/) to meet Yandex ClickHouse team in person.
+
+## Upcoming Events
+* [ClickHouse Community Meetup in Limassol](https://www.facebook.com/events/386638262181785/) on May 7.
+* ClickHouse at [Percona Live 2019](https://www.percona.com/live/19/other-open-source-databases-track) in Austin on May 28-30.
+* [ClickHouse Community Meetup in Beijing](https://www.huodongxing.com/event/2483759276200) on June 8.
+* [ClickHouse Community Meetup in Shenzhen](https://www.huodongxing.com/event/3483759917300) on October 20.
+* [ClickHouse Community Meetup in Shanghai](https://www.huodongxing.com/event/4483760336000) on October 27.
--- a/cmake/find_lfalloc.cmake
+++ b/cmake/find_lfalloc.cmake
@ -0,0 +1,10 @@
+if (NOT SANITIZE AND NOT ARCH_ARM AND NOT ARCH_32 AND NOT ARCH_PPC64LE AND NOT OS_FREEBSD)
+    option (ENABLE_LFALLOC "Set to FALSE to use system libgsasl library instead of bundled" ${NOT_UNBUNDLED})
+endif ()
+
+if (ENABLE_LFALLOC)
+    set (USE_LFALLOC 1)
+    set (USE_LFALLOC_RANDOM_HINT 1)
+    set (LFALLOC_INCLUDE_DIR ${ClickHouse_SOURCE_DIR}/contrib/lfalloc/src)
+    message (STATUS "Using lfalloc=${USE_LFALLOC}: ${LFALLOC_INCLUDE_DIR}")
+endif ()
--- a/contrib/lfalloc/src/lf_allocX64.h
+++ b/contrib/lfalloc/src/lf_allocX64.h
--- a/contrib/lfalloc/src/lfmalloc.h
+++ b/contrib/lfalloc/src/lfmalloc.h
@ -0,0 +1,23 @@
+#pragma once
+
+#include <string.h>
+#include <stdlib.h>
+#include "util/system/compiler.h"
+
+namespace NMalloc {
+    volatile inline bool IsAllocatorCorrupted = false;
+
+    static inline void AbortFromCorruptedAllocator() {
+        IsAllocatorCorrupted = true;
+        abort();
+    }
+
+    struct TAllocHeader {
+        void* Block;
+        size_t AllocSize;
+        void Y_FORCE_INLINE Encode(void* block, size_t size, size_t signature) {
+            Block = block;
+            AllocSize = size | signature;
+        }
+    };
+}
--- a/contrib/lfalloc/src/util/README.md
+++ b/contrib/lfalloc/src/util/README.md
@ -0,0 +1,33 @@
+Style guide for the util folder is a stricter version of general style guide (mostly in terms of ambiguity resolution).
+
+ * all {} must be in K&R style
+ * &, * tied closer to a type, not to variable
+ * always use `using` not `typedef`
+ * even a single line block must be in braces {}:
+   ```
+   if (A) {
+       B();
+   }
+   ```
+ * _ at the end of private data member of a class - `First_`, `Second_`
+ * every .h file must be accompanied with corresponding .cpp to avoid a leakage and check that it is self contained
+ * prohibited to use `printf`-like functions
+
+
+Things declared in the general style guide, which sometimes are missed:
+
+ * `template <`, not `template<`
+ * `noexcept`, not `throw ()` nor `throw()`, not required for destructors
+ * indents inside `namespace` same as inside `class`
+
+
+Requirements for a new code (and for corrections in an old code which involves change of behaviour) in util:
+
+ * presence of UNIT-tests
+ * presence of comments in Doxygen style
+ * accessors without Get prefix (`Length()`, but not `GetLength()`)
+
+This guide is not a mandatory as there is the general style guide.
+Nevertheless if it is not followed, then a next `ya style .` run in the util folder will undeservedly update authors of some lines of code.
+
+Thus before a commit it is recommended to run `ya style .` in the util folder.
--- a/contrib/lfalloc/src/util/system/atomic.h
+++ b/contrib/lfalloc/src/util/system/atomic.h
@ -0,0 +1,51 @@
+#pragma once
+
+#include "defaults.h"
+
+using TAtomicBase = intptr_t;
+using TAtomic = volatile TAtomicBase;
+
+#if defined(__GNUC__)
+#include "atomic_gcc.h"
+#elif defined(_MSC_VER)
+#include "atomic_win.h"
+#else
+#error unsupported platform
+#endif
+
+#if !defined(ATOMIC_COMPILER_BARRIER)
+#define ATOMIC_COMPILER_BARRIER()
+#endif
+
+static inline TAtomicBase AtomicSub(TAtomic& a, TAtomicBase v) {
+    return AtomicAdd(a, -v);
+}
+
+static inline TAtomicBase AtomicGetAndSub(TAtomic& a, TAtomicBase v) {
+    return AtomicGetAndAdd(a, -v);
+}
+
+#if defined(USE_GENERIC_SETGET)
+static inline TAtomicBase AtomicGet(const TAtomic& a) {
+    return a;
+}
+
+static inline void AtomicSet(TAtomic& a, TAtomicBase v) {
+    a = v;
+}
+#endif
+
+static inline bool AtomicTryLock(TAtomic* a) {
+    return AtomicCas(a, 1, 0);
+}
+
+static inline bool AtomicTryAndTryLock(TAtomic* a) {
+    return (AtomicGet(*a) == 0) && AtomicTryLock(a);
+}
+
+static inline void AtomicUnlock(TAtomic* a) {
+    ATOMIC_COMPILER_BARRIER();
+    AtomicSet(*a, 0);
+}
+
+#include "atomic_ops.h"
--- a/contrib/lfalloc/src/util/system/atomic_gcc.h
+++ b/contrib/lfalloc/src/util/system/atomic_gcc.h
@ -0,0 +1,90 @@
+#pragma once
+
+#define ATOMIC_COMPILER_BARRIER() __asm__ __volatile__("" \
+                                                       :  \
+                                                       :  \
+                                                       : "memory")
+
+static inline TAtomicBase AtomicGet(const TAtomic& a) {
+    TAtomicBase tmp;
+#if defined(_arm64_)
+    __asm__ __volatile__(
+        "ldar %x[value], %[ptr]  \n\t"
+        : [value] "=r"(tmp)
+        : [ptr] "Q"(a)
+        : "memory");
+#else
+    __atomic_load(&a, &tmp, __ATOMIC_ACQUIRE);
+#endif
+    return tmp;
+}
+
+static inline void AtomicSet(TAtomic& a, TAtomicBase v) {
+#if defined(_arm64_)
+    __asm__ __volatile__(
+        "stlr %x[value], %[ptr]  \n\t"
+        : [ptr] "=Q"(a)
+        : [value] "r"(v)
+        : "memory");
+#else
+    __atomic_store(&a, &v, __ATOMIC_RELEASE);
+#endif
+}
+
+static inline intptr_t AtomicIncrement(TAtomic& p) {
+    return __atomic_add_fetch(&p, 1, __ATOMIC_SEQ_CST);
+}
+
+static inline intptr_t AtomicGetAndIncrement(TAtomic& p) {
+    return __atomic_fetch_add(&p, 1, __ATOMIC_SEQ_CST);
+}
+
+static inline intptr_t AtomicDecrement(TAtomic& p) {
+    return __atomic_sub_fetch(&p, 1, __ATOMIC_SEQ_CST);
+}
+
+static inline intptr_t AtomicGetAndDecrement(TAtomic& p) {
+    return __atomic_fetch_sub(&p, 1, __ATOMIC_SEQ_CST);
+}
+
+static inline intptr_t AtomicAdd(TAtomic& p, intptr_t v) {
+    return __atomic_add_fetch(&p, v, __ATOMIC_SEQ_CST);
+}
+
+static inline intptr_t AtomicGetAndAdd(TAtomic& p, intptr_t v) {
+    return __atomic_fetch_add(&p, v, __ATOMIC_SEQ_CST);
+}
+
+static inline intptr_t AtomicSwap(TAtomic* p, intptr_t v) {
+    (void)p; // disable strange 'parameter set but not used' warning on gcc
+    intptr_t ret;
+    __atomic_exchange(p, &v, &ret, __ATOMIC_SEQ_CST);
+    return ret;
+}
+
+static inline bool AtomicCas(TAtomic* a, intptr_t exchange, intptr_t compare) {
+    (void)a; // disable strange 'parameter set but not used' warning on gcc
+    return __atomic_compare_exchange(a, &compare, &exchange, false, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+
+static inline intptr_t AtomicGetAndCas(TAtomic* a, intptr_t exchange, intptr_t compare) {
+    (void)a; // disable strange 'parameter set but not used' warning on gcc
+    __atomic_compare_exchange(a, &compare, &exchange, false, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+    return compare;
+}
+
+static inline intptr_t AtomicOr(TAtomic& a, intptr_t b) {
+    return __atomic_or_fetch(&a, b, __ATOMIC_SEQ_CST);
+}
+
+static inline intptr_t AtomicXor(TAtomic& a, intptr_t b) {
+    return __atomic_xor_fetch(&a, b, __ATOMIC_SEQ_CST);
+}
+
+static inline intptr_t AtomicAnd(TAtomic& a, intptr_t b) {
+    return __atomic_and_fetch(&a, b, __ATOMIC_SEQ_CST);
+}
+
+static inline void AtomicBarrier() {
+    __sync_synchronize();
+}
--- a/contrib/lfalloc/src/util/system/atomic_ops.h
+++ b/contrib/lfalloc/src/util/system/atomic_ops.h
@ -0,0 +1,189 @@
+#pragma once
+
+#include <type_traits>
+
+template <typename T>
+inline TAtomic* AsAtomicPtr(T volatile* target) {
+    return reinterpret_cast<TAtomic*>(target);
+}
+
+template <typename T>
+inline const TAtomic* AsAtomicPtr(T const volatile* target) {
+    return reinterpret_cast<const TAtomic*>(target);
+}
+
+// integral types
+
+template <typename T>
+struct TAtomicTraits {
+    enum {
+        Castable = std::is_integral<T>::value && sizeof(T) == sizeof(TAtomicBase) && !std::is_const<T>::value,
+    };
+};
+
+template <typename T, typename TT>
+using TEnableIfCastable = std::enable_if_t<TAtomicTraits<T>::Castable, TT>;
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicGet(T const volatile& target) {
+    return static_cast<T>(AtomicGet(*AsAtomicPtr(&target)));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, void> AtomicSet(T volatile& target, TAtomicBase value) {
+    AtomicSet(*AsAtomicPtr(&target), value);
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicIncrement(T volatile& target) {
+    return static_cast<T>(AtomicIncrement(*AsAtomicPtr(&target)));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicGetAndIncrement(T volatile& target) {
+    return static_cast<T>(AtomicGetAndIncrement(*AsAtomicPtr(&target)));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicDecrement(T volatile& target) {
+    return static_cast<T>(AtomicDecrement(*AsAtomicPtr(&target)));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicGetAndDecrement(T volatile& target) {
+    return static_cast<T>(AtomicGetAndDecrement(*AsAtomicPtr(&target)));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicAdd(T volatile& target, TAtomicBase value) {
+    return static_cast<T>(AtomicAdd(*AsAtomicPtr(&target), value));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicGetAndAdd(T volatile& target, TAtomicBase value) {
+    return static_cast<T>(AtomicGetAndAdd(*AsAtomicPtr(&target), value));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicSub(T volatile& target, TAtomicBase value) {
+    return static_cast<T>(AtomicSub(*AsAtomicPtr(&target), value));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicGetAndSub(T volatile& target, TAtomicBase value) {
+    return static_cast<T>(AtomicGetAndSub(*AsAtomicPtr(&target), value));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicSwap(T volatile* target, TAtomicBase exchange) {
+    return static_cast<T>(AtomicSwap(AsAtomicPtr(target), exchange));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, bool> AtomicCas(T volatile* target, TAtomicBase exchange, TAtomicBase compare) {
+    return AtomicCas(AsAtomicPtr(target), exchange, compare);
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicGetAndCas(T volatile* target, TAtomicBase exchange, TAtomicBase compare) {
+    return static_cast<T>(AtomicGetAndCas(AsAtomicPtr(target), exchange, compare));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, bool> AtomicTryLock(T volatile* target) {
+    return AtomicTryLock(AsAtomicPtr(target));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, bool> AtomicTryAndTryLock(T volatile* target) {
+    return AtomicTryAndTryLock(AsAtomicPtr(target));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, void> AtomicUnlock(T volatile* target) {
+    AtomicUnlock(AsAtomicPtr(target));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicOr(T volatile& target, TAtomicBase value) {
+    return static_cast<T>(AtomicOr(*AsAtomicPtr(&target), value));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicAnd(T volatile& target, TAtomicBase value) {
+    return static_cast<T>(AtomicAnd(*AsAtomicPtr(&target), value));
+}
+
+template <typename T>
+inline TEnableIfCastable<T, T> AtomicXor(T volatile& target, TAtomicBase value) {
+    return static_cast<T>(AtomicXor(*AsAtomicPtr(&target), value));
+}
+
+// pointer types
+
+template <typename T>
+inline T* AtomicGet(T* const volatile& target) {
+    return reinterpret_cast<T*>(AtomicGet(*AsAtomicPtr(&target)));
+}
+
+template <typename T>
+inline void AtomicSet(T* volatile& target, T* value) {
+    AtomicSet(*AsAtomicPtr(&target), reinterpret_cast<TAtomicBase>(value));
+}
+
+using TNullPtr = decltype(nullptr);
+
+template <typename T>
+inline void AtomicSet(T* volatile& target, TNullPtr) {
+    AtomicSet(*AsAtomicPtr(&target), 0);
+}
+
+template <typename T>
+inline T* AtomicSwap(T* volatile* target, T* exchange) {
+    return reinterpret_cast<T*>(AtomicSwap(AsAtomicPtr(target), reinterpret_cast<TAtomicBase>(exchange)));
+}
+
+template <typename T>
+inline T* AtomicSwap(T* volatile* target, TNullPtr) {
+    return reinterpret_cast<T*>(AtomicSwap(AsAtomicPtr(target), 0));
+}
+
+template <typename T>
+inline bool AtomicCas(T* volatile* target, T* exchange, T* compare) {
+    return AtomicCas(AsAtomicPtr(target), reinterpret_cast<TAtomicBase>(exchange), reinterpret_cast<TAtomicBase>(compare));
+}
+
+template <typename T>
+inline T* AtomicGetAndCas(T* volatile* target, T* exchange, T* compare) {
+    return reinterpret_cast<T*>(AtomicGetAndCas(AsAtomicPtr(target), reinterpret_cast<TAtomicBase>(exchange), reinterpret_cast<TAtomicBase>(compare)));
+}
+
+template <typename T>
+inline bool AtomicCas(T* volatile* target, T* exchange, TNullPtr) {
+    return AtomicCas(AsAtomicPtr(target), reinterpret_cast<TAtomicBase>(exchange), 0);
+}
+
+template <typename T>
+inline T* AtomicGetAndCas(T* volatile* target, T* exchange, TNullPtr) {
+    return reinterpret_cast<T*>(AtomicGetAndCas(AsAtomicPtr(target), reinterpret_cast<TAtomicBase>(exchange), 0));
+}
+
+template <typename T>
+inline bool AtomicCas(T* volatile* target, TNullPtr, T* compare) {
+    return AtomicCas(AsAtomicPtr(target), 0, reinterpret_cast<TAtomicBase>(compare));
+}
+
+template <typename T>
+inline T* AtomicGetAndCas(T* volatile* target, TNullPtr, T* compare) {
+    return reinterpret_cast<T*>(AtomicGetAndCas(AsAtomicPtr(target), 0, reinterpret_cast<TAtomicBase>(compare)));
+}
+
+template <typename T>
+inline bool AtomicCas(T* volatile* target, TNullPtr, TNullPtr) {
+    return AtomicCas(AsAtomicPtr(target), 0, 0);
+}
+
+template <typename T>
+inline T* AtomicGetAndCas(T* volatile* target, TNullPtr, TNullPtr) {
+    return reinterpret_cast<T*>(AtomicGetAndCas(AsAtomicPtr(target), 0, 0));
+}
--- a/contrib/lfalloc/src/util/system/atomic_win.h
+++ b/contrib/lfalloc/src/util/system/atomic_win.h
@ -0,0 +1,114 @@
+#pragma once
+
+#include <intrin.h>
+
+#define USE_GENERIC_SETGET
+
+#if defined(_i386_)
+
+#pragma intrinsic(_InterlockedIncrement)
+#pragma intrinsic(_InterlockedDecrement)
+#pragma intrinsic(_InterlockedExchangeAdd)
+#pragma intrinsic(_InterlockedExchange)
+#pragma intrinsic(_InterlockedCompareExchange)
+
+static inline intptr_t AtomicIncrement(TAtomic& a) {
+    return _InterlockedIncrement((volatile long*)&a);
+}
+
+static inline intptr_t AtomicGetAndIncrement(TAtomic& a) {
+    return _InterlockedIncrement((volatile long*)&a) - 1;
+}
+
+static inline intptr_t AtomicDecrement(TAtomic& a) {
+    return _InterlockedDecrement((volatile long*)&a);
+}
+
+static inline intptr_t AtomicGetAndDecrement(TAtomic& a) {
+    return _InterlockedDecrement((volatile long*)&a) + 1;
+}
+
+static inline intptr_t AtomicAdd(TAtomic& a, intptr_t b) {
+    return _InterlockedExchangeAdd((volatile long*)&a, b) + b;
+}
+
+static inline intptr_t AtomicGetAndAdd(TAtomic& a, intptr_t b) {
+    return _InterlockedExchangeAdd((volatile long*)&a, b);
+}
+
+static inline intptr_t AtomicSwap(TAtomic* a, intptr_t b) {
+    return _InterlockedExchange((volatile long*)a, b);
+}
+
+static inline bool AtomicCas(TAtomic* a, intptr_t exchange, intptr_t compare) {
+    return _InterlockedCompareExchange((volatile long*)a, exchange, compare) == compare;
+}
+
+static inline intptr_t AtomicGetAndCas(TAtomic* a, intptr_t exchange, intptr_t compare) {
+    return _InterlockedCompareExchange((volatile long*)a, exchange, compare);
+}
+
+#else // _x86_64_
+
+#pragma intrinsic(_InterlockedIncrement64)
+#pragma intrinsic(_InterlockedDecrement64)
+#pragma intrinsic(_InterlockedExchangeAdd64)
+#pragma intrinsic(_InterlockedExchange64)
+#pragma intrinsic(_InterlockedCompareExchange64)
+
+static inline intptr_t AtomicIncrement(TAtomic& a) {
+    return _InterlockedIncrement64((volatile __int64*)&a);
+}
+
+static inline intptr_t AtomicGetAndIncrement(TAtomic& a) {
+    return _InterlockedIncrement64((volatile __int64*)&a) - 1;
+}
+
+static inline intptr_t AtomicDecrement(TAtomic& a) {
+    return _InterlockedDecrement64((volatile __int64*)&a);
+}
+
+static inline intptr_t AtomicGetAndDecrement(TAtomic& a) {
+    return _InterlockedDecrement64((volatile __int64*)&a) + 1;
+}
+
+static inline intptr_t AtomicAdd(TAtomic& a, intptr_t b) {
+    return _InterlockedExchangeAdd64((volatile __int64*)&a, b) + b;
+}
+
+static inline intptr_t AtomicGetAndAdd(TAtomic& a, intptr_t b) {
+    return _InterlockedExchangeAdd64((volatile __int64*)&a, b);
+}
+
+static inline intptr_t AtomicSwap(TAtomic* a, intptr_t b) {
+    return _InterlockedExchange64((volatile __int64*)a, b);
+}
+
+static inline bool AtomicCas(TAtomic* a, intptr_t exchange, intptr_t compare) {
+    return _InterlockedCompareExchange64((volatile __int64*)a, exchange, compare) == compare;
+}
+
+static inline intptr_t AtomicGetAndCas(TAtomic* a, intptr_t exchange, intptr_t compare) {
+    return _InterlockedCompareExchange64((volatile __int64*)a, exchange, compare);
+}
+
+static inline intptr_t AtomicOr(TAtomic& a, intptr_t b) {
+    return _InterlockedOr64(&a, b) | b;
+}
+
+static inline intptr_t AtomicAnd(TAtomic& a, intptr_t b) {
+    return _InterlockedAnd64(&a, b) & b;
+}
+
+static inline intptr_t AtomicXor(TAtomic& a, intptr_t b) {
+    return _InterlockedXor64(&a, b) ^ b;
+}
+
+#endif // _x86_
+
+//TODO
+static inline void AtomicBarrier() {
+    TAtomic val = 0;
+
+    AtomicSwap(&val, 0);
+}
--- a/contrib/lfalloc/src/util/system/compiler.h
+++ b/contrib/lfalloc/src/util/system/compiler.h
@ -0,0 +1,617 @@
+#pragma once
+
+// useful cross-platfrom definitions for compilers
+
+/**
+ * @def Y_FUNC_SIGNATURE
+ *
+ * Use this macro to get pretty function name (see example).
+ *
+ * @code
+ * void Hi() {
+ *     Cout << Y_FUNC_SIGNATURE << Endl;
+ * }
+
+ * template <typename T>
+ * void Do() {
+ *     Cout << Y_FUNC_SIGNATURE << Endl;
+ * }
+
+ * int main() {
+ *    Hi();         // void Hi()
+ *    Do<int>();    // void Do() [T = int]
+ *    Do<TString>(); // void Do() [T = TString]
+ * }
+ * @endcode
+ */
+#if defined(__GNUC__)
+#define Y_FUNC_SIGNATURE __PRETTY_FUNCTION__
+#elif defined(_MSC_VER)
+#define Y_FUNC_SIGNATURE __FUNCSIG__
+#else
+#define Y_FUNC_SIGNATURE ""
+#endif
+
+#ifdef __GNUC__
+#define Y_PRINTF_FORMAT(n, m) __attribute__((__format__(__printf__, n, m)))
+#endif
+
+#ifndef Y_PRINTF_FORMAT
+#define Y_PRINTF_FORMAT(n, m)
+#endif
+
+#if defined(__clang__)
+#define Y_NO_SANITIZE(...) __attribute__((no_sanitize(__VA_ARGS__)))
+#endif
+
+#if !defined(Y_NO_SANITIZE)
+#define Y_NO_SANITIZE(...)
+#endif
+
+/**
+ * @def Y_DECLARE_UNUSED
+ *
+ * Macro is needed to silence compiler warning about unused entities (e.g. function or argument).
+ *
+ * @code
+ * Y_DECLARE_UNUSED int FunctionUsedSolelyForDebugPurposes();
+ * assert(FunctionUsedSolelyForDebugPurposes() == 42);
+ *
+ * void Foo(const int argumentUsedOnlyForDebugPurposes Y_DECLARE_UNUSED) {
+ *     assert(argumentUsedOnlyForDebugPurposes == 42);
+ *     // however you may as well omit `Y_DECLARE_UNUSED` and use `UNUSED` macro instead
+ *     Y_UNUSED(argumentUsedOnlyForDebugPurposes);
+ * }
+ * @endcode
+ */
+#ifdef __GNUC__
+#define Y_DECLARE_UNUSED __attribute__((unused))
+#endif
+
+#ifndef Y_DECLARE_UNUSED
+#define Y_DECLARE_UNUSED
+#endif
+
+#if defined(__GNUC__)
+#define Y_LIKELY(Cond) __builtin_expect(!!(Cond), 1)
+#define Y_UNLIKELY(Cond) __builtin_expect(!!(Cond), 0)
+#define Y_PREFETCH_READ(Pointer, Priority) __builtin_prefetch((const void*)(Pointer), 0, Priority)
+#define Y_PREFETCH_WRITE(Pointer, Priority) __builtin_prefetch((const void*)(Pointer), 1, Priority)
+#endif
+
+/**
+ * @def Y_FORCE_INLINE
+ *
+ * Macro to use in place of 'inline' in function declaration/definition to force
+ * it to be inlined.
+ */
+#if !defined(Y_FORCE_INLINE)
+#if defined(CLANG_COVERAGE)
+#/* excessive __always_inline__ might significantly slow down compilation of an instrumented unit */
+#define Y_FORCE_INLINE inline
+#elif defined(_MSC_VER)
+#define Y_FORCE_INLINE __forceinline
+#elif defined(__GNUC__)
+#/* Clang also defines __GNUC__ (as 4) */
+#define Y_FORCE_INLINE inline __attribute__((__always_inline__))
+#else
+#define Y_FORCE_INLINE inline
+#endif
+#endif
+
+/**
+ * @def Y_NO_INLINE
+ *
+ * Macro to use in place of 'inline' in function declaration/definition to
+ * prevent it from being inlined.
+ */
+#if !defined(Y_NO_INLINE)
+#if defined(_MSC_VER)
+#define Y_NO_INLINE __declspec(noinline)
+#elif defined(__GNUC__) || defined(__INTEL_COMPILER)
+#/* Clang also defines __GNUC__ (as 4) */
+#define Y_NO_INLINE __attribute__((__noinline__))
+#else
+#define Y_NO_INLINE
+#endif
+#endif
+
+//to cheat compiler about strict aliasing or similar problems
+#if defined(__GNUC__)
+#define Y_FAKE_READ(X)                  \
+    do {                                \
+        __asm__ __volatile__(""         \
+                             :          \
+                             : "m"(X)); \
+    } while (0)
+
+#define Y_FAKE_WRITE(X)                  \
+    do {                                 \
+        __asm__ __volatile__(""          \
+                             : "=m"(X)); \
+    } while (0)
+#endif
+
+#if !defined(Y_FAKE_READ)
+#define Y_FAKE_READ(X)
+#endif
+
+#if !defined(Y_FAKE_WRITE)
+#define Y_FAKE_WRITE(X)
+#endif
+
+#ifndef Y_PREFETCH_READ
+#define Y_PREFETCH_READ(Pointer, Priority) (void)(const void*)(Pointer), (void)Priority
+#endif
+
+#ifndef Y_PREFETCH_WRITE
+#define Y_PREFETCH_WRITE(Pointer, Priority) (void)(const void*)(Pointer), (void)Priority
+#endif
+
+#ifndef Y_LIKELY
+#define Y_LIKELY(Cond) (Cond)
+#define Y_UNLIKELY(Cond) (Cond)
+#endif
+
+#ifdef __GNUC__
+#define _packed __attribute__((packed))
+#else
+#define _packed
+#endif
+
+#if defined(__GNUC__)
+#define Y_WARN_UNUSED_RESULT __attribute__((warn_unused_result))
+#endif
+
+#ifndef Y_WARN_UNUSED_RESULT
+#define Y_WARN_UNUSED_RESULT
+#endif
+
+#if defined(__GNUC__)
+#define Y_HIDDEN __attribute__((visibility("hidden")))
+#endif
+
+#if !defined(Y_HIDDEN)
+#define Y_HIDDEN
+#endif
+
+#if defined(__GNUC__)
+#define Y_PUBLIC __attribute__((visibility("default")))
+#endif
+
+#if !defined(Y_PUBLIC)
+#define Y_PUBLIC
+#endif
+
+#if !defined(Y_UNUSED) && !defined(__cplusplus)
+#define Y_UNUSED(var) (void)(var)
+#endif
+#if !defined(Y_UNUSED) && defined(__cplusplus)
+template <class... Types>
+constexpr Y_FORCE_INLINE int Y_UNUSED(Types&&...) {
+    return 0;
+};
+#endif
+
+/**
+ * @def Y_ASSUME
+ *
+ * Macro that tells the compiler that it can generate optimized code
+ * as if the given expression will always evaluate true.
+ * The behavior is undefined if it ever evaluates false.
+ *
+ * @code
+ * // factored into a function so that it's testable
+ * inline int Avg(int x, int y) {
+ *     if (x >= 0 && y >= 0) {
+ *         return (static_cast<unsigned>(x) + static_cast<unsigned>(y)) >> 1;
+ *     } else {
+ *         // a slower implementation
+ *     }
+ * }
+ *
+ * // we know that xs and ys are non-negative from domain knowledge,
+ * // but we can't change the types of xs and ys because of API constrains
+ * int Foo(const TVector<int>& xs, const TVector<int>& ys) {
+ *     TVector<int> avgs;
+ *     avgs.resize(xs.size());
+ *     for (size_t i = 0; i < xs.size(); ++i) {
+ *         auto x = xs[i];
+ *         auto y = ys[i];
+ *         Y_ASSUME(x >= 0);
+ *         Y_ASSUME(y >= 0);
+ *         xs[i] = Avg(x, y);
+ *     }
+ * }
+ * @endcode
+ */
+#if defined(__GNUC__)
+#define Y_ASSUME(condition) ((condition) ? (void)0 : __builtin_unreachable())
+#elif defined(_MSC_VER)
+#define Y_ASSUME(condition) __assume(condition)
+#else
+#define Y_ASSUME(condition) Y_UNUSED(condition)
+#endif
+
+#ifdef __cplusplus
+[[noreturn]]
+#endif
+Y_HIDDEN void _YandexAbort();
+
+/**
+ * @def Y_UNREACHABLE
+ *
+ * Macro that marks the rest of the code branch unreachable.
+ * The behavior is undefined if it's ever reached.
+ *
+ * @code
+ * switch (i % 3) {
+ * case 0:
+ *     return foo;
+ * case 1:
+ *     return bar;
+ * case 2:
+ *     return baz;
+ * default:
+ *     Y_UNREACHABLE();
+ * }
+ * @endcode
+ */
+#if defined(__GNUC__) || defined(_MSC_VER)
+#define Y_UNREACHABLE() Y_ASSUME(0)
+#else
+#define Y_UNREACHABLE() _YandexAbort()
+#endif
+
+#if defined(undefined_sanitizer_enabled)
+#define _ubsan_enabled_
+#endif
+
+#ifdef __clang__
+
+#if __has_feature(thread_sanitizer)
+#define _tsan_enabled_
+#endif
+#if __has_feature(memory_sanitizer)
+#define _msan_enabled_
+#endif
+#if __has_feature(address_sanitizer)
+#define _asan_enabled_
+#endif
+
+#else
+
+#if defined(thread_sanitizer_enabled) || defined(__SANITIZE_THREAD__)
+#define _tsan_enabled_
+#endif
+#if defined(memory_sanitizer_enabled)
+#define _msan_enabled_
+#endif
+#if defined(address_sanitizer_enabled) || defined(__SANITIZE_ADDRESS__)
+#define _asan_enabled_
+#endif
+
+#endif
+
+#if defined(_asan_enabled_) || defined(_msan_enabled_) || defined(_tsan_enabled_) || defined(_ubsan_enabled_)
+#define _san_enabled_
+#endif
+
+#if defined(_MSC_VER)
+#define __PRETTY_FUNCTION__ __FUNCSIG__
+#endif
+
+#if defined(__GNUC__)
+#define Y_WEAK __attribute__((weak))
+#else
+#define Y_WEAK
+#endif
+
+#if defined(__CUDACC_VER_MAJOR__)
+#define Y_CUDA_AT_LEAST(x, y) (__CUDACC_VER_MAJOR__ > x || (__CUDACC_VER_MAJOR__ == x && __CUDACC_VER_MINOR__ >= y))
+#else
+#define Y_CUDA_AT_LEAST(x, y) 0
+#endif
+
+// NVidia CUDA C++ Compiler did not know about noexcept keyword until version 9.0
+#if !Y_CUDA_AT_LEAST(9, 0)
+#if defined(__CUDACC__) && !defined(noexcept)
+#define noexcept throw ()
+#endif
+#endif
+
+#if defined(__GNUC__)
+#define Y_COLD __attribute__((cold))
+#define Y_LEAF __attribute__((leaf))
+#define Y_WRAPPER __attribute__((artificial))
+#else
+#define Y_COLD
+#define Y_LEAF
+#define Y_WRAPPER
+#endif
+
+/**
+ * @def Y_PRAGMA
+ *
+ * Macro for use in other macros to define compiler pragma
+ * See below for other usage examples
+ *
+ * @code
+ * #if defined(__clang__) || defined(__GNUC__)
+ * #define Y_PRAGMA_NO_WSHADOW \
+ *     Y_PRAGMA("GCC diagnostic ignored \"-Wshadow\"")
+ * #elif defined(_MSC_VER)
+ * #define Y_PRAGMA_NO_WSHADOW \
+ *     Y_PRAGMA("warning(disable:4456 4457")
+ * #else
+ * #define Y_PRAGMA_NO_WSHADOW
+ * #endif
+ * @endcode
+ */
+#if defined(__clang__) || defined(__GNUC__)
+#define Y_PRAGMA(x) _Pragma(x)
+#elif defined(_MSC_VER)
+#define Y_PRAGMA(x) __pragma(x)
+#else
+#define Y_PRAGMA(x)
+#endif
+
+/**
+ * @def Y_PRAGMA_DIAGNOSTIC_PUSH
+ *
+ * Cross-compiler pragma to save diagnostic settings
+ *
+ * @see
+ *     GCC: https://gcc.gnu.org/onlinedocs/gcc/Diagnostic-Pragmas.html
+ *     MSVC: https://msdn.microsoft.com/en-us/library/2c8f766e.aspx
+ *     Clang: https://clang.llvm.org/docs/UsersManual.html#controlling-diagnostics-via-pragmas
+ *
+ * @code
+ * Y_PRAGMA_DIAGNOSTIC_PUSH
+ * @endcode
+ */
+#if defined(__clang__) || defined(__GNUC__)
+#define Y_PRAGMA_DIAGNOSTIC_PUSH \
+    Y_PRAGMA("GCC diagnostic push")
+#elif defined(_MSC_VER)
+#define Y_PRAGMA_DIAGNOSTIC_PUSH \
+    Y_PRAGMA(warning(push))
+#else
+#define Y_PRAGMA_DIAGNOSTIC_PUSH
+#endif
+
+/**
+ * @def Y_PRAGMA_DIAGNOSTIC_POP
+ *
+ * Cross-compiler pragma to restore diagnostic settings
+ *
+ * @see
+ *     GCC: https://gcc.gnu.org/onlinedocs/gcc/Diagnostic-Pragmas.html
+ *     MSVC: https://msdn.microsoft.com/en-us/library/2c8f766e.aspx
+ *     Clang: https://clang.llvm.org/docs/UsersManual.html#controlling-diagnostics-via-pragmas
+ *
+ * @code
+ * Y_PRAGMA_DIAGNOSTIC_POP
+ * @endcode
+ */
+#if defined(__clang__) || defined(__GNUC__)
+#define Y_PRAGMA_DIAGNOSTIC_POP \
+    Y_PRAGMA("GCC diagnostic pop")
+#elif defined(_MSC_VER)
+#define Y_PRAGMA_DIAGNOSTIC_POP \
+    Y_PRAGMA(warning(pop))
+#else
+#define Y_PRAGMA_DIAGNOSTIC_POP
+#endif
+
+/**
+ * @def Y_PRAGMA_NO_WSHADOW
+ *
+ * Cross-compiler pragma to disable warnings about shadowing variables
+ *
+ * @code
+ * Y_PRAGMA_DIAGNOSTIC_PUSH
+ * Y_PRAGMA_NO_WSHADOW
+ *
+ * // some code which use variable shadowing, e.g.:
+ *
+ * for (int i = 0; i < 100; ++i) {
+ *   Use(i);
+ *
+ *   for (int i = 42; i < 100500; ++i) { // this i is shadowing previous i
+ *       AnotherUse(i);
+ *    }
+ * }
+ *
+ * Y_PRAGMA_DIAGNOSTIC_POP
+ * @endcode
+ */
+#if defined(__clang__) || defined(__GNUC__)
+#define Y_PRAGMA_NO_WSHADOW \
+    Y_PRAGMA("GCC diagnostic ignored \"-Wshadow\"")
+#elif defined(_MSC_VER)
+#define Y_PRAGMA_NO_WSHADOW \
+    Y_PRAGMA(warning(disable : 4456 4457))
+#else
+#define Y_PRAGMA_NO_WSHADOW
+#endif
+
+/**
+ * @ def Y_PRAGMA_NO_UNUSED_FUNCTION
+ *
+ * Cross-compiler pragma to disable warnings about unused functions
+ *
+ * @see
+ *     GCC: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
+ *     Clang: https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function
+ *     MSVC: there is no such warning
+ *
+ * @code
+ * Y_PRAGMA_DIAGNOSTIC_PUSH
+ * Y_PRAGMA_NO_UNUSED_FUNCTION
+ *
+ * // some code which introduces a function which later will not be used, e.g.:
+ *
+ * void Foo() {
+ * }
+ *
+ * int main() {
+ *     return 0; // Foo() never called
+ * }
+ *
+ * Y_PRAGMA_DIAGNOSTIC_POP
+ * @endcode
+ */
+#if defined(__clang__) || defined(__GNUC__)
+#define Y_PRAGMA_NO_UNUSED_FUNCTION \
+    Y_PRAGMA("GCC diagnostic ignored \"-Wunused-function\"")
+#else
+#define Y_PRAGMA_NO_UNUSED_FUNCTION
+#endif
+
+/**
+ * @ def Y_PRAGMA_NO_UNUSED_PARAMETER
+ *
+ * Cross-compiler pragma to disable warnings about unused function parameters
+ *
+ * @see
+ *     GCC: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
+ *     Clang: https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-parameter
+ *     MSVC: https://msdn.microsoft.com/en-us/library/26kb9fy0.aspx
+ *
+ * @code
+ * Y_PRAGMA_DIAGNOSTIC_PUSH
+ * Y_PRAGMA_NO_UNUSED_PARAMETER
+ *
+ * // some code which introduces a function with unused parameter, e.g.:
+ *
+ * void foo(int a) {
+ *     // a is not referenced
+ * }
+ *
+ * int main() {
+ *     foo(1);
+ *     return 0;
+ * }
+ *
+ * Y_PRAGMA_DIAGNOSTIC_POP
+ * @endcode
+ */
+#if defined(__clang__) || defined(__GNUC__)
+#define Y_PRAGMA_NO_UNUSED_PARAMETER \
+    Y_PRAGMA("GCC diagnostic ignored \"-Wunused-parameter\"")
+#elif defined(_MSC_VER)
+#define Y_PRAGMA_NO_UNUSED_PARAMETER \
+    Y_PRAGMA(warning(disable : 4100))
+#else
+#define Y_PRAGMA_NO_UNUSED_PARAMETER
+#endif
+
+/**
+ * @def Y_PRAGMA_NO_DEPRECATED
+ *
+ * Cross compiler pragma to disable warnings and errors about deprecated
+ *
+ * @see
+ *     GCC: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
+ *     Clang: https://clang.llvm.org/docs/DiagnosticsReference.html#wdeprecated
+ *     MSVC: https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-3-c4996?view=vs-2017
+ *
+ * @code
+ * Y_PRAGMA_DIAGNOSTIC_PUSH
+ * Y_PRAGMA_NO_DEPRECATED
+ *
+ * [deprecated] void foo() {
+ *     // ...
+ * }
+ *
+ * int main() {
+ *     foo();
+ *     return 0;
+ * }
+ *
+ * Y_PRAGMA_DIAGNOSTIC_POP
+ * @endcode
+ */
+#if defined(__clang__) || defined(__GNUC__)
+#define Y_PRAGMA_NO_DEPRECATED \
+    Y_PRAGMA("GCC diagnostic ignored \"-Wdeprecated\"")
+#elif defined(_MSC_VER)
+#define Y_PRAGMA_NO_DEPRECATED \
+    Y_PRAGMA(warning(disable : 4996))
+#else
+#define Y_PRAGMA_NO_DEPRECATED
+#endif
+
+#if defined(__clang__) || defined(__GNUC__)
+/**
+ * @def Y_CONST_FUNCTION
+   methods and functions, marked with this method are promised to:
+     1. do not have side effects
+     2. this method do not read global memory
+   NOTE: this attribute can't be set for methods that depend on data, pointed by this
+   this allow compilers to do hard optimization of that functions
+   NOTE: in common case this attribute can't be set if method have pointer-arguments
+   NOTE: as result there no any reason to discard result of such method
+*/
+#define Y_CONST_FUNCTION [[gnu::const]]
+#endif
+
+#if !defined(Y_CONST_FUNCTION)
+#define Y_CONST_FUNCTION
+#endif
+
+#if defined(__clang__) || defined(__GNUC__)
+/**
+ * @def Y_PURE_FUNCTION
+   methods and functions, marked with this method are promised to:
+     1. do not have side effects
+     2. result will be the same if no global memory changed
+   this allow compilers to do hard optimization of that functions
+   NOTE: as result there no any reason to discard result of such method
+*/
+#define Y_PURE_FUNCTION [[gnu::pure]]
+#endif
+
+#if !defined(Y_PURE_FUNCTION)
+#define Y_PURE_FUNCTION
+#endif
+
+/**
+ * @ def Y_HAVE_INT128
+ *
+ * Defined when the compiler supports __int128 extension
+ *
+ * @code
+ *
+ * #if defined(Y_HAVE_INT128)
+ *     __int128 myVeryBigInt = 12345678901234567890;
+ * #endif
+ *
+ * @endcode
+ */
+#if defined(__SIZEOF_INT128__)
+#define Y_HAVE_INT128 1
+#endif
+
+/**
+ * XRAY macro must be passed to compiler if XRay is enabled.
+ *
+ * Define everything XRay-specific as a macro so that it doesn't cause errors
+ * for compilers that doesn't support XRay.
+ */
+#if defined(XRAY) && defined(__cplusplus)
+#include <xray/xray_interface.h>
+#define Y_XRAY_ALWAYS_INSTRUMENT [[clang::xray_always_instrument]]
+#define Y_XRAY_NEVER_INSTRUMENT [[clang::xray_never_instrument]]
+#define Y_XRAY_CUSTOM_EVENT(__string, __length) \
+    do {                                        \
+        __xray_customevent(__string, __length); \
+    } while (0)
+#else
+#define Y_XRAY_ALWAYS_INSTRUMENT
+#define Y_XRAY_NEVER_INSTRUMENT
+#define Y_XRAY_CUSTOM_EVENT(__string, __length) \
+    do {                                        \
+    } while (0)
+#endif
--- a/contrib/lfalloc/src/util/system/defaults.h
+++ b/contrib/lfalloc/src/util/system/defaults.h
@ -0,0 +1,168 @@
+#pragma once
+
+#include "platform.h"
+
+#if defined _unix_
+#define LOCSLASH_C '/'
+#define LOCSLASH_S "/"
+#else
+#define LOCSLASH_C '\\'
+#define LOCSLASH_S "\\"
+#endif // _unix_
+
+#if defined(__INTEL_COMPILER) && defined(__cplusplus)
+#include <new>
+#endif
+
+// low and high parts of integers
+#if !defined(_win_)
+#include <sys/param.h>
+#endif
+
+#if defined(BSD) || defined(_android_)
+
+#if defined(BSD)
+#include <machine/endian.h>
+#endif
+
+#if defined(_android_)
+#include <endian.h>
+#endif
+
+#if (BYTE_ORDER == LITTLE_ENDIAN)
+#define _little_endian_
+#elif (BYTE_ORDER == BIG_ENDIAN)
+#define _big_endian_
+#else
+#error unknown endian not supported
+#endif
+
+#elif (defined(_sun_) && !defined(__i386__)) || defined(_hpux_) || defined(WHATEVER_THAT_HAS_BIG_ENDIAN)
+#define _big_endian_
+#else
+#define _little_endian_
+#endif
+
+// alignment
+#if (defined(_sun_) && !defined(__i386__)) || defined(_hpux_) || defined(__alpha__) || defined(__ia64__) || defined(WHATEVER_THAT_NEEDS_ALIGNING_QUADS)
+#define _must_align8_
+#endif
+
+#if (defined(_sun_) && !defined(__i386__)) || defined(_hpux_) || defined(__alpha__) || defined(__ia64__) || defined(WHATEVER_THAT_NEEDS_ALIGNING_LONGS)
+#define _must_align4_
+#endif
+
+#if (defined(_sun_) && !defined(__i386__)) || defined(_hpux_) || defined(__alpha__) || defined(__ia64__) || defined(WHATEVER_THAT_NEEDS_ALIGNING_SHORTS)
+#define _must_align2_
+#endif
+
+#if defined(__GNUC__)
+#define alias_hack __attribute__((__may_alias__))
+#endif
+
+#ifndef alias_hack
+#define alias_hack
+#endif
+
+#include "types.h"
+
+#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L)
+#define PRAGMA(x) _Pragma(#x)
+#define RCSID(idstr) PRAGMA(comment(exestr, idstr))
+#else
+#define RCSID(idstr) static const char rcsid[] = idstr
+#endif
+
+#include "compiler.h"
+
+#ifdef _win_
+#include <malloc.h>
+#elif defined(_sun_)
+#include <alloca.h>
+#endif
+
+#ifdef NDEBUG
+#define Y_IF_DEBUG(X)
+#else
+#define Y_IF_DEBUG(X) X
+#endif
+
+/**
+ * @def Y_ARRAY_SIZE
+ *
+ * This macro is needed to get number of elements in a statically allocated fixed size array. The
+ * expression is a compile-time constant and therefore can be used in compile time computations.
+ *
+ * @code
+ * enum ENumbers {
+ *     EN_ONE,
+ *     EN_TWO,
+ *     EN_SIZE
+ * }
+ *
+ * const char* NAMES[] = {
+ *     "one",
+ *     "two"
+ * }
+ *
+ * static_assert(Y_ARRAY_SIZE(NAMES) == EN_SIZE, "you should define `NAME` for each enumeration");
+ * @endcode
+ *
+ * This macro also catches type errors. If you see a compiler error like "warning: division by zero
+ * is undefined" when using `Y_ARRAY_SIZE` then you are probably giving it a pointer.
+ *
+ * Since all of our code is expected to work on a 64 bit platform where pointers are 8 bytes we may
+ * falsefully accept pointers to types of sizes that are divisors of 8 (1, 2, 4 and 8).
+ */
+#if defined(__cplusplus)
+namespace NArraySizePrivate {
+    template <class T>
+    struct TArraySize;
+
+    template <class T, size_t N>
+    struct TArraySize<T[N]> {
+        enum {
+            Result = N
+        };
+    };
+
+    template <class T, size_t N>
+    struct TArraySize<T (&)[N]> {
+        enum {
+            Result = N
+        };
+    };
+}
+
+#define Y_ARRAY_SIZE(arr) ((size_t)::NArraySizePrivate::TArraySize<decltype(arr)>::Result)
+#else
+#undef Y_ARRAY_SIZE
+#define Y_ARRAY_SIZE(arr) \
+    ((sizeof(arr) / sizeof((arr)[0])) / static_cast<size_t>(!(sizeof(arr) % sizeof((arr)[0]))))
+#endif
+
+#undef Y_ARRAY_BEGIN
+#define Y_ARRAY_BEGIN(arr) (arr)
+
+#undef Y_ARRAY_END
+#define Y_ARRAY_END(arr) ((arr) + Y_ARRAY_SIZE(arr))
+
+/**
+ * Concatenates two symbols, even if one of them is itself a macro.
+ */
+#define Y_CAT(X, Y) Y_CAT_I(X, Y)
+#define Y_CAT_I(X, Y) Y_CAT_II(X, Y)
+#define Y_CAT_II(X, Y) X##Y
+
+#define Y_STRINGIZE(X) UTIL_PRIVATE_STRINGIZE_AUX(X)
+#define UTIL_PRIVATE_STRINGIZE_AUX(X) #X
+
+#if defined(__COUNTER__)
+#define Y_GENERATE_UNIQUE_ID(N) Y_CAT(N, __COUNTER__)
+#endif
+
+#if !defined(Y_GENERATE_UNIQUE_ID)
+#define Y_GENERATE_UNIQUE_ID(N) Y_CAT(N, __LINE__)
+#endif
+
+#define NPOS ((size_t)-1)
--- a/contrib/lfalloc/src/util/system/platform.h
+++ b/contrib/lfalloc/src/util/system/platform.h
@ -0,0 +1,242 @@
+#pragma once
+
+// What OS ?
+// our definition has the form _{osname}_
+
+#if defined(_WIN64)
+#define _win64_
+#define _win32_
+#elif defined(__WIN32__) || defined(_WIN32) // _WIN32 is also defined by the 64-bit compiler for backward compatibility
+#define _win32_
+#else
+#define _unix_
+#if defined(__sun__) || defined(sun) || defined(sparc) || defined(__sparc)
+#define _sun_
+#endif
+#if defined(__hpux__)
+#define _hpux_
+#endif
+#if defined(__linux__)
+#define _linux_
+#endif
+#if defined(__FreeBSD__)
+#define _freebsd_
+#endif
+#if defined(__CYGWIN__)
+#define _cygwin_
+#endif
+#if defined(__APPLE__)
+#define _darwin_
+#endif
+#if defined(__ANDROID__)
+#define _android_
+#endif
+#endif
+
+#if defined(__IOS__)
+#define _ios_
+#endif
+
+#if defined(_linux_)
+#if defined(_musl_)
+//nothing to do
+#elif defined(_android_)
+#define _bionic_
+#else
+#define _glibc_
+#endif
+#endif
+
+#if defined(_darwin_)
+#define unix
+#define __unix__
+#endif
+
+#if defined(_win32_) || defined(_win64_)
+#define _win_
+#endif
+
+#if defined(__arm__) || defined(__ARM__) || defined(__ARM_NEON) || defined(__aarch64__) || defined(_M_ARM)
+#if defined(__arm64) || defined(__arm64__) || defined(__aarch64__)
+#define _arm64_
+#else
+#define _arm32_
+#endif
+#endif
+
+#if defined(_arm64_) || defined(_arm32_)
+#define _arm_
+#endif
+
+/* __ia64__ and __x86_64__      - defined by GNU C.
+ * _M_IA64, _M_X64, _M_AMD64    - defined by Visual Studio.
+ *
+ * Microsoft can define _M_IX86, _M_AMD64 (before Visual Studio 8)
+ * or _M_X64 (starting in Visual Studio 8).
+ */
+#if defined(__x86_64__) || defined(_M_X64) || defined(_M_AMD64)
+#define _x86_64_
+#endif
+
+#if defined(__i386__) || defined(_M_IX86)
+#define _i386_
+#endif
+
+#if defined(__ia64__) || defined(_M_IA64)
+#define _ia64_
+#endif
+
+#if defined(__powerpc__)
+#define _ppc_
+#endif
+
+#if defined(__powerpc64__)
+#define _ppc64_
+#endif
+
+#if !defined(sparc) && !defined(__sparc) && !defined(__hpux__) && !defined(__alpha__) && !defined(_ia64_) && !defined(_x86_64_) && !defined(_arm_) && !defined(_i386_) && !defined(_ppc_) && !defined(_ppc64_)
+#error "platform not defined, please, define one"
+#endif
+
+#if defined(_x86_64_) || defined(_i386_)
+#define _x86_
+#endif
+
+#if defined(__MIC__)
+#define _mic_
+#define _k1om_
+#endif
+
+// stdio or MessageBox
+#if defined(__CONSOLE__) || defined(_CONSOLE)
+#define _console_
+#endif
+#if (defined(_win_) && !defined(_console_))
+#define _windows_
+#elif !defined(_console_)
+#define _console_
+#endif
+
+#if defined(__SSE__) || defined(SSE_ENABLED)
+#define _sse_
+#endif
+
+#if defined(__SSE2__) || defined(SSE2_ENABLED)
+#define _sse2_
+#endif
+
+#if defined(__SSE3__) || defined(SSE3_ENABLED)
+#define _sse3_
+#endif
+
+#if defined(__SSSE3__) || defined(SSSE3_ENABLED)
+#define _ssse3_
+#endif
+
+#if defined(POPCNT_ENABLED)
+#define _popcnt_
+#endif
+
+#if defined(__DLL__) || defined(_DLL)
+#define _dll_
+#endif
+
+// 16, 32 or 64
+#if defined(__sparc_v9__) || defined(_x86_64_) || defined(_ia64_) || defined(_arm64_) || defined(_ppc64_)
+#define _64_
+#else
+#define _32_
+#endif
+
+/* All modern 64-bit Unix systems use scheme LP64 (long, pointers are 64-bit).
+ * Microsoft uses a different scheme: LLP64 (long long, pointers are 64-bit).
+ *
+ * Scheme          LP64   LLP64
+ * char              8      8
+ * short            16     16
+ * int              32     32
+ * long             64     32
+ * long long        64     64
+ * pointer          64     64
+ */
+
+#if defined(_32_)
+#define SIZEOF_PTR 4
+#elif defined(_64_)
+#define SIZEOF_PTR 8
+#endif
+
+#define PLATFORM_DATA_ALIGN SIZEOF_PTR
+
+#if !defined(SIZEOF_PTR)
+#error todo
+#endif
+
+#define SIZEOF_CHAR 1
+#define SIZEOF_UNSIGNED_CHAR 1
+#define SIZEOF_SHORT 2
+#define SIZEOF_UNSIGNED_SHORT 2
+#define SIZEOF_INT 4
+#define SIZEOF_UNSIGNED_INT 4
+
+#if defined(_32_)
+#define SIZEOF_LONG 4
+#define SIZEOF_UNSIGNED_LONG 4
+#elif defined(_64_)
+#if defined(_win_)
+#define SIZEOF_LONG 4
+#define SIZEOF_UNSIGNED_LONG 4
+#else
+#define SIZEOF_LONG 8
+#define SIZEOF_UNSIGNED_LONG 8
+#endif // _win_
+#endif // _32_
+
+#if !defined(SIZEOF_LONG)
+#error todo
+#endif
+
+#define SIZEOF_LONG_LONG 8
+#define SIZEOF_UNSIGNED_LONG_LONG 8
+
+#undef SIZEOF_SIZE_T // in case we include <Python.h> which defines it, too
+#define SIZEOF_SIZE_T SIZEOF_PTR
+
+#if defined(__INTEL_COMPILER)
+#pragma warning(disable 1292)
+#pragma warning(disable 1469)
+#pragma warning(disable 193)
+#pragma warning(disable 271)
+#pragma warning(disable 383)
+#pragma warning(disable 424)
+#pragma warning(disable 444)
+#pragma warning(disable 584)
+#pragma warning(disable 593)
+#pragma warning(disable 981)
+#pragma warning(disable 1418)
+#pragma warning(disable 304)
+#pragma warning(disable 810)
+#pragma warning(disable 1029)
+#pragma warning(disable 1419)
+#pragma warning(disable 177)
+#pragma warning(disable 522)
+#pragma warning(disable 858)
+#pragma warning(disable 111)
+#pragma warning(disable 1599)
+#pragma warning(disable 411)
+#pragma warning(disable 304)
+#pragma warning(disable 858)
+#pragma warning(disable 444)
+#pragma warning(disable 913)
+#pragma warning(disable 310)
+#pragma warning(disable 167)
+#pragma warning(disable 180)
+#pragma warning(disable 1572)
+#endif
+
+#if defined(_MSC_VER)
+#undef _WINSOCKAPI_
+#define _WINSOCKAPI_
+#undef NOMINMAX
+#define NOMINMAX
+#endif
--- a/contrib/lfalloc/src/util/system/types.h
+++ b/contrib/lfalloc/src/util/system/types.h
@ -0,0 +1,117 @@
+#pragma once
+
+// DO_NOT_STYLE
+
+#include "platform.h"
+
+#include <inttypes.h>
+
+typedef int8_t i8;
+typedef int16_t i16;
+typedef uint8_t ui8;
+typedef uint16_t ui16;
+
+typedef int yssize_t;
+#define PRIYSZT "d"
+
+#if defined(_darwin_) && defined(_32_)
+typedef unsigned long ui32;
+typedef long i32;
+#else
+typedef uint32_t ui32;
+typedef int32_t i32;
+#endif
+
+#if defined(_darwin_) && defined(_64_)
+typedef unsigned long ui64;
+typedef long i64;
+#else
+typedef uint64_t ui64;
+typedef int64_t i64;
+#endif
+
+#define LL(number) INT64_C(number)
+#define ULL(number) UINT64_C(number)
+
+// Macro for size_t and ptrdiff_t types
+#if defined(_32_)
+#   if defined(_darwin_)
+#       define PRISZT "lu"
+#       undef PRIi32
+#       define PRIi32 "li"
+#       undef SCNi32
+#       define SCNi32 "li"
+#       undef PRId32
+#       define PRId32 "li"
+#       undef SCNd32
+#       define SCNd32 "li"
+#       undef PRIu32
+#       define PRIu32 "lu"
+#       undef SCNu32
+#       define SCNu32 "lu"
+#       undef PRIx32
+#       define PRIx32 "lx"
+#       undef SCNx32
+#       define SCNx32 "lx"
+#   elif !defined(_cygwin_)
+#       define PRISZT PRIu32
+#   else
+#       define PRISZT "u"
+#   endif
+#   define SCNSZT SCNu32
+#   define PRIPDT PRIi32
+#   define SCNPDT SCNi32
+#   define PRITMT PRIi32
+#   define SCNTMT SCNi32
+#elif defined(_64_)
+#   if defined(_darwin_)
+#       define PRISZT "lu"
+#       undef PRIu64
+#       define PRIu64 PRISZT
+#       undef PRIx64
+#       define PRIx64 "lx"
+#       undef PRIX64
+#       define PRIX64 "lX"
+#       undef PRId64
+#       define PRId64 "ld"
+#       undef PRIi64
+#       define PRIi64 "li"
+#       undef SCNi64
+#       define SCNi64 "li"
+#       undef SCNu64
+#       define SCNu64 "lu"
+#       undef SCNx64
+#       define SCNx64 "lx"
+#   else
+#       define PRISZT PRIu64
+#   endif
+#   define SCNSZT SCNu64
+#   define PRIPDT PRIi64
+#   define SCNPDT SCNi64
+#   define PRITMT PRIi64
+#   define SCNTMT SCNi64
+#else
+#   error "Unsupported platform"
+#endif
+
+// SUPERLONG
+#if !defined(DONT_USE_SUPERLONG) && !defined(SUPERLONG_MAX)
+#define SUPERLONG_MAX ~LL(0)
+typedef i64 SUPERLONG;
+#endif
+
+// UNICODE
+// UCS-2, native byteorder
+typedef ui16 wchar16;
+// internal symbol type: UTF-16LE
+typedef wchar16 TChar;
+typedef ui32 wchar32;
+
+#if defined(_MSC_VER)
+#include <basetsd.h>
+typedef SSIZE_T ssize_t;
+#define HAVE_SSIZE_T 1
+#include <wchar.h>
+#endif
+
+#include <sys/types.h>
--- a/contrib/poco
+++ b/contrib/poco
@ -1 +1 @@
-Subproject commit fe5505e56c27b6ecb0dcbc40c49dc2caf4e9637f
+Subproject commit 29439cf7fa32c1a2d62d925bb6d6a3f14668a4a2
--- a/dbms/CMakeLists.txt
+++ b/dbms/CMakeLists.txt
@ -155,7 +155,6 @@ if (USE_EMBEDDED_COMPILER)
    target_include_directories (dbms SYSTEM BEFORE PUBLIC ${LLVM_INCLUDE_DIRS})
 endif ()

-
 if (CMAKE_BUILD_TYPE_UC STREQUAL "RELEASE" OR CMAKE_BUILD_TYPE_UC STREQUAL "RELWITHDEBINFO" OR CMAKE_BUILD_TYPE_UC STREQUAL "MINSIZEREL")
    # Won't generate debug info for files with heavy template instantiation to achieve faster linking and lower size.
    set_source_files_properties(
@ -214,6 +213,10 @@ target_link_libraries (clickhouse_common_io

 target_include_directories(clickhouse_common_io SYSTEM BEFORE PUBLIC ${RE2_INCLUDE_DIR})

+if (USE_LFALLOC)
+    target_include_directories (clickhouse_common_io SYSTEM BEFORE PUBLIC ${LFALLOC_INCLUDE_DIR})
+endif ()
+
 if(CPUID_LIBRARY)
    target_link_libraries(clickhouse_common_io PRIVATE ${CPUID_LIBRARY})
 endif()
@ -223,8 +226,9 @@ if(CPUINFO_LIBRARY)
 endif()

 target_link_libraries (dbms
-        PRIVATE
+        PUBLIC
    clickhouse_compression
+        PRIVATE
    clickhouse_parsers
    clickhouse_common_config
        PUBLIC
--- a/dbms/programs/benchmark/Benchmark.cpp
+++ b/dbms/programs/benchmark/Benchmark.cpp
@ -439,7 +439,7 @@ int mainEntryClickHouseBenchmark(int argc, char ** argv)
            ("help",                                                            "produce help message")
            ("concurrency,c", value<unsigned>()->default_value(1),              "number of parallel queries")
            ("delay,d",       value<double>()->default_value(1),                "delay between intermediate reports in seconds (set 0 to disable reports)")
-            ("stage",         value<std::string>()->default_value("complete"),  "request query processing up to specified stage")
+            ("stage",         value<std::string>()->default_value("complete"),  "request query processing up to specified stage: complete,fetch_columns,with_mergeable_state")
            ("iterations,i",  value<size_t>()->default_value(0),                "amount of queries to be executed")
            ("timelimit,t",   value<double>()->default_value(0.),               "stop launch of queries after specified time limit")
            ("randomize,r",   value<bool>()->default_value(false),              "randomize order of execution")
--- a/dbms/src/Columns/IColumn.h
+++ b/dbms/src/Columns/IColumn.h
@ -250,7 +250,7 @@ public:

    /// Size of memory, allocated for column.
    /// This is greater or equals to byteSize due to memory reservation in containers.
-    /// Zero, if could be determined.
+    /// Zero, if could not be determined.
    virtual size_t allocatedBytes() const = 0;

    /// Make memory region readonly with mprotect if it is large enough.
--- a/dbms/src/Common/ErrorCodes.cpp
+++ b/dbms/src/Common/ErrorCodes.cpp
@ -424,6 +424,8 @@ namespace ErrorCodes
    extern const int HYPERSCAN_CANNOT_SCAN_TEXT = 447;
    extern const int BROTLI_READ_FAILED = 448;
    extern const int BROTLI_WRITE_FAILED = 449;
+    extern const int BAD_TTL_EXPRESSION = 450;
+    extern const int BAD_TTL_FILE = 451;

    extern const int KEEPER_EXCEPTION = 999;
    extern const int POCO_EXCEPTION = 1000;
--- a/dbms/src/Common/LFAllocator.cpp
+++ b/dbms/src/Common/LFAllocator.cpp
@ -0,0 +1,53 @@
+#include <Common/config.h>
+
+#if USE_LFALLOC
+#include "LFAllocator.h"
+
+#include <cstring>
+#include <lf_allocX64.h>
+
+namespace DB
+{
+
+void * LFAllocator::alloc(size_t size, size_t alignment)
+{
+    if (alignment == 0)
+        return LFAlloc(size);
+    else
+    {
+        void * ptr;
+        int res = LFPosixMemalign(&ptr, alignment, size);
+        return res ? nullptr : ptr;
+    }
+}
+
+void LFAllocator::free(void * buf, size_t)
+{
+    LFFree(buf);
+}
+
+void * LFAllocator::realloc(void * old_ptr, size_t, size_t new_size, size_t alignment)
+{
+    if (old_ptr == nullptr)
+    {
+        void * result = LFAllocator::alloc(new_size, alignment);
+        return result;
+    }
+    if (new_size == 0)
+    {
+        LFFree(old_ptr);
+        return nullptr;
+    }
+
+    void * new_ptr = LFAllocator::alloc(new_size, alignment);
+    if (new_ptr == nullptr)
+        return nullptr;
+    size_t old_size = LFGetSize(old_ptr);
+    memcpy(new_ptr, old_ptr, ((old_size < new_size) ? old_size : new_size));
+    LFFree(old_ptr);
+    return new_ptr;
+}
+
+}
+
+#endif
--- a/dbms/src/Common/LFAllocator.h
+++ b/dbms/src/Common/LFAllocator.h
@ -0,0 +1,22 @@
+#pragma once
+
+#include <Common/config.h>
+
+#if !USE_LFALLOC
+#error "do not include this file until USE_LFALLOC is set to 1"
+#endif
+
+#include <cstddef>
+
+namespace DB
+{
+struct LFAllocator
+{
+    static void * alloc(size_t size, size_t alignment = 0);
+
+    static void free(void * buf, size_t);
+
+    static void * realloc(void * buf, size_t, size_t new_size, size_t alignment = 0);
+};
+
+}
--- a/dbms/src/Common/RadixSort.h
+++ b/dbms/src/Common/RadixSort.h
@ -64,15 +64,15 @@ struct RadixSortFloatTransform
 };


-template <typename Float>
+template <typename _Element, typename _Key = _Element>
 struct RadixSortFloatTraits
 {
-    using Element = Float;        /// The type of the element. It can be a structure with a key and some other payload. Or just a key.
-    using Key = Float;            /// The key to sort.
+    using Element = _Element;     /// The type of the element. It can be a structure with a key and some other payload. Or just a key.
+    using Key = _Key;             /// The key to sort.
    using CountType = uint32_t;   /// Type for calculating histograms. In the case of a known small number of elements, it can be less than size_t.

    /// The type to which the key is transformed to do bit operations. This UInt is the same size as the key.
-    using KeyBits = std::conditional_t<sizeof(Float) == 8, uint64_t, uint32_t>;
+    using KeyBits = std::conditional_t<sizeof(_Key) == 8, uint64_t, uint32_t>;

    static constexpr size_t PART_SIZE_BITS = 8;    /// With what pieces of the key, in bits, to do one pass - reshuffle of the array.

@ -85,7 +85,13 @@ struct RadixSortFloatTraits
    using Allocator = RadixSortMallocAllocator;

    /// The function to get the key from an array element.
-    static Key & extractKey(Element & elem) { return elem; }
+    static Key & extractKey(Element & elem)
+    {
+        if constexpr (std::is_same_v<Element, Key>)
+            return elem;
+        else
+            return *reinterpret_cast<Key *>(&elem);
+    }
 };


@ -109,13 +115,13 @@ struct RadixSortSignedTransform
 };


-template <typename UInt>
+template <typename _Element, typename _Key = _Element>
 struct RadixSortUIntTraits
 {
-    using Element = UInt;
-    using Key = UInt;
+    using Element = _Element;
+    using Key = _Key;
    using CountType = uint32_t;
-    using KeyBits = UInt;
+    using KeyBits = _Key;

    static constexpr size_t PART_SIZE_BITS = 8;

@ -123,16 +129,22 @@ struct RadixSortUIntTraits
    using Allocator = RadixSortMallocAllocator;

    /// The function to get the key from an array element.
-    static Key & extractKey(Element & elem) { return elem; }
+    static Key & extractKey(Element & elem)
+    {
+        if constexpr (std::is_same_v<Element, Key>)
+            return elem;
+        else
+            return *reinterpret_cast<Key *>(&elem);
+    }
 };

-template <typename Int>
+template <typename _Element, typename _Key = _Element>
 struct RadixSortIntTraits
 {
-    using Element = Int;
-    using Key = Int;
+    using Element = _Element;
+    using Key = _Key;
    using CountType = uint32_t;
-    using KeyBits = std::make_unsigned_t<Int>;
+    using KeyBits = std::make_unsigned_t<_Key>;

    static constexpr size_t PART_SIZE_BITS = 8;

@ -140,7 +152,13 @@ struct RadixSortIntTraits
    using Allocator = RadixSortMallocAllocator;

    /// The function to get the key from an array element.
-    static Key & extractKey(Element & elem) { return elem; }
+    static Key & extractKey(Element & elem)
+    {
+        if constexpr (std::is_same_v<Element, Key>)
+            return elem;
+        else
+            return *reinterpret_cast<Key *>(&elem);
+    }
 };


@ -261,3 +279,16 @@ radixSort(T * arr, size_t size)
    return RadixSort<RadixSortFloatTraits<T>>::execute(arr, size);
 }

+template <typename _Element, typename _Key>
+std::enable_if_t<std::is_integral_v<_Key>, void>
+radixSort(_Element * arr, size_t size)
+{
+    return RadixSort<RadixSortUIntTraits<_Element, _Key>>::execute(arr, size);
+}
+
+template <typename _Element, typename _Key>
+std::enable_if_t<std::is_floating_point_v<_Key>, void>
+radixSort(_Element * arr, size_t size)
+{
+    return RadixSort<RadixSortFloatTraits<_Element, _Key>>::execute(arr, size);
+}
--- a/dbms/src/Common/config.h.in
+++ b/dbms/src/Common/config.h.in
@ -25,6 +25,8 @@
 #cmakedefine01 USE_BROTLI
 #cmakedefine01 USE_SSL
 #cmakedefine01 USE_HYPERSCAN
+#cmakedefine01 USE_LFALLOC
+#cmakedefine01 USE_LFALLOC_RANDOM_HINT

 #cmakedefine01 CLICKHOUSE_SPLIT_BINARY
 #cmakedefine01 LLVM_HAS_RTTI
--- a/dbms/src/Core/Settings.h
+++ b/dbms/src/Core/Settings.h
@ -93,9 +93,12 @@ struct Settings
    M(SettingBool, optimize_skip_unused_shards, false, "Assumes that data is distributed by sharding_key. Optimization to skip unused shards if SELECT query filters by sharding_key.") \
    \
    M(SettingUInt64, merge_tree_min_rows_for_concurrent_read, (20 * 8192), "If at least as many lines are read from one file, the reading can be parallelized.") \
+    M(SettingUInt64, merge_tree_min_bytes_for_concurrent_read, (100 * 1024 * 1024), "If at least as many bytes are read from one file, the reading can be parallelized.") \
    M(SettingUInt64, merge_tree_min_rows_for_seek, 0, "You can skip reading more than that number of rows at the price of one seek per file.") \
+    M(SettingUInt64, merge_tree_min_bytes_for_seek, 0, "You can skip reading more than that number of bytes at the price of one seek per file.") \
    M(SettingUInt64, merge_tree_coarse_index_granularity, 8, "If the index segment can contain the required keys, divide it into as many parts and recursively check them.") \
    M(SettingUInt64, merge_tree_max_rows_to_use_cache, (1024 * 1024), "The maximum number of rows per request, to use the cache of uncompressed data. If the request is large, the cache is not used. (For large queries not to flush out the cache.)") \
+    M(SettingUInt64, merge_tree_max_bytes_to_use_cache, (600 * 1024 * 1024), "The maximum number of rows per request, to use the cache of uncompressed data. If the request is large, the cache is not used. (For large queries not to flush out the cache.)") \
    \
    M(SettingBool, merge_tree_uniform_read_distribution, true, "Distribute read from MergeTree over threads evenly, ensuring stable average execution time of each thread within one read operation.") \
    \
--- a/dbms/src/DataStreams/BlocksListBlockInputStream.h
+++ b/dbms/src/DataStreams/BlocksListBlockInputStream.h
@ -23,6 +23,8 @@ public:
    String getName() const override { return "BlocksList"; }

 protected:
+    Block getHeader() const override { return list.empty() ? Block() : *list.begin(); }
+
    Block readImpl() override
    {
        if (it == end)
--- a/dbms/src/DataStreams/CollapsingSortedBlockInputStream.cpp
+++ b/dbms/src/DataStreams/CollapsingSortedBlockInputStream.cpp
@ -40,7 +40,7 @@ void CollapsingSortedBlockInputStream::reportIncorrectData()
 }


-void CollapsingSortedBlockInputStream::insertRows(MutableColumns & merged_columns, size_t & merged_rows)
+void CollapsingSortedBlockInputStream::insertRows(MutableColumns & merged_columns, size_t block_size, MergeStopCondition & condition)
 {
    if (count_positive == 0 && count_negative == 0)
    {
@ -52,7 +52,7 @@ void CollapsingSortedBlockInputStream::insertRows(MutableColumns & merged_column
    {
        if (count_positive <= count_negative)
        {
-            ++merged_rows;
+            condition.addRowWithGranularity(block_size);
            for (size_t i = 0; i < num_columns; ++i)
                merged_columns[i]->insertFrom(*(*first_negative.columns)[i], first_negative.row_num);

@ -62,7 +62,7 @@ void CollapsingSortedBlockInputStream::insertRows(MutableColumns & merged_column

        if (count_positive >= count_negative)
        {
-            ++merged_rows;
+            condition.addRowWithGranularity(block_size);
            for (size_t i = 0; i < num_columns; ++i)
                merged_columns[i]->insertFrom(*(*last_positive.columns)[i], last_positive.row_num);

@ -106,12 +106,14 @@ Block CollapsingSortedBlockInputStream::readImpl()

 void CollapsingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::priority_queue<SortCursor> & queue)
 {
-    size_t merged_rows = 0;

+    MergeStopCondition stop_condition(average_block_sizes, max_block_size);
+    size_t current_block_granularity;
    /// Take rows in correct order and put them into `merged_columns` until the rows no more than `max_block_size`
    for (; !queue.empty(); ++current_pos)
    {
        SortCursor current = queue.top();
+        current_block_granularity = current->rows;

        if (current_key.empty())
            setPrimaryKeyRef(current_key, current);
@ -122,7 +124,7 @@ void CollapsingSortedBlockInputStream::merge(MutableColumns & merged_columns, st
        bool key_differs = next_key != current_key;

        /// if there are enough rows and the last one is calculated completely
-        if (key_differs && merged_rows >= max_block_size)
+        if (key_differs && stop_condition.checkStop())
        {
            ++blocks_written;
            return;
@ -133,7 +135,7 @@ void CollapsingSortedBlockInputStream::merge(MutableColumns & merged_columns, st
        if (key_differs)
        {
            /// We write data for the previous primary key.
-            insertRows(merged_columns, merged_rows);
+            insertRows(merged_columns, current_block_granularity, stop_condition);

            current_key.swap(next_key);

@ -167,7 +169,7 @@ void CollapsingSortedBlockInputStream::merge(MutableColumns & merged_columns, st
                first_negative_pos = current_pos;
            }

-            if (!blocks_written && !merged_rows)
+            if (!blocks_written && stop_condition.empty())
            {
                setRowRef(last_negative, current);
                last_negative_pos = current_pos;
@ -193,7 +195,7 @@ void CollapsingSortedBlockInputStream::merge(MutableColumns & merged_columns, st
    }

    /// Write data for last primary key.
-    insertRows(merged_columns, merged_rows);
+    insertRows(merged_columns, /*some_granularity*/ 0, stop_condition);

    finished = true;
 }
--- a/dbms/src/DataStreams/CollapsingSortedBlockInputStream.h
+++ b/dbms/src/DataStreams/CollapsingSortedBlockInputStream.h
@ -26,8 +26,9 @@ class CollapsingSortedBlockInputStream : public MergingSortedBlockInputStream
 public:
    CollapsingSortedBlockInputStream(
            BlockInputStreams inputs_, const SortDescription & description_,
-            const String & sign_column, size_t max_block_size_, WriteBuffer * out_row_sources_buf_ = nullptr)
-        : MergingSortedBlockInputStream(inputs_, description_, max_block_size_, 0, out_row_sources_buf_)
+            const String & sign_column, size_t max_block_size_,
+            WriteBuffer * out_row_sources_buf_ = nullptr, bool average_block_sizes_ = false)
+        : MergingSortedBlockInputStream(inputs_, description_, max_block_size_, 0, out_row_sources_buf_, false, average_block_sizes_)
    {
        sign_column_number = header.getPositionByName(sign_column);
    }
@ -75,7 +76,7 @@ private:
    void merge(MutableColumns & merged_columns, std::priority_queue<SortCursor> & queue);

    /// Output to result rows for the current primary key.
-    void insertRows(MutableColumns & merged_columns, size_t & merged_rows);
+    void insertRows(MutableColumns & merged_columns, size_t block_size, MergeStopCondition & condition);

    void reportIncorrectData();
 };
--- a/dbms/src/DataStreams/MarkInCompressedFile.h
+++ b/dbms/src/DataStreams/MarkInCompressedFile.h
@ -6,6 +6,10 @@
 #include <IO/WriteHelpers.h>
 #include <Common/PODArray.h>

+#include <Common/config.h>
+#if USE_LFALLOC
+#include <Common/LFAllocator.h>
+#endif

 namespace DB
 {
@ -32,8 +36,16 @@ struct MarkInCompressedFile
    {
        return "(" + DB::toString(offset_in_compressed_file) + "," + DB::toString(offset_in_decompressed_block) + ")";
    }
+
+    String toStringWithRows(size_t rows_num)
+    {
+        return "(" + DB::toString(offset_in_compressed_file) + "," + DB::toString(offset_in_decompressed_block) + "," + DB::toString(rows_num) + ")";
+    }
+
 };
-
+#if USE_LFALLOC
+using MarksInCompressedFile = PODArray<MarkInCompressedFile, 4096, LFAllocator>;
+#else
 using MarksInCompressedFile = PODArray<MarkInCompressedFile>;
-
+#endif
 }
--- a/dbms/src/DataStreams/MergingSortedBlockInputStream.cpp
+++ b/dbms/src/DataStreams/MergingSortedBlockInputStream.cpp
@ -18,9 +18,10 @@ namespace ErrorCodes

 MergingSortedBlockInputStream::MergingSortedBlockInputStream(
    const BlockInputStreams & inputs_, const SortDescription & description_,
-    size_t max_block_size_, UInt64 limit_, WriteBuffer * out_row_sources_buf_, bool quiet_)
+    size_t max_block_size_, UInt64 limit_, WriteBuffer * out_row_sources_buf_, bool quiet_, bool average_block_sizes_)
    : description(description_), max_block_size(max_block_size_), limit(limit_), quiet(quiet_)
-    , source_blocks(inputs_.size()), cursors(inputs_.size()), out_row_sources_buf(out_row_sources_buf_)
+    , average_block_sizes(average_block_sizes_), source_blocks(inputs_.size())
+    , cursors(inputs_.size()), out_row_sources_buf(out_row_sources_buf_)
 {
    children.insert(children.end(), inputs_.begin(), inputs_.end());
    header = children.at(0)->getHeader();
@ -116,7 +117,7 @@ Block MergingSortedBlockInputStream::readImpl()
 template <typename TSortCursor>
 void MergingSortedBlockInputStream::fetchNextBlock(const TSortCursor & current, std::priority_queue<TSortCursor> & queue)
 {
-    size_t order = current.impl->order;
+    size_t order = current->order;
    size_t size = cursors.size();

    if (order >= size || &cursors[order] != current.impl)
@ -132,6 +133,19 @@ void MergingSortedBlockInputStream::fetchNextBlock(const TSortCursor & current,
    }
 }

+
+bool MergingSortedBlockInputStream::MergeStopCondition::checkStop() const
+{
+    if (!count_average)
+        return sum_rows_count == max_block_size;
+
+    if (sum_rows_count == 0)
+        return false;
+
+    size_t average = sum_blocks_granularity / sum_rows_count;
+    return sum_rows_count >= average;
+}
+
 template
 void MergingSortedBlockInputStream::fetchNextBlock<SortCursor>(const SortCursor & current, std::priority_queue<SortCursor> & queue);

@ -144,10 +158,11 @@ void MergingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::
 {
    size_t merged_rows = 0;

+    MergeStopCondition stop_condition(average_block_sizes, max_block_size);
    /** Increase row counters.
      * Return true if it's time to finish generating the current data block.
      */
-    auto count_row_and_check_limit = [&, this]()
+    auto count_row_and_check_limit = [&, this](size_t current_granularity)
    {
        ++total_merged_rows;
        if (limit && total_merged_rows == limit)
@ -159,19 +174,15 @@ void MergingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::
        }

        ++merged_rows;
-        if (merged_rows == max_block_size)
-        {
-    //        std::cerr << "max_block_size reached\n";
-            return true;
-        }
-
-        return false;
+        stop_condition.addRowWithGranularity(current_granularity);
+        return stop_condition.checkStop();
    };

    /// Take rows in required order and put them into `merged_columns`, while the rows are no more than `max_block_size`
    while (!queue.empty())
    {
        TSortCursor current = queue.top();
+        size_t current_block_granularity = current->rows;
        queue.pop();

        while (true)
@ -179,20 +190,20 @@ void MergingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::
            /** And what if the block is totally less or equal than the rest for the current cursor?
              * Or is there only one data source left in the queue? Then you can take the entire block on current cursor.
              */
-            if (current.impl->isFirst() && (queue.empty() || current.totallyLessOrEquals(queue.top())))
+            if (current->isFirst() && (queue.empty() || current.totallyLessOrEquals(queue.top())))
            {
    //            std::cerr << "current block is totally less or equals\n";

                /// If there are already data in the current block, we first return it. We'll get here again the next time we call the merge function.
                if (merged_rows != 0)
                {
-    //                std::cerr << "merged rows is non-zero\n";
+                    //std::cerr << "merged rows is non-zero\n";
                    queue.push(current);
                    return;
                }

-                /// Actually, current.impl->order stores source number (i.e. cursors[current.impl->order] == current.impl)
-                size_t source_num = current.impl->order;
+                /// Actually, current->order stores source number (i.e. cursors[current->order] == current)
+                size_t source_num = current->order;

                if (source_num >= cursors.size())
                    throw Exception("Logical error in MergingSortedBlockInputStream", ErrorCodes::LOGICAL_ERROR);
@ -204,6 +215,7 @@ void MergingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::

                merged_rows = merged_columns.at(0)->size();

+                /// Limit output
                if (limit && total_merged_rows + merged_rows > limit)
                {
                    merged_rows = limit - total_merged_rows;
@ -217,6 +229,8 @@ void MergingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::
                    finished = true;
                }

+                /// Write order of rows for other columns
+                /// this data will be used in grather stream
                if (out_row_sources_buf)
                {
                    RowSourcePart row_source(source_num);
@ -224,7 +238,7 @@ void MergingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::
                        out_row_sources_buf->write(row_source.data);
                }

-    //            std::cerr << "fetching next block\n";
+                //std::cerr << "fetching next block\n";

                total_merged_rows += merged_rows;
                fetchNextBlock(current, queue);
@ -239,7 +253,7 @@ void MergingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::
            if (out_row_sources_buf)
            {
                /// Actually, current.impl->order stores source number (i.e. cursors[current.impl->order] == current.impl)
-                RowSourcePart row_source(current.impl->order);
+                RowSourcePart row_source(current->order);
                out_row_sources_buf->write(row_source.data);
            }

@ -250,7 +264,7 @@ void MergingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::

                if (queue.empty() || !(current.greater(queue.top())))
                {
-                    if (count_row_and_check_limit())
+                    if (count_row_and_check_limit(current_block_granularity))
                    {
    //                    std::cerr << "pushing back to queue\n";
                        queue.push(current);
@ -277,7 +291,7 @@ void MergingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::
            break;
        }

-        if (count_row_and_check_limit())
+        if (count_row_and_check_limit(current_block_granularity))
            return;
    }

--- a/dbms/src/DataStreams/MergingSortedBlockInputStream.h
+++ b/dbms/src/DataStreams/MergingSortedBlockInputStream.h
@ -68,7 +68,7 @@ public:
      */
    MergingSortedBlockInputStream(
        const BlockInputStreams & inputs_, const SortDescription & description_, size_t max_block_size_,
-        UInt64 limit_ = 0, WriteBuffer * out_row_sources_buf_ = nullptr, bool quiet_ = false);
+        UInt64 limit_ = 0, WriteBuffer * out_row_sources_buf_ = nullptr, bool quiet_ = false, bool average_block_sizes_ = false);

    String getName() const override { return "MergingSorted"; }

@ -116,6 +116,38 @@ protected:
        size_t size() const { return empty() ? 0 : columns->size(); }
    };

+    /// Simple class, which allows to check stop condition during merge process
+    /// in simple case it just compare amount of merged rows with max_block_size
+    /// in `count_average` case it compares amount of merged rows with linear combination
+    /// of block sizes from which these rows were taken.
+    struct MergeStopCondition
+    {
+        size_t sum_blocks_granularity = 0;
+        size_t sum_rows_count = 0;
+        bool count_average;
+        size_t max_block_size;
+
+        MergeStopCondition(bool count_average_, size_t max_block_size_)
+            : count_average(count_average_)
+            , max_block_size(max_block_size_)
+        {}
+
+        /// add single row from block size `granularity`
+        void addRowWithGranularity(size_t granularity)
+        {
+            sum_blocks_granularity += granularity;
+            sum_rows_count++;
+        }
+
+        /// check that sum_rows_count is enough
+        bool checkStop() const;
+
+        bool empty() const
+        {
+            return sum_blocks_granularity == 0;
+        }
+    };
+

    Block readImpl() override;

@ -139,6 +171,7 @@ protected:
    bool first = true;
    bool has_collation = false;
    bool quiet = false;
+    bool average_block_sizes = false;

    /// May be smaller or equal to max_block_size. To do 'reserve' for columns.
    size_t expected_block_size = 0;
--- a/dbms/src/DataStreams/ReplacingSortedBlockInputStream.cpp
+++ b/dbms/src/DataStreams/ReplacingSortedBlockInputStream.cpp
@ -12,7 +12,7 @@ namespace ErrorCodes
 }


-void ReplacingSortedBlockInputStream::insertRow(MutableColumns & merged_columns, size_t & merged_rows)
+void ReplacingSortedBlockInputStream::insertRow(MutableColumns & merged_columns)
 {
    if (out_row_sources_buf)
    {
@ -24,7 +24,6 @@ void ReplacingSortedBlockInputStream::insertRow(MutableColumns & merged_columns,
        current_row_sources.resize(0);
    }

-    ++merged_rows;
    for (size_t i = 0; i < num_columns; ++i)
        merged_columns[i]->insertFrom(*(*selected_row.columns)[i], selected_row.row_num);
 }
@ -51,12 +50,12 @@ Block ReplacingSortedBlockInputStream::readImpl()

 void ReplacingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::priority_queue<SortCursor> & queue)
 {
-    size_t merged_rows = 0;
-
+    MergeStopCondition stop_condition(average_block_sizes, max_block_size);
    /// Take the rows in needed order and put them into `merged_columns` until rows no more than `max_block_size`
    while (!queue.empty())
    {
        SortCursor current = queue.top();
+        size_t current_block_granularity = current->rows;

        if (current_key.empty())
            setPrimaryKeyRef(current_key, current);
@ -66,7 +65,7 @@ void ReplacingSortedBlockInputStream::merge(MutableColumns & merged_columns, std
        bool key_differs = next_key != current_key;

        /// if there are enough rows and the last one is calculated completely
-        if (key_differs && merged_rows >= max_block_size)
+        if (key_differs && stop_condition.checkStop())
            return;

        queue.pop();
@ -74,7 +73,8 @@ void ReplacingSortedBlockInputStream::merge(MutableColumns & merged_columns, std
        if (key_differs)
        {
            /// Write the data for the previous primary key.
-            insertRow(merged_columns, merged_rows);
+            insertRow(merged_columns);
+            stop_condition.addRowWithGranularity(current_block_granularity);
            selected_row.reset();
            current_key.swap(next_key);
        }
@ -110,7 +110,7 @@ void ReplacingSortedBlockInputStream::merge(MutableColumns & merged_columns, std

    /// We will write the data for the last primary key.
    if (!selected_row.empty())
-        insertRow(merged_columns, merged_rows);
+        insertRow(merged_columns);

    finished = true;
 }
--- a/dbms/src/DataStreams/ReplacingSortedBlockInputStream.h
+++ b/dbms/src/DataStreams/ReplacingSortedBlockInputStream.h
@ -18,8 +18,9 @@ class ReplacingSortedBlockInputStream : public MergingSortedBlockInputStream
 public:
    ReplacingSortedBlockInputStream(
        const BlockInputStreams & inputs_, const SortDescription & description_,
-        const String & version_column, size_t max_block_size_, WriteBuffer * out_row_sources_buf_ = nullptr)
-        : MergingSortedBlockInputStream(inputs_, description_, max_block_size_, 0, out_row_sources_buf_)
+        const String & version_column, size_t max_block_size_, WriteBuffer * out_row_sources_buf_ = nullptr,
+        bool average_block_sizes_ = false)
+        : MergingSortedBlockInputStream(inputs_, description_, max_block_size_, 0, out_row_sources_buf_, false, average_block_sizes_)
    {
        if (!version_column.empty())
            version_column_number = header.getPositionByName(version_column);
@ -54,7 +55,7 @@ private:
    void merge(MutableColumns & merged_columns, std::priority_queue<SortCursor> & queue);

    /// Output into result the rows for current primary key.
-    void insertRow(MutableColumns & merged_columns, size_t & merged_rows);
+    void insertRow(MutableColumns & merged_columns);
 };

 }
--- a/dbms/src/DataStreams/TTLBlockInputStream.cpp
+++ b/dbms/src/DataStreams/TTLBlockInputStream.cpp
@ -0,0 +1,208 @@
+#include <DataStreams/TTLBlockInputStream.h>
+#include <DataTypes/DataTypeDate.h>
+#include <Interpreters/evaluateMissingDefaults.h>
+#include <Interpreters/SyntaxAnalyzer.h>
+#include <Interpreters/ExpressionAnalyzer.h>
+
+namespace DB
+{
+
+namespace ErrorCodes
+{
+    extern const int LOGICAL_ERROR;
+}
+
+
+TTLBlockInputStream::TTLBlockInputStream(
+    const BlockInputStreamPtr & input_,
+    const MergeTreeData & storage_,
+    const MergeTreeData::MutableDataPartPtr & data_part_,
+    time_t current_time_)
+    : storage(storage_)
+    , data_part(data_part_)
+    , current_time(current_time_)
+    , old_ttl_infos(data_part->ttl_infos)
+    , log(&Logger::get(storage.getLogName() + " (TTLBlockInputStream)"))
+    , date_lut(DateLUT::instance())
+{
+    children.push_back(input_);
+
+    const auto & column_defaults = storage.getColumns().getDefaults();
+    ASTPtr default_expr_list = std::make_shared<ASTExpressionList>();
+    for (const auto & [name, ttl_info] : old_ttl_infos.columns_ttl)
+    {
+        if (ttl_info.min <= current_time)
+        {
+            new_ttl_infos.columns_ttl.emplace(name, MergeTreeDataPart::TTLInfo{});
+            empty_columns.emplace(name);
+
+            auto it = column_defaults.find(name);
+
+            if (it != column_defaults.end())
+                default_expr_list->children.emplace_back(
+                    setAlias(it->second.expression, it->first));
+        }
+        else
+            new_ttl_infos.columns_ttl.emplace(name, ttl_info);
+    }
+
+    if (old_ttl_infos.table_ttl.min > current_time)
+        new_ttl_infos.table_ttl = old_ttl_infos.table_ttl;
+
+    if (!default_expr_list->children.empty())
+    {
+        auto syntax_result = SyntaxAnalyzer(storage.global_context).analyze(
+            default_expr_list, storage.getColumns().getAllPhysical());
+        defaults_expression = ExpressionAnalyzer{default_expr_list, syntax_result, storage.global_context}.getActions(true);
+    }
+}
+
+
+Block TTLBlockInputStream::getHeader() const
+{
+    return children.at(0)->getHeader();
+}
+
+Block TTLBlockInputStream::readImpl()
+{
+    Block block = children.at(0)->read();
+    if (!block)
+        return block;
+
+    if (storage.hasTableTTL())
+    {
+        /// Skip all data if table ttl is expired for part
+        if (old_ttl_infos.table_ttl.max <= current_time)
+        {
+            rows_removed = data_part->rows_count;
+            return {};
+        }
+
+        if (old_ttl_infos.table_ttl.min <= current_time)
+            removeRowsWithExpiredTableTTL(block);
+    }
+
+    removeValuesWithExpiredColumnTTL(block);
+
+    return block;
+}
+
+void TTLBlockInputStream::readSuffixImpl()
+{
+    for (const auto & elem : new_ttl_infos.columns_ttl)
+        new_ttl_infos.updatePartMinTTL(elem.second.min);
+
+    new_ttl_infos.updatePartMinTTL(new_ttl_infos.table_ttl.min);
+
+    data_part->ttl_infos = std::move(new_ttl_infos);
+    data_part->empty_columns = std::move(empty_columns);
+
+    if (rows_removed)
+        LOG_INFO(log, "Removed " << rows_removed << " rows with expired ttl from part " << data_part->name);
+}
+
+void TTLBlockInputStream::removeRowsWithExpiredTableTTL(Block & block)
+{
+    storage.ttl_table_entry.expression->execute(block);
+
+    const auto & current = block.getByName(storage.ttl_table_entry.result_column);
+    const IColumn * ttl_column = current.column.get();
+
+    MutableColumns result_columns;
+    result_columns.reserve(getHeader().columns());
+    for (const auto & name : storage.getColumns().getNamesOfPhysical())
+    {
+        auto & column_with_type = block.getByName(name);
+        const IColumn * values_column = column_with_type.column.get();
+        MutableColumnPtr result_column = values_column->cloneEmpty();
+        result_column->reserve(block.rows());
+
+        for (size_t i = 0; i < block.rows(); ++i)
+        {
+            UInt32 cur_ttl = getTimestampByIndex(ttl_column, i);
+            if (cur_ttl > current_time)
+            {
+                new_ttl_infos.table_ttl.update(cur_ttl);
+                result_column->insertFrom(*values_column, i);
+            }
+            else
+                ++rows_removed;
+        }
+        result_columns.emplace_back(std::move(result_column));
+    }
+
+    block = getHeader().cloneWithColumns(std::move(result_columns));
+}
+
+void TTLBlockInputStream::removeValuesWithExpiredColumnTTL(Block & block)
+{
+    Block block_with_defaults;
+    if (defaults_expression)
+    {
+        block_with_defaults = block;
+        defaults_expression->execute(block_with_defaults);
+    }
+
+    for (const auto & [name, ttl_entry] : storage.ttl_entries_by_name)
+    {
+        const auto & old_ttl_info = old_ttl_infos.columns_ttl[name];
+        auto & new_ttl_info = new_ttl_infos.columns_ttl[name];
+
+        if (old_ttl_info.min > current_time)
+            continue;
+
+        if (old_ttl_info.max <= current_time)
+            continue;
+
+        if (!block.has(ttl_entry.result_column))
+            ttl_entry.expression->execute(block);
+
+        ColumnPtr default_column = nullptr;
+        if (block_with_defaults.has(name))
+            default_column = block_with_defaults.getByName(name).column->convertToFullColumnIfConst();
+
+        auto & column_with_type = block.getByName(name);
+        const IColumn * values_column = column_with_type.column.get();
+        MutableColumnPtr result_column = values_column->cloneEmpty();
+        result_column->reserve(block.rows());
+
+        const auto & current = block.getByName(ttl_entry.result_column);
+        const IColumn * ttl_column = current.column.get();
+
+        for (size_t i = 0; i < block.rows(); ++i)
+        {
+            UInt32 cur_ttl = getTimestampByIndex(ttl_column, i);
+
+            if (cur_ttl <= current_time)
+            {
+                if (default_column)
+                    result_column->insertFrom(*default_column, i);
+                else
+                    result_column->insertDefault();
+            }
+            else
+            {
+                new_ttl_info.update(cur_ttl);
+                empty_columns.erase(name);
+                result_column->insertFrom(*values_column, i);
+            }
+        }
+        column_with_type.column = std::move(result_column);
+    }
+
+    for (const auto & elem : storage.ttl_entries_by_name)
+        if (block.has(elem.second.result_column))
+            block.erase(elem.second.result_column);
+}
+
+UInt32 TTLBlockInputStream::getTimestampByIndex(const IColumn * column, size_t ind)
+{
+    if (const ColumnUInt16 * column_date = typeid_cast<const ColumnUInt16 *>(column))
+        return date_lut.fromDayNum(DayNum(column_date->getData()[ind]));
+    else if (const ColumnUInt32 * column_date_time = typeid_cast<const ColumnUInt32 *>(column))
+        return column_date_time->getData()[ind];
+    else
+        throw Exception("Unexpected type of result ttl column", ErrorCodes::LOGICAL_ERROR);
+}
+
+}
--- a/dbms/src/DataStreams/TTLBlockInputStream.h
+++ b/dbms/src/DataStreams/TTLBlockInputStream.h
@ -0,0 +1,60 @@
+#pragma once
+#include <DataStreams/IBlockInputStream.h>
+#include <Storages/MergeTree/MergeTreeData.h>
+#include <Storages/MergeTree/MergeTreeDataPart.h>
+#include <Core/Block.h>
+
+#include <common/DateLUT.h>
+
+namespace DB
+{
+
+class TTLBlockInputStream : public IBlockInputStream
+{
+public:
+    TTLBlockInputStream(
+        const BlockInputStreamPtr & input_,
+        const MergeTreeData & storage_,
+        const MergeTreeData::MutableDataPartPtr & data_part_,
+        time_t current_time
+    );
+
+    String getName() const override { return "TTLBlockInputStream"; }
+
+    Block getHeader() const override;
+
+protected:
+    Block readImpl() override;
+
+    /// Finalizes ttl infos and updates data part
+    void readSuffixImpl() override;
+
+private:
+    const MergeTreeData & storage;
+
+    /// ttl_infos and empty_columns are updating while reading
+    const MergeTreeData::MutableDataPartPtr & data_part;
+
+    time_t current_time;
+
+    MergeTreeDataPart::TTLInfos old_ttl_infos;
+    MergeTreeDataPart::TTLInfos new_ttl_infos;
+    NameSet empty_columns;
+
+    size_t rows_removed = 0;
+    Logger * log;
+    DateLUTImpl date_lut;
+
+    std::unordered_map<String, String> defaults_result_column;
+    ExpressionActionsPtr defaults_expression;
+private:
+    /// Removes values with expired ttl and computes new min_ttl and empty_columns for part
+    void removeValuesWithExpiredColumnTTL(Block & block);
+
+    /// Remove rows with expired table ttl and computes new min_ttl for part
+    void removeRowsWithExpiredTableTTL(Block & block);
+
+    UInt32 getTimestampByIndex(const IColumn * column, size_t ind);
+};
+
+}
--- a/dbms/src/DataStreams/VersionedCollapsingSortedBlockInputStream.cpp
+++ b/dbms/src/DataStreams/VersionedCollapsingSortedBlockInputStream.cpp
@ -16,8 +16,8 @@ namespace ErrorCodes
 VersionedCollapsingSortedBlockInputStream::VersionedCollapsingSortedBlockInputStream(
    const BlockInputStreams & inputs_, const SortDescription & description_,
    const String & sign_column_, size_t max_block_size_,
-    WriteBuffer * out_row_sources_buf_)
-    : MergingSortedBlockInputStream(inputs_, description_, max_block_size_, 0, out_row_sources_buf_)
+    WriteBuffer * out_row_sources_buf_, bool average_block_sizes_)
+    : MergingSortedBlockInputStream(inputs_, description_, max_block_size_, 0, out_row_sources_buf_, false, average_block_sizes_)
    , max_rows_in_queue(std::min(std::max<size_t>(3, max_block_size_), MAX_ROWS_IN_MULTIVERSION_QUEUE) - 2)
    , current_keys(max_rows_in_queue + 1)
 {
@ -83,7 +83,7 @@ Block VersionedCollapsingSortedBlockInputStream::readImpl()

 void VersionedCollapsingSortedBlockInputStream::merge(MutableColumns & merged_columns, std::priority_queue<SortCursor> & queue)
 {
-    size_t merged_rows = 0;
+    MergeStopCondition stop_condition(average_block_sizes, max_block_size);

    auto update_queue = [this, & queue](SortCursor & cursor)
    {
@ -108,6 +108,7 @@ void VersionedCollapsingSortedBlockInputStream::merge(MutableColumns & merged_co
    while (!queue.empty())
    {
        SortCursor current = queue.top();
+        size_t current_block_granularity = current->rows;

        RowRef next_key;

@ -154,10 +155,10 @@ void VersionedCollapsingSortedBlockInputStream::merge(MutableColumns & merged_co

            current_keys.popFront();

-            ++merged_rows;
+            stop_condition.addRowWithGranularity(current_block_granularity);
            --rows_to_merge;

-            if (merged_rows >= max_block_size)
+            if (stop_condition.checkStop())
            {
                ++blocks_written;
                return;
@ -173,7 +174,6 @@ void VersionedCollapsingSortedBlockInputStream::merge(MutableColumns & merged_co
        insertRow(gap, row, merged_columns);

        current_keys.popFront();
-        ++merged_rows;
    }

    /// Write information about last collapsed rows.
--- a/dbms/src/DataStreams/VersionedCollapsingSortedBlockInputStream.h
+++ b/dbms/src/DataStreams/VersionedCollapsingSortedBlockInputStream.h
@ -178,7 +178,7 @@ public:
    VersionedCollapsingSortedBlockInputStream(
        const BlockInputStreams & inputs_, const SortDescription & description_,
        const String & sign_column_, size_t max_block_size_,
-        WriteBuffer * out_row_sources_buf_ = nullptr);
+        WriteBuffer * out_row_sources_buf_ = nullptr, bool average_block_sizes_ = false);

    String getName() const override { return "VersionedCollapsingSorted"; }

--- a/dbms/src/DataStreams/tests/gtest_blocks_size_merging_streams.cpp
+++ b/dbms/src/DataStreams/tests/gtest_blocks_size_merging_streams.cpp
@ -0,0 +1,138 @@
+#pragma GCC diagnostic ignored "-Wsign-compare"
+#ifdef __clang__
+#pragma clang diagnostic ignored "-Wzero-as-null-pointer-constant"
+#pragma clang diagnostic ignored "-Wundef"
+#endif
+
+#include <gtest/gtest.h>
+#include <Core/Block.h>
+#include <Columns/ColumnVector.h>
+#include <DataStreams/MergingSortedBlockInputStream.h>
+#include <DataStreams/BlocksListBlockInputStream.h>
+#include <DataTypes/DataTypesNumber.h>
+#include <Columns/ColumnsNumber.h>
+
+using namespace DB;
+
+Block getBlockWithSize(const std::vector<std::string> & columns, size_t rows, size_t stride, size_t & start)
+{
+
+    ColumnsWithTypeAndName cols;
+    size_t size_of_row_in_bytes = columns.size() * sizeof(UInt64);
+    for (size_t i = 0; i * sizeof(UInt64) < size_of_row_in_bytes; i++)
+    {
+        auto column = ColumnUInt64::create(rows, 0);
+        for (size_t j = 0; j < rows; ++j)
+        {
+            column->getElement(j) = start;
+            start += stride;
+        }
+        cols.emplace_back(std::move(column), std::make_shared<DataTypeUInt64>(), columns[i]);
+    }
+    return Block(cols);
+}
+
+
+BlockInputStreams getInputStreams(const std::vector<std::string> & column_names, const std::vector<std::tuple<size_t, size_t, size_t>> & block_sizes)
+{
+    BlockInputStreams result;
+    for (auto [block_size_in_bytes, blocks_count, stride] : block_sizes)
+    {
+        BlocksList blocks;
+        size_t start = stride;
+        while (blocks_count--)
+            blocks.push_back(getBlockWithSize(column_names, block_size_in_bytes, stride, start));
+        result.push_back(std::make_shared<BlocksListBlockInputStream>(std::move(blocks)));
+    }
+    return result;
+
+}
+
+
+BlockInputStreams getInputStreamsEqualStride(const std::vector<std::string> & column_names, const std::vector<std::tuple<size_t, size_t, size_t>> & block_sizes)
+{
+    BlockInputStreams result;
+    size_t i = 0;
+    for (auto [block_size_in_bytes, blocks_count, stride] : block_sizes)
+    {
+        BlocksList blocks;
+        size_t start = i;
+        while (blocks_count--)
+            blocks.push_back(getBlockWithSize(column_names, block_size_in_bytes, stride, start));
+        result.push_back(std::make_shared<BlocksListBlockInputStream>(std::move(blocks)));
+        i++;
+    }
+    return result;
+
+}
+
+
+SortDescription getSortDescription(const std::vector<std::string> & column_names)
+{
+    SortDescription descr;
+    for (const auto & column : column_names)
+    {
+        descr.emplace_back(column, 1, 1);
+    }
+    return descr;
+}
+
+TEST(MergingSortedTest, SimpleBlockSizeTest)
+{
+    std::vector<std::string> key_columns{"K1", "K2", "K3"};
+    auto sort_description = getSortDescription(key_columns);
+    auto streams = getInputStreams(key_columns, {{5, 1, 1}, {10, 1, 2}, {21, 1, 3}});
+
+    EXPECT_EQ(streams.size(), 3);
+
+    MergingSortedBlockInputStream stream(streams, sort_description, DEFAULT_MERGE_BLOCK_SIZE, 0, nullptr, false, true);
+
+    size_t total_rows = 0;
+    auto block1 = stream.read();
+    auto block2 = stream.read();
+    auto block3 = stream.read();
+
+    EXPECT_EQ(stream.read(), Block());
+
+    for (auto & block : {block1, block2, block3})
+        total_rows += block.rows();
+    /**
+      * First block consists of 1 row from block3 with 21 rows + 2 rows from block2 with 10 rows
+      * + 5 rows from block 1 with 5 rows granularity
+      */
+    EXPECT_EQ(block1.rows(), 8);
+    /**
+      * Combination of 10 and 21 rows blocks
+      */
+    EXPECT_EQ(block2.rows(), 14);
+    /**
+      * Combination of 10 and 21 rows blocks
+      */
+    EXPECT_EQ(block3.rows(), 14);
+
+    EXPECT_EQ(total_rows, 5 + 10 + 21);
+}
+
+
+TEST(MergingSortedTest, MoreInterestingBlockSizes)
+{
+    std::vector<std::string> key_columns{"K1", "K2", "K3"};
+    auto sort_description = getSortDescription(key_columns);
+    auto streams = getInputStreamsEqualStride(key_columns, {{1000, 1, 3}, {1500, 1, 3}, {1400, 1, 3}});
+
+    EXPECT_EQ(streams.size(), 3);
+
+    MergingSortedBlockInputStream stream(streams, sort_description, DEFAULT_MERGE_BLOCK_SIZE, 0, nullptr, false, true);
+
+    auto block1 = stream.read();
+    auto block2 = stream.read();
+    auto block3 = stream.read();
+
+    EXPECT_EQ(stream.read(), Block());
+
+    EXPECT_EQ(block1.rows(), (1000 + 1500 + 1400) / 3);
+    EXPECT_EQ(block2.rows(), (1000 + 1500 + 1400) / 3);
+    EXPECT_EQ(block3.rows(), (1000 + 1500 + 1400) / 3);
+
+    EXPECT_EQ(block1.rows() + block2.rows() + block3.rows(), 1000 + 1500 + 1400);
+}
--- a/dbms/src/Databases/DatabaseDictionary.cpp
+++ b/dbms/src/Databases/DatabaseDictionary.cpp
@ -65,21 +65,12 @@ StoragePtr DatabaseDictionary::tryGetTable(
    const Context & context,
    const String & table_name) const
 {
-    auto objects_map = context.getExternalDictionaries().getObjectsMap();
-    const auto & dictionaries = objects_map.get();
-
+    auto dict_ptr = context.getExternalDictionaries().tryGetDictionary(table_name);
+    if (dict_ptr)
    {
-        auto it = dictionaries.find(table_name);
-        if (it != dictionaries.end())
-        {
-            const auto & dict_ptr = std::static_pointer_cast<IDictionaryBase>(it->second.loadable);
-            if (dict_ptr)
-            {
-                const DictionaryStructure & dictionary_structure = dict_ptr->getStructure();
-                auto columns = StorageDictionary::getNamesAndTypes(dictionary_structure);
-                return StorageDictionary::create(table_name, ColumnsDescription{columns}, context, true, table_name);
-            }
-        }
+        const DictionaryStructure & dictionary_structure = dict_ptr->getStructure();
+        auto columns = StorageDictionary::getNamesAndTypes(dictionary_structure);
+        return StorageDictionary::create(table_name, ColumnsDescription{columns}, context, true, table_name);
    }

    return {};
--- a/dbms/src/Functions/isValidUTF8.cpp
+++ b/dbms/src/Functions/isValidUTF8.cpp
@ -283,8 +283,7 @@ SOFTWARE.
        memset(buf + len + 1, 0, 16);
        check_packed(_mm_loadu_si128(reinterpret_cast<__m128i *>(buf + 1)));

-        /* Reduce error vector, error_reduced = 0xFFFF if error == 0 */
-        return _mm_movemask_epi8(_mm_cmpeq_epi8(error, _mm_set1_epi8(0))) == 0xFFFF;
+        return _mm_testz_si128(error, error);
    }
 #endif

--- a/dbms/src/IO/UncompressedCache.h
+++ b/dbms/src/IO/UncompressedCache.h
@ -6,6 +6,11 @@
 #include <Common/ProfileEvents.h>
 #include <IO/BufferWithOwnMemory.h>

+#include <Common/config.h>
+#if USE_LFALLOC
+#include <Common/LFAllocator.h>
+#endif
+

 namespace ProfileEvents
 {
@ -20,7 +25,11 @@ namespace DB

 struct UncompressedCacheCell
 {
+#if USE_LFALLOC
+    Memory<LFAllocator> data;
+#else
    Memory<> data;
+#endif
    size_t compressed_size;
    UInt32 additional_bytes;
 };
--- a/dbms/src/Interpreters/ClientInfo.h
+++ b/dbms/src/Interpreters/ClientInfo.h
@ -37,7 +37,7 @@ public:
    {
        NO_QUERY = 0,            /// Uninitialized object.
        INITIAL_QUERY = 1,
-        SECONDARY_QUERY = 2,    /// Query that was initiated by another query for distributed query execution.
+        SECONDARY_QUERY = 2,    /// Query that was initiated by another query for distributed or ON CLUSTER query execution.
    };


--- a/dbms/src/Interpreters/Compiler.cpp
+++ b/dbms/src/Interpreters/Compiler.cpp
@ -2,6 +2,7 @@
 #include <Poco/Util/Application.h>
 #include <ext/unlock_guard.h>
 #include <Common/ClickHouseRevision.h>
+#include <Common/config.h>
 #include <Common/SipHash.h>
 #include <Common/ShellCommand.h>
 #include <Common/StringUtils/StringUtils.h>
@ -261,6 +262,9 @@ void Compiler::compile(
            " -I " << compiler_headers << "/dbms/src/"
            " -isystem " << compiler_headers << "/contrib/cityhash102/include/"
            " -isystem " << compiler_headers << "/contrib/libpcg-random/include/"
+        #if USE_LFALLOC
+            " -isystem " << compiler_headers << "/contrib/lfalloc/src/"
+        #endif
            " -isystem " << compiler_headers << INTERNAL_DOUBLE_CONVERSION_INCLUDE_DIR
            " -isystem " << compiler_headers << INTERNAL_Poco_Foundation_INCLUDE_DIR
            " -isystem " << compiler_headers << INTERNAL_Boost_INCLUDE_DIRS
--- a/dbms/src/Interpreters/Context.cpp
+++ b/dbms/src/Interpreters/Context.cpp
@ -104,7 +104,7 @@ struct ContextShared
    mutable std::recursive_mutex mutex;
    /// Separate mutex for access of dictionaries. Separate mutex to avoid locks when server doing request to itself.
    mutable std::mutex embedded_dictionaries_mutex;
-    mutable std::mutex external_dictionaries_mutex;
+    mutable std::recursive_mutex external_dictionaries_mutex;
    mutable std::mutex external_models_mutex;
    /// Separate mutex for re-initialization of zookeer session. This operation could take a long time and must not interfere with another operations.
    mutable std::mutex zookeeper_mutex;
@ -1240,44 +1240,38 @@ EmbeddedDictionaries & Context::getEmbeddedDictionariesImpl(const bool throw_on_

 ExternalDictionaries & Context::getExternalDictionariesImpl(const bool throw_on_error) const
 {
+    {
+        std::lock_guard lock(shared->external_dictionaries_mutex);
+        if (shared->external_dictionaries)
+            return *shared->external_dictionaries;
+    }
+
    const auto & config = getConfigRef();
-
    std::lock_guard lock(shared->external_dictionaries_mutex);
-
    if (!shared->external_dictionaries)
    {
        if (!this->global_context)
            throw Exception("Logical error: there is no global context", ErrorCodes::LOGICAL_ERROR);

        auto config_repository = shared->runtime_components_factory->createExternalDictionariesConfigRepository();
-
-        shared->external_dictionaries.emplace(
-            std::move(config_repository),
-            config,
-            *this->global_context,
-            throw_on_error);
+        shared->external_dictionaries.emplace(std::move(config_repository), config, *this->global_context);
+        shared->external_dictionaries->init(throw_on_error);
    }
-
    return *shared->external_dictionaries;
 }

 ExternalModels & Context::getExternalModelsImpl(bool throw_on_error) const
 {
    std::lock_guard lock(shared->external_models_mutex);
-
    if (!shared->external_models)
    {
        if (!this->global_context)
            throw Exception("Logical error: there is no global context", ErrorCodes::LOGICAL_ERROR);

        auto config_repository = shared->runtime_components_factory->createExternalModelsConfigRepository();
-
-        shared->external_models.emplace(
-            std::move(config_repository),
-            *this->global_context,
-            throw_on_error);
+        shared->external_models.emplace(std::move(config_repository), *this->global_context);
+        shared->external_models->init(throw_on_error);
    }
-
    return *shared->external_models;
 }

--- a/dbms/src/Interpreters/DDLWorker.cpp
+++ b/dbms/src/Interpreters/DDLWorker.cpp
@ -547,6 +547,7 @@ bool DDLWorker::tryExecuteQuery(const String & query, const DDLTask & task, Exec
    try
    {
        current_context = std::make_unique<Context>(context);
+        current_context->getClientInfo().query_kind = ClientInfo::QueryKind::SECONDARY_QUERY;
        current_context->setCurrentQueryId(""); // generate random query_id
        executeQuery(istr, ostr, false, *current_context, {}, {});
    }
--- a/dbms/src/Interpreters/ExternalDictionaries.cpp
+++ b/dbms/src/Interpreters/ExternalDictionaries.cpp
@ -30,8 +30,7 @@ namespace
 ExternalDictionaries::ExternalDictionaries(
    std::unique_ptr<IExternalLoaderConfigRepository> config_repository,
    const Poco::Util::AbstractConfiguration & config,
-    Context & context,
-    bool throw_on_error)
+    Context & context)
        : ExternalLoader(config,
                         externalDictionariesUpdateSettings,
                         getExternalDictionariesConfigSettings(),
@ -40,11 +39,11 @@ ExternalDictionaries::ExternalDictionaries(
                         "external dictionary"),
        context(context)
 {
-    init(throw_on_error);
 }

+
 std::unique_ptr<IExternalLoadable> ExternalDictionaries::create(
-        const std::string & name, const Configuration & config, const std::string & config_prefix)
+        const std::string & name, const Configuration & config, const std::string & config_prefix) const
 {
    return DictionaryFactory::instance().create(name, config, config_prefix, context);
 }
--- a/dbms/src/Interpreters/ExternalDictionaries.h
+++ b/dbms/src/Interpreters/ExternalDictionaries.h
@ -21,8 +21,7 @@ public:
    ExternalDictionaries(
        std::unique_ptr<IExternalLoaderConfigRepository> config_repository,
        const Poco::Util::AbstractConfiguration & config,
-        Context & context,
-        bool throw_on_error);
+        Context & context);

    /// Forcibly reloads specified dictionary.
    void reloadDictionary(const std::string & name) { reload(name); }
@ -40,7 +39,7 @@ public:
 protected:

    std::unique_ptr<IExternalLoadable> create(const std::string & name, const Configuration & config,
-                                              const std::string & config_prefix) override;
+                                              const std::string & config_prefix) const override;

    using ExternalLoader::getObjectsMap;

--- a/dbms/src/Interpreters/ExternalLoader.cpp
+++ b/dbms/src/Interpreters/ExternalLoader.cpp
@ -57,12 +57,16 @@ ExternalLoader::ExternalLoader(const Poco::Util::AbstractConfiguration & config_
 {
 }

+
 void ExternalLoader::init(bool throw_on_error)
 {
-    if (is_initialized)
-        return;
+    std::call_once(is_initialized_flag, &ExternalLoader::initImpl, this, throw_on_error);
+}

-    is_initialized = true;
+
+void ExternalLoader::initImpl(bool throw_on_error)
+{
+    std::lock_guard all_lock(all_mutex);

    {
        /// During synchronous loading of external dictionaries at moment of query execution,
@ -87,13 +91,13 @@ ExternalLoader::~ExternalLoader()

 void ExternalLoader::reloadAndUpdate(bool throw_on_error)
 {
+    std::lock_guard all_lock(all_mutex);
+
    reloadFromConfigFiles(throw_on_error);

    /// list of recreated loadable objects to perform delayed removal from unordered_map
    std::list<std::string> recreated_failed_loadable_objects;

-    std::lock_guard all_lock(all_mutex);
-
    /// retry loading failed loadable objects
    for (auto & failed_loadable_object : failed_loadable_objects)
    {
@ -250,15 +254,17 @@ void ExternalLoader::reloadAndUpdate(bool throw_on_error)
    }
 }

+
 void ExternalLoader::reloadFromConfigFiles(const bool throw_on_error, const bool force_reload, const std::string & only_dictionary)
 {
-    const auto config_paths = config_repository->list(config_main, config_settings.path_setting_name);
+    std::lock_guard all_lock{all_mutex};

+    const auto config_paths = config_repository->list(config_main, config_settings.path_setting_name);
    for (const auto & config_path : config_paths)
    {
        try
        {
-            reloadFromConfigFile(config_path, throw_on_error, force_reload, only_dictionary);
+            reloadFromConfigFile(config_path, force_reload, only_dictionary);
        }
        catch (...)
        {
@ -270,30 +276,34 @@ void ExternalLoader::reloadFromConfigFiles(const bool throw_on_error, const bool
    }

    /// erase removed from config loadable objects
-    std::lock_guard lock{map_mutex};
-
-    std::list<std::string> removed_loadable_objects;
-    for (const auto & loadable : loadable_objects)
    {
-        const auto & current_config = loadable_objects_defined_in_config[loadable.second.origin];
-        if (current_config.find(loadable.first) == std::end(current_config))
-            removed_loadable_objects.emplace_back(loadable.first);
+        std::lock_guard lock{map_mutex};
+        std::list<std::string> removed_loadable_objects;
+        for (const auto & loadable : loadable_objects)
+        {
+            const auto & current_config = loadable_objects_defined_in_config[loadable.second.origin];
+            if (current_config.find(loadable.first) == std::end(current_config))
+                removed_loadable_objects.emplace_back(loadable.first);
+        }
+        for (const auto & name : removed_loadable_objects)
+            loadable_objects.erase(name);
    }
-    for (const auto & name : removed_loadable_objects)
-        loadable_objects.erase(name);
+
+    /// create all loadable objects which was read from config
+    finishAllReloads(throw_on_error);
 }

-void ExternalLoader::reloadFromConfigFile(const std::string & config_path, const bool throw_on_error,
-                                          const bool force_reload, const std::string & loadable_name)
+
+void ExternalLoader::reloadFromConfigFile(const std::string & config_path, const bool force_reload, const std::string & loadable_name)
 {
+    // We assume `all_mutex` is already locked.
+
    if (config_path.empty() || !config_repository->exists(config_path))
    {
        LOG_WARNING(log, "config file '" + config_path + "' does not exist");
    }
    else
    {
-        std::lock_guard all_lock(all_mutex);
-
        auto modification_time_it = last_modification_times.find(config_path);
        if (modification_time_it == std::end(last_modification_times))
            modification_time_it = last_modification_times.emplace(config_path, Poco::Timestamp{0}).first;
@ -329,103 +339,24 @@ void ExternalLoader::reloadFromConfigFile(const std::string & config_path, const
                    continue;
                }

-                try
+                name = loaded_config->getString(key + "." + config_settings.external_name);
+                if (name.empty())
                {
-                    name = loaded_config->getString(key + "." + config_settings.external_name);
-                    if (name.empty())
-                    {
-                        LOG_WARNING(log, config_path << ": " + config_settings.external_name + " name cannot be empty");
-                        continue;
-                    }
-
-                    loadable_objects_defined_in_config[config_path].emplace(name);
-                    if (!loadable_name.empty() && name != loadable_name)
-                        continue;
-
-                    decltype(loadable_objects.begin()) object_it;
-                    {
-                        std::lock_guard lock{map_mutex};
-                        object_it = loadable_objects.find(name);
-
-                        /// Object with the same name was declared in other config file.
-                        if (object_it != std::end(loadable_objects) && object_it->second.origin != config_path)
-                            throw Exception(object_name + " '" + name + "' from file " + config_path
-                                            + " already declared in file " + object_it->second.origin,
-                                            ErrorCodes::EXTERNAL_LOADABLE_ALREADY_EXISTS);
-                    }
-
-                    auto object_ptr = create(name, *loaded_config, key);
-
-                    /// If the object could not be loaded.
-                    if (const auto exception_ptr = object_ptr->getCreationException())
-                    {
-                        std::chrono::seconds delay(update_settings.backoff_initial_sec);
-                        const auto failed_dict_it = failed_loadable_objects.find(name);
-                        FailedLoadableInfo info{std::move(object_ptr), std::chrono::system_clock::now() + delay, 0};
-                        if (failed_dict_it != std::end(failed_loadable_objects))
-                            (*failed_dict_it).second = std::move(info);
-                        else
-                            failed_loadable_objects.emplace(name, std::move(info));
-
-                        std::rethrow_exception(exception_ptr);
-                    }
-                    else if (object_ptr->supportUpdates())
-                    {
-                        const auto & lifetime = object_ptr->getLifetime();
-                        if (lifetime.min_sec != 0 && lifetime.max_sec != 0)
-                        {
-                            std::uniform_int_distribution<UInt64> distribution(lifetime.min_sec, lifetime.max_sec);
-
-                            update_times[name] = std::chrono::system_clock::now() +
-                                                 std::chrono::seconds{distribution(rnd_engine)};
-                        }
-                    }
-
-                    std::lock_guard lock{map_mutex};
-
-                    /// add new loadable object or update an existing version
-                    if (object_it == std::end(loadable_objects))
-                        loadable_objects.emplace(name, LoadableInfo{std::move(object_ptr), config_path, {}});
-                    else
-                    {
-                        if (object_it->second.loadable)
-                            object_it->second.loadable.reset();
-                        object_it->second.loadable = std::move(object_ptr);
-
-                        /// erase stored exception on success
-                        object_it->second.exception = std::exception_ptr{};
-                        failed_loadable_objects.erase(name);
-                    }
+                    LOG_WARNING(log, config_path << ": " + config_settings.external_name + " name cannot be empty");
+                    continue;
                }
-                catch (...)
-                {
-                    if (!name.empty())
-                    {
-                        /// If the loadable object could not load data or even failed to initialize from the config.
-                        /// - all the same we insert information into the `loadable_objects`, with the zero pointer `loadable`.

-                        std::lock_guard lock{map_mutex};
+                loadable_objects_defined_in_config[config_path].emplace(name);
+                if (!loadable_name.empty() && name != loadable_name)
+                    continue;

-                        const auto exception_ptr = std::current_exception();
-                        const auto loadable_it = loadable_objects.find(name);
-                        if (loadable_it == std::end(loadable_objects))
-                            loadable_objects.emplace(name, LoadableInfo{nullptr, config_path, exception_ptr});
-                        else
-                            loadable_it->second.exception = exception_ptr;
-                    }
-
-                    tryLogCurrentException(log, "Cannot create " + object_name + " '"
-                                                + name + "' from config path " + config_path);
-
-                    /// propagate exception
-                    if (throw_on_error)
-                        throw;
-                }
+                objects_to_reload.emplace(name, LoadableCreationInfo{name, loaded_config, config_path, key});
            }
        }
    }
 }

+
 void ExternalLoader::reload()
 {
    reloadFromConfigFiles(true, true);
@ -441,10 +372,149 @@ void ExternalLoader::reload(const std::string & name)
        throw Exception("Failed to load " + object_name + " '" + name + "' during the reload process", ErrorCodes::BAD_ARGUMENTS);
 }

+
+void ExternalLoader::finishReload(const std::string & loadable_name, bool throw_on_error) const
+{
+    // We assume `all_mutex` is already locked.
+
+    auto it = objects_to_reload.find(loadable_name);
+    if (it == objects_to_reload.end())
+        return;
+
+    LoadableCreationInfo creation_info = std::move(it->second);
+    objects_to_reload.erase(it);
+    finishReloadImpl(creation_info, throw_on_error);
+}
+
+
+void ExternalLoader::finishAllReloads(bool throw_on_error) const
+{
+    // We assume `all_mutex` is already locked.
+
+    // We cannot just go through the map `objects_to_create` from begin to end and create every object
+    // because these objects can depend on each other.
+    // For example, if the first object's data depends on the second object's data then
+    // creating the first object will cause creating the second object too.
+    while (!objects_to_reload.empty())
+    {
+        auto it = objects_to_reload.begin();
+        LoadableCreationInfo creation_info = std::move(it->second);
+        objects_to_reload.erase(it);
+
+        try
+        {
+            finishReloadImpl(creation_info, throw_on_error);
+        }
+        catch (...)
+        {
+            objects_to_reload.clear(); // no more objects to create after an exception
+            throw;
+        }
+    }
+}
+
+
+void ExternalLoader::finishReloadImpl(const LoadableCreationInfo & creation_info, bool throw_on_error) const
+{
+    // We assume `all_mutex` is already locked.
+
+    const std::string & name = creation_info.name;
+    const std::string & config_path = creation_info.config_path;
+
+    try
+    {
+        ObjectsMap::iterator object_it;
+        {
+            std::lock_guard lock{map_mutex};
+            object_it = loadable_objects.find(name);
+
+            /// Object with the same name was declared in other config file.
+            if (object_it != std::end(loadable_objects) && object_it->second.origin != config_path)
+                throw Exception(object_name + " '" + name + "' from file " + config_path
+                                + " already declared in file " + object_it->second.origin,
+                                ErrorCodes::EXTERNAL_LOADABLE_ALREADY_EXISTS);
+        }
+
+        auto object_ptr = create(name, *creation_info.config, creation_info.config_prefix);
+
+        /// If the object could not be loaded.
+        if (const auto exception_ptr = object_ptr->getCreationException())
+        {
+            std::chrono::seconds delay(update_settings.backoff_initial_sec);
+            const auto failed_dict_it = failed_loadable_objects.find(name);
+            FailedLoadableInfo info{std::move(object_ptr), std::chrono::system_clock::now() + delay, 0};
+            if (failed_dict_it != std::end(failed_loadable_objects))
+                (*failed_dict_it).second = std::move(info);
+            else
+                failed_loadable_objects.emplace(name, std::move(info));
+
+            std::rethrow_exception(exception_ptr);
+        }
+        else if (object_ptr->supportUpdates())
+        {
+            const auto & lifetime = object_ptr->getLifetime();
+            if (lifetime.min_sec != 0 && lifetime.max_sec != 0)
+            {
+                std::uniform_int_distribution<UInt64> distribution(lifetime.min_sec, lifetime.max_sec);
+
+                update_times[name] = std::chrono::system_clock::now() +
+                                     std::chrono::seconds{distribution(rnd_engine)};
+            }
+        }
+
+        std::lock_guard lock{map_mutex};
+
+        /// add new loadable object or update an existing version
+        if (object_it == std::end(loadable_objects))
+            loadable_objects.emplace(name, LoadableInfo{std::move(object_ptr), config_path, {}});
+        else
+        {
+            if (object_it->second.loadable)
+                object_it->second.loadable.reset();
+            object_it->second.loadable = std::move(object_ptr);
+
+            /// erase stored exception on success
+            object_it->second.exception = std::exception_ptr{};
+            failed_loadable_objects.erase(name);
+        }
+    }
+    catch (...)
+    {
+        if (!name.empty())
+        {
+            /// If the loadable object could not load data or even failed to initialize from the config.
+            /// - all the same we insert information into the `loadable_objects`, with the zero pointer `loadable`.
+
+            std::lock_guard lock{map_mutex};
+
+            const auto exception_ptr = std::current_exception();
+            const auto loadable_it = loadable_objects.find(name);
+            if (loadable_it == std::end(loadable_objects))
+                loadable_objects.emplace(name, LoadableInfo{nullptr, config_path, exception_ptr});
+            else
+                loadable_it->second.exception = exception_ptr;
+        }
+
+        tryLogCurrentException(log, "Cannot create " + object_name + " '"
+                                    + name + "' from config path " + config_path);
+
+        /// propagate exception
+        if (throw_on_error)
+            throw;
+    }
+}
+
+
 ExternalLoader::LoadablePtr ExternalLoader::getLoadableImpl(const std::string & name, bool throw_on_error) const
 {
-    std::lock_guard lock{map_mutex};
+    /// We try to finish the reloading of the object `name` here, before searching it in the map `loadable_objects` later in this function.
+    /// If some other thread is already doing this reload work we don't want to wait until it finishes, because it's faster to just use
+    /// the current version of this loadable object. That's why we use try_lock() instead of lock() here.
+    std::unique_lock all_lock{all_mutex, std::defer_lock};
+    if (all_lock.try_lock())
+        finishReload(name, throw_on_error);

+    std::lock_guard lock{map_mutex};
    const auto it = loadable_objects.find(name);
    if (it == std::end(loadable_objects))
    {
--- a/dbms/src/Interpreters/ExternalLoader.h
+++ b/dbms/src/Interpreters/ExternalLoader.h
@ -89,7 +89,7 @@ public:
    using Configuration = Poco::Util::AbstractConfiguration;
    using ObjectsMap = std::unordered_map<std::string, LoadableInfo>;

-    /// Objects will be loaded immediately and then will be updated in separate thread, each 'reload_period' seconds.
+    /// Call init() after constructing the instance of any derived class.
    ExternalLoader(const Configuration & config_main,
                   const ExternalLoaderUpdateSettings & update_settings,
                   const ExternalLoaderConfigSettings & config_settings,
@ -97,6 +97,11 @@ public:
                   Logger * log, const std::string & loadable_object_name);
    virtual ~ExternalLoader();

+    /// Should be called after creating an instance of a derived class.
+    /// Loads the objects immediately and starts a separate thread to update them once in each 'reload_period' seconds.
+    /// This function does nothing if called again.
+    void init(bool throw_on_error);
+
    /// Forcibly reloads all loadable objects.
    void reload();

@ -108,7 +113,7 @@ public:

 protected:
    virtual std::unique_ptr<IExternalLoadable> create(const std::string & name, const Configuration & config,
-                                                      const std::string & config_prefix) = 0;
+                                                      const std::string & config_prefix) const = 0;

    class LockedObjectsMap
    {
@ -123,12 +128,8 @@ protected:
    /// Direct access to objects.
    LockedObjectsMap getObjectsMap() const;

-    /// Should be called in derived constructor (to avoid pure virtual call).
-    void init(bool throw_on_error);
-
 private:
-
-    bool is_initialized = false;
+    std::once_flag is_initialized_flag;

    /// Protects only objects map.
    /** Reading and assignment of "loadable" should be done under mutex.
@ -136,22 +137,35 @@ private:
      */
    mutable std::mutex map_mutex;

-    /// Protects all data, currently used to avoid races between updating thread and SYSTEM queries
-    mutable std::mutex all_mutex;
+    /// Protects all data, currently used to avoid races between updating thread and SYSTEM queries.
+    /// The mutex is recursive because creating of objects might be recursive, i.e.
+    /// creating objects might cause creating other objects.
+    mutable std::recursive_mutex all_mutex;

    /// name -> loadable.
-    ObjectsMap loadable_objects;
+    mutable ObjectsMap loadable_objects;
+
+    struct LoadableCreationInfo
+    {
+        std::string name;
+        Poco::AutoPtr<Poco::Util::AbstractConfiguration> config;
+        std::string config_path;
+        std::string config_prefix;
+    };
+
+    /// Objects which should be reloaded soon.
+    mutable std::unordered_map<std::string, LoadableCreationInfo> objects_to_reload;

    /// Here are loadable objects, that has been never loaded successfully.
    /// They are also in 'loadable_objects', but with nullptr as 'loadable'.
-    std::unordered_map<std::string, FailedLoadableInfo> failed_loadable_objects;
+    mutable std::unordered_map<std::string, FailedLoadableInfo> failed_loadable_objects;

    /// Both for loadable_objects and failed_loadable_objects.
-    std::unordered_map<std::string, std::chrono::system_clock::time_point> update_times;
+    mutable std::unordered_map<std::string, std::chrono::system_clock::time_point> update_times;

    std::unordered_map<std::string, std::unordered_set<std::string>> loadable_objects_defined_in_config;

-    pcg64 rnd_engine{randomSeed()};
+    mutable pcg64 rnd_engine{randomSeed()};

    const Configuration & config_main;
    const ExternalLoaderUpdateSettings & update_settings;
@ -168,17 +182,22 @@ private:

    std::unordered_map<std::string, Poco::Timestamp> last_modification_times;

+    void initImpl(bool throw_on_error);
+
    /// Check objects definitions in config files and reload or/and add new ones if the definition is changed
    /// If loadable_name is not empty, load only loadable object with name loadable_name
    void reloadFromConfigFiles(bool throw_on_error, bool force_reload = false, const std::string & loadable_name = "");
-    void reloadFromConfigFile(const std::string & config_path, const bool throw_on_error,
-                                const bool force_reload, const std::string & loadable_name);
+    void reloadFromConfigFile(const std::string & config_path, const bool force_reload, const std::string & loadable_name);

    /// Check config files and update expired loadable objects
    void reloadAndUpdate(bool throw_on_error = false);

    void reloadPeriodically();

+    void finishReload(const std::string & loadable_name, bool throw_on_error) const;
+    void finishAllReloads(bool throw_on_error) const;
+    void finishReloadImpl(const LoadableCreationInfo & creation_info, bool throw_on_error) const;
+
    LoadablePtr getLoadableImpl(const std::string & name, bool throw_on_error) const;
 };

--- a/dbms/src/Interpreters/ExternalModels.cpp
+++ b/dbms/src/Interpreters/ExternalModels.cpp
@ -33,8 +33,7 @@ namespace

 ExternalModels::ExternalModels(
    std::unique_ptr<IExternalLoaderConfigRepository> config_repository,
-    Context & context,
-    bool throw_on_error)
+    Context & context)
        : ExternalLoader(context.getConfigRef(),
                         externalModelsUpdateSettings,
                         getExternalModelsConfigSettings(),
@ -43,11 +42,10 @@ ExternalModels::ExternalModels(
                         "external model"),
          context(context)
 {
-    init(throw_on_error);
 }

 std::unique_ptr<IExternalLoadable> ExternalModels::create(
-        const std::string & name, const Configuration & config, const std::string & config_prefix)
+        const std::string & name, const Configuration & config, const std::string & config_prefix) const
 {
    String type = config.getString(config_prefix + ".type");
    ExternalLoadableLifetime lifetime(config, config_prefix + ".lifetime");
--- a/dbms/src/Interpreters/ExternalModels.h
+++ b/dbms/src/Interpreters/ExternalModels.h
@ -20,8 +20,7 @@ public:
    /// Models will be loaded immediately and then will be updated in separate thread, each 'reload_period' seconds.
    ExternalModels(
        std::unique_ptr<IExternalLoaderConfigRepository> config_repository,
-        Context & context,
-        bool throw_on_error);
+        Context & context);

    /// Forcibly reloads specified model.
    void reloadModel(const std::string & name) { reload(name); }
@ -34,7 +33,7 @@ public:
 protected:

    std::unique_ptr<IExternalLoadable> create(const std::string & name, const Configuration & config,
-                                              const std::string & config_prefix) override;
+                                              const std::string & config_prefix) const override;

    using ExternalLoader::getObjectsMap;

--- a/dbms/src/Interpreters/InterpreterCreateQuery.cpp
+++ b/dbms/src/Interpreters/InterpreterCreateQuery.cpp
@ -233,6 +233,9 @@ ASTPtr InterpreterCreateQuery::formatColumns(const ColumnsDescription & columns)
            column_declaration->codec = parseQuery(codec_p, codec_desc_pos, codec_desc_end, "column codec", 0);
        }

+        if (column.ttl)
+            column_declaration->ttl = column.ttl;
+
        columns_list->children.push_back(column_declaration_ptr);
    }

@ -347,6 +350,9 @@ ColumnsDescription InterpreterCreateQuery::getColumnsDescription(const ASTExpres
        if (col_decl.codec)
            column.codec = CompressionCodecFactory::instance().get(col_decl.codec, column.type);

+        if (col_decl.ttl)
+            column.ttl = col_decl.ttl;
+
        res.add(std::move(column));
    }

--- a/dbms/src/Interpreters/InterpreterDescribeQuery.cpp
+++ b/dbms/src/Interpreters/InterpreterDescribeQuery.cpp
@ -52,6 +52,9 @@ Block InterpreterDescribeQuery::getSampleBlock()
    col.name = "codec_expression";
    block.insert(col);

+    col.name = "ttl_expression";
+    block.insert(col);
+
    return block;
 }

@ -118,6 +121,11 @@ BlockInputStreamPtr InterpreterDescribeQuery::executeImpl()
            res_columns[5]->insert(column.codec->getCodecDesc());
        else
            res_columns[5]->insertDefault();
+
+        if (column.ttl)
+            res_columns[6]->insert(queryToString(column.ttl));
+        else
+            res_columns[6]->insertDefault();
    }

    return std::make_shared<OneBlockInputStream>(sample_block.cloneWithColumns(std::move(res_columns)));
--- a/dbms/src/Interpreters/InterpreterSelectQuery.cpp
+++ b/dbms/src/Interpreters/InterpreterSelectQuery.cpp
@ -760,11 +760,11 @@ void InterpreterSelectQuery::executeImpl(Pipeline & pipeline, const BlockInputSt
                executeExpression(pipeline, expressions.before_order_and_select);
                executeDistinct(pipeline, true, expressions.selected_columns);

-                need_second_distinct_pass = query.distinct && pipeline.hasMoreThanOneStream();
+                need_second_distinct_pass = query.distinct && pipeline.hasMixedStreams();
            }
            else
            {
-                need_second_distinct_pass = query.distinct && pipeline.hasMoreThanOneStream();
+                need_second_distinct_pass = query.distinct && pipeline.hasMixedStreams();

                if (query.group_by_with_totals && !aggregate_final)
                {
@ -1533,6 +1533,7 @@ void InterpreterSelectQuery::executeUnion(Pipeline & pipeline)
        pipeline.firstStream() = std::make_shared<UnionBlockInputStream>(pipeline.streams, pipeline.stream_with_non_joined_data, max_streams);
        pipeline.stream_with_non_joined_data = nullptr;
        pipeline.streams.resize(1);
+        pipeline.union_stream = true;
    }
    else if (pipeline.stream_with_non_joined_data)
    {
--- a/dbms/src/Interpreters/InterpreterSelectQuery.h
+++ b/dbms/src/Interpreters/InterpreterSelectQuery.h
@ -100,6 +100,7 @@ private:
          * It is appended to the main streams in UnionBlockInputStream or ParallelAggregatingBlockInputStream.
          */
        BlockInputStreamPtr stream_with_non_joined_data;
+        bool union_stream = false;

        BlockInputStreamPtr & firstStream() { return streams.at(0); }

@ -117,6 +118,12 @@ private:
        {
            return streams.size() + (stream_with_non_joined_data ? 1 : 0) > 1;
        }
+
+        /// Resulting stream is mix of other streams data. Distinct and/or order guaranties are broken.
+        bool hasMixedStreams() const
+        {
+            return hasMoreThanOneStream() || union_stream;
+        }
    };

    void executeImpl(Pipeline & pipeline, const BlockInputStreamPtr & prepared_input, bool dry_run);
--- a/dbms/src/Interpreters/RowRefs.h
+++ b/dbms/src/Interpreters/RowRefs.h
@ -1,5 +1,6 @@
 #pragma once

+#include <Common/RadixSort.h>
 #include <Columns/IColumn.h>

 #include <optional>
@ -39,11 +40,11 @@ struct RowRefList : RowRef
 * references that can be returned by the lookup methods
 */

-template <typename T>
+template <typename _Entry, typename _Key>
 class SortedLookupVector
 {
 public:
-    using Base = std::vector<T>;
+    using Base = std::vector<_Entry>;

    // First stage, insertions into the vector
    template <typename U, typename ... TAllocatorParams>
@ -54,7 +55,7 @@ public:
    }

    // Transition into second stage, ensures that the vector is sorted
-    typename Base::const_iterator upper_bound(const T & k)
+    typename Base::const_iterator upper_bound(const _Entry & k)
    {
        sort();
        return std::upper_bound(array.cbegin(), array.cend(), k);
@ -81,7 +82,15 @@ private:
            std::lock_guard<std::mutex> l(lock);
            if (!sorted.load(std::memory_order_relaxed))
            {
-                std::sort(array.begin(), array.end());
+                /// TODO: It has been tested only for UInt32 yet. It needs to check UInt64, Float32/64.
+                if constexpr (std::is_same_v<_Key, UInt32>)
+                {
+                    if (!array.empty())
+                        radixSort<_Entry, _Key>(&array[0], array.size());
+                }
+                else
+                    std::sort(array.begin(), array.end());
+
                sorted.store(true, std::memory_order_release);
            }
        }
@ -94,7 +103,7 @@ public:
    template <typename T>
    struct Entry
    {
-        using LookupType = SortedLookupVector<Entry<T>>;
+        using LookupType = SortedLookupVector<Entry<T>, T>;
        using LookupPtr = std::unique_ptr<LookupType>;
        T asof_value;
        RowRef row_ref;
--- a/dbms/src/Parsers/ASTAlterQuery.cpp
+++ b/dbms/src/Parsers/ASTAlterQuery.cpp
@ -40,6 +40,11 @@ ASTPtr ASTAlterCommand::clone() const
        res->predicate = predicate->clone();
        res->children.push_back(res->predicate);
    }
+    if (ttl)
+    {
+        res->ttl = ttl->clone();
+        res->children.push_back(res->ttl);
+    }

    return res;
 }
@ -174,6 +179,11 @@ void ASTAlterCommand::formatImpl(
        settings.ostr << " " << (settings.hilite ? hilite_none : "");
        comment->formatImpl(settings, state, frame);
    }
+    else if (type == ASTAlterCommand::MODIFY_TTL)
+    {
+        settings.ostr << (settings.hilite ? hilite_keyword : "") << indent_str << "MODIFY TTL " << (settings.hilite ? hilite_none : "");
+        ttl->formatImpl(settings, state, frame);
+    }
    else
        throw Exception("Unexpected type of ALTER", ErrorCodes::UNEXPECTED_AST_STRUCTURE);
 }
--- a/dbms/src/Parsers/ASTAlterQuery.h
+++ b/dbms/src/Parsers/ASTAlterQuery.h
@ -27,6 +27,7 @@ public:
        MODIFY_COLUMN,
        COMMENT_COLUMN,
        MODIFY_ORDER_BY,
+        MODIFY_TTL,

        ADD_INDEX,
        DROP_INDEX,
@ -84,6 +85,9 @@ public:
    /// A column comment
    ASTPtr comment;

+    /// For MODIFY TTL query
+    ASTPtr ttl;
+
    bool detach = false;        /// true for DETACH PARTITION

    bool part = false;          /// true for ATTACH PART
--- a/dbms/src/Parsers/ASTColumnDeclaration.cpp
+++ b/dbms/src/Parsers/ASTColumnDeclaration.cpp
@ -33,6 +33,12 @@ ASTPtr ASTColumnDeclaration::clone() const
        res->children.push_back(res->comment);
    }

+    if (ttl)
+    {
+        res->ttl = ttl->clone();
+        res->children.push_back(res->ttl);
+    }
+
    return res;
 }

@ -69,6 +75,12 @@ void ASTColumnDeclaration::formatImpl(const FormatSettings & settings, FormatSta
        settings.ostr << ' ';
        codec->formatImpl(settings, state, frame);
    }
+
+    if (ttl)
+    {
+        settings.ostr << ' ' << (settings.hilite ? hilite_keyword : "") << "TTL" << (settings.hilite ? hilite_none : "") << ' ';
+        ttl->formatImpl(settings, state, frame);
+    }
 }

 }
--- a/dbms/src/Parsers/ASTColumnDeclaration.h
+++ b/dbms/src/Parsers/ASTColumnDeclaration.h
@ -17,6 +17,7 @@ public:
    ASTPtr default_expression;
    ASTPtr codec;
    ASTPtr comment;
+    ASTPtr ttl;

    String getID(char delim) const override { return "ColumnDeclaration" + (delim + name); }

--- a/dbms/src/Parsers/ASTCreateQuery.cpp
+++ b/dbms/src/Parsers/ASTCreateQuery.cpp
@ -23,6 +23,8 @@ ASTPtr ASTStorage::clone() const
        res->set(res->order_by, order_by->clone());
    if (sample_by)
        res->set(res->sample_by, sample_by->clone());
+    if (ttl_table)
+        res->set(res->ttl_table, ttl_table->clone());

    if (settings)
        res->set(res->settings, settings->clone());
@ -57,6 +59,11 @@ void ASTStorage::formatImpl(const FormatSettings & s, FormatState & state, Forma
        s.ostr << (s.hilite ? hilite_keyword : "") << s.nl_or_ws << "SAMPLE BY " << (s.hilite ? hilite_none : "");
        sample_by->formatImpl(s, state, frame);
    }
+    if (ttl_table)
+    {
+        s.ostr << (s.hilite ? hilite_keyword : "") << s.nl_or_ws << "TTL " << (s.hilite ? hilite_none : "");
+        ttl_table->formatImpl(s, state, frame);
+    }
    if (settings)
    {
        s.ostr << (s.hilite ? hilite_keyword : "") << s.nl_or_ws << "SETTINGS " << (s.hilite ? hilite_none : "");
--- a/dbms/src/Parsers/ASTCreateQuery.h
+++ b/dbms/src/Parsers/ASTCreateQuery.h
@ -18,6 +18,7 @@ public:
    IAST * primary_key = nullptr;
    IAST * order_by = nullptr;
    IAST * sample_by = nullptr;
+    IAST * ttl_table = nullptr;
    ASTSetQuery * settings = nullptr;

    String getID(char) const override { return "Storage definition"; }
--- a/dbms/src/Parsers/ParserAlterQuery.cpp
+++ b/dbms/src/Parsers/ParserAlterQuery.cpp
@ -27,6 +27,7 @@ bool ParserAlterCommand::parseImpl(Pos & pos, ASTPtr & node, Expected & expected
    ParserKeyword s_modify_column("MODIFY COLUMN");
    ParserKeyword s_comment_column("COMMENT COLUMN");
    ParserKeyword s_modify_order_by("MODIFY ORDER BY");
+    ParserKeyword s_modify_ttl("MODIFY TTL");

    ParserKeyword s_add_index("ADD INDEX");
    ParserKeyword s_drop_index("DROP INDEX");
@ -282,6 +283,12 @@ bool ParserAlterCommand::parseImpl(Pos & pos, ASTPtr & node, Expected & expected

        command->type = ASTAlterCommand::COMMENT_COLUMN;
    }
+    else if (s_modify_ttl.ignore(pos, expected))
+    {
+        if (!parser_exp_elem.parse(pos, command->ttl, expected))
+            return false;
+        command->type = ASTAlterCommand::MODIFY_TTL;
+    }
    else
        return false;

@ -299,6 +306,8 @@ bool ParserAlterCommand::parseImpl(Pos & pos, ASTPtr & node, Expected & expected
        command->children.push_back(command->update_assignments);
    if (command->comment)
        command->children.push_back(command->comment);
+    if (command->ttl)
+        command->children.push_back(command->ttl);

    return true;
 }
--- a/dbms/src/Parsers/ParserCreateQuery.cpp
+++ b/dbms/src/Parsers/ParserCreateQuery.cpp
@ -210,6 +210,7 @@ bool ParserStorage::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
    ParserKeyword s_primary_key("PRIMARY KEY");
    ParserKeyword s_order_by("ORDER BY");
    ParserKeyword s_sample_by("SAMPLE BY");
+    ParserKeyword s_ttl("TTL");
    ParserKeyword s_settings("SETTINGS");

    ParserIdentifierWithOptionalParameters ident_with_optional_params_p;
@ -221,6 +222,7 @@ bool ParserStorage::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
    ASTPtr primary_key;
    ASTPtr order_by;
    ASTPtr sample_by;
+    ASTPtr ttl_table;
    ASTPtr settings;

    if (!s_engine.ignore(pos, expected))
@ -265,6 +267,14 @@ bool ParserStorage::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
                return false;
        }

+        if (!ttl_table && s_ttl.ignore(pos, expected))
+        {
+            if (expression_p.parse(pos, ttl_table, expected))
+                continue;
+            else
+                return false;
+        }
+
        if (s_settings.ignore(pos, expected))
        {
            if (!settings_p.parse(pos, settings, expected))
@ -280,6 +290,7 @@ bool ParserStorage::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
    storage->set(storage->primary_key, primary_key);
    storage->set(storage->order_by, order_by);
    storage->set(storage->sample_by, sample_by);
+    storage->set(storage->ttl_table, ttl_table);

    storage->set(storage->settings, settings);

--- a/dbms/src/Parsers/ParserCreateQuery.h
+++ b/dbms/src/Parsers/ParserCreateQuery.h
@ -123,9 +123,11 @@ bool IParserColumnDeclaration<NameParser>::parseImpl(Pos & pos, ASTPtr & node, E
    ParserKeyword s_alias{"ALIAS"};
    ParserKeyword s_comment{"COMMENT"};
    ParserKeyword s_codec{"CODEC"};
+    ParserKeyword s_ttl{"TTL"};
    ParserTernaryOperatorExpression expr_parser;
    ParserStringLiteral string_literal_parser;
    ParserCodec codec_parser;
+    ParserExpression expression_parser;

    /// mandatory column name
    ASTPtr name;
@ -140,6 +142,7 @@ bool IParserColumnDeclaration<NameParser>::parseImpl(Pos & pos, ASTPtr & node, E
    ASTPtr default_expression;
    ASTPtr comment_expression;
    ASTPtr codec_expression;
+    ASTPtr ttl_expression;

    if (!s_default.check_without_moving(pos, expected) &&
        !s_materialized.check_without_moving(pos, expected) &&
@ -178,6 +181,12 @@ bool IParserColumnDeclaration<NameParser>::parseImpl(Pos & pos, ASTPtr & node, E
            return false;
    }

+    if (s_ttl.ignore(pos, expected))
+    {
+        if (!expression_parser.parse(pos, ttl_expression, expected))
+            return false;
+    }
+
    const auto column_declaration = std::make_shared<ASTColumnDeclaration>();
    node = column_declaration;
    getIdentifierName(name, column_declaration->name);
@ -207,6 +216,12 @@ bool IParserColumnDeclaration<NameParser>::parseImpl(Pos & pos, ASTPtr & node, E
        column_declaration->children.push_back(std::move(codec_expression));
    }

+    if (ttl_expression)
+    {
+        column_declaration->ttl = ttl_expression;
+        column_declaration->children.push_back(std::move(ttl_expression));
+    }
+
    return true;
 }

--- a/dbms/src/Storages/AlterCommands.cpp
+++ b/dbms/src/Storages/AlterCommands.cpp
@ -17,6 +17,8 @@
 #include <Common/typeid_cast.h>
 #include <Compression/CompressionFactory.h>

+#include <Parsers/queryToString.h>
+

 namespace DB
 {
@ -64,6 +66,9 @@ std::optional<AlterCommand> AlterCommand::parse(const ASTAlterCommand * command_
        if (command_ast->column)
            command.after_column = *getIdentifierName(command_ast->column);

+        if (ast_col_decl.ttl)
+            command.ttl = ast_col_decl.ttl;
+
        command.if_not_exists = command_ast->if_not_exists;

        return command;
@ -104,6 +109,9 @@ std::optional<AlterCommand> AlterCommand::parse(const ASTAlterCommand * command_
            command.comment = ast_comment.value.get<String>();
        }

+        if (ast_col_decl.ttl)
+            command.ttl = ast_col_decl.ttl;
+
        if (ast_col_decl.codec)
            command.codec = compression_codec_factory.get(ast_col_decl.codec, command.data_type);

@ -157,13 +165,20 @@ std::optional<AlterCommand> AlterCommand::parse(const ASTAlterCommand * command_

        return command;
    }
+    else if (command_ast->type == ASTAlterCommand::MODIFY_TTL)
+    {
+        AlterCommand command;
+        command.type = AlterCommand::MODIFY_TTL;
+        command.ttl = command_ast->ttl;
+        return command;
+    }
    else
        return {};
 }


 void AlterCommand::apply(ColumnsDescription & columns_description, IndicesDescription & indices_description,
-        ASTPtr & order_by_ast, ASTPtr & primary_key_ast) const
+        ASTPtr & order_by_ast, ASTPtr & primary_key_ast, ASTPtr & ttl_table_ast) const
 {
    if (type == ADD_COLUMN)
    {
@ -175,6 +190,7 @@ void AlterCommand::apply(ColumnsDescription & columns_description, IndicesDescri
        }
        column.comment = comment;
        column.codec = codec;
+        column.ttl = ttl;

        columns_description.add(column, after_column);

@ -204,6 +220,9 @@ void AlterCommand::apply(ColumnsDescription & columns_description, IndicesDescri
            return;
        }

+        if (ttl)
+            column.ttl = ttl;
+
        column.type = data_type;

        column.default_desc.kind = default_kind;
@ -278,6 +297,10 @@ void AlterCommand::apply(ColumnsDescription & columns_description, IndicesDescri

        indices_description.indices.erase(erase_it);
    }
+    else if (type == MODIFY_TTL)
+    {
+        ttl_table_ast = ttl;
+    }
    else
        throw Exception("Wrong parameter type in ALTER query", ErrorCodes::LOGICAL_ERROR);
 }
@ -293,20 +316,22 @@ bool AlterCommand::is_mutable() const
 }

 void AlterCommands::apply(ColumnsDescription & columns_description, IndicesDescription & indices_description,
-        ASTPtr & order_by_ast, ASTPtr & primary_key_ast) const
+        ASTPtr & order_by_ast, ASTPtr & primary_key_ast, ASTPtr & ttl_table_ast) const
 {
    auto new_columns_description = columns_description;
    auto new_indices_description = indices_description;
    auto new_order_by_ast = order_by_ast;
    auto new_primary_key_ast = primary_key_ast;
+    auto new_ttl_table_ast = ttl_table_ast;

    for (const AlterCommand & command : *this)
        if (!command.ignore)
-            command.apply(new_columns_description, new_indices_description, new_order_by_ast, new_primary_key_ast);
+            command.apply(new_columns_description, new_indices_description, new_order_by_ast, new_primary_key_ast, new_ttl_table_ast);
    columns_description = std::move(new_columns_description);
    indices_description = std::move(new_indices_description);
    order_by_ast = std::move(new_order_by_ast);
    primary_key_ast = std::move(new_primary_key_ast);
+    ttl_table_ast = std::move(new_ttl_table_ast);
 }

 void AlterCommands::validate(const IStorage & table, const Context & context)
@ -493,7 +518,8 @@ void AlterCommands::apply(ColumnsDescription & columns_description) const
    IndicesDescription indices_description;
    ASTPtr out_order_by;
    ASTPtr out_primary_key;
-    apply(out_columns_description, indices_description, out_order_by, out_primary_key);
+    ASTPtr out_ttl_table;
+    apply(out_columns_description, indices_description, out_order_by, out_primary_key, out_ttl_table);

    if (out_order_by)
        throw Exception("Storage doesn't support modifying ORDER BY expression", ErrorCodes::NOT_IMPLEMENTED);
@ -501,6 +527,8 @@ void AlterCommands::apply(ColumnsDescription & columns_description) const
        throw Exception("Storage doesn't support modifying PRIMARY KEY expression", ErrorCodes::NOT_IMPLEMENTED);
    if (!indices_description.indices.empty())
        throw Exception("Storage doesn't support modifying indices", ErrorCodes::NOT_IMPLEMENTED);
+    if (out_ttl_table)
+        throw Exception("Storage doesn't support modifying TTL expression", ErrorCodes::NOT_IMPLEMENTED);

    columns_description = std::move(out_columns_description);
 }
--- a/dbms/src/Storages/AlterCommands.h
+++ b/dbms/src/Storages/AlterCommands.h
@ -24,6 +24,7 @@ struct AlterCommand
        MODIFY_ORDER_BY,
        ADD_INDEX,
        DROP_INDEX,
+        MODIFY_TTL,
        UKNOWN_TYPE,
    };

@ -60,6 +61,9 @@ struct AlterCommand
    /// For ADD/DROP INDEX
    String index_name;

+    /// For MODIFY TTL
+    ASTPtr ttl;
+
    /// indicates that this command should not be applied, for example in case of if_exists=true and column doesn't exist.
    bool ignore = false;

@ -79,7 +83,7 @@ struct AlterCommand
    static std::optional<AlterCommand> parse(const ASTAlterCommand * command);

    void apply(ColumnsDescription & columns_description, IndicesDescription & indices_description,
-            ASTPtr & order_by_ast, ASTPtr & primary_key_ast) const;
+            ASTPtr & order_by_ast, ASTPtr & primary_key_ast, ASTPtr & ttl_table_ast) const;
    /// Checks that not only metadata touched by that command
    bool is_mutable() const;
 };
@ -91,7 +95,7 @@ class AlterCommands : public std::vector<AlterCommand>
 {
 public:
    void apply(ColumnsDescription & columns_description, IndicesDescription & indices_description, ASTPtr & order_by_ast,
-            ASTPtr & primary_key_ast) const;
+            ASTPtr & primary_key_ast, ASTPtr & ttl_table_ast) const;

    /// For storages that don't support MODIFY_ORDER_BY.
    void apply(ColumnsDescription & columns_description) const;
--- a/dbms/src/Storages/ColumnsDescription.cpp
+++ b/dbms/src/Storages/ColumnsDescription.cpp
@ -37,12 +37,14 @@ namespace ErrorCodes
 bool ColumnDescription::operator==(const ColumnDescription & other) const
 {
    auto codec_str = [](const CompressionCodecPtr & codec_ptr) { return codec_ptr ? codec_ptr->getCodecDesc() : String(); };
+    auto ttl_str = [](const ASTPtr & ttl_ast) { return ttl_ast ? queryToString(ttl_ast) : String{}; };

    return name == other.name
        && type->equals(*other.type)
        && default_desc == other.default_desc
        && comment == other.comment
-        && codec_str(codec) == codec_str(other.codec);
+        && codec_str(codec) == codec_str(other.codec)
+        && ttl_str(ttl) == ttl_str(other.ttl);
 }

 void ColumnDescription::writeText(WriteBuffer & buf) const
@ -74,6 +76,13 @@ void ColumnDescription::writeText(WriteBuffer & buf) const
        DB::writeText(")", buf);
    }

+    if (ttl)
+    {
+        writeChar('\t', buf);
+        DB::writeText("TTL ", buf);
+        DB::writeText(queryToString(ttl), buf);
+    }
+
    writeChar('\n', buf);
 }

@ -99,6 +108,9 @@ void ColumnDescription::readText(ReadBuffer & buf)

        if (col_ast->codec)
            codec = CompressionCodecFactory::instance().get(col_ast->codec, type);
+
+        if (col_ast->ttl)
+            ttl = col_ast->ttl;
    }
    else
        throw Exception("Cannot parse column description", ErrorCodes::CANNOT_PARSE_TEXT);
@ -388,6 +400,18 @@ CompressionCodecPtr ColumnsDescription::getCodecOrDefault(const String & column_
    return getCodecOrDefault(column_name, CompressionCodecFactory::instance().getDefaultCodec());
 }

+ColumnsDescription::ColumnTTLs ColumnsDescription::getColumnTTLs() const
+{
+    ColumnTTLs ret;
+    for (const auto & column : columns)
+    {
+        if (column.ttl)
+            ret.emplace(column.name, column.ttl);
+    }
+
+    return ret;
+}
+

 String ColumnsDescription::toString() const
 {
--- a/dbms/src/Storages/ColumnsDescription.h
+++ b/dbms/src/Storages/ColumnsDescription.h
@ -18,6 +18,7 @@ struct ColumnDescription
    ColumnDefault default_desc;
    String comment;
    CompressionCodecPtr codec;
+    ASTPtr ttl;

    ColumnDescription() = default;
    ColumnDescription(String name_, DataTypePtr type_) : name(std::move(name_)), type(std::move(type_)) {}
@ -58,6 +59,9 @@ public:
    /// ordinary + materialized + aliases.
    NamesAndTypesList getAll() const;

+    using ColumnTTLs = std::unordered_map<String, ASTPtr>;
+    ColumnTTLs getColumnTTLs() const;
+
    bool has(const String & column_name) const;
    bool hasNested(const String & column_name) const;
    ColumnDescription & get(const String & column_name);
--- a/dbms/src/Storages/MergeTree/MergeList.cpp
+++ b/dbms/src/Storages/MergeTree/MergeList.cpp
@ -28,7 +28,8 @@ MergeListElement::MergeListElement(const std::string & database, const std::stri
        std::shared_lock<std::shared_mutex> part_lock(source_part->columns_lock);

        total_size_bytes_compressed += source_part->bytes_on_disk;
-        total_size_marks += source_part->marks_count;
+        total_size_marks += source_part->getMarksCount();
+        total_rows_count += source_part->index_granularity.getTotalRows();
    }

    if (!future_part.parts.empty())
@ -60,6 +61,7 @@ MergeInfo MergeListElement::getInfo() const
    res.num_parts = num_parts;
    res.total_size_bytes_compressed = total_size_bytes_compressed;
    res.total_size_marks = total_size_marks;
+    res.total_rows_count = total_rows_count;
    res.bytes_read_uncompressed = bytes_read_uncompressed.load(std::memory_order_relaxed);
    res.bytes_written_uncompressed = bytes_written_uncompressed.load(std::memory_order_relaxed);
    res.rows_read = rows_read.load(std::memory_order_relaxed);
--- a/dbms/src/Storages/MergeTree/MergeList.h
+++ b/dbms/src/Storages/MergeTree/MergeList.h
@ -36,6 +36,7 @@ struct MergeInfo
    UInt64 num_parts;
    UInt64 total_size_bytes_compressed;
    UInt64 total_size_marks;
+    UInt64 total_rows_count;
    UInt64 bytes_read_uncompressed;
    UInt64 bytes_written_uncompressed;
    UInt64 rows_read;
@ -67,6 +68,7 @@ struct MergeListElement : boost::noncopyable

    UInt64 total_size_bytes_compressed{};
    UInt64 total_size_marks{};
+    UInt64 total_rows_count{};
    std::atomic<UInt64> bytes_read_uncompressed{};
    std::atomic<UInt64> bytes_written_uncompressed{};

--- a/dbms/src/Storages/MergeTree/MergeSelector.h
+++ b/dbms/src/Storages/MergeTree/MergeSelector.h
@ -39,6 +39,9 @@ public:

        /// Opaque pointer to avoid dependencies (it is not possible to do forward declaration of typedef).
        const void * data;
+
+        /// Minimal time, when we need to delete some data from this part
+        time_t min_ttl;
    };

    /// Parts are belong to partitions. Only parts within same partition could be merged.
--- a/dbms/src/Storages/MergeTree/MergeTreeBaseSelectBlockInputStream.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeBaseSelectBlockInputStream.cpp
@ -5,6 +5,7 @@
 #include <Columns/FilterDescription.h>
 #include <Columns/ColumnArray.h>
 #include <Common/typeid_cast.h>
+#include <Common/StackTrace.h>
 #include <ext/range.h>
 #include <DataTypes/DataTypeNothing.h>

@ -40,8 +41,7 @@ MergeTreeBaseSelectBlockInputStream::MergeTreeBaseSelectBlockInputStream(
    max_read_buffer_size(max_read_buffer_size),
    use_uncompressed_cache(use_uncompressed_cache),
    save_marks_in_cache(save_marks_in_cache),
-    virt_column_names(virt_column_names),
-    max_block_size_marks(max_block_size_rows / storage.index_granularity)
+    virt_column_names(virt_column_names)
 {
 }

@ -76,7 +76,7 @@ Block MergeTreeBaseSelectBlockInputStream::readFromPart()
    const auto current_max_block_size_rows = max_block_size_rows;
    const auto current_preferred_block_size_bytes = preferred_block_size_bytes;
    const auto current_preferred_max_column_in_block_size_bytes = preferred_max_column_in_block_size_bytes;
-    const auto index_granularity = storage.index_granularity;
+    const auto & index_granularity = task->data_part->index_granularity;
    const double min_filtration_ratio = 0.00001;

    auto estimateNumRows = [current_preferred_block_size_bytes, current_max_block_size_rows,
@ -87,11 +87,12 @@ Block MergeTreeBaseSelectBlockInputStream::readFromPart()
            return current_max_block_size_rows;

        /// Calculates number of rows will be read using preferred_block_size_bytes.
-        /// Can't be less than index_granularity.
+        /// Can't be less than avg_index_granularity.
        UInt64 rows_to_read = current_task.size_predictor->estimateNumRows(current_preferred_block_size_bytes);
        if (!rows_to_read)
            return rows_to_read;
-        rows_to_read = std::max<UInt64>(index_granularity, rows_to_read);
+        UInt64 total_row_in_current_granule = current_reader.numRowsInCurrentGranule();
+        rows_to_read = std::max<UInt64>(total_row_in_current_granule, rows_to_read);

        if (current_preferred_max_column_in_block_size_bytes)
        {
@ -102,7 +103,7 @@ Block MergeTreeBaseSelectBlockInputStream::readFromPart()
            auto rows_to_read_for_max_size_column_with_filtration
                = static_cast<UInt64>(rows_to_read_for_max_size_column / filtration_ratio);

-            /// If preferred_max_column_in_block_size_bytes is used, number of rows to read can be less than index_granularity.
+            /// If preferred_max_column_in_block_size_bytes is used, number of rows to read can be less than current_index_granularity.
            rows_to_read = std::min(rows_to_read, rows_to_read_for_max_size_column_with_filtration);
        }

@ -110,8 +111,7 @@ Block MergeTreeBaseSelectBlockInputStream::readFromPart()
        if (unread_rows_in_current_granule >= rows_to_read)
            return rows_to_read;

-        UInt64 granule_to_read = (rows_to_read + current_reader.numReadRowsInCurrentGranule() + index_granularity / 2) / index_granularity;
-        return index_granularity * granule_to_read - current_reader.numReadRowsInCurrentGranule();
+        return index_granularity.countMarksForRows(current_reader.currentMark(), rows_to_read, current_reader.numReadRowsInCurrentGranule());
    };

    if (!task->range_reader.isInitialized())
@ -121,28 +121,33 @@ Block MergeTreeBaseSelectBlockInputStream::readFromPart()
            if (reader->getColumns().empty())
            {
                task->range_reader = MergeTreeRangeReader(
-                    pre_reader.get(), index_granularity, nullptr,
+                    pre_reader.get(), nullptr,
                    prewhere_info->alias_actions, prewhere_info->prewhere_actions,
                    &prewhere_info->prewhere_column_name, &task->ordered_names,
                    task->should_reorder, task->remove_prewhere_column, true);
            }
            else
            {
-                task->pre_range_reader = MergeTreeRangeReader(
-                    pre_reader.get(), index_granularity, nullptr,
-                    prewhere_info->alias_actions, prewhere_info->prewhere_actions,
-                    &prewhere_info->prewhere_column_name, &task->ordered_names,
-                    task->should_reorder, task->remove_prewhere_column, false);
+                MergeTreeRangeReader * pre_reader_ptr = nullptr;
+                if (pre_reader != nullptr)
+                {
+                    task->pre_range_reader = MergeTreeRangeReader(
+                        pre_reader.get(), nullptr,
+                        prewhere_info->alias_actions, prewhere_info->prewhere_actions,
+                        &prewhere_info->prewhere_column_name, &task->ordered_names,
+                        task->should_reorder, task->remove_prewhere_column, false);
+                    pre_reader_ptr = &task->pre_range_reader;
+                }

                task->range_reader = MergeTreeRangeReader(
-                    reader.get(), index_granularity, &task->pre_range_reader, nullptr, nullptr,
+                    reader.get(), pre_reader_ptr, nullptr, nullptr,
                    nullptr, &task->ordered_names, true, false, true);
            }
        }
        else
        {
            task->range_reader = MergeTreeRangeReader(
-                reader.get(), index_granularity, nullptr, nullptr, nullptr,
+                reader.get(),  nullptr, nullptr, nullptr,
                nullptr, &task->ordered_names, task->should_reorder, false, true);
        }
    }
--- a/dbms/src/Storages/MergeTree/MergeTreeBaseSelectBlockInputStream.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeBaseSelectBlockInputStream.h
@ -71,8 +71,6 @@ protected:
    using MergeTreeReaderPtr = std::unique_ptr<MergeTreeReader>;
    MergeTreeReaderPtr reader;
    MergeTreeReaderPtr pre_reader;
-
-    UInt64 max_block_size_marks;
 };

 }
--- a/dbms/src/Storages/MergeTree/MergeTreeData.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeData.cpp
@ -15,6 +15,7 @@
 #include <Parsers/parseQuery.h>
 #include <Parsers/queryToString.h>
 #include <DataStreams/ExpressionBlockInputStream.h>
+#include <DataStreams/MarkInCompressedFile.h>
 #include <Formats/ValuesRowInputStream.h>
 #include <DataStreams/copyData.h>
 #include <IO/WriteBufferFromFile.h>
@ -84,6 +85,7 @@ namespace ErrorCodes
    extern const int CANNOT_ALLOCATE_MEMORY;
    extern const int CANNOT_MUNMAP;
    extern const int CANNOT_MREMAP;
+    extern const int BAD_TTL_EXPRESSION;
 }


@ -97,17 +99,19 @@ MergeTreeData::MergeTreeData(
    const ASTPtr & order_by_ast_,
    const ASTPtr & primary_key_ast_,
    const ASTPtr & sample_by_ast_,
+    const ASTPtr & ttl_table_ast_,
    const MergingParams & merging_params_,
    const MergeTreeSettings & settings_,
    bool require_part_metadata_,
    bool attach,
    BrokenPartCallback broken_part_callback_)
    : global_context(context_),
+    index_granularity_info(settings_),
    merging_params(merging_params_),
-    index_granularity(settings_.index_granularity),
    settings(settings_),
    partition_by_ast(partition_by_ast_),
    sample_by_ast(sample_by_ast_),
+    ttl_table_ast(ttl_table_ast_),
    require_part_metadata(require_part_metadata_),
    database_name(database_), table_name(table_),
    full_path(full_path_),
@ -133,7 +137,6 @@ MergeTreeData::MergeTreeData(
        columns_required_for_sampling = ExpressionAnalyzer(sample_by_ast, syntax, global_context)
            .getRequiredSourceColumns();
    }
-
    MergeTreeDataFormatVersion min_format_version(0);
    if (!date_column_name.empty())
    {
@ -159,6 +162,8 @@ MergeTreeData::MergeTreeData(
        min_format_version = MERGE_TREE_DATA_MIN_FORMAT_VERSION_WITH_CUSTOM_PARTITIONING;
    }

+    setTTLExpressions(columns_.getColumnTTLs(), ttl_table_ast_);
+
    auto path_exists = Poco::File(full_path).exists();
    /// Creating directories, if not exist.
    Poco::File(full_path).createDirectories();
@ -187,9 +192,33 @@ MergeTreeData::MergeTreeData(
        format_version = 0;

    if (format_version < min_format_version)
-        throw Exception(
-            "MergeTree data format version on disk doesn't support custom partitioning",
-            ErrorCodes::METADATA_MISMATCH);
+    {
+        if (min_format_version == MERGE_TREE_DATA_MIN_FORMAT_VERSION_WITH_CUSTOM_PARTITIONING.toUnderType())
+            throw Exception(
+                "MergeTree data format version on disk doesn't support custom partitioning",
+                ErrorCodes::METADATA_MISMATCH);
+    }
+
+}
+
+
+MergeTreeData::IndexGranularityInfo::IndexGranularityInfo(const MergeTreeSettings & settings)
+    : fixed_index_granularity(settings.index_granularity)
+    , index_granularity_bytes(settings.index_granularity_bytes)
+{
+    /// Granularity is fixed
+    if (index_granularity_bytes == 0)
+    {
+        is_adaptive = false;
+        mark_size_in_bytes = sizeof(UInt64) * 2;
+        marks_file_extension = ".mrk";
+    }
+    else
+    {
+        is_adaptive = true;
+        mark_size_in_bytes = sizeof(UInt64) * 3;
+        marks_file_extension = ".mrk2";
+    }
 }


@ -490,6 +519,98 @@ void MergeTreeData::initPartitionKey()
    }
 }

+namespace
+{
+
+void checkTTLExpression(const ExpressionActionsPtr & ttl_expression, const String & result_column_name)
+{
+    for (const auto & action : ttl_expression->getActions())
+    {
+        if (action.type == ExpressionAction::APPLY_FUNCTION)
+        {
+            IFunctionBase & func = *action.function_base;
+            if (!func.isDeterministic())
+                throw Exception("TTL expression cannot contain non-deterministic functions, "
+                    "but contains function " + func.getName(), ErrorCodes::BAD_ARGUMENTS);
+        }
+    }
+
+    bool has_date_column = false;
+    for (const auto & elem : ttl_expression->getRequiredColumnsWithTypes())
+    {
+        if (typeid_cast<const DataTypeDateTime *>(elem.type.get()) || typeid_cast<const DataTypeDate *>(elem.type.get()))
+        {
+            has_date_column = true;
+            break;
+        }
+    }
+
+    if (!has_date_column)
+        throw Exception("TTL expression should use at least one Date or DateTime column", ErrorCodes::BAD_TTL_EXPRESSION);
+
+    const auto & result_column = ttl_expression->getSampleBlock().getByName(result_column_name);
+
+    if (!typeid_cast<const DataTypeDateTime *>(result_column.type.get())
+        && !typeid_cast<const DataTypeDate *>(result_column.type.get()))
+    {
+        throw Exception("TTL expression result column should have DateTime or Date type, but has "
+            + result_column.type->getName(), ErrorCodes::BAD_TTL_EXPRESSION);
+    }
+}
+
+}
+
+
+void MergeTreeData::setTTLExpressions(const ColumnsDescription::ColumnTTLs & new_column_ttls,
+        const ASTPtr & new_ttl_table_ast, bool only_check)
+{
+    auto create_ttl_entry = [this](ASTPtr ttl_ast) -> TTLEntry
+    {
+        auto syntax_result = SyntaxAnalyzer(global_context).analyze(ttl_ast, getColumns().getAllPhysical());
+        auto expr = ExpressionAnalyzer(ttl_ast, syntax_result, global_context).getActions(false);
+
+        String result_column = ttl_ast->getColumnName();
+        checkTTLExpression(expr, result_column);
+
+        return {expr, result_column};
+    };
+
+    if (!new_column_ttls.empty())
+    {
+        NameSet columns_ttl_forbidden;
+
+        if (partition_key_expr)
+            for (const auto & col : partition_key_expr->getRequiredColumns())
+                columns_ttl_forbidden.insert(col);
+
+        if (sorting_key_expr)
+            for (const auto & col : sorting_key_expr->getRequiredColumns())
+                columns_ttl_forbidden.insert(col);
+
+        for (const auto & [name, ast] : new_column_ttls)
+        {
+            if (columns_ttl_forbidden.count(name))
+                throw Exception("Trying to set ttl for key column " + name, ErrorCodes::ILLEGAL_COLUMN);
+            else
+            {
+                auto new_ttl_entry = create_ttl_entry(ast);
+                if (!only_check)
+                    ttl_entries_by_name.emplace(name, new_ttl_entry);
+            }
+        }
+    }
+
+    if (new_ttl_table_ast)
+    {
+        auto new_ttl_table_entry = create_ttl_entry(new_ttl_table_ast);
+        if (!only_check)
+        {
+            ttl_table_ast = new_ttl_table_ast;
+            ttl_table_entry = new_ttl_table_entry;
+        }
+    }
+}
+

 void MergeTreeData::MergingParams::check(const NamesAndTypesList & columns) const
 {
@ -1059,7 +1180,8 @@ void MergeTreeData::checkAlter(const AlterCommands & commands, const Context & c
    auto new_indices = getIndicesDescription();
    ASTPtr new_order_by_ast = order_by_ast;
    ASTPtr new_primary_key_ast = primary_key_ast;
-    commands.apply(new_columns, new_indices, new_order_by_ast, new_primary_key_ast);
+    ASTPtr new_ttl_table_ast = ttl_table_ast;
+    commands.apply(new_columns, new_indices, new_order_by_ast, new_primary_key_ast, new_ttl_table_ast);

    if (getIndicesDescription().empty() && !new_indices.empty() &&
            !context.getSettingsRef().allow_experimental_data_skipping_indices)
@ -1145,11 +1267,12 @@ void MergeTreeData::checkAlter(const AlterCommands & commands, const Context & c
    setPrimaryKeyIndicesAndColumns(new_order_by_ast, new_primary_key_ast,
            new_columns, new_indices, /* only_check = */ true);

+    setTTLExpressions(new_columns.getColumnTTLs(), new_ttl_table_ast, /* only_check = */ true);
+
    /// Check that type conversions are possible.
    ExpressionActionsPtr unused_expression;
    NameToNameMap unused_map;
    bool unused_bool;
-
    createConvertExpression(nullptr, getColumns().getAllPhysical(), new_columns.getAllPhysical(),
            getIndicesDescription().indices, new_indices.indices, unused_expression, unused_map, unused_bool);
 }
@ -1181,7 +1304,7 @@ void MergeTreeData::createConvertExpression(const DataPartPtr & part, const Name
        if (!new_indices_set.count(index.name))
        {
            out_rename_map["skp_idx_" + index.name + ".idx"] = "";
-            out_rename_map["skp_idx_" + index.name + ".mrk"] = "";
+            out_rename_map["skp_idx_" + index.name + index_granularity_info.marks_file_extension] = "";
        }
    }

@ -1210,7 +1333,7 @@ void MergeTreeData::createConvertExpression(const DataPartPtr & part, const Name
                    if (--stream_counts[file_name] == 0)
                    {
                        out_rename_map[file_name + ".bin"] = "";
-                        out_rename_map[file_name + ".mrk"] = "";
+                        out_rename_map[file_name + index_granularity_info.marks_file_extension] = "";
                    }
                }, {});
            }
@ -1285,7 +1408,7 @@ void MergeTreeData::createConvertExpression(const DataPartPtr & part, const Name
                    String temporary_file_name = IDataType::getFileNameForStream(temporary_column_name, substream_path);

                    out_rename_map[temporary_file_name + ".bin"] = original_file_name + ".bin";
-                    out_rename_map[temporary_file_name + ".mrk"] = original_file_name + ".mrk";
+                    out_rename_map[temporary_file_name + index_granularity_info.marks_file_extension] = original_file_name + index_granularity_info.marks_file_extension;
                }, {});
        }

@ -1404,7 +1527,14 @@ MergeTreeData::AlterDataPartTransactionPtr MergeTreeData::alterDataPart(
          */
        IMergedBlockOutputStream::WrittenOffsetColumns unused_written_offsets;
        MergedColumnOnlyOutputStream out(
-            *this, in.getHeader(), full_path + part->name + '/', true /* sync */, compression_codec, true /* skip_offsets */, unused_written_offsets);
+            *this,
+            in.getHeader(),
+            full_path + part->name + '/',
+            true /* sync */,
+            compression_codec,
+            true /* skip_offsets */,
+            unused_written_offsets,
+            part->index_granularity);

        in.readPrefix();
        out.writePrefix();
@ -1446,6 +1576,32 @@ MergeTreeData::AlterDataPartTransactionPtr MergeTreeData::alterDataPart(
    return transaction;
 }

+void MergeTreeData::removeEmptyColumnsFromPart(MergeTreeData::MutableDataPartPtr & data_part)
+{
+    auto & empty_columns = data_part->empty_columns;
+    if (empty_columns.empty())
+        return;
+
+    NamesAndTypesList new_columns;
+    for (const auto & [name, type] : data_part->columns)
+        if (!empty_columns.count(name))
+            new_columns.emplace_back(name, type);
+
+    std::stringstream log_message;
+    for (auto it = empty_columns.begin(); it != empty_columns.end(); ++it)
+    {
+        if (it != empty_columns.begin())
+            log_message << ", ";
+        log_message << *it;
+    }
+
+    LOG_INFO(log, "Removing empty columns: " << log_message.str() << " from part " << data_part->name);
+
+    if (auto transaction = alterDataPart(data_part, new_columns, getIndicesDescription().indices, false))
+        transaction->commit();
+    empty_columns.clear();
+}
+
 void MergeTreeData::freezeAll(const String & with_name, const Context & context)
 {
    freezePartitionsByMatcher([] (const DataPartPtr &){ return true; }, with_name, context);
@ -2150,8 +2306,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeData::loadPartAndFixMetadata(const St
    /// Check the data while we are at it.
    if (part->checksums.empty())
    {
-        part->checksums = checkDataPart(full_part_path, index_granularity, false, primary_key_data_types, skip_indices);
-
+        part->checksums = checkDataPart(part, false, primary_key_data_types, skip_indices);
        {
            WriteBufferFromFile out(full_part_path + "checksums.txt.tmp", 4096);
            part->checksums.write(out);
--- a/dbms/src/Storages/MergeTree/MergeTreeData.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeData.h
@ -285,6 +285,32 @@ public:
        String getModeName() const;
    };

+    /// Meta information about index granularity
+    struct IndexGranularityInfo
+    {
+        /// Marks file extension '.mrk' or '.mrk2'
+        String marks_file_extension;
+
+        /// Size of one mark in file two or three size_t numbers
+        UInt8 mark_size_in_bytes;
+
+        /// Is stride in rows between marks non fixed?
+        bool is_adaptive;
+
+        /// Fixed size in rows of one granule if index_granularity_bytes is zero
+        size_t fixed_index_granularity;
+
+        /// Approximate bytes size of one granule
+        size_t index_granularity_bytes;
+
+        IndexGranularityInfo(const MergeTreeSettings & settings);
+
+        String getMarksFilePath(const String & column_path) const
+        {
+            return column_path + marks_file_extension;
+        }
+    };
+

    /// Attach the table corresponding to the directory in full_path (must end with /), with the given columns.
    /// Correctness of names and paths is not checked.
@ -312,6 +338,7 @@ public:
                  const ASTPtr & order_by_ast_,
                  const ASTPtr & primary_key_ast_,
                  const ASTPtr & sample_by_ast_, /// nullptr, if sampling is not supported.
+                  const ASTPtr & ttl_table_ast_,
                  const MergingParams & merging_params_,
                  const MergeTreeSettings & settings_,
                  bool require_part_metadata_,
@ -494,6 +521,9 @@ public:
        const IndicesASTs & new_indices,
        bool skip_sanity_checks);

+    /// Remove columns, that have been markedd as empty after zeroing values with expired ttl
+    void removeEmptyColumnsFromPart(MergeTreeData::MutableDataPartPtr & data_part);
+
    /// Freezes all parts.
    void freezeAll(const String & with_name, const Context & context);

@ -514,6 +544,7 @@ public:
    bool hasSortingKey() const { return !sorting_key_columns.empty(); }
    bool hasPrimaryKey() const { return !primary_key_columns.empty(); }
    bool hasSkipIndices() const { return !skip_indices.empty(); }
+    bool hasTableTTL() const { return ttl_table_ast != nullptr; }

    ASTPtr getSortingKeyAST() const { return sorting_key_expr_ast; }
    ASTPtr getPrimaryKeyAST() const { return primary_key_expr_ast; }
@ -569,6 +600,7 @@ public:
    MergeTreeDataFormatVersion format_version;

    Context global_context;
+    IndexGranularityInfo index_granularity_info;

    /// Merging params - what additional actions to perform during merge.
    const MergingParams merging_params;
@ -601,10 +633,20 @@ public:
    Block primary_key_sample;
    DataTypes primary_key_data_types;

+    struct TTLEntry
+    {
+        ExpressionActionsPtr expression;
+        String result_column;
+    };
+
+    using TTLEntriesByName = std::unordered_map<String, TTLEntry>;
+    TTLEntriesByName ttl_entries_by_name;
+
+    TTLEntry ttl_table_entry;
+
    String sampling_expr_column_name;
    Names columns_required_for_sampling;

-    const size_t index_granularity;
    const MergeTreeSettings settings;

    /// Limiting parallel sends per one table, used in DataPartsExchange
@ -625,6 +667,7 @@ private:
    ASTPtr order_by_ast;
    ASTPtr primary_key_ast;
    ASTPtr sample_by_ast;
+    ASTPtr ttl_table_ast;

    bool require_part_metadata;

@ -735,6 +778,9 @@ private:

    void initPartitionKey();

+    void setTTLExpressions(const ColumnsDescription::ColumnTTLs & new_column_ttls,
+                           const ASTPtr & new_ttl_table_ast, bool only_check = false);
+
    /// Expression for column type conversion.
    /// If no conversions are needed, out_expression=nullptr.
    /// out_rename_map maps column files for the out_expression onto new table files.
--- a/dbms/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
@ -4,9 +4,11 @@
 #include <Storages/MergeTree/DiskSpaceMonitor.h>
 #include <Storages/MergeTree/SimpleMergeSelector.h>
 #include <Storages/MergeTree/AllMergeSelector.h>
+#include <Storages/MergeTree/TTLMergeSelector.h>
 #include <Storages/MergeTree/MergeList.h>
 #include <Storages/MergeTree/StorageFromMergeTreeDataPart.h>
 #include <Storages/MergeTree/BackgroundProcessingPool.h>
+#include <DataStreams/TTLBlockInputStream.h>
 #include <DataStreams/DistinctSortedBlockInputStream.h>
 #include <DataStreams/ExpressionBlockInputStream.h>
 #include <DataStreams/MergingSortedBlockInputStream.h>
@ -176,6 +178,7 @@ bool MergeTreeDataMergerMutator::selectPartsToMerge(

    const String * prev_partition_id = nullptr;
    const MergeTreeData::DataPartPtr * prev_part = nullptr;
+    bool has_part_with_expired_ttl = false;
    for (const MergeTreeData::DataPartPtr & part : data_parts)
    {
        const String & partition_id = part->info.partition_id;
@ -191,6 +194,10 @@ bool MergeTreeDataMergerMutator::selectPartsToMerge(
        part_info.age = current_time - part->modification_time;
        part_info.level = part->info.level;
        part_info.data = &part;
+        part_info.min_ttl = part->ttl_infos.part_min_ttl;
+
+        if (part_info.min_ttl && part_info.min_ttl <= current_time)
+            has_part_with_expired_ttl = true;

        partitions.back().emplace_back(part_info);

@ -210,8 +217,17 @@ bool MergeTreeDataMergerMutator::selectPartsToMerge(
    if (aggressive)
        merge_settings.base = 1;

+    bool can_merge_with_ttl =
+        (current_time - last_merge_with_ttl > data.settings.merge_with_ttl_timeout);
+
    /// NOTE Could allow selection of different merge strategy.
-    merge_selector = std::make_unique<SimpleMergeSelector>(merge_settings);
+    if (can_merge_with_ttl && has_part_with_expired_ttl)
+    {
+        merge_selector = std::make_unique<TTLMergeSelector>(current_time);
+        last_merge_with_ttl = current_time;
+    }
+    else
+        merge_selector = std::make_unique<SimpleMergeSelector>(merge_settings);

    IMergeSelector::PartsInPartition parts_to_merge = merge_selector->select(
        partitions,
@ -224,7 +240,8 @@ bool MergeTreeDataMergerMutator::selectPartsToMerge(
        return false;
    }

-    if (parts_to_merge.size() == 1)
+    /// Allow to "merge" part with itself if we need remove some values with expired ttl
+    if (parts_to_merge.size() == 1 && !has_part_with_expired_ttl)
        throw Exception("Logical error: merge selector returned only one part to merge", ErrorCodes::LOGICAL_ERROR);

    MergeTreeData::DataPartsVector parts;
@ -536,9 +553,18 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
    new_data_part->relative_path = TMP_PREFIX + future_part.name;
    new_data_part->is_temp = true;

-    size_t sum_input_rows_upper_bound = merge_entry->total_size_marks * data.index_granularity;
+    size_t sum_input_rows_upper_bound = merge_entry->total_rows_count;

-    MergeAlgorithm merge_alg = chooseMergeAlgorithm(parts, sum_input_rows_upper_bound, gathering_columns, deduplicate);
+    bool need_remove_expired_values = false;
+    for (const MergeTreeData::DataPartPtr & part : parts)
+        new_data_part->ttl_infos.update(part->ttl_infos);
+
+    const auto & part_min_ttl = new_data_part->ttl_infos.part_min_ttl;
+    if (part_min_ttl && part_min_ttl <= time_of_merge)
+        need_remove_expired_values = true;
+
+
+    MergeAlgorithm merge_alg = chooseMergeAlgorithm(parts, sum_input_rows_upper_bound, gathering_columns, deduplicate, need_remove_expired_values);

    LOG_DEBUG(log, "Selected MergeAlgorithm: " << ((merge_alg == MergeAlgorithm::Vertical) ? "Vertical" : "Horizontal"));

@ -599,6 +625,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor

    MergeStageProgress horizontal_stage_progress(
        merge_alg == MergeAlgorithm::Horizontal ? 1.0 : column_sizes.keyColumnsWeight());
+
    for (const auto & part : parts)
    {
        auto input = std::make_unique<MergeTreeSequentialBlockInputStream>(
@ -629,16 +656,19 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
    ///  that is going in insertion order.
    std::shared_ptr<IBlockInputStream> merged_stream;

+    /// If merge is vertical we cannot calculate it
+    bool blocks_are_granules_size = (merge_alg == MergeAlgorithm::Vertical);
+
    switch (data.merging_params.mode)
    {
        case MergeTreeData::MergingParams::Ordinary:
            merged_stream = std::make_unique<MergingSortedBlockInputStream>(
-                src_streams, sort_description, DEFAULT_MERGE_BLOCK_SIZE, 0, rows_sources_write_buf.get(), true);
+                src_streams, sort_description, DEFAULT_MERGE_BLOCK_SIZE, 0, rows_sources_write_buf.get(), true, blocks_are_granules_size);
            break;

        case MergeTreeData::MergingParams::Collapsing:
            merged_stream = std::make_unique<CollapsingSortedBlockInputStream>(
-                src_streams, sort_description, data.merging_params.sign_column, DEFAULT_MERGE_BLOCK_SIZE, rows_sources_write_buf.get());
+                src_streams, sort_description, data.merging_params.sign_column, DEFAULT_MERGE_BLOCK_SIZE, rows_sources_write_buf.get(), blocks_are_granules_size);
            break;

        case MergeTreeData::MergingParams::Summing:
@ -653,7 +683,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor

        case MergeTreeData::MergingParams::Replacing:
            merged_stream = std::make_unique<ReplacingSortedBlockInputStream>(
-                src_streams, sort_description, data.merging_params.version_column, DEFAULT_MERGE_BLOCK_SIZE, rows_sources_write_buf.get());
+                src_streams, sort_description, data.merging_params.version_column, DEFAULT_MERGE_BLOCK_SIZE, rows_sources_write_buf.get(), blocks_are_granules_size);
            break;

        case MergeTreeData::MergingParams::Graphite:
@ -664,15 +694,24 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor

        case MergeTreeData::MergingParams::VersionedCollapsing:
            merged_stream = std::make_unique<VersionedCollapsingSortedBlockInputStream>(
-                src_streams, sort_description, data.merging_params.sign_column, DEFAULT_MERGE_BLOCK_SIZE, rows_sources_write_buf.get());
+                src_streams, sort_description, data.merging_params.sign_column, DEFAULT_MERGE_BLOCK_SIZE, rows_sources_write_buf.get(), blocks_are_granules_size);
            break;
    }

    if (deduplicate)
        merged_stream = std::make_shared<DistinctSortedBlockInputStream>(merged_stream, SizeLimits(), 0 /*limit_hint*/, Names());

+    if (need_remove_expired_values)
+        merged_stream = std::make_shared<TTLBlockInputStream>(merged_stream, data, new_data_part, time_of_merge);
+
    MergedBlockOutputStream to{
-        data, new_part_tmp_path, merging_columns, compression_codec, merged_column_to_size, data.settings.min_merge_bytes_to_use_direct_io};
+        data,
+        new_part_tmp_path,
+        merging_columns,
+        compression_codec,
+        merged_column_to_size,
+        data.settings.min_merge_bytes_to_use_direct_io,
+        blocks_are_granules_size};

    merged_stream->readPrefix();
    to.writePrefix();
@ -684,6 +723,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
    while (!actions_blocker.isCancelled() && (block = merged_stream->read()))
    {
        rows_written += block.rows();
+
        to.write(block);

        merge_entry->rows_written = merged_stream->getProfileInfo().rows;
@ -758,7 +798,15 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
            rows_sources_read_buf.seek(0, 0);
            ColumnGathererStream column_gathered_stream(column_name, column_part_streams, rows_sources_read_buf);
            MergedColumnOnlyOutputStream column_to(
-                data, column_gathered_stream.getHeader(), new_part_tmp_path, false, compression_codec, false, written_offset_columns);
+                data,
+                column_gathered_stream.getHeader(),
+                new_part_tmp_path,
+                false,
+                compression_codec,
+                false,
+                written_offset_columns,
+                to.getIndexGranularity()
+            );
            size_t column_elems_written = 0;

            column_to.writePrefix();
@ -857,6 +905,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mutatePartToTempor
        data, future_part.name, future_part.part_info);
    new_data_part->relative_path = "tmp_mut_" + future_part.name;
    new_data_part->is_temp = true;
+    new_data_part->ttl_infos = source_part->ttl_infos;

    String new_part_tmp_path = new_data_part->getFullPath();

@ -936,7 +985,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mutatePartToTempor
            {
                String stream_name = IDataType::getFileNameForStream(entry.name, substream_path);
                files_to_skip.insert(stream_name + ".bin");
-                files_to_skip.insert(stream_name + ".mrk");
+                files_to_skip.insert(stream_name + data.index_granularity_info.marks_file_extension);
            };

            IDataType::SubstreamPath stream_path;
@ -959,7 +1008,15 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mutatePartToTempor

        IMergedBlockOutputStream::WrittenOffsetColumns unused_written_offsets;
        MergedColumnOnlyOutputStream out(
-            data, in_header, new_part_tmp_path, /* sync = */ false, compression_codec, /* skip_offsets = */ false, unused_written_offsets);
+            data,
+            in_header,
+            new_part_tmp_path,
+            /* sync = */ false,
+            compression_codec,
+            /* skip_offsets = */ false,
+            unused_written_offsets,
+            source_part->index_granularity
+        );

        in->readPrefix();
        out.writePrefix();
@ -1002,7 +1059,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mutatePartToTempor
        }

        new_data_part->rows_count = source_part->rows_count;
-        new_data_part->marks_count = source_part->marks_count;
+        new_data_part->index_granularity = source_part->index_granularity;
        new_data_part->index = source_part->index;
        new_data_part->partition.assign(source_part->partition);
        new_data_part->minmax_idx = source_part->minmax_idx;
@ -1016,12 +1073,14 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mutatePartToTempor

 MergeTreeDataMergerMutator::MergeAlgorithm MergeTreeDataMergerMutator::chooseMergeAlgorithm(
    const MergeTreeData::DataPartsVector & parts, size_t sum_rows_upper_bound,
-    const NamesAndTypesList & gathering_columns, bool deduplicate) const
+    const NamesAndTypesList & gathering_columns, bool deduplicate, bool need_remove_expired_values) const
 {
    if (deduplicate)
        return MergeAlgorithm::Horizontal;
    if (data.settings.enable_vertical_merge_algorithm == 0)
        return MergeAlgorithm::Horizontal;
+    if (need_remove_expired_values)
+        return MergeAlgorithm::Horizontal;

    bool is_supported_storage =
        data.merging_params.mode == MergeTreeData::MergingParams::Ordinary ||
@ -1093,7 +1152,6 @@ MergeTreeData::DataPartPtr MergeTreeDataMergerMutator::renameMergedTemporaryPart
    return new_data_part;
 }

-
 size_t MergeTreeDataMergerMutator::estimateNeededDiskSpace(const MergeTreeData::DataPartsVector & source_parts)
 {
    size_t res = 0;
--- a/dbms/src/Storages/MergeTree/MergeTreeDataMergerMutator.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeDataMergerMutator.h
@ -127,7 +127,7 @@ private:

    MergeAlgorithm chooseMergeAlgorithm(
        const MergeTreeData::DataPartsVector & parts,
-        size_t rows_upper_bound, const NamesAndTypesList & gathering_columns, bool deduplicate) const;
+        size_t rows_upper_bound, const NamesAndTypesList & gathering_columns, bool deduplicate, bool need_remove_expired_values) const;

 private:
    MergeTreeData & data;
@ -137,6 +137,9 @@ private:

    /// When the last time you wrote to the log that the disk space was running out (not to write about this too often).
    time_t disk_space_warning_time = 0;
+
+    /// Last time when TTLMergeSelector has been used
+    time_t last_merge_with_ttl = 0;
 };


--- a/dbms/src/Storages/MergeTree/MergeTreeDataPart.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeDataPart.cpp
@ -22,9 +22,7 @@
 #include <Poco/DirectoryIterator.h>

 #include <common/logger_useful.h>
-
-#define MERGE_TREE_MARK_SIZE (2 * sizeof(UInt64))
-
+#include <common/JSON.h>

 namespace DB
 {
@ -37,6 +35,7 @@ namespace ErrorCodes
    extern const int CORRUPTED_DATA;
    extern const int NOT_FOUND_EXPECTED_DATA_PART;
    extern const int BAD_SIZE_OF_FILE_IN_DATA_PART;
+    extern const int BAD_TTL_FILE;
 }


@ -137,10 +136,20 @@ void MergeTreeDataPart::MinMaxIndex::merge(const MinMaxIndex & other)


 MergeTreeDataPart::MergeTreeDataPart(MergeTreeData & storage_, const String & name_)
-    : storage(storage_), name(name_), info(MergeTreePartInfo::fromPartName(name_, storage.format_version))
+    : storage(storage_)
+    , name(name_)
+    , info(MergeTreePartInfo::fromPartName(name_, storage.format_version))
 {
 }

+MergeTreeDataPart::MergeTreeDataPart(const MergeTreeData & storage_, const String & name_, const MergeTreePartInfo & info_)
+    : storage(storage_)
+    , name(name_)
+    , info(info_)
+{
+}
+
+
 /// Takes into account the fact that several columns can e.g. share their .size substreams.
 /// When calculating totals these should be counted only once.
 MergeTreeDataPart::ColumnSize MergeTreeDataPart::getColumnSizeImpl(
@ -164,7 +173,7 @@ MergeTreeDataPart::ColumnSize MergeTreeDataPart::getColumnSizeImpl(
            size.data_uncompressed += bin_checksum->second.uncompressed_size;
        }

-        auto mrk_checksum = checksums.files.find(file_name + ".mrk");
+        auto mrk_checksum = checksums.files.find(file_name + storage.index_granularity_info.marks_file_extension);
        if (mrk_checksum != checksums.files.end())
            size.marks += mrk_checksum->second.file_size;
    }, {});
@ -198,7 +207,6 @@ size_t MergeTreeDataPart::getFileSizeOrZero(const String & file_name) const
    return checksum->second.file_size;
 }

-
 /** Returns the name of a column with minimum compressed size (as returned by getColumnSize()).
  * If no checksums are present returns the name of the first physically existing column.
  */
@ -459,6 +467,11 @@ void MergeTreeDataPart::renameToDetached(const String & prefix) const
 }


+UInt64 MergeTreeDataPart::getMarksCount() const
+{
+    return index_granularity.getMarksCount();
+}
+
 void MergeTreeDataPart::makeCloneInDetached(const String & prefix) const
 {
    Poco::Path src(getFullPath());
@ -476,24 +489,55 @@ void MergeTreeDataPart::loadColumnsChecksumsIndexes(bool require_columns_checksu

    loadColumns(require_columns_checksums);
    loadChecksums(require_columns_checksums);
-    loadIndex();
-    loadRowsCount(); /// Must be called after loadIndex() as it uses the value of `marks_count`.
+    loadIndexGranularity();
+    loadIndex(); /// Must be called after loadIndexGranularity as it uses the value of `index_granularity`
+    loadRowsCount(); /// Must be called after loadIndex() as it uses the value of `index_granularity`.
    loadPartitionAndMinMaxIndex();
+    loadTTLInfos();
    if (check_consistency)
        checkConsistency(require_columns_checksums);
 }

+void MergeTreeDataPart::loadIndexGranularity()
+{
+    if (columns.empty())
+        throw Exception("No columns in part " + name, ErrorCodes::NO_FILE_IN_DATA_PART);
+    const auto & granularity_info = storage.index_granularity_info;
+
+    /// We can use any column, it doesn't matter
+    std::string marks_file_path = granularity_info.getMarksFilePath(getFullPath() + escapeForFileName(columns.front().name));
+    if (!Poco::File(marks_file_path).exists())
+        throw Exception("Marks file '" + marks_file_path + "' doesn't exist", ErrorCodes::NO_FILE_IN_DATA_PART);
+
+    size_t marks_file_size = Poco::File(marks_file_path).getSize();
+
+    /// old version of marks with static index granularity
+    if (!granularity_info.is_adaptive)
+    {
+        size_t marks_count = marks_file_size / granularity_info.mark_size_in_bytes;
+        index_granularity.resizeWithFixedGranularity(marks_count, granularity_info.fixed_index_granularity); /// all the same
+    }
+    else
+    {
+        ReadBufferFromFile buffer(marks_file_path, marks_file_size, -1);
+        while (!buffer.eof())
+        {
+            buffer.seek(sizeof(size_t) * 2, SEEK_CUR); /// skip offset_in_compressed file and offset_in_decompressed_block
+            size_t granularity;
+            readIntBinary(granularity, buffer);
+            index_granularity.appendMark(granularity);
+        }
+        if (index_granularity.getMarksCount() * granularity_info.mark_size_in_bytes != marks_file_size)
+            throw Exception("Cannot read all marks from file " + marks_file_path, ErrorCodes::CANNOT_READ_ALL_DATA);
+    }
+    index_granularity.setInitialized();
+}

 void MergeTreeDataPart::loadIndex()
 {
-    if (!marks_count)
-    {
-        if (columns.empty())
-            throw Exception("No columns in part " + name, ErrorCodes::NO_FILE_IN_DATA_PART);
-
-        marks_count = Poco::File(getFullPath() + escapeForFileName(columns.front().name) + ".mrk")
-            .getSize() / MERGE_TREE_MARK_SIZE;
-    }
+    /// It can be empty in case of mutations
+    if (!index_granularity.isInitialized())
+        throw Exception("Index granularity is not loaded before index loading", ErrorCodes::LOGICAL_ERROR);

    size_t key_size = storage.primary_key_columns.size();

@ -505,22 +549,22 @@ void MergeTreeDataPart::loadIndex()
        for (size_t i = 0; i < key_size; ++i)
        {
            loaded_index[i] = storage.primary_key_data_types[i]->createColumn();
-            loaded_index[i]->reserve(marks_count);
+            loaded_index[i]->reserve(index_granularity.getMarksCount());
        }

        String index_path = getFullPath() + "primary.idx";
        ReadBufferFromFile index_file = openForReading(index_path);

-        for (size_t i = 0; i < marks_count; ++i)    //-V756
+        for (size_t i = 0; i < index_granularity.getMarksCount(); ++i)    //-V756
            for (size_t j = 0; j < key_size; ++j)
                storage.primary_key_data_types[j]->deserializeBinary(*loaded_index[j], index_file);

        for (size_t i = 0; i < key_size; ++i)
        {
            loaded_index[i]->protect();
-            if (loaded_index[i]->size() != marks_count)
+            if (loaded_index[i]->size() != index_granularity.getMarksCount())
                throw Exception("Cannot read all data from index file " + index_path
-                    + "(expected size: " + toString(marks_count) + ", read: " + toString(loaded_index[i]->size()) + ")",
+                    + "(expected size: " + toString(index_granularity.getMarksCount()) + ", read: " + toString(loaded_index[i]->size()) + ")",
                    ErrorCodes::CANNOT_READ_ALL_DATA);
        }

@ -585,7 +629,7 @@ void MergeTreeDataPart::loadChecksums(bool require)

 void MergeTreeDataPart::loadRowsCount()
 {
-    if (marks_count == 0)
+    if (index_granularity.empty())
    {
        rows_count = 0;
    }
@ -601,8 +645,6 @@ void MergeTreeDataPart::loadRowsCount()
    }
    else
    {
-        size_t rows_approx = storage.index_granularity * marks_count;
-
        for (const NameAndTypePair & column : columns)
        {
            ColumnPtr column_col = column.type->createColumn();
@ -624,10 +666,12 @@ void MergeTreeDataPart::loadRowsCount()
                    ErrorCodes::LOGICAL_ERROR);
            }

-            if (!(rows_count <= rows_approx && rows_approx < rows_count + storage.index_granularity))
+            size_t last_mark_index_granularity = index_granularity.getLastMarkRows();
+            size_t rows_approx = index_granularity.getTotalRows();
+            if (!(rows_count <= rows_approx && rows_approx < rows_count + last_mark_index_granularity))
                throw Exception(
                    "Unexpected size of column " + column.name + ": " + toString(rows_count) + " rows, expected "
-                    + toString(rows_approx) + "+-" + toString(storage.index_granularity) + " rows according to the index",
+                    + toString(rows_approx) + "+-" + toString(last_mark_index_granularity) + " rows according to the index",
                    ErrorCodes::LOGICAL_ERROR);

            return;
@ -637,6 +681,33 @@ void MergeTreeDataPart::loadRowsCount()
    }
 }

+void MergeTreeDataPart::loadTTLInfos()
+{
+    String path = getFullPath() + "ttl.txt";
+    if (Poco::File(path).exists())
+    {
+        ReadBufferFromFile in = openForReading(path);
+        assertString("ttl format version: ", in);
+        size_t format_version;
+        readText(format_version, in);
+        assertChar('\n', in);
+
+        if (format_version == 1)
+        {
+            try
+            {
+                ttl_infos.read(in);
+            }
+            catch (const JSONException &)
+            {
+                throw Exception("Error while parsing file ttl.txt in part: " + name, ErrorCodes::BAD_TTL_FILE);
+            }
+        }
+        else
+            throw Exception("Unknown ttl format version: " + toString(format_version), ErrorCodes::BAD_TTL_FILE);
+    }
+}
+
 void MergeTreeDataPart::accumulateColumnSizes(ColumnToSize & column_to_size) const
 {
    std::shared_lock<std::shared_mutex> part_lock(columns_lock);
@ -699,7 +770,7 @@ void MergeTreeDataPart::checkConsistency(bool require_part_metadata)
                name_type.type->enumerateStreams([&](const IDataType::SubstreamPath & substream_path)
                {
                    String file_name = IDataType::getFileNameForStream(name_type.name, substream_path);
-                    String mrk_file_name = file_name + ".mrk";
+                    String mrk_file_name = file_name + storage.index_granularity_info.marks_file_extension;
                    String bin_file_name = file_name + ".bin";
                    if (!checksums.files.count(mrk_file_name))
                        throw Exception("No " + mrk_file_name + " file checksum for column " + name_type.name + " in part " + path,
@ -763,7 +834,7 @@ void MergeTreeDataPart::checkConsistency(bool require_part_metadata)
        {
            name_type.type->enumerateStreams([&](const IDataType::SubstreamPath & substream_path)
            {
-                Poco::File file(IDataType::getFileNameForStream(name_type.name, substream_path) + ".mrk");
+                Poco::File file(IDataType::getFileNameForStream(name_type.name, substream_path) + storage.index_granularity_info.marks_file_extension);

                /// Missing file is Ok for case when new column was added.
                if (file.exists())
@ -794,7 +865,7 @@ bool MergeTreeDataPart::hasColumnFiles(const String & column) const

    String escaped_column = escapeForFileName(column);
    return Poco::File(prefix + escaped_column + ".bin").exists()
-        && Poco::File(prefix + escaped_column + ".mrk").exists();
+        && Poco::File(prefix + escaped_column + storage.index_granularity_info.marks_file_extension).exists();
 }


--- a/dbms/src/Storages/MergeTree/MergeTreeDataPart.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeDataPart.h
@ -4,10 +4,12 @@
 #include <Core/Block.h>
 #include <Core/Types.h>
 #include <Core/NamesAndTypes.h>
+#include <Storages/MergeTree/MergeTreeIndexGranularity.h>
 #include <Storages/MergeTree/MergeTreeIndices.h>
 #include <Storages/MergeTree/MergeTreePartInfo.h>
 #include <Storages/MergeTree/MergeTreePartition.h>
 #include <Storages/MergeTree/MergeTreeDataPartChecksum.h>
+#include <Storages/MergeTree/MergeTreeDataPartTTLInfo.h>
 #include <Storages/MergeTree/KeyCondition.h>
 #include <Columns/IColumn.h>

@ -28,10 +30,7 @@ struct MergeTreeDataPart
    using Checksums = MergeTreeDataPartChecksums;
    using Checksum = MergeTreeDataPartChecksums::Checksum;

-    MergeTreeDataPart(const MergeTreeData & storage_, const String & name_, const MergeTreePartInfo & info_)
-        : storage(storage_), name(name_), info(info_)
-    {
-    }
+    MergeTreeDataPart(const MergeTreeData & storage_, const String & name_, const MergeTreePartInfo & info_);

    MergeTreeDataPart(MergeTreeData & storage_, const String & name_);

@ -94,7 +93,6 @@ struct MergeTreeDataPart
    mutable String relative_path;

    size_t rows_count = 0;
-    size_t marks_count = 0;
    std::atomic<UInt64> bytes_on_disk {0};  /// 0 - if not counted;
                                            /// Is used from several threads without locks (it is changed with ALTER).
                                            /// May not contain size of checksums.txt and columns.txt
@ -129,6 +127,11 @@ struct MergeTreeDataPart
        Deleting        /// not active data part with identity refcounter, it is deleting right now by a cleaner
    };

+    using TTLInfo = MergeTreeDataPartTTLInfo;
+    using TTLInfos = MergeTreeDataPartTTLInfos;
+
+    TTLInfos ttl_infos;
+
    /// Current state of the part. If the part is in working set already, it should be accessed via data_parts mutex
    mutable State state{State::Temporary};

@ -181,6 +184,10 @@ struct MergeTreeDataPart

    MergeTreePartition partition;

+    /// Amount of rows between marks
+    /// As index always loaded into memory
+    MergeTreeIndexGranularity index_granularity;
+
    /// Index that for each part stores min and max values of a set of columns. This allows quickly excluding
    /// parts based on conditions on these columns imposed by a query.
    /// Currently this index is built using only columns required by partition expression, but in principle it
@ -216,6 +223,9 @@ struct MergeTreeDataPart
    /// Columns description.
    NamesAndTypesList columns;

+    /// Columns with values, that all have been zeroed by expired ttl
+    NameSet empty_columns;
+
    using ColumnToSize = std::map<std::string, UInt64>;

    /** It is blocked for writing when changing columns, checksums or any part files.
@ -266,6 +276,7 @@ struct MergeTreeDataPart
    /// For data in RAM ('index')
    UInt64 getIndexSizeInBytes() const;
    UInt64 getIndexSizeInAllocatedBytes() const;
+    UInt64 getMarksCount() const;

 private:
    /// Reads columns names and types from columns.txt
@ -274,13 +285,19 @@ private:
    /// If checksums.txt exists, reads files' checksums (and sizes) from it
    void loadChecksums(bool require);

-    /// Loads index file. Also calculates this->marks_count if marks_count = 0
+    /// Loads marks index granularity into memory
+    void loadIndexGranularity();
+
+    /// Loads index file.
    void loadIndex();

    /// Load rows count for this part from disk (for the newer storage format version).
    /// For the older format version calculates rows count from the size of a column with a fixed size.
    void loadRowsCount();

+    /// Loads ttl infos in json format from file ttl.txt. If file doesn`t exists assigns ttl infos with all zeros
+    void loadTTLInfos();
+
    void loadPartitionAndMinMaxIndex();

    void checkConsistency(bool require_part_metadata);
--- a/dbms/src/Storages/MergeTree/MergeTreeDataPartTTLInfo.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeDataPartTTLInfo.cpp
@ -0,0 +1,88 @@
+#include <Storages/MergeTree/MergeTreeDataPartTTLInfo.h>
+#include <IO/ReadHelpers.h>
+#include <IO/WriteHelpers.h>
+
+#include <common/JSON.h>
+
+namespace DB
+{
+
+void MergeTreeDataPartTTLInfos::update(const MergeTreeDataPartTTLInfos & other_infos)
+{
+    for (const auto & [name, ttl_info] : other_infos.columns_ttl)
+    {
+        columns_ttl[name].update(ttl_info);
+        updatePartMinTTL(ttl_info.min);
+    }
+
+    table_ttl.update(other_infos.table_ttl);
+    updatePartMinTTL(table_ttl.min);
+}
+
+void MergeTreeDataPartTTLInfos::read(ReadBuffer & in)
+{
+    String json_str;
+    readString(json_str, in);
+    assertEOF(in);
+
+    JSON json(json_str);
+    if (json.has("columns"))
+    {
+        JSON columns = json["columns"];
+        for (auto col : columns)
+        {
+            MergeTreeDataPartTTLInfo ttl_info;
+            ttl_info.min = col["min"].getUInt();
+            ttl_info.max = col["max"].getUInt();
+            String name = col["name"].getString();
+            columns_ttl.emplace(name, ttl_info);
+
+            updatePartMinTTL(ttl_info.min);
+        }
+    }
+    if (json.has("table"))
+    {
+        JSON table = json["table"];
+        table_ttl.min = table["min"].getUInt();
+        table_ttl.max = table["max"].getUInt();
+
+        updatePartMinTTL(table_ttl.min);
+    }
+}
+
+void MergeTreeDataPartTTLInfos::write(WriteBuffer & out) const
+{
+    writeString("ttl format version: 1\n", out);
+    writeString("{", out);
+    if (!columns_ttl.empty())
+    {
+        writeString("\"columns\":[", out);
+        for (auto it = columns_ttl.begin(); it != columns_ttl.end(); ++it)
+        {
+            if (it != columns_ttl.begin())
+                writeString(",", out);
+
+            writeString("{\"name\":\"", out);
+            writeString(it->first, out);
+            writeString("\",\"min\":", out);
+            writeIntText(it->second.min, out);
+            writeString(",\"max\":", out);
+            writeIntText(it->second.max, out);
+            writeString("}", out);
+        }
+        writeString("]", out);
+    }
+    if (table_ttl.min)
+    {
+        if (!columns_ttl.empty())
+            writeString(",", out);
+        writeString("\"table\":{\"min\":", out);
+        writeIntText(table_ttl.min, out);
+        writeString(",\"max\":", out);
+        writeIntText(table_ttl.max, out);
+        writeString("}", out);
+    }
+    writeString("}", out);
+}
+
+}
--- a/dbms/src/Storages/MergeTree/MergeTreeDataPartTTLInfo.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeDataPartTTLInfo.h
@ -0,0 +1,51 @@
+#pragma once
+#include <IO/WriteBufferFromFile.h>
+#include <IO/ReadBufferFromFile.h>
+
+#include <unordered_map>
+
+namespace DB
+{
+
+/// Minimal and maximal ttl for column or table
+struct MergeTreeDataPartTTLInfo
+{
+    time_t min = 0;
+    time_t max = 0;
+
+    void update(time_t time)
+    {
+        if (time && (!min || time < min))
+            min = time;
+
+        max = std::max(time, max);
+    }
+
+    void update(const MergeTreeDataPartTTLInfo & other_info)
+    {
+        if (other_info.min && (!min || other_info.min < min))
+            min = other_info.min;
+
+        max = std::max(other_info.max, max);
+    }
+};
+
+/// PartTTLInfo for all columns and table with minimal ttl for whole part
+struct MergeTreeDataPartTTLInfos
+{
+    std::unordered_map<String, MergeTreeDataPartTTLInfo> columns_ttl;
+    MergeTreeDataPartTTLInfo table_ttl;
+    time_t part_min_ttl = 0;
+
+    void read(ReadBuffer & in);
+    void write(WriteBuffer & out) const;
+    void update(const MergeTreeDataPartTTLInfos & other_infos);
+
+    void updatePartMinTTL(time_t time)
+    {
+        if (time && (!part_min_ttl || time < part_min_ttl))
+            part_min_ttl = time;
+    }
+};
+
+}
--- a/dbms/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp
@ -93,7 +93,7 @@ static Block getBlockWithPartColumn(const MergeTreeData::DataPartsVector & parts
 size_t MergeTreeDataSelectExecutor::getApproximateTotalRowsToRead(
    const MergeTreeData::DataPartsVector & parts, const KeyCondition & key_condition, const Settings & settings) const
 {
-    size_t full_marks_count = 0;
+    size_t rows_count = 0;

    /// We will find out how many rows we would have read without sampling.
    LOG_DEBUG(log, "Preliminary index scan with condition: " << key_condition.toString());
@ -101,7 +101,7 @@ size_t MergeTreeDataSelectExecutor::getApproximateTotalRowsToRead(
    for (size_t i = 0; i < parts.size(); ++i)
    {
        const MergeTreeData::DataPartPtr & part = parts[i];
-        MarkRanges ranges = markRangesFromPKRange(part->index, key_condition, settings);
+        MarkRanges ranges = markRangesFromPKRange(part, key_condition, settings);

        /** In order to get a lower bound on the number of rows that match the condition on PK,
          *  consider only guaranteed full marks.
@ -109,10 +109,11 @@ size_t MergeTreeDataSelectExecutor::getApproximateTotalRowsToRead(
          */
        for (size_t j = 0; j < ranges.size(); ++j)
            if (ranges[j].end - ranges[j].begin > 2)
-                full_marks_count += ranges[j].end - ranges[j].begin - 2;
+                rows_count += part->index_granularity.getRowsCountInRange({ranges[j].begin + 1, ranges[j].end - 1});
+
    }

-    return full_marks_count * data.index_granularity;
+    return rows_count;
 }


@ -533,9 +534,9 @@ BlockInputStreams MergeTreeDataSelectExecutor::readFromParts(
        RangesInDataPart ranges(part, part_index++);

        if (data.hasPrimaryKey())
-            ranges.ranges = markRangesFromPKRange(part->index, key_condition, settings);
+            ranges.ranges = markRangesFromPKRange(part, key_condition, settings);
        else
-            ranges.ranges = MarkRanges{MarkRange{0, part->marks_count}};
+            ranges.ranges = MarkRanges{MarkRange{0, part->getMarksCount()}};

        for (const auto & index_and_condition : useful_indices)
            ranges.ranges = filterMarksUsingIndex(
@ -616,6 +617,28 @@ BlockInputStreams MergeTreeDataSelectExecutor::readFromParts(
    return res;
 }

+namespace
+{
+
+size_t roundRowsOrBytesToMarks(
+    size_t rows_setting,
+    size_t bytes_setting,
+    const MergeTreeData::IndexGranularityInfo & granularity_info)
+{
+    if (!granularity_info.is_adaptive)
+    {
+        size_t fixed_index_granularity = granularity_info.fixed_index_granularity;
+        return  (rows_setting + fixed_index_granularity - 1) / fixed_index_granularity;
+    }
+    else
+    {
+        size_t index_granularity_bytes = granularity_info.index_granularity_bytes;
+        return (bytes_setting + index_granularity_bytes - 1) / index_granularity_bytes;
+    }
+}
+
+}
+

 BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreams(
    RangesInDataParts && parts,
@ -627,16 +650,23 @@ BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreams(
    const Names & virt_columns,
    const Settings & settings) const
 {
-    const size_t min_marks_for_concurrent_read =
-        (settings.merge_tree_min_rows_for_concurrent_read + data.index_granularity - 1) / data.index_granularity;
-    const size_t max_marks_to_use_cache =
-        (settings.merge_tree_max_rows_to_use_cache + data.index_granularity - 1) / data.index_granularity;
+    const size_t max_marks_to_use_cache = roundRowsOrBytesToMarks(
+        settings.merge_tree_max_rows_to_use_cache,
+        settings.merge_tree_max_bytes_to_use_cache,
+        data.index_granularity_info);
+
+    const size_t min_marks_for_concurrent_read = roundRowsOrBytesToMarks(
+        settings.merge_tree_min_rows_for_concurrent_read,
+        settings.merge_tree_min_bytes_for_concurrent_read,
+        data.index_granularity_info);

    /// Count marks for each part.
    std::vector<size_t> sum_marks_in_parts(parts.size());
    size_t sum_marks = 0;
+    size_t total_rows = 0;
    for (size_t i = 0; i < parts.size(); ++i)
    {
+        total_rows += parts[i].getRowsCount();
        /// Let the ranges be listed from right to left so that the leftmost range can be dropped using `pop_back()`.
        std::reverse(parts[i].ranges.begin(), parts[i].ranges.end());

@ -662,7 +692,6 @@ BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreams(
            column_names, MergeTreeReadPool::BackoffSettings(settings), settings.preferred_block_size_bytes, false);

        /// Let's estimate total number of rows for progress bar.
-        const size_t total_rows = data.index_granularity * sum_marks;
        LOG_TRACE(log, "Reading approx. " << total_rows << " rows with " << num_streams << " streams");

        for (size_t i = 0; i < num_streams; ++i)
@ -769,8 +798,10 @@ BlockInputStreams MergeTreeDataSelectExecutor::spreadMarkRangesAmongStreamsFinal
    const Names & virt_columns,
    const Settings & settings) const
 {
-    const size_t max_marks_to_use_cache =
-        (settings.merge_tree_max_rows_to_use_cache + data.index_granularity - 1) / data.index_granularity;
+    const size_t max_marks_to_use_cache = roundRowsOrBytesToMarks(
+        settings.merge_tree_max_rows_to_use_cache,
+        settings.merge_tree_max_bytes_to_use_cache,
+        data.index_granularity_info);

    size_t sum_marks = 0;
    for (size_t i = 0; i < parts.size(); ++i)
@ -870,11 +901,12 @@ void MergeTreeDataSelectExecutor::createPositiveSignCondition(
 /// Calculates a set of mark ranges, that could possibly contain keys, required by condition.
 /// In other words, it removes subranges from whole range, that definitely could not contain required keys.
 MarkRanges MergeTreeDataSelectExecutor::markRangesFromPKRange(
-    const MergeTreeData::DataPart::Index & index, const KeyCondition & key_condition, const Settings & settings) const
+    const MergeTreeData::DataPartPtr & part, const KeyCondition & key_condition, const Settings & settings) const
 {
    MarkRanges res;

-    size_t marks_count = index.at(0)->size();
+    size_t marks_count = part->index_granularity.getMarksCount();
+    const auto & index = part->index;
    if (marks_count == 0)
        return res;

@ -886,7 +918,10 @@ MarkRanges MergeTreeDataSelectExecutor::markRangesFromPKRange(
    else
    {
        size_t used_key_size = key_condition.getMaxKeyColumn() + 1;
-        size_t min_marks_for_seek = (settings.merge_tree_min_rows_for_seek + data.index_granularity - 1) / data.index_granularity;
+        size_t min_marks_for_seek = roundRowsOrBytesToMarks(
+            settings.merge_tree_min_rows_for_seek,
+            settings.merge_tree_min_bytes_for_seek,
+            data.index_granularity_info);

        /** There will always be disjoint suspicious segments on the stack, the leftmost one at the top (back).
            * At each step, take the left segment and check if it fits.
@ -968,13 +1003,16 @@ MarkRanges MergeTreeDataSelectExecutor::filterMarksUsingIndex(
        return ranges;
    }

-    const size_t min_marks_for_seek = (settings.merge_tree_min_rows_for_seek + data.index_granularity - 1) / data.index_granularity;
+    const size_t min_marks_for_seek = roundRowsOrBytesToMarks(
+        settings.merge_tree_min_rows_for_seek,
+        settings.merge_tree_min_bytes_for_seek,
+        data.index_granularity_info);

    size_t granules_dropped = 0;

    MergeTreeIndexReader reader(
            index, part,
-            ((part->marks_count + index->granularity - 1) / index->granularity),
+            ((part->getMarksCount() + index->granularity - 1) / index->granularity),
            ranges);

    MarkRanges res;
@ -1021,4 +1059,5 @@ MarkRanges MergeTreeDataSelectExecutor::filterMarksUsingIndex(
    return res;
 }

+
 }
--- a/dbms/src/Storages/MergeTree/MergeTreeDataSelectExecutor.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeDataSelectExecutor.h
@ -78,7 +78,7 @@ private:
        const Context & context) const;

    MarkRanges markRangesFromPKRange(
-        const MergeTreeData::DataPart::Index & index,
+        const MergeTreeData::DataPartPtr & part,
        const KeyCondition & key_condition,
        const Settings & settings) const;

--- a/dbms/src/Storages/MergeTree/MergeTreeDataWriter.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeDataWriter.cpp
@ -4,8 +4,11 @@
 #include <Common/Exception.h>
 #include <Interpreters/AggregationCommon.h>
 #include <IO/HashingWriteBuffer.h>
+#include <DataTypes/DataTypeDateTime.h>
+#include <DataTypes/DataTypeDate.h>
 #include <IO/WriteHelpers.h>
 #include <Poco/File.h>
+#include <Common/typeid_cast.h>


 namespace ProfileEvents
@ -71,6 +74,34 @@ void buildScatterSelector(
    }
 }

+/// Computes ttls and updates ttl infos
+void updateTTL(const MergeTreeData::TTLEntry & ttl_entry, MergeTreeDataPart::TTLInfos & ttl_infos, Block & block, const String & column_name)
+{
+    if (!block.has(ttl_entry.result_column))
+        ttl_entry.expression->execute(block);
+
+    auto & ttl_info = (column_name.empty() ? ttl_infos.table_ttl : ttl_infos.columns_ttl[column_name]);
+
+    const auto & current = block.getByName(ttl_entry.result_column);
+
+    const IColumn * column = current.column.get();
+    if (const ColumnUInt16 * column_date = typeid_cast<const ColumnUInt16 *>(column))
+    {
+        const auto & date_lut = DateLUT::instance();
+        for (const auto & val : column_date->getData())
+            ttl_info.update(date_lut.fromDayNum(DayNum(val)));
+    }
+    else if (const ColumnUInt32 * column_date_time = typeid_cast<const ColumnUInt32 *>(column))
+    {
+        for (const auto & val : column_date_time->getData())
+            ttl_info.update(val);
+    }
+    else
+        throw Exception("Unexpected type of result ttl column", ErrorCodes::LOGICAL_ERROR);
+
+    ttl_infos.updatePartMinTTL(ttl_info.min);
+}
+
 }

 BlocksWithPartition MergeTreeDataWriter::splitBlockIntoParts(const Block & block, size_t max_parts)
@ -213,6 +244,12 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataWriter::writeTempPart(BlockWithPa
            ProfileEvents::increment(ProfileEvents::MergeTreeDataWriterBlocksAlreadySorted);
    }

+    if (data.hasTableTTL())
+        updateTTL(data.ttl_table_entry, new_data_part->ttl_infos, block, "");
+
+    for (const auto & [name, ttl_entry] : data.ttl_entries_by_name)
+        updateTTL(ttl_entry, new_data_part->ttl_infos, block, name);
+
    /// This effectively chooses minimal compression method:
    ///  either default lz4 or compression method with zero thresholds on absolute and relative part size.
    auto compression_codec = data.global_context.chooseCompressionCodec(0, 0);
--- a/dbms/src/Storages/MergeTree/MergeTreeIndexGranularity.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeIndexGranularity.cpp
@ -0,0 +1,102 @@
+#include <Storages/MergeTree/MergeTreeIndexGranularity.h>
+#include <Common/Exception.h>
+#include <IO/WriteHelpers.h>
+
+
+namespace DB
+{
+namespace ErrorCodes
+{
+    extern const int LOGICAL_ERROR;
+}
+
+MergeTreeIndexGranularity::MergeTreeIndexGranularity(const std::vector<size_t> & marks_rows_partial_sums_)
+    : marks_rows_partial_sums(marks_rows_partial_sums_)
+{
+}
+
+
+MergeTreeIndexGranularity::MergeTreeIndexGranularity(size_t marks_count, size_t fixed_granularity)
+    : marks_rows_partial_sums(marks_count, fixed_granularity)
+{
+}
+
+size_t MergeTreeIndexGranularity::getMarkStartingRow(size_t mark_index) const
+{
+    if (mark_index == 0)
+        return 0;
+    return marks_rows_partial_sums[mark_index - 1];
+}
+
+size_t MergeTreeIndexGranularity::getMarksCount() const
+{
+    return marks_rows_partial_sums.size();
+}
+
+size_t MergeTreeIndexGranularity::getTotalRows() const
+{
+    if (marks_rows_partial_sums.empty())
+        return 0;
+    return marks_rows_partial_sums.back();
+}
+
+void MergeTreeIndexGranularity::appendMark(size_t rows_count)
+{
+    if (marks_rows_partial_sums.empty())
+        marks_rows_partial_sums.push_back(rows_count);
+    else
+        marks_rows_partial_sums.push_back(marks_rows_partial_sums.back() + rows_count);
+}
+
+size_t MergeTreeIndexGranularity::getRowsCountInRange(size_t begin, size_t end) const
+{
+    size_t subtrahend = 0;
+    if (begin != 0)
+        subtrahend = marks_rows_partial_sums[begin - 1];
+    return marks_rows_partial_sums[end - 1] - subtrahend;
+}
+
+size_t MergeTreeIndexGranularity::getRowsCountInRange(const MarkRange & range) const
+{
+    return getRowsCountInRange(range.begin, range.end);
+}
+
+size_t MergeTreeIndexGranularity::getRowsCountInRanges(const MarkRanges & ranges) const
+{
+    size_t total = 0;
+    for (const auto & range : ranges)
+        total += getRowsCountInRange(range);
+
+    return total;
+}
+
+
+size_t MergeTreeIndexGranularity::countMarksForRows(size_t from_mark, size_t number_of_rows, size_t offset_in_rows) const
+{
+    size_t rows_before_mark = getMarkStartingRow(from_mark);
+    size_t last_row_pos = rows_before_mark + offset_in_rows + number_of_rows;
+    auto position = std::upper_bound(marks_rows_partial_sums.begin(), marks_rows_partial_sums.end(), last_row_pos);
+    size_t to_mark;
+    if (position == marks_rows_partial_sums.end())
+        to_mark = marks_rows_partial_sums.size();
+    else
+        to_mark = position - marks_rows_partial_sums.begin();
+
+    return getRowsCountInRange(from_mark, std::max(1UL, to_mark)) - offset_in_rows;
+
+}
+
+void MergeTreeIndexGranularity::resizeWithFixedGranularity(size_t size, size_t fixed_granularity)
+{
+    marks_rows_partial_sums.resize(size);
+
+    size_t prev = 0;
+    for (size_t i = 0; i < size; ++i)
+    {
+        marks_rows_partial_sums[i] = fixed_granularity + prev;
+        prev = marks_rows_partial_sums[i];
+    }
+}
+
+
+}
--- a/dbms/src/Storages/MergeTree/MergeTreeIndexGranularity.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeIndexGranularity.h
@ -0,0 +1,86 @@
+#pragma once
+#include <vector>
+#include <Storages/MergeTree/MarkRange.h>
+
+namespace DB
+{
+
+/// Class contains information about index granularity in rows of MergeTreeDataPart
+/// Inside it contains vector of partial sums of rows after mark:
+/// |-----|---|----|----|
+/// |  5  | 8 | 12 | 16 |
+/// If user doesn't specify setting adaptive_index_granularity_bytes for MergeTree* table
+/// all values in inner vector would have constant stride (default 8192).
+class MergeTreeIndexGranularity
+{
+private:
+    std::vector<size_t> marks_rows_partial_sums;
+    bool initialized = false;
+
+public:
+    MergeTreeIndexGranularity() = default;
+    explicit MergeTreeIndexGranularity(const std::vector<size_t> & marks_rows_partial_sums_);
+    MergeTreeIndexGranularity(size_t marks_count, size_t fixed_granularity);
+
+
+    /// Return count of rows between marks
+    size_t getRowsCountInRange(const MarkRange & range) const;
+    /// Return count of rows between marks
+    size_t getRowsCountInRange(size_t begin, size_t end) const;
+    /// Return sum of rows between all ranges
+    size_t getRowsCountInRanges(const MarkRanges & ranges) const;
+
+    /// Return amount of marks that contains amount of `number_of_rows` starting from
+    /// `from_mark` and possible some offset_in_rows from `from_mark`
+    ///                                     1    2  <- answer
+    /// |-----|---------------------------|----|----|
+    ///       ^------------------------^-----------^
+    ////  from_mark  offset_in_rows    number_of_rows
+    size_t countMarksForRows(size_t from_mark, size_t number_of_rows, size_t offset_in_rows=0) const;
+
+    /// Total marks
+    size_t getMarksCount() const;
+    /// Total rows
+    size_t getTotalRows() const;
+
+    /// Rows after mark to next mark
+    inline size_t getMarkRows(size_t mark_index) const
+    {
+        if (mark_index == 0)
+            return marks_rows_partial_sums[0];
+        else
+            return marks_rows_partial_sums[mark_index] - marks_rows_partial_sums[mark_index - 1];
+    }
+
+    /// Return amount of rows before mark
+    size_t getMarkStartingRow(size_t mark_index) const;
+
+    /// Amount of rows after last mark
+    size_t getLastMarkRows() const
+    {
+        size_t last = marks_rows_partial_sums.size() - 1;
+        return getMarkRows(last);
+    }
+
+    bool empty() const
+    {
+        return marks_rows_partial_sums.empty();
+    }
+
+    bool isInitialized() const
+    {
+        return initialized;
+    }
+
+    void setInitialized()
+    {
+        initialized = true;
+    }
+    /// Add new mark with rows_count
+    void appendMark(size_t rows_count);
+
+    /// Add `size` of marks with `fixed_granularity` rows
+    void resizeWithFixedGranularity(size_t size, size_t fixed_granularity);
+};
+
+}
--- a/dbms/src/Storages/MergeTree/MergeTreeIndexReader.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeIndexReader.cpp
@ -10,6 +10,7 @@ MergeTreeIndexReader::MergeTreeIndexReader(
        part->getFullPath() + index->getFileName(), ".idx", marks_count,
        all_mark_ranges, nullptr, false, nullptr,
        part->getFileSizeOrZero(index->getFileName() + ".idx"), 0, DBMS_DEFAULT_BUFFER_SIZE,
+        &part->storage.index_granularity_info,
        ReadBufferFromFileBase::ProfileCallback{}, CLOCK_MONOTONIC_COARSE)
 {
    stream.seekToStart();
--- a/dbms/src/Storages/MergeTree/MergeTreeRangeReader.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeRangeReader.cpp
@ -13,16 +13,18 @@ namespace DB
 {

 MergeTreeRangeReader::DelayedStream::DelayedStream(
-        size_t from_mark, size_t index_granularity, MergeTreeReader * merge_tree_reader)
+        size_t from_mark, MergeTreeReader * merge_tree_reader)
        : current_mark(from_mark), current_offset(0), num_delayed_rows(0)
-        , index_granularity(index_granularity), merge_tree_reader(merge_tree_reader)
+        , merge_tree_reader(merge_tree_reader)
+        , index_granularity(&(merge_tree_reader->data_part->index_granularity))
        , continue_reading(false), is_finished(false)
 {
 }

 size_t MergeTreeRangeReader::DelayedStream::position() const
 {
-    return current_mark * index_granularity + current_offset + num_delayed_rows;
+    size_t num_rows_before_current_mark = index_granularity->getMarkStartingRow(current_mark);
+    return num_rows_before_current_mark + current_offset + num_delayed_rows;
 }

 size_t MergeTreeRangeReader::DelayedStream::readRows(Block & block, size_t num_rows)
@ -32,7 +34,7 @@ size_t MergeTreeRangeReader::DelayedStream::readRows(Block & block, size_t num_r
        size_t rows_read = merge_tree_reader->readRows(current_mark, continue_reading, num_rows, block);
        continue_reading = true;

-        /// Zero rows_read my be either because reading has finished
+        /// Zero rows_read maybe either because reading has finished
        ///  or because there is no columns we can read in current part (for example, all columns are default).
        /// In the last case we can't finish reading, but it's also ok for the first case
        ///  because we can finish reading by calculation the number of pending rows.
@ -47,7 +49,11 @@ size_t MergeTreeRangeReader::DelayedStream::readRows(Block & block, size_t num_r

 size_t MergeTreeRangeReader::DelayedStream::read(Block & block, size_t from_mark, size_t offset, size_t num_rows)
 {
-    if (position() == from_mark * index_granularity + offset)
+    size_t num_rows_before_from_mark = index_granularity->getMarkStartingRow(from_mark);
+    /// We already stand accurately in required position,
+    /// so because stream is lazy, we don't read anything
+    /// and only increment amount delayed_rows
+    if (position() == num_rows_before_from_mark + offset)
    {
        num_delayed_rows += num_rows;
        return 0;
@ -67,12 +73,25 @@ size_t MergeTreeRangeReader::DelayedStream::read(Block & block, size_t from_mark

 size_t MergeTreeRangeReader::DelayedStream::finalize(Block & block)
 {
+    /// We need to skip some rows before reading
    if (current_offset && !continue_reading)
    {
-        size_t granules_to_skip = current_offset / index_granularity;
-        current_mark += granules_to_skip;
-        current_offset -= granules_to_skip * index_granularity;
+        for (size_t mark_num : ext::range(current_mark, index_granularity->getMarksCount()))
+        {
+            size_t mark_index_granularity = index_granularity->getMarkRows(mark_num);
+            if (current_offset >= mark_index_granularity)
+            {
+                current_offset -= mark_index_granularity;
+                current_mark++;
+            }
+            else
+                break;

+        }
+
+        /// Skip some rows from beging of granule
+        /// We don't know size of rows in compressed granule,
+        /// so have to read them and throw out
        if (current_offset)
        {
            Block temp_block;
@ -89,11 +108,22 @@ size_t MergeTreeRangeReader::DelayedStream::finalize(Block & block)


 MergeTreeRangeReader::Stream::Stream(
-        size_t from_mark, size_t to_mark, size_t index_granularity, MergeTreeReader * merge_tree_reader)
+        size_t from_mark, size_t to_mark,  MergeTreeReader * merge_tree_reader)
        : current_mark(from_mark), offset_after_current_mark(0)
-        , index_granularity(index_granularity), last_mark(to_mark)
-        , stream(from_mark, index_granularity, merge_tree_reader)
+        , last_mark(to_mark)
+        , merge_tree_reader(merge_tree_reader)
+        , index_granularity(&(merge_tree_reader->data_part->index_granularity))
+        , current_mark_index_granularity(index_granularity->getMarkRows(from_mark))
+        , stream(from_mark, merge_tree_reader)
 {
+    size_t marks_count = index_granularity->getMarksCount();
+    if (from_mark >= marks_count)
+        throw Exception("Trying create stream to read from mark №"+ toString(current_mark) + " but total marks count is "
+            + toString(marks_count), ErrorCodes::LOGICAL_ERROR);
+
+    if (last_mark > marks_count)
+        throw Exception("Trying create stream to read to mark №"+ toString(current_mark) + " but total marks count is "
+            + toString(marks_count), ErrorCodes::LOGICAL_ERROR);
 }

 void MergeTreeRangeReader::Stream::checkNotFinished() const
@ -104,7 +134,7 @@ void MergeTreeRangeReader::Stream::checkNotFinished() const

 void MergeTreeRangeReader::Stream::checkEnoughSpaceInCurrentGranule(size_t num_rows) const
 {
-    if (num_rows + offset_after_current_mark > index_granularity)
+    if (num_rows + offset_after_current_mark > current_mark_index_granularity)
        throw Exception("Cannot read from granule more than index_granularity.", ErrorCodes::LOGICAL_ERROR);
 }

@ -118,6 +148,21 @@ size_t MergeTreeRangeReader::Stream::readRows(Block & block, size_t num_rows)
    return rows_read;
 }

+void MergeTreeRangeReader::Stream::toNextMark()
+{
+    ++current_mark;
+
+    size_t total_marks_count = index_granularity->getMarksCount();
+    if (current_mark < total_marks_count)
+        current_mark_index_granularity = index_granularity->getMarkRows(current_mark);
+    else if (current_mark == total_marks_count)
+        current_mark_index_granularity = 0; /// HACK?
+    else
+        throw Exception("Trying to read from mark " + toString(current_mark) + ", but total marks count " + toString(total_marks_count), ErrorCodes::LOGICAL_ERROR);
+
+    offset_after_current_mark = 0;
+}
+
 size_t MergeTreeRangeReader::Stream::read(Block & block, size_t num_rows, bool skip_remaining_rows_in_current_granule)
 {
    checkEnoughSpaceInCurrentGranule(num_rows);
@ -127,14 +172,12 @@ size_t MergeTreeRangeReader::Stream::read(Block & block, size_t num_rows, bool s
        checkNotFinished();

        size_t read_rows = readRows(block, num_rows);
+
        offset_after_current_mark += num_rows;

-        if (offset_after_current_mark == index_granularity || skip_remaining_rows_in_current_granule)
-        {
-            /// Start new granule; skipped_rows_after_offset is already zero.
-            ++current_mark;
-            offset_after_current_mark = 0;
-        }
+        /// Start new granule; skipped_rows_after_offset is already zero.
+        if (offset_after_current_mark == current_mark_index_granularity || skip_remaining_rows_in_current_granule)
+            toNextMark();

        return read_rows;
    }
@ -145,9 +188,7 @@ size_t MergeTreeRangeReader::Stream::read(Block & block, size_t num_rows, bool s
        {
            /// Skip the rest of the rows in granule and start new one.
            checkNotFinished();
-
-            ++current_mark;
-            offset_after_current_mark = 0;
+            toNextMark();
        }

        return 0;
@ -163,11 +204,10 @@ void MergeTreeRangeReader::Stream::skip(size_t num_rows)

        offset_after_current_mark += num_rows;

-        if (offset_after_current_mark == index_granularity)
+        if (offset_after_current_mark == current_mark_index_granularity)
        {
            /// Start new granule; skipped_rows_after_offset is already zero.
-            ++current_mark;
-            offset_after_current_mark = 0;
+            toNextMark();
        }
    }
 }
@ -198,7 +238,7 @@ void MergeTreeRangeReader::ReadResult::adjustLastGranule()

    if (num_rows_to_subtract > rows_per_granule.back())
        throw Exception("Can't adjust last granule because it has " + toString(rows_per_granule.back())
-                        + "rows, but try to subtract " + toString(num_rows_to_subtract) + " rows.",
+                        + " rows, but try to subtract " + toString(num_rows_to_subtract) + " rows.",
                        ErrorCodes::LOGICAL_ERROR);

    rows_per_granule.back() -= num_rows_to_subtract;
@ -366,11 +406,11 @@ void MergeTreeRangeReader::ReadResult::setFilter(const ColumnPtr & new_filter)


 MergeTreeRangeReader::MergeTreeRangeReader(
-        MergeTreeReader * merge_tree_reader, size_t index_granularity, MergeTreeRangeReader * prev_reader,
+        MergeTreeReader * merge_tree_reader, MergeTreeRangeReader * prev_reader,
        ExpressionActionsPtr alias_actions, ExpressionActionsPtr prewhere_actions,
        const String * prewhere_column_name, const Names * ordered_names,
        bool always_reorder, bool remove_prewhere_column, bool last_reader_in_chain)
-        : index_granularity(index_granularity), merge_tree_reader(merge_tree_reader)
+        : merge_tree_reader(merge_tree_reader), index_granularity(&(merge_tree_reader->data_part->index_granularity))
        , prev_reader(prev_reader), prewhere_column_name(prewhere_column_name)
        , ordered_names(ordered_names), alias_actions(alias_actions), prewhere_actions(std::move(prewhere_actions))
        , always_reorder(always_reorder), remove_prewhere_column(remove_prewhere_column)
@ -393,8 +433,34 @@ size_t MergeTreeRangeReader::numPendingRowsInCurrentGranule() const
        return prev_reader->numPendingRowsInCurrentGranule();

    auto pending_rows = stream.numPendingRowsInCurrentGranule();
+
+    if (pending_rows)
+        return pending_rows;
+
+    return numRowsInCurrentGranule();
+}
+
+
+size_t MergeTreeRangeReader::numRowsInCurrentGranule() const
+{
    /// If pending_rows is zero, than stream is not initialized.
-    return pending_rows ? pending_rows : index_granularity;
+    if (stream.current_mark_index_granularity)
+        return stream.current_mark_index_granularity;
+
+    /// We haven't read anything, return first
+    size_t first_mark = merge_tree_reader->getFirstMarkToRead();
+    return index_granularity->getMarkRows(first_mark);
+}
+
+size_t MergeTreeRangeReader::currentMark() const
+{
+    return stream.currentMark();
+}
+
+size_t MergeTreeRangeReader::Stream::numPendingRows() const
+{
+    size_t rows_between_marks = index_granularity->getRowsCountInRange(current_mark, last_mark);
+    return rows_between_marks - offset_after_current_mark;
 }

 bool MergeTreeRangeReader::isCurrentRangeFinished() const
@ -515,7 +581,7 @@ MergeTreeRangeReader::ReadResult MergeTreeRangeReader::startReadingChain(size_t
            if (stream.isFinished())
            {
                result.addRows(stream.finalize(result.block));
-                stream = Stream(ranges.back().begin, ranges.back().end, index_granularity, merge_tree_reader);
+                stream = Stream(ranges.back().begin, ranges.back().end, merge_tree_reader);
                result.addRange(ranges.back());
                ranges.pop_back();
            }
@ -563,7 +629,7 @@ Block MergeTreeRangeReader::continueReadingChain(ReadResult & result)
            added_rows += stream.finalize(block);
            auto & range = started_ranges[next_range_to_start].range;
            ++next_range_to_start;
-            stream = Stream(range.begin, range.end, index_granularity, merge_tree_reader);
+            stream = Stream(range.begin, range.end, merge_tree_reader);
        }

        bool last = i + 1 == size;
--- a/dbms/src/Storages/MergeTree/MergeTreeRangeReader.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeRangeReader.h
@ -19,7 +19,7 @@ class MergeTreeReader;
 class MergeTreeRangeReader
 {
 public:
-    MergeTreeRangeReader(MergeTreeReader * merge_tree_reader, size_t index_granularity, MergeTreeRangeReader * prev_reader,
+    MergeTreeRangeReader(MergeTreeReader * merge_tree_reader, MergeTreeRangeReader * prev_reader,
                         ExpressionActionsPtr alias_actions, ExpressionActionsPtr prewhere_actions,
                         const String * prewhere_column_name, const Names * ordered_names,
                         bool always_reorder, bool remove_prewhere_column, bool last_reader_in_chain);
@ -30,6 +30,8 @@ public:

    size_t numReadRowsInCurrentGranule() const;
    size_t numPendingRowsInCurrentGranule() const;
+    size_t numRowsInCurrentGranule() const;
+    size_t currentMark() const;

    bool isCurrentRangeFinished() const;
    bool isInitialized() const { return is_initialized; }
@ -38,35 +40,44 @@ public:
    {
    public:
        DelayedStream() = default;
-        DelayedStream(size_t from_mark, size_t index_granularity, MergeTreeReader * merge_tree_reader);
+        DelayedStream(size_t from_mark,  MergeTreeReader * merge_tree_reader);

+        /// Read @num_rows rows from @from_mark starting from @offset row
        /// Returns the number of rows added to block.
        /// NOTE: have to return number of rows because block has broken invariant:
        ///       some columns may have different size (for example, default columns may be zero size).
        size_t read(Block & block, size_t from_mark, size_t offset, size_t num_rows);
+
+        /// Skip extra rows to current_offset and perform actual reading
        size_t finalize(Block & block);

        bool isFinished() const { return is_finished; }

    private:
        size_t current_mark = 0;
+        /// Offset from current mark in rows
        size_t current_offset = 0;
+        /// Num of rows we have to read
        size_t num_delayed_rows = 0;

-        size_t index_granularity = 0;
+        /// Actual reader of data from disk
        MergeTreeReader * merge_tree_reader = nullptr;
+        const MergeTreeIndexGranularity * index_granularity = nullptr;
        bool continue_reading = false;
        bool is_finished = true;

+        /// Current position from the begging of file in rows
        size_t position() const;
        size_t readRows(Block & block, size_t num_rows);
    };

+    /// Very thin wrapper for DelayedStream
+    /// Check bounds of read ranges and make steps between marks
    class Stream
    {
    public:
        Stream() = default;
-        Stream(size_t from_mark, size_t to_mark, size_t index_granularity, MergeTreeReader * merge_tree_reader);
+        Stream(size_t from_mark, size_t to_mark, MergeTreeReader * merge_tree_reader);

        /// Returns the number of rows added to block.
        size_t read(Block & block, size_t num_rows, bool skip_remaining_rows_in_current_granule);
@ -77,23 +88,31 @@ public:
        bool isFinished() const { return current_mark >= last_mark; }

        size_t numReadRowsInCurrentGranule() const { return offset_after_current_mark; }
-        size_t numPendingRowsInCurrentGranule() const { return index_granularity - numReadRowsInCurrentGranule(); }
-        size_t numRendingGranules() const { return last_mark - current_mark; }
-        size_t numPendingRows() const { return numRendingGranules() * index_granularity - offset_after_current_mark; }
+        size_t numPendingRowsInCurrentGranule() const
+        {
+            return current_mark_index_granularity - numReadRowsInCurrentGranule();
+        }
+        size_t numPendingGranules() const { return last_mark - current_mark; }
+        size_t numPendingRows() const;
+        size_t currentMark() const { return current_mark; }

-    private:
        size_t current_mark = 0;
        /// Invariant: offset_after_current_mark + skipped_rows_after_offset < index_granularity
        size_t offset_after_current_mark = 0;

-        size_t index_granularity = 0;
        size_t last_mark = 0;

+        MergeTreeReader * merge_tree_reader = nullptr;
+        const MergeTreeIndexGranularity * index_granularity = nullptr;
+
+        size_t current_mark_index_granularity = 0;
+
        DelayedStream stream;

        void checkNotFinished() const;
        void checkEnoughSpaceInCurrentGranule(size_t num_rows) const;
        size_t readRows(Block & block, size_t num_rows);
+        void toNextMark();
    };

    /// Statistics after next reading step.
@ -142,6 +161,8 @@ public:
    private:
        RangesInfo started_ranges;
        /// The number of rows read from each granule.
+        /// Granule here is not number of rows between two marks
+        /// It's amount of rows per single reading act
        NumRows rows_per_granule;
        /// Sum(rows_per_granule)
        size_t total_rows_per_granule = 0;
@ -169,8 +190,8 @@ private:
    void executePrewhereActionsAndFilterColumns(ReadResult & result);
    void filterBlock(Block & block, const IColumn::Filter & filter) const;

-    size_t index_granularity = 0;
    MergeTreeReader * merge_tree_reader = nullptr;
+    const MergeTreeIndexGranularity * index_granularity = nullptr;
    MergeTreeRangeReader * prev_reader = nullptr; /// If not nullptr, read from prev_reader firstly.

    const String * prewhere_column_name = nullptr;
@ -187,4 +208,3 @@ private:
 };

 }
-
--- a/dbms/src/Storages/MergeTree/MergeTreeReader.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeReader.cpp
@ -38,9 +38,9 @@ MergeTreeReader::MergeTreeReader(const String & path,
    size_t aio_threshold, size_t max_read_buffer_size, const ValueSizeMap & avg_value_size_hints,
    const ReadBufferFromFileBase::ProfileCallback & profile_callback,
    clockid_t clock_type)
-    : avg_value_size_hints(avg_value_size_hints), path(path), data_part(data_part), columns(columns)
+    : data_part(data_part), avg_value_size_hints(avg_value_size_hints), path(path), columns(columns)
    , uncompressed_cache(uncompressed_cache), mark_cache(mark_cache), save_marks_in_cache(save_marks_in_cache), storage(storage)
-    , all_mark_ranges(all_mark_ranges), aio_threshold(aio_threshold), max_read_buffer_size(max_read_buffer_size), index_granularity(storage.index_granularity)
+    , all_mark_ranges(all_mark_ranges), aio_threshold(aio_threshold), max_read_buffer_size(max_read_buffer_size)
 {
    try
    {
@ -172,10 +172,12 @@ void MergeTreeReader::addStreams(const String & name, const IDataType & type,
            return;

        streams.emplace(stream_name, std::make_unique<MergeTreeReaderStream>(
-            path + stream_name, DATA_FILE_EXTENSION, data_part->marks_count,
+            path + stream_name, DATA_FILE_EXTENSION, data_part->getMarksCount(),
            all_mark_ranges, mark_cache, save_marks_in_cache,
            uncompressed_cache, data_part->getFileSizeOrZero(stream_name + DATA_FILE_EXTENSION),
-            aio_threshold, max_read_buffer_size, profile_callback, clock_type));
+            aio_threshold, max_read_buffer_size,
+            &storage.index_granularity_info,
+            profile_callback, clock_type));
    };

    IDataType::SubstreamPath substream_path;
--- a/dbms/src/Storages/MergeTree/MergeTreeReader.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeReader.h
@ -51,6 +51,12 @@ public:
    /// If continue_reading is true, continue reading from last state, otherwise seek to from_mark
    size_t readRows(size_t from_mark, bool continue_reading, size_t max_rows_to_read, Block & res);

+    MergeTreeData::DataPartPtr data_part;
+
+    size_t getFirstMarkToRead() const
+    {
+        return all_mark_ranges.back().begin;
+    }
 private:
    using FileStreams = std::map<std::string, std::unique_ptr<MergeTreeReaderStream>>;

@ -60,7 +66,6 @@ private:
    DeserializeBinaryBulkStateMap deserialize_binary_bulk_state_map;
    /// Path to the directory containing the part
    String path;
-    MergeTreeData::DataPartPtr data_part;

    FileStreams streams;

@ -76,7 +81,6 @@ private:
    MarkRanges all_mark_ranges;
    size_t aio_threshold;
    size_t max_read_buffer_size;
-    size_t index_granularity;

    void addStreams(const String & name, const IDataType & type,
        const ReadBufferFromFileBase::ProfileCallback & profile_callback, clockid_t clock_type);
--- a/dbms/src/Storages/MergeTree/MergeTreeReaderStream.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeReaderStream.cpp
@ -15,14 +15,16 @@ namespace ErrorCodes


 MergeTreeReaderStream::MergeTreeReaderStream(
-        const String & path_prefix_, const String & extension_, size_t marks_count_,
+        const String & path_prefix_, const String & data_file_extension_, size_t marks_count_,
        const MarkRanges & all_mark_ranges,
        MarkCache * mark_cache_, bool save_marks_in_cache_,
        UncompressedCache * uncompressed_cache,
        size_t file_size, size_t aio_threshold, size_t max_read_buffer_size,
+        const GranularityInfo * index_granularity_info_,
        const ReadBufferFromFileBase::ProfileCallback & profile_callback, clockid_t clock_type)
-        : path_prefix(path_prefix_), extension(extension_), marks_count(marks_count_)
+        : path_prefix(path_prefix_), data_file_extension(data_file_extension_), marks_count(marks_count_)
        , mark_cache(mark_cache_), save_marks_in_cache(save_marks_in_cache_)
+        , index_granularity_info(index_granularity_info_)
 {
    /// Compute the size of the buffer.
    size_t max_mark_range_bytes = 0;
@ -77,7 +79,7 @@ MergeTreeReaderStream::MergeTreeReaderStream(
    if (uncompressed_cache)
    {
        auto buffer = std::make_unique<CachedCompressedReadBuffer>(
-            path_prefix + extension, uncompressed_cache, sum_mark_range_bytes, aio_threshold, buffer_size);
+            path_prefix + data_file_extension, uncompressed_cache, sum_mark_range_bytes, aio_threshold, buffer_size);

        if (profile_callback)
            buffer->setProfileCallback(profile_callback, clock_type);
@ -88,7 +90,7 @@ MergeTreeReaderStream::MergeTreeReaderStream(
    else
    {
        auto buffer = std::make_unique<CompressedReadBufferFromFile>(
-            path_prefix + extension, sum_mark_range_bytes, aio_threshold, buffer_size);
+            path_prefix + data_file_extension, sum_mark_range_bytes, aio_threshold, buffer_size);

        if (profile_callback)
            buffer->setProfileCallback(profile_callback, clock_type);
@ -109,7 +111,7 @@ const MarkInCompressedFile & MergeTreeReaderStream::getMark(size_t index)

 void MergeTreeReaderStream::loadMarks()
 {
-    std::string mrk_path = path_prefix + ".mrk";
+    std::string mrk_path = index_granularity_info->getMarksFilePath(path_prefix);

    auto load = [&]() -> MarkCache::MappedPtr
    {
@ -117,7 +119,7 @@ void MergeTreeReaderStream::loadMarks()
        auto temporarily_disable_memory_tracker = getCurrentMemoryTrackerActionLock();

        size_t file_size = Poco::File(mrk_path).getSize();
-        size_t expected_file_size = sizeof(MarkInCompressedFile) * marks_count;
+        size_t expected_file_size = index_granularity_info->mark_size_in_bytes * marks_count;
        if (expected_file_size != file_size)
            throw Exception(
                "Bad size of marks file '" + mrk_path + "': " + std::to_string(file_size) + ", must be: " + std::to_string(expected_file_size),
@ -125,12 +127,28 @@ void MergeTreeReaderStream::loadMarks()

        auto res = std::make_shared<MarksInCompressedFile>(marks_count);

-        /// Read directly to marks.
-        ReadBufferFromFile buffer(mrk_path, file_size, -1, reinterpret_cast<char *>(res->data()));
-
-        if (buffer.eof() || buffer.buffer().size() != file_size)
-            throw Exception("Cannot read all marks from file " + mrk_path, ErrorCodes::CANNOT_READ_ALL_DATA);
+        if (!index_granularity_info->is_adaptive)
+        {
+            /// Read directly to marks.
+            ReadBufferFromFile buffer(mrk_path, file_size, -1, reinterpret_cast<char *>(res->data()));

+            if (buffer.eof() || buffer.buffer().size() != file_size)
+                throw Exception("Cannot read all marks from file " + mrk_path, ErrorCodes::CANNOT_READ_ALL_DATA);
+        }
+        else
+        {
+            ReadBufferFromFile buffer(mrk_path, file_size, -1);
+            size_t i = 0;
+            while (!buffer.eof())
+            {
+                readIntBinary((*res)[i].offset_in_compressed_file, buffer);
+                readIntBinary((*res)[i].offset_in_decompressed_block, buffer);
+                buffer.seek(sizeof(size_t), SEEK_CUR);
+                ++i;
+            }
+            if (i * index_granularity_info->mark_size_in_bytes != file_size)
+                throw Exception("Cannot read all marks from file " + mrk_path, ErrorCodes::CANNOT_READ_ALL_DATA);
+        }
        res->protect();
        return res;
    };
--- a/dbms/src/Storages/MergeTree/MergeTreeReaderStream.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeReaderStream.h
@ -13,12 +13,14 @@ namespace DB
 class MergeTreeReaderStream
 {
 public:
+    using GranularityInfo = MergeTreeData::IndexGranularityInfo;
    MergeTreeReaderStream(
-            const String & path_prefix_, const String & extension_, size_t marks_count_,
+            const String & path_prefix_, const String & data_file_extension_, size_t marks_count_,
            const MarkRanges & all_mark_ranges,
            MarkCache * mark_cache, bool save_marks_in_cache,
            UncompressedCache * uncompressed_cache,
            size_t file_size, size_t aio_threshold, size_t max_read_buffer_size,
+            const GranularityInfo * index_granularity_info_,
            const ReadBufferFromFileBase::ProfileCallback & profile_callback, clockid_t clock_type);

    void seekToMark(size_t index);
@ -34,7 +36,7 @@ private:
    void loadMarks();

    std::string path_prefix;
-    std::string extension;
+    std::string data_file_extension;

    size_t marks_count;

@ -42,6 +44,8 @@ private:
    bool save_marks_in_cache;
    MarkCache::MappedPtr marks;

+    const GranularityInfo * index_granularity_info;
+
    std::unique_ptr<CachedCompressedReadBuffer> cached_buffer;
    std::unique_ptr<CompressedReadBufferFromFile> non_cached_buffer;
 };
--- a/dbms/src/Storages/MergeTree/MergeTreeSelectBlockInputStream.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeSelectBlockInputStream.cpp
@ -46,15 +46,15 @@ MergeTreeSelectBlockInputStream::MergeTreeSelectBlockInputStream(
    for (const auto & range : all_mark_ranges)
        total_marks_count += range.end - range.begin;

-    size_t total_rows = total_marks_count * storage.index_granularity;
+    size_t total_rows = data_part->index_granularity.getTotalRows();

    if (!quiet)
        LOG_TRACE(log, "Reading " << all_mark_ranges.size() << " ranges from part " << data_part->name
        << ", approx. " << total_rows
        << (all_mark_ranges.size() > 1
-        ? ", up to " + toString((all_mark_ranges.back().end - all_mark_ranges.front().begin) * storage.index_granularity)
+        ? ", up to " + toString(data_part->index_granularity.getRowsCountInRanges(all_mark_ranges))
        : "")
-        << " rows starting from " << all_mark_ranges.front().begin * storage.index_granularity);
+        << " rows starting from " << data_part->index_granularity.getMarkStartingRow(all_mark_ranges.front().begin));

    addTotalRowsApprox(total_rows);

--- a/dbms/src/Storages/MergeTree/MergeTreeSequentialBlockInputStream.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeSequentialBlockInputStream.cpp
@ -25,7 +25,7 @@ MergeTreeSequentialBlockInputStream::MergeTreeSequentialBlockInputStream(
    if (!quiet)
    {
        std::stringstream message;
-        message << "Reading " << data_part->marks_count << " marks from part " << data_part->name
+        message << "Reading " << data_part->getMarksCount() << " marks from part " << data_part->name
            << ", total " << data_part->rows_count
            << " rows starting from the beginning of the part, columns: ";
        for (size_t i = 0, size = columns_to_read.size(); i < size; ++i)
@ -56,7 +56,7 @@ MergeTreeSequentialBlockInputStream::MergeTreeSequentialBlockInputStream(
    reader = std::make_unique<MergeTreeReader>(
        data_part->getFullPath(), data_part, columns_for_reader, /* uncompressed_cache = */ nullptr,
        mark_cache.get(), /* save_marks_in_cache = */ false, storage,
-        MarkRanges{MarkRange(0, data_part->marks_count)},
+        MarkRanges{MarkRange(0, data_part->getMarksCount())},
        /* bytes to use AIO (this is hack) */
        read_with_direct_io ? 1UL : std::numeric_limits<size_t>::max(),
        DBMS_DEFAULT_BUFFER_SIZE);
@ -91,15 +91,16 @@ try
    Block res;
    if (!isCancelled() && current_row < data_part->rows_count)
    {
+        size_t rows_to_read = data_part->index_granularity.getMarkRows(current_mark);
        bool continue_reading = (current_mark != 0);
-        size_t rows_readed = reader->readRows(current_mark, continue_reading, storage.index_granularity, res);
+        size_t rows_readed = reader->readRows(current_mark, continue_reading, rows_to_read, res);

        if (res)
        {
            res.checkNumberOfRows();

            current_row += rows_readed;
-            current_mark += (rows_readed / storage.index_granularity);
+            current_mark += (rows_to_read == rows_readed);

            bool should_reorder = false, should_evaluate_missing_defaults = false;
            reader->fillMissingColumns(res, should_reorder, should_evaluate_missing_defaults, res.rows());
--- a/dbms/src/Storages/MergeTree/MergeTreeSettings.h
+++ b/dbms/src/Storages/MergeTree/MergeTreeSettings.h
@ -160,7 +160,13 @@ struct MergeTreeSettings
    M(SettingUInt64, finished_mutations_to_keep, 100)                                                         \
                                                                                                              \
    /** Minimal amount of bytes to enable O_DIRECT in merge (0 - disabled) */                                 \
-    M(SettingUInt64, min_merge_bytes_to_use_direct_io, 10ULL * 1024 * 1024 * 1024)
+    M(SettingUInt64, min_merge_bytes_to_use_direct_io, 10ULL * 1024 * 1024 * 1024)                            \
+                                                                                                              \
+    /** Approximate amount of bytes in single granule (0 - disabled) */                                       \
+    M(SettingUInt64, index_granularity_bytes, 0)                                                              \
+                                                                                                              \
+    /** Minimal time in seconds, when merge with TTL can be repeated */                                       \
+    M(SettingInt64, merge_with_ttl_timeout, 3600 * 24)

    /// Settings that should not change after the creation of a table.
 #define APPLY_FOR_IMMUTABLE_MERGE_TREE_SETTINGS(M) \
--- a/dbms/src/Storages/MergeTree/MergeTreeThreadSelectBlockInputStream.cpp
+++ b/dbms/src/Storages/MergeTree/MergeTreeThreadSelectBlockInputStream.cpp
@ -27,10 +27,13 @@ MergeTreeThreadSelectBlockInputStream::MergeTreeThreadSelectBlockInputStream(
    pool{pool}
 {
    /// round min_marks_to_read up to nearest multiple of block_size expressed in marks
-    if (max_block_size_rows)
+    /// If granularity is adaptive it doesn't make sense
+    /// Maybe it will make sence to add settings `max_block_size_bytes`
+    if (max_block_size_rows && !storage.index_granularity_info.is_adaptive)
    {
-        min_marks_to_read = (min_marks_to_read_ * storage.index_granularity + max_block_size_rows - 1)
-                            / max_block_size_rows * max_block_size_rows / storage.index_granularity;
+        size_t fixed_index_granularity = storage.index_granularity_info.fixed_index_granularity;
+        min_marks_to_read = (min_marks_to_read_ * fixed_index_granularity + max_block_size_rows - 1)
+            / max_block_size_rows * max_block_size_rows / fixed_index_granularity;
    }
    else
        min_marks_to_read = min_marks_to_read_;
--- a/dbms/src/Storages/MergeTree/MergedBlockOutputStream.cpp
+++ b/dbms/src/Storages/MergeTree/MergedBlockOutputStream.cpp
@ -2,6 +2,7 @@
 #include <IO/createWriteBufferFromFileBase.h>
 #include <Common/escapeForFileName.h>
 #include <DataTypes/NestedUtils.h>
+#include <DataStreams/MarkInCompressedFile.h>
 #include <Common/StringUtils/StringUtils.h>
 #include <Common/typeid_cast.h>
 #include <Common/MemoryTracker.h>
@ -20,7 +21,6 @@ namespace
 {

 constexpr auto DATA_FILE_EXTENSION = ".bin";
-constexpr auto MARKS_FILE_EXTENSION = ".mrk";
 constexpr auto INDEX_FILE_EXTENSION = ".idx";

 }
@ -33,13 +33,22 @@ IMergedBlockOutputStream::IMergedBlockOutputStream(
    size_t min_compress_block_size_,
    size_t max_compress_block_size_,
    CompressionCodecPtr codec_,
-    size_t aio_threshold_)
-    : storage(storage_),
-    min_compress_block_size(min_compress_block_size_),
-    max_compress_block_size(max_compress_block_size_),
-    aio_threshold(aio_threshold_),
-    codec(std::move(codec_))
+    size_t aio_threshold_,
+    bool blocks_are_granules_size_,
+    const MergeTreeIndexGranularity & index_granularity_)
+    : storage(storage_)
+    , min_compress_block_size(min_compress_block_size_)
+    , max_compress_block_size(max_compress_block_size_)
+    , aio_threshold(aio_threshold_)
+    , marks_file_extension(storage.index_granularity_info.marks_file_extension)
+    , mark_size_in_bytes(storage.index_granularity_info.mark_size_in_bytes)
+    , blocks_are_granules_size(blocks_are_granules_size_)
+    , index_granularity(index_granularity_)
+    , compute_granularity(index_granularity.empty())
+    , codec(std::move(codec_))
 {
+    if (blocks_are_granules_size && !index_granularity.empty())
+        throw Exception("Can't take information about index granularity from blocks, when non empty index_granularity array specified", ErrorCodes::LOGICAL_ERROR);
 }


@ -65,7 +74,7 @@ void IMergedBlockOutputStream::addStreams(
        column_streams[stream_name] = std::make_unique<ColumnStream>(
            stream_name,
            path + stream_name, DATA_FILE_EXTENSION,
-            path + stream_name, MARKS_FILE_EXTENSION,
+            path + stream_name, marks_file_extension,
            effective_codec,
            max_compress_block_size,
            estimated_size,
@ -96,61 +105,68 @@ IDataType::OutputStreamGetter IMergedBlockOutputStream::createStreamGetter(
    };
 }

+void fillIndexGranularityImpl(
+    const Block & block,
+    size_t index_granularity_bytes,
+    size_t fixed_index_granularity_rows,
+    bool blocks_are_granules,
+    size_t index_offset,
+    MergeTreeIndexGranularity & index_granularity)
+{
+    size_t rows_in_block = block.rows();
+    size_t index_granularity_for_block;
+    if (index_granularity_bytes == 0)
+        index_granularity_for_block = fixed_index_granularity_rows;
+    else
+    {
+        size_t block_size_in_memory = block.bytes();
+        if (blocks_are_granules)
+            index_granularity_for_block = rows_in_block;
+        else if (block_size_in_memory >= index_granularity_bytes)
+        {
+            size_t granules_in_block = block_size_in_memory / index_granularity_bytes;
+            index_granularity_for_block = rows_in_block / granules_in_block;
+        }
+        else
+        {
+            size_t size_of_row_in_bytes = block_size_in_memory / rows_in_block;
+            index_granularity_for_block = index_granularity_bytes / size_of_row_in_bytes;
+        }
+    }
+    if (index_granularity_for_block == 0) /// very rare case when index granularity bytes less then single row
+        index_granularity_for_block = 1;

-void IMergedBlockOutputStream::writeData(
+    for (size_t current_row = index_offset; current_row < rows_in_block; current_row += index_granularity_for_block)
+        index_granularity.appendMark(index_granularity_for_block);
+
+}
+
+void IMergedBlockOutputStream::fillIndexGranularity(const Block & block)
+{
+    fillIndexGranularityImpl(
+        block,
+        storage.index_granularity_info.index_granularity_bytes,
+        storage.index_granularity_info.fixed_index_granularity,
+        blocks_are_granules_size,
+        index_offset,
+        index_granularity);
+}
+
+size_t IMergedBlockOutputStream::writeSingleGranule(
    const String & name,
    const IDataType & type,
    const IColumn & column,
    WrittenOffsetColumns & offset_columns,
    bool skip_offsets,
-    IDataType::SerializeBinaryBulkStatePtr & serialization_state)
+    IDataType::SerializeBinaryBulkStatePtr & serialization_state,
+    IDataType::SerializeBinaryBulkSettings & serialize_settings,
+    size_t from_row,
+    size_t number_of_rows,
+    bool write_marks)
 {
-    auto & settings = storage.global_context.getSettingsRef();
-    IDataType::SerializeBinaryBulkSettings serialize_settings;
-    serialize_settings.getter = createStreamGetter(name, offset_columns, skip_offsets);
-    serialize_settings.low_cardinality_max_dictionary_size = settings.low_cardinality_max_dictionary_size;
-    serialize_settings.low_cardinality_use_single_dictionary_for_part = settings.low_cardinality_use_single_dictionary_for_part != 0;
-
-    size_t size = column.size();
-    size_t prev_mark = 0;
-    while (prev_mark < size)
+    if (write_marks)
    {
-        UInt64 limit = 0;
-
-        /// If there is `index_offset`, then the first mark goes not immediately, but after this number of rows.
-        if (prev_mark == 0 && index_offset != 0)
-            limit = index_offset;
-        else
-        {
-            limit = storage.index_granularity;
-
-            /// Write marks.
-            type.enumerateStreams([&] (const IDataType::SubstreamPath & substream_path)
-            {
-                bool is_offsets = !substream_path.empty() && substream_path.back().type == IDataType::Substream::ArraySizes;
-                if (is_offsets && skip_offsets)
-                    return;
-
-                String stream_name = IDataType::getFileNameForStream(name, substream_path);
-
-                /// Don't write offsets more than one time for Nested type.
-                if (is_offsets && offset_columns.count(stream_name))
-                    return;
-
-                ColumnStream & stream = *column_streams[stream_name];
-
-                /// There could already be enough data to compress into the new block.
-                if (stream.compressed.offset() >= min_compress_block_size)
-                    stream.compressed.next();
-
-                writeIntBinary(stream.plain_hashing.count(), stream.marks);
-                writeIntBinary(stream.compressed.offset(), stream.marks);
-            }, serialize_settings.path);
-        }
-
-        type.serializeBinaryBulkWithMultipleStreams(column, prev_mark, limit, serialize_settings, serialization_state);
-
-        /// So that instead of the marks pointing to the end of the compressed block, there were marks pointing to the beginning of the next one.
+        /// Write marks.
        type.enumerateStreams([&] (const IDataType::SubstreamPath & substream_path)
        {
            bool is_offsets = !substream_path.empty() && substream_path.back().type == IDataType::Substream::ArraySizes;
@ -163,10 +179,94 @@ void IMergedBlockOutputStream::writeData(
            if (is_offsets && offset_columns.count(stream_name))
                return;

-            column_streams[stream_name]->compressed.nextIfAtEnd();
-        }, serialize_settings.path);
+            ColumnStream & stream = *column_streams[stream_name];

-        prev_mark += limit;
+            /// There could already be enough data to compress into the new block.
+            if (stream.compressed.offset() >= min_compress_block_size)
+                stream.compressed.next();
+
+            writeIntBinary(stream.plain_hashing.count(), stream.marks);
+            writeIntBinary(stream.compressed.offset(), stream.marks);
+            if (storage.index_granularity_info.is_adaptive)
+                writeIntBinary(number_of_rows, stream.marks);
+        }, serialize_settings.path);
+    }
+
+    type.serializeBinaryBulkWithMultipleStreams(column, from_row, number_of_rows, serialize_settings, serialization_state);
+
+    /// So that instead of the marks pointing to the end of the compressed block, there were marks pointing to the beginning of the next one.
+    type.enumerateStreams([&] (const IDataType::SubstreamPath & substream_path)
+    {
+        bool is_offsets = !substream_path.empty() && substream_path.back().type == IDataType::Substream::ArraySizes;
+        if (is_offsets && skip_offsets)
+            return;
+
+        String stream_name = IDataType::getFileNameForStream(name, substream_path);
+
+        /// Don't write offsets more than one time for Nested type.
+        if (is_offsets && offset_columns.count(stream_name))
+            return;
+
+        column_streams[stream_name]->compressed.nextIfAtEnd();
+    }, serialize_settings.path);
+
+    return from_row + number_of_rows;
+}
+
+std::pair<size_t, size_t> IMergedBlockOutputStream::writeColumn(
+    const String & name,
+    const IDataType & type,
+    const IColumn & column,
+    WrittenOffsetColumns & offset_columns,
+    bool skip_offsets,
+    IDataType::SerializeBinaryBulkStatePtr & serialization_state,
+    size_t from_mark)
+{
+    auto & settings = storage.global_context.getSettingsRef();
+    IDataType::SerializeBinaryBulkSettings serialize_settings;
+    serialize_settings.getter = createStreamGetter(name, offset_columns, skip_offsets);
+    serialize_settings.low_cardinality_max_dictionary_size = settings.low_cardinality_max_dictionary_size;
+    serialize_settings.low_cardinality_use_single_dictionary_for_part = settings.low_cardinality_use_single_dictionary_for_part != 0;
+
+    size_t total_rows = column.size();
+    size_t current_row = 0;
+    size_t current_column_mark = from_mark;
+    while (current_row < total_rows)
+    {
+        size_t rows_to_write;
+        bool write_marks = true;
+
+        /// If there is `index_offset`, then the first mark goes not immediately, but after this number of rows.
+        if (current_row == 0 && index_offset != 0)
+        {
+            write_marks = false;
+            rows_to_write = index_offset;
+        }
+        else
+        {
+            if (index_granularity.getMarksCount() <= current_column_mark)
+                throw Exception(
+                    "Incorrect size of index granularity expect mark " + toString(current_column_mark) + " totally have marks " + toString(index_granularity.getMarksCount()),
+                    ErrorCodes::LOGICAL_ERROR);
+
+            rows_to_write = index_granularity.getMarkRows(current_column_mark);
+        }
+
+        current_row = writeSingleGranule(
+            name,
+            type,
+            column,
+            offset_columns,
+            skip_offsets,
+            serialization_state,
+            serialize_settings,
+            current_row,
+            rows_to_write,
+            write_marks
+        );
+
+        if (write_marks)
+            current_column_mark++;
    }

    /// Memoize offsets for Nested types, that are already written. They will not be written again for next columns of Nested structure.
@ -179,6 +279,8 @@ void IMergedBlockOutputStream::writeData(
            offset_columns.insert(stream_name);
        }
    }, serialize_settings.path);
+
+    return std::make_pair(current_column_mark, current_row - total_rows);
 }


@ -237,11 +339,14 @@ MergedBlockOutputStream::MergedBlockOutputStream(
    MergeTreeData & storage_,
    String part_path_,
    const NamesAndTypesList & columns_list_,
-    CompressionCodecPtr default_codec_)
+    CompressionCodecPtr default_codec_,
+    bool blocks_are_granules_size_)
    : IMergedBlockOutputStream(
        storage_, storage_.global_context.getSettings().min_compress_block_size,
        storage_.global_context.getSettings().max_compress_block_size, default_codec_,
-        storage_.global_context.getSettings().min_bytes_to_use_direct_io),
+        storage_.global_context.getSettings().min_bytes_to_use_direct_io,
+        blocks_are_granules_size_,
+        {}),
    columns_list(columns_list_), part_path(part_path_)
 {
    init();
@ -258,11 +363,12 @@ MergedBlockOutputStream::MergedBlockOutputStream(
    const NamesAndTypesList & columns_list_,
    CompressionCodecPtr default_codec_,
    const MergeTreeData::DataPart::ColumnToSize & merged_column_to_size_,
-    size_t aio_threshold_)
+    size_t aio_threshold_,
+    bool blocks_are_granules_size_)
    : IMergedBlockOutputStream(
        storage_, storage_.global_context.getSettings().min_compress_block_size,
        storage_.global_context.getSettings().max_compress_block_size, default_codec_,
-        aio_threshold_),
+        aio_threshold_, blocks_are_granules_size_, {}),
    columns_list(columns_list_), part_path(part_path_)
 {
    init();
@ -392,6 +498,16 @@ void MergedBlockOutputStream::writeSuffixAndFinalizePart(
        checksums.files["count.txt"].file_hash = count_out_hashing.getHash();
    }

+    if (new_part->ttl_infos.part_min_ttl)
+    {
+        /// Write a file with ttl infos in json format.
+        WriteBufferFromFile out(part_path + "ttl.txt", 4096);
+        HashingWriteBuffer out_hashing(out);
+        new_part->ttl_infos.write(out_hashing);
+        checksums.files["ttl.txt"].file_size = out_hashing.count();
+        checksums.files["ttl.txt"].file_hash = out_hashing.getHash();
+    }
+
    {
        /// Write a file with a description of columns.
        WriteBufferFromFile out(part_path + "columns.txt", 4096);
@ -405,12 +521,12 @@ void MergedBlockOutputStream::writeSuffixAndFinalizePart(
    }

    new_part->rows_count = rows_count;
-    new_part->marks_count = marks_count;
    new_part->modification_time = time(nullptr);
    new_part->columns = *total_column_list;
    new_part->index.assign(std::make_move_iterator(index_columns.begin()), std::make_move_iterator(index_columns.end()));
    new_part->checksums = checksums;
    new_part->bytes_on_disk = checksums.getTotalSizeOnDisk();
+    new_part->index_granularity = index_granularity;
 }

 void MergedBlockOutputStream::init()
@ -431,7 +547,7 @@ void MergedBlockOutputStream::init()
                std::make_unique<ColumnStream>(
                        stream_name,
                        part_path + stream_name, INDEX_FILE_EXTENSION,
-                        part_path + stream_name, MARKS_FILE_EXTENSION,
+                        part_path + stream_name, marks_file_extension,
                        codec, max_compress_block_size,
                        0, aio_threshold));
        skip_indices_aggregators.push_back(index->createIndexAggregator());
@ -444,6 +560,14 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm
 {
    block.checkNumberOfRows();
    size_t rows = block.rows();
+    if (!rows)
+        return;
+
+    /// Fill index granularity for this block
+    /// if it's unknown (in case of insert data or horizontal merge,
+    /// but not in case of vertical merge)
+    if (compute_granularity)
+        fillIndexGranularity(block);

    /// The set of written offset columns so that you do not write shared offsets of nested structures columns several times
    WrittenOffsetColumns offset_columns;
@ -509,6 +633,7 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm
        }
    }

+    size_t new_index_offset = 0;
    /// Now write the data.
    auto it = columns_list.begin();
    for (size_t i = 0; i < columns_list.size(); ++i, ++it)
@ -522,23 +647,23 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm
            if (primary_key_column_name_to_position.end() != primary_column_it)
            {
                const auto & primary_column = *primary_key_columns[primary_column_it->second].column;
-                writeData(column.name, *column.type, primary_column, offset_columns, false, serialization_states[i]);
+                std::tie(std::ignore, new_index_offset) = writeColumn(column.name, *column.type, primary_column, offset_columns, false, serialization_states[i], current_mark);
            }
            else if (skip_indexes_column_name_to_position.end() != skip_index_column_it)
            {
                const auto & index_column = *skip_indexes_columns[skip_index_column_it->second].column;
-                writeData(column.name, *column.type, index_column, offset_columns, false, serialization_states[i]);
+                writeColumn(column.name, *column.type, index_column, offset_columns, false, serialization_states[i], current_mark);
            }
            else
            {
                /// We rearrange the columns that are not included in the primary key here; Then the result is released - to save RAM.
                ColumnPtr permuted_column = column.column->permute(*permutation, 0);
-                writeData(column.name, *column.type, *permuted_column, offset_columns, false, serialization_states[i]);
+                std::tie(std::ignore, new_index_offset) = writeColumn(column.name, *column.type, *permuted_column, offset_columns, false, serialization_states[i], current_mark);
            }
        }
        else
        {
-            writeData(column.name, *column.type, *column.column, offset_columns, false, serialization_states[i]);
+            std::tie(std::ignore, new_index_offset) = writeColumn(column.name, *column.type, *column.column, offset_columns, false, serialization_states[i], current_mark);
        }
    }

@ -547,13 +672,14 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm
    {
        /// Creating block for update
        Block indices_update_block(skip_indexes_columns);
-        /// Filling and writing skip indices like in IMergedBlockOutputStream::writeData
+        /// Filling and writing skip indices like in IMergedBlockOutputStream::writeColumn
        for (size_t i = 0; i < storage.skip_indices.size(); ++i)
        {
            const auto index = storage.skip_indices[i];
            auto & stream = *skip_indices_streams[i];
            size_t prev_pos = 0;

+            size_t current_mark = 0;
            while (prev_pos < rows)
            {
                UInt64 limit = 0;
@ -563,7 +689,7 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm
                }
                else
                {
-                    limit = storage.index_granularity;
+                    limit = index_granularity.getMarkRows(current_mark);
                    if (skip_indices_aggregators[i]->empty())
                    {
                        skip_indices_aggregators[i] = index->createIndexAggregator();
@ -574,6 +700,10 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm

                        writeIntBinary(stream.plain_hashing.count(), stream.marks);
                        writeIntBinary(stream.compressed.offset(), stream.marks);
+                        /// Actually this numbers is redundant, but we have to store them
+                        /// to be compatible with normal .mrk2 file format
+                        if (storage.index_granularity_info.is_adaptive)
+                            writeIntBinary(1UL, stream.marks);
                    }
                }

@ -592,6 +722,7 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm
                    }
                }
                prev_pos = pos;
+                current_mark++;
            }
        }
    }
@ -606,7 +737,7 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm
        auto temporarily_disable_memory_tracker = getCurrentMemoryTrackerActionLock();

        /// Write index. The index contains Primary Key value for each `index_granularity` row.
-        for (size_t i = index_offset; i < rows; i += storage.index_granularity)
+        for (size_t i = index_offset; i < rows;)
        {
            if (storage.hasPrimaryKey())
            {
@ -618,12 +749,15 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm
                }
            }

-            ++marks_count;
+            ++current_mark;
+            if (current_mark < index_granularity.getMarksCount())
+                i += index_granularity.getMarkRows(current_mark);
+            else
+                break;
        }
    }

-    size_t written_for_last_mark = (storage.index_granularity - index_offset + rows) % storage.index_granularity;
-    index_offset = (storage.index_granularity - written_for_last_mark) % storage.index_granularity;
+    index_offset = new_index_offset;
 }


@ -632,11 +766,14 @@ void MergedBlockOutputStream::writeImpl(const Block & block, const IColumn::Perm
 MergedColumnOnlyOutputStream::MergedColumnOnlyOutputStream(
    MergeTreeData & storage_, const Block & header_, String part_path_, bool sync_,
    CompressionCodecPtr default_codec_, bool skip_offsets_,
-    WrittenOffsetColumns & already_written_offset_columns)
+    WrittenOffsetColumns & already_written_offset_columns,
+    const MergeTreeIndexGranularity & index_granularity_)
    : IMergedBlockOutputStream(
        storage_, storage_.global_context.getSettings().min_compress_block_size,
        storage_.global_context.getSettings().max_compress_block_size, default_codec_,
-        storage_.global_context.getSettings().min_bytes_to_use_direct_io),
+        storage_.global_context.getSettings().min_bytes_to_use_direct_io,
+        false,
+        index_granularity_),
    header(header_), part_path(part_path_), sync(sync_), skip_offsets(skip_offsets_),
    already_written_offset_columns(already_written_offset_columns)
 {
@ -666,17 +803,17 @@ void MergedColumnOnlyOutputStream::write(const Block & block)
        initialized = true;
    }

-    size_t rows = block.rows();
-
+    size_t new_index_offset = 0;
+    size_t new_current_mark = 0;
    WrittenOffsetColumns offset_columns = already_written_offset_columns;
    for (size_t i = 0; i < block.columns(); ++i)
    {
        const ColumnWithTypeAndName & column = block.safeGetByPosition(i);
-        writeData(column.name, *column.type, *column.column, offset_columns, skip_offsets, serialization_states[i]);
+        std::tie(new_current_mark, new_index_offset) = writeColumn(column.name, *column.type, *column.column, offset_columns, skip_offsets, serialization_states[i], current_mark);
    }

-    size_t written_for_last_mark = (storage.index_granularity - index_offset + rows) % storage.index_granularity;
-    index_offset = (storage.index_granularity - written_for_last_mark) % storage.index_granularity;
+    index_offset = new_index_offset;
+    current_mark = new_current_mark;
 }

 void MergedColumnOnlyOutputStream::writeSuffix()
--- a/dbms/src/Storages/MergeTree/MergedBlockOutputStream.h
+++ b/dbms/src/Storages/MergeTree/MergedBlockOutputStream.h
@ -1,5 +1,6 @@
 #pragma once

+#include <Storages/MergeTree/MergeTreeIndexGranularity.h>
 #include <IO/WriteBufferFromFile.h>
 #include <Compression/CompressedWriteBuffer.h>
 #include <IO/HashingWriteBuffer.h>
@ -21,7 +22,9 @@ public:
        size_t min_compress_block_size_,
        size_t max_compress_block_size_,
        CompressionCodecPtr default_codec_,
-        size_t aio_threshold_);
+        size_t aio_threshold_,
+        bool blocks_are_granules_size_,
+        const MergeTreeIndexGranularity & index_granularity_);

    using WrittenOffsetColumns = std::set<std::string>;

@ -72,8 +75,33 @@ protected:
    IDataType::OutputStreamGetter createStreamGetter(const String & name, WrittenOffsetColumns & offset_columns, bool skip_offsets);

    /// Write data of one column.
-    void writeData(const String & name, const IDataType & type, const IColumn & column, WrittenOffsetColumns & offset_columns,
-                   bool skip_offsets, IDataType::SerializeBinaryBulkStatePtr & serialization_state);
+    /// Return how many marks were written and
+    /// how many rows were written for last mark
+    std::pair<size_t, size_t> writeColumn(
+        const String & name,
+        const IDataType & type,
+        const IColumn & column,
+        WrittenOffsetColumns & offset_columns,
+        bool skip_offsets,
+        IDataType::SerializeBinaryBulkStatePtr & serialization_state,
+        size_t from_mark
+    );
+
+    /// Write single granule of one column (rows between 2 marks)
+    size_t writeSingleGranule(
+        const String & name,
+        const IDataType & type,
+        const IColumn & column,
+        WrittenOffsetColumns & offset_columns,
+        bool skip_offsets,
+        IDataType::SerializeBinaryBulkStatePtr & serialization_state,
+        IDataType::SerializeBinaryBulkSettings & serialize_settings,
+        size_t from_row,
+        size_t number_of_rows,
+        bool write_marks);
+
+    /// Count index_granularity for block and store in `index_granularity`
+    void fillIndexGranularity(const Block & block);

    MergeTreeData & storage;

@ -87,6 +115,15 @@ protected:

    size_t aio_threshold;

+    size_t current_mark = 0;
+
+    const std::string marks_file_extension;
+    const size_t mark_size_in_bytes;
+    const bool blocks_are_granules_size;
+
+    MergeTreeIndexGranularity index_granularity;
+
+    const bool compute_granularity;
    CompressionCodecPtr codec;
 };

@ -101,7 +138,8 @@ public:
        MergeTreeData & storage_,
        String part_path_,
        const NamesAndTypesList & columns_list_,
-        CompressionCodecPtr default_codec_);
+        CompressionCodecPtr default_codec_,
+        bool blocks_are_granules_size_ = false);

    MergedBlockOutputStream(
        MergeTreeData & storage_,
@ -109,7 +147,8 @@ public:
        const NamesAndTypesList & columns_list_,
        CompressionCodecPtr default_codec_,
        const MergeTreeData::DataPart::ColumnToSize & merged_column_to_size_,
-        size_t aio_threshold_);
+        size_t aio_threshold_,
+        bool blocks_are_granules_size_ = false);

    std::string getPartPath() const;

@ -125,11 +164,17 @@ public:

    void writeSuffix() override;

+    /// Finilize writing part and fill inner structures
    void writeSuffixAndFinalizePart(
            MergeTreeData::MutableDataPartPtr & new_part,
            const NamesAndTypesList * total_columns_list = nullptr,
            MergeTreeData::DataPart::Checksums * additional_column_checksums = nullptr);

+    const MergeTreeIndexGranularity & getIndexGranularity() const
+    {
+        return index_granularity;
+    }
+
 private:
    void init();

@ -144,7 +189,6 @@ private:
    String part_path;

    size_t rows_count = 0;
-    size_t marks_count = 0;

    std::unique_ptr<WriteBufferFromFile> index_file_stream;
    std::unique_ptr<HashingWriteBuffer> index_stream;
@ -166,7 +210,8 @@ public:
    MergedColumnOnlyOutputStream(
        MergeTreeData & storage_, const Block & header_, String part_path_, bool sync_,
        CompressionCodecPtr default_codec_, bool skip_offsets_,
-        WrittenOffsetColumns & already_written_offset_columns);
+        WrittenOffsetColumns & already_written_offset_columns,
+        const MergeTreeIndexGranularity & index_granularity_);

    Block getHeader() const override { return header; }
    void write(const Block & block) override;
--- a/Show More
+++ b/Show More