Merge branch 'master' into FawnD2-switch-upstream-for-arrow-submodule

2024-11-22 07:31:57 +00:00 · 2020-12-21 10:22:14 +03:00 · 2020-12-21 10:22:14 +03:00 · 252f2ae876
commit 252f2ae876
parent bf2df558d4 9c1516bd74
52 changed files with 1609 additions and 92 deletions
--- a/docs/_description_templates/template-system-table.md
+++ b/docs/_description_templates/template-system-table.md
@ -1,4 +1,4 @@
-## system.table_name {#system-tables_table-name}
+# system.table_name {#system-tables_table-name}

 Description.

--- a/docs/en/engines/database-engines/index.md
+++ b/docs/en/engines/database-engines/index.md
@ -16,4 +16,6 @@ You can also use the following database engines:

 -   [Lazy](../../engines/database-engines/lazy.md)

+-   [MaterializeMySQL](../../engines/database-engines/materialize-mysql.md)
+
 [Original article](https://clickhouse.tech/docs/en/database_engines/) <!--hide-->
--- a/docs/en/engines/database-engines/materialize-mysql.md
+++ b/docs/en/engines/database-engines/materialize-mysql.md
@ -0,0 +1,156 @@
+---
+toc_priority: 29
+toc_title: MaterializeMySQL
+---
+
+# MaterializeMySQL {#materialize-mysql}
+
+ Creates ClickHouse database with all the tables existing in MySQL, and all the data in those tables. 
+
+ ClickHouse server works as MySQL replica. It reads binlog and performs DDL and DML queries.   
+
+## Creating a Database {#creating-a-database}
+
+``` sql
+CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster]
+ENGINE = MaterializeMySQL('host:port', ['database' | database], 'user', 'password') [SETTINGS ...]
+```
+
+**Engine Parameters**
+
+-   `host:port` — MySQL server endpoint.
+-   `database` — MySQL database name.
+-   `user` — MySQL user.
+-   `password` — User password.
+
+## Virtual columns {#virtual-columns}
+
+ When working with the `MaterializeMySQL` database engine, [ReplacingMergeTree](../../engines/table-engines/mergetree-family/replacingmergetree.md) tables are used with virtual `_sign` and `_version` columns.
+ 
+ - `_version` — Transaction counter. Type [UInt64](../../sql-reference/data-types/int-uint.md). 
+ - `_sign` — Deletion mark. Type [Int8](../../sql-reference/data-types/int-uint.md). Possible values: 
+     - `1` — Row is not deleted, 
+     - `-1` — Row is deleted.
+
+## Data Types Support {#data_types-support}
+
+| MySQL                   | ClickHouse                                                   |
+|-------------------------|--------------------------------------------------------------|
+| TINY                    | [Int8](../../sql-reference/data-types/int-uint.md)           |
+| SHORT                   | [Int16](../../sql-reference/data-types/int-uint.md)          |
+| INT24                   | [Int32](../../sql-reference/data-types/int-uint.md)          |
+| LONG                    | [UInt32](../../sql-reference/data-types/int-uint.md)         |
+| LONGLONG                | [UInt64](../../sql-reference/data-types/int-uint.md)         |
+| FLOAT                   | [Float32](../../sql-reference/data-types/float.md)           |
+| DOUBLE                  | [Float64](../../sql-reference/data-types/float.md)           |
+| DECIMAL, NEWDECIMAL     | [Decimal](../../sql-reference/data-types/decimal.md)         |
+| DATE, NEWDATE           | [Date](../../sql-reference/data-types/date.md)               |
+| DATETIME, TIMESTAMP     | [DateTime](../../sql-reference/data-types/datetime.md)       |
+| DATETIME2, TIMESTAMP2   | [DateTime64](../../sql-reference/data-types/datetime64.md)   |
+| STRING                  | [String](../../sql-reference/data-types/string.md)           |
+| VARCHAR, VAR_STRING     | [String](../../sql-reference/data-types/string.md)           |
+| BLOB                    | [String](../../sql-reference/data-types/string.md)           |
+
+Other types are not supported. If MySQL table contains a column of such type, ClickHouse throws exception "Unhandled data type" and stops replication.
+
+[Nullable](../../sql-reference/data-types/nullable.md) is supported.
+
+## Specifics and Recommendations {#specifics-and-recommendations}
+
+### DDL Queries {#ddl-queries}
+
+MySQL DDL queries are converted into the corresponding ClickHouse DDL queries ([ALTER](../../sql-reference/statements/alter/index.md), [CREATE](../../sql-reference/statements/create/index.md), [DROP](../../sql-reference/statements/drop.md), [RENAME](../../sql-reference/statements/rename.md)). If ClickHouse cannot parse some DDL query, the query is ignored.
+
+###  Data Replication {#data-replication}
+
+MaterializeMySQL does not support direct `INSERT`, `DELETE` and `UPDATE` queries. However, they are supported in terms of data replication:
+
+- MySQL `INSERT` query is converted into `INSERT` with `_sign=1`.
+
+- MySQl `DELETE` query is converted into `INSERT` with `_sign=-1`. 
+
+- MySQL `UPDATE` query is converted into `INSERT` with `_sign=-1` and `INSERT` with `_sign=1`.
+
+### Selecting from MaterializeMySQL Tables {#select}
+
+`SELECT` query form MaterializeMySQL tables has some specifics:
+
+- If `_version` is not specified in the `SELECT` query, [FINAL](../../sql-reference/statements/select/from.md#select-from-final) modifier is used. So only rows with `MAX(_version)` are selected.
+
+- If `_sign` is not specified in the `SELECT` query, `WHERE _sign=1` is used by default, so the deleted rows are not included into the result set. 
+
+### Index Conversion {#index-conversion}
+
+MySQL `PRIMARY KEY` and `INDEX` clauses are converted into `ORDER BY` tuples in ClickHouse tables.
+
+ClickHouse has only one physical order, which is determined by `ORDER BY` clause. To create a new physical order, use [materialized views](../../sql-reference/statements/create/view.md#materialized).
+
+ **Notes**
+
+ - Rows with `_sign=-1` are not deleted physically from the tables. 
+ - Cascade `UPDATE/DELETE` queries are not supported by the `MaterializeMySQL` engine.
+ - Replication can be easily broken.
+ - Manual operations on database and tables are forbidden.
+
+## Examples of Use {#examples-of-use}
+
+Queries in MySQL:
+
+``` sql
+mysql> CREATE DATABASE db;
+mysql> CREATE TABLE db.test (a INT PRIMARY KEY, b INT);
+mysql> INSERT INTO db.test VALUES (1, 11), (2, 22);
+mysql> DELETE FROM db.test WHERE a=1;
+mysql> ALTER TABLE db.test ADD COLUMN c VARCHAR(16);
+mysql> UPDATE db.test SET c='Wow!', b=222;
+mysql> SELECT * FROM test;
+```
+```text
+---+------+------+ 
+| a |    b |    c |
+---+------+------+ 
+| 2 |  222 | Wow! |
+---+------+------+
+```
+
+Database in ClickHouse, exchanging data with the MySQL server:
+
+The database and the table created:
+
+``` sql
+CREATE DATABASE mysql ENGINE = MaterializeMySQL('localhost:3306', 'db', 'user', '***');
+SHOW TABLES FROM mysql;
+```
+
+``` text
+┌─name─┐
+│ test │
+└──────┘
+```
+
+After inserting data:
+
+``` sql
+SELECT * FROM mysql.test;
+```
+
+``` text
+┌─a─┬──b─┐ 
+│ 1 │ 11 │ 
+│ 2 │ 22 │ 
+└───┴────┘
+```
+
+After deleting data, adding the column and updating:
+
+``` sql
+SELECT * FROM mysql.test;
+```
+
+``` text
+┌─a─┬───b─┬─c────┐ 
+│ 2 │ 222 │ Wow! │ 
+└───┴─────┴──────┘
+```
+
+[Original article](https://clickhouse.tech/docs/en/database_engines/materialize-mysql/) <!--hide-->
--- a/docs/en/getting-started/example-datasets/recipes.md
+++ b/docs/en/getting-started/example-datasets/recipes.md
@ -0,0 +1,317 @@
+---
+toc_priority: 16
+toc_title: Recipes Dataset
+---
+
+# Recipes Dataset
+
+RecipeNLG dataset is available for download [here](https://recipenlg.cs.put.poznan.pl/dataset). It contains 2.2 million recipes. The size is slightly less than 1 GB.
+
+## Download and unpack the dataset
+
+Accept Terms and Conditions and download it [here](https://recipenlg.cs.put.poznan.pl/dataset). Unpack the zip file with `unzip`. You will get the `full_dataset.csv` file.
+
+## Create a table
+
+Run clickhouse-client and execute the following CREATE query:
+
+```
+CREATE TABLE recipes
+(
+    title String,
+    ingredients Array(String),
+    directions Array(String),
+    link String,
+    source LowCardinality(String),
+    NER Array(String)
+) ENGINE = MergeTree ORDER BY title;
+```
+
+## Insert the data
+
+Run the following command:
+
+```
+clickhouse-client --query "
+    INSERT INTO recipes
+    SELECT
+        title,
+        JSONExtract(ingredients, 'Array(String)'),
+        JSONExtract(directions, 'Array(String)'),
+        link,
+        source,
+        JSONExtract(NER, 'Array(String)')
+    FROM input('num UInt32, title String, ingredients String, directions String, link String, source LowCardinality(String), NER String')
+    FORMAT CSVWithNames
+" --input_format_with_names_use_header 0 --format_csv_allow_single_quote 0 --input_format_allow_errors_num 10 < full_dataset.csv
+```
+
+This is a showcase how to parse custom CSV, as it requires multiple tunes.
+
+Explanation:
+- the dataset is in CSV format, but it requires some preprocessing on insertion; we use table function [input](../../sql-reference/table-functions/input/) to perform preprocessing;
+- the structure of CSV file is specified in the argument of the table function `input`;
+- the field `num` (row number) is unneeded - we parse it from file and ignore;
+- we use `FORMAT CSVWithNames` but the header in CSV will be ignored (by command line parameter `--input_format_with_names_use_header 0`), because the header does not contain the name for the first field;
+- file is using only double quotes to enclose CSV strings; some strings are not enclosed in double quotes, and single quote must not be parsed as the string enclosing - that's why we also add the `--format_csv_allow_single_quote 0` parameter;
+- some strings from CSV cannot parse, because they contain `\M/` sequence at the beginning of the value; the only value starting with backslash in CSV can be `\N` that is parsed as SQL NULL. We add `--input_format_allow_errors_num 10` parameter and up to ten malformed records can be skipped;
+- there are arrays for ingredients, directions and NER fields; these arrays are represented in unusual form: they are serialized into string as JSON and then placed in CSV - we parse them as String and then use [JSONExtract](../../sql-reference/functions/json-functions/) function to transform it to Array.
+
+## Validate the inserted data
+
+By checking the row count:
+
+```
+SELECT count() FROM recipes
+
+┌─count()─┐
+│ 2231141 │
+└─────────┘
+```
+
+
+## Example queries
+
+### Top components by the number of recipes:
+
+```
+SELECT
+    arrayJoin(NER) AS k,
+    count() AS c
+FROM recipes
+GROUP BY k
+ORDER BY c DESC
+LIMIT 50
+
+┌─k────────────────────┬──────c─┐
+│ salt                 │ 890741 │
+│ sugar                │ 620027 │
+│ butter               │ 493823 │
+│ flour                │ 466110 │
+│ eggs                 │ 401276 │
+│ onion                │ 372469 │
+│ garlic               │ 358364 │
+│ milk                 │ 346769 │
+│ water                │ 326092 │
+│ vanilla              │ 270381 │
+│ olive oil            │ 197877 │
+│ pepper               │ 179305 │
+│ brown sugar          │ 174447 │
+│ tomatoes             │ 163933 │
+│ egg                  │ 160507 │
+│ baking powder        │ 148277 │
+│ lemon juice          │ 146414 │
+│ Salt                 │ 122557 │
+│ cinnamon             │ 117927 │
+│ sour cream           │ 116682 │
+│ cream cheese         │ 114423 │
+│ margarine            │ 112742 │
+│ celery               │ 112676 │
+│ baking soda          │ 110690 │
+│ parsley              │ 102151 │
+│ chicken              │ 101505 │
+│ onions               │  98903 │
+│ vegetable oil        │  91395 │
+│ oil                  │  85600 │
+│ mayonnaise           │  84822 │
+│ pecans               │  79741 │
+│ nuts                 │  78471 │
+│ potatoes             │  75820 │
+│ carrots              │  75458 │
+│ pineapple            │  74345 │
+│ soy sauce            │  70355 │
+│ black pepper         │  69064 │
+│ thyme                │  68429 │
+│ mustard              │  65948 │
+│ chicken broth        │  65112 │
+│ bacon                │  64956 │
+│ honey                │  64626 │
+│ oregano              │  64077 │
+│ ground beef          │  64068 │
+│ unsalted butter      │  63848 │
+│ mushrooms            │  61465 │
+│ Worcestershire sauce │  59328 │
+│ cornstarch           │  58476 │
+│ green pepper         │  58388 │
+│ Cheddar cheese       │  58354 │
+└──────────────────────┴────────┘
+
+50 rows in set. Elapsed: 0.112 sec. Processed 2.23 million rows, 361.57 MB (19.99 million rows/s., 3.24 GB/s.)
+```
+
+In this example we learn how to use [arrayJoin](../../sql-reference/functions/array-join/) function to multiply data by array elements.
+
+### The most complex recipes with strawberry
+
+```
+SELECT
+    title,
+    length(NER),
+    length(directions)
+FROM recipes
+WHERE has(NER, 'strawberry')
+ORDER BY length(directions) DESC
+LIMIT 10
+
+┌─title────────────────────────────────────────────────────────────┬─length(NER)─┬─length(directions)─┐
+│ Chocolate-Strawberry-Orange Wedding Cake                         │          24 │                126 │
+│ Strawberry Cream Cheese Crumble Tart                             │          19 │                 47 │
+│ Charlotte-Style Ice Cream                                        │          11 │                 45 │
+│ Sinfully Good a Million Layers Chocolate Layer Cake, With Strawb │          31 │                 45 │
+│ Sweetened Berries With Elderflower Sherbet                       │          24 │                 44 │
+│ Chocolate-Strawberry Mousse Cake                                 │          15 │                 42 │
+│ Rhubarb Charlotte with Strawberries and Rum                      │          20 │                 42 │
+│ Chef Joey's Strawberry Vanilla Tart                              │           7 │                 37 │
+│ Old-Fashioned Ice Cream Sundae Cake                              │          17 │                 37 │
+│ Watermelon Cake                                                  │          16 │                 36 │
+└──────────────────────────────────────────────────────────────────┴─────────────┴────────────────────┘
+
+10 rows in set. Elapsed: 0.215 sec. Processed 2.23 million rows, 1.48 GB (10.35 million rows/s., 6.86 GB/s.)
+```
+
+In this example, we involve [has](../../sql-reference/functions/array-functions/#hasarr-elem) function to filter by array elements and sort by the number of directions.
+
+There is a wedding cake that requires the whole 126 steps to produce!
+
+Show that directions:
+
+```
+SELECT arrayJoin(directions)
+FROM recipes
+WHERE title = 'Chocolate-Strawberry-Orange Wedding Cake'
+
+┌─arrayJoin(directions)───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
+│ Position 1 rack in center and 1 rack in bottom third of oven and preheat to 350F.                                                                                           │
+│ Butter one 5-inch-diameter cake pan with 2-inch-high sides, one 8-inch-diameter cake pan with 2-inch-high sides and one 12-inch-diameter cake pan with 2-inch-high sides.   │
+│ Dust pans with flour; line bottoms with parchment.                                                                                                                          │
+│ Combine 1/3 cup orange juice and 2 ounces unsweetened chocolate in heavy small saucepan.                                                                                    │
+│ Stir mixture over medium-low heat until chocolate melts.                                                                                                                    │
+│ Remove from heat.                                                                                                                                                           │
+│ Gradually mix in 1 2/3 cups orange juice.                                                                                                                                   │
+│ Sift 3 cups flour, 2/3 cup cocoa, 2 teaspoons baking soda, 1 teaspoon salt and 1/2 teaspoon baking powder into medium bowl.                                                 │
+│ using electric mixer, beat 1 cup (2 sticks) butter and 3 cups sugar in large bowl until blended (mixture will look grainy).                                                 │
+│ Add 4 eggs, 1 at a time, beating to blend after each.                                                                                                                       │
+│ Beat in 1 tablespoon orange peel and 1 tablespoon vanilla extract.                                                                                                          │
+│ Add dry ingredients alternately with orange juice mixture in 3 additions each, beating well after each addition.                                                            │
+│ Mix in 1 cup chocolate chips.                                                                                                                                               │
+│ Transfer 1 cup plus 2 tablespoons batter to prepared 5-inch pan, 3 cups batter to prepared 8-inch pan and remaining batter (about 6 cups) to 12-inch pan.                   │
+│ Place 5-inch and 8-inch pans on center rack of oven.                                                                                                                        │
+│ Place 12-inch pan on lower rack of oven.                                                                                                                                    │
+│ Bake cakes until tester inserted into center comes out clean, about 35 minutes.                                                                                             │
+│ Transfer cakes in pans to racks and cool completely.                                                                                                                        │
+│ Mark 4-inch diameter circle on one 6-inch-diameter cardboard cake round.                                                                                                    │
+│ Cut out marked circle.                                                                                                                                                      │
+│ Mark 7-inch-diameter circle on one 8-inch-diameter cardboard cake round.                                                                                                    │
+│ Cut out marked circle.                                                                                                                                                      │
+│ Mark 11-inch-diameter circle on one 12-inch-diameter cardboard cake round.                                                                                                  │
+│ Cut out marked circle.                                                                                                                                                      │
+│ Cut around sides of 5-inch-cake to loosen.                                                                                                                                  │
+│ Place 4-inch cardboard over pan.                                                                                                                                            │
+│ Hold cardboard and pan together; turn cake out onto cardboard.                                                                                                              │
+│ Peel off parchment.Wrap cakes on its cardboard in foil.                                                                                                                     │
+│ Repeat turning out, peeling off parchment and wrapping cakes in foil, using 7-inch cardboard for 8-inch cake and 11-inch cardboard for 12-inch cake.                        │
+│ Using remaining ingredients, make 1 more batch of cake batter and bake 3 more cake layers as described above.                                                               │
+│ Cool cakes in pans.                                                                                                                                                         │
+│ Cover cakes in pans tightly with foil.                                                                                                                                      │
+│ (Can be prepared ahead.                                                                                                                                                     │
+│ Let stand at room temperature up to 1 day or double-wrap all cake layers and freeze up to 1 week.                                                                           │
+│ Bring cake layers to room temperature before using.)                                                                                                                        │
+│ Place first 12-inch cake on its cardboard on work surface.                                                                                                                  │
+│ Spread 2 3/4 cups ganache over top of cake and all the way to edge.                                                                                                         │
+│ Spread 2/3 cup jam over ganache, leaving 1/2-inch chocolate border at edge.                                                                                                 │
+│ Drop 1 3/4 cups white chocolate frosting by spoonfuls over jam.                                                                                                             │
+│ Gently spread frosting over jam, leaving 1/2-inch chocolate border at edge.                                                                                                 │
+│ Rub some cocoa powder over second 12-inch cardboard.                                                                                                                        │
+│ Cut around sides of second 12-inch cake to loosen.                                                                                                                          │
+│ Place cardboard, cocoa side down, over pan.                                                                                                                                 │
+│ Turn cake out onto cardboard.                                                                                                                                               │
+│ Peel off parchment.                                                                                                                                                         │
+│ Carefully slide cake off cardboard and onto filling on first 12-inch cake.                                                                                                  │
+│ Refrigerate.                                                                                                                                                                │
+│ Place first 8-inch cake on its cardboard on work surface.                                                                                                                   │
+│ Spread 1 cup ganache over top all the way to edge.                                                                                                                          │
+│ Spread 1/4 cup jam over, leaving 1/2-inch chocolate border at edge.                                                                                                         │
+│ Drop 1 cup white chocolate frosting by spoonfuls over jam.                                                                                                                  │
+│ Gently spread frosting over jam, leaving 1/2-inch chocolate border at edge.                                                                                                 │
+│ Rub some cocoa over second 8-inch cardboard.                                                                                                                                │
+│ Cut around sides of second 8-inch cake to loosen.                                                                                                                           │
+│ Place cardboard, cocoa side down, over pan.                                                                                                                                 │
+│ Turn cake out onto cardboard.                                                                                                                                               │
+│ Peel off parchment.                                                                                                                                                         │
+│ Slide cake off cardboard and onto filling on first 8-inch cake.                                                                                                             │
+│ Refrigerate.                                                                                                                                                                │
+│ Place first 5-inch cake on its cardboard on work surface.                                                                                                                   │
+│ Spread 1/2 cup ganache over top of cake and all the way to edge.                                                                                                            │
+│ Spread 2 tablespoons jam over, leaving 1/2-inch chocolate border at edge.                                                                                                   │
+│ Drop 1/3 cup white chocolate frosting by spoonfuls over jam.                                                                                                                │
+│ Gently spread frosting over jam, leaving 1/2-inch chocolate border at edge.                                                                                                 │
+│ Rub cocoa over second 6-inch cardboard.                                                                                                                                     │
+│ Cut around sides of second 5-inch cake to loosen.                                                                                                                           │
+│ Place cardboard, cocoa side down, over pan.                                                                                                                                 │
+│ Turn cake out onto cardboard.                                                                                                                                               │
+│ Peel off parchment.                                                                                                                                                         │
+│ Slide cake off cardboard and onto filling on first 5-inch cake.                                                                                                             │
+│ Chill all cakes 1 hour to set filling.                                                                                                                                      │
+│ Place 12-inch tiered cake on its cardboard on revolving cake stand.                                                                                                         │
+│ Spread 2 2/3 cups frosting over top and sides of cake as a first coat.                                                                                                      │
+│ Refrigerate cake.                                                                                                                                                           │
+│ Place 8-inch tiered cake on its cardboard on cake stand.                                                                                                                    │
+│ Spread 1 1/4 cups frosting over top and sides of cake as a first coat.                                                                                                      │
+│ Refrigerate cake.                                                                                                                                                           │
+│ Place 5-inch tiered cake on its cardboard on cake stand.                                                                                                                    │
+│ Spread 3/4 cup frosting over top and sides of cake as a first coat.                                                                                                         │
+│ Refrigerate all cakes until first coats of frosting set, about 1 hour.                                                                                                      │
+│ (Cakes can be made to this point up to 1 day ahead; cover and keep refrigerate.)                                                                                            │
+│ Prepare second batch of frosting, using remaining frosting ingredients and following directions for first batch.                                                            │
+│ Spoon 2 cups frosting into pastry bag fitted with small star tip.                                                                                                           │
+│ Place 12-inch cake on its cardboard on large flat platter.                                                                                                                  │
+│ Place platter on cake stand.                                                                                                                                                │
+│ Using icing spatula, spread 2 1/2 cups frosting over top and sides of cake; smooth top.                                                                                     │
+│ Using filled pastry bag, pipe decorative border around top edge of cake.                                                                                                    │
+│ Refrigerate cake on platter.                                                                                                                                                │
+│ Place 8-inch cake on its cardboard on cake stand.                                                                                                                           │
+│ Using icing spatula, spread 1 1/2 cups frosting over top and sides of cake; smooth top.                                                                                     │
+│ Using pastry bag, pipe decorative border around top edge of cake.                                                                                                           │
+│ Refrigerate cake on its cardboard.                                                                                                                                          │
+│ Place 5-inch cake on its cardboard on cake stand.                                                                                                                           │
+│ Using icing spatula, spread 3/4 cup frosting over top and sides of cake; smooth top.                                                                                        │
+│ Using pastry bag, pipe decorative border around top edge of cake, spooning more frosting into bag if necessary.                                                             │
+│ Refrigerate cake on its cardboard.                                                                                                                                          │
+│ Keep all cakes refrigerated until frosting sets, about 2 hours.                                                                                                             │
+│ (Can be prepared 2 days ahead.                                                                                                                                              │
+│ Cover loosely; keep refrigerated.)                                                                                                                                          │
+│ Place 12-inch cake on platter on work surface.                                                                                                                              │
+│ Press 1 wooden dowel straight down into and completely through center of cake.                                                                                              │
+│ Mark dowel 1/4 inch above top of frosting.                                                                                                                                  │
+│ Remove dowel and cut with serrated knife at marked point.                                                                                                                   │
+│ Cut 4 more dowels to same length.                                                                                                                                           │
+│ Press 1 cut dowel back into center of cake.                                                                                                                                 │
+│ Press remaining 4 cut dowels into cake, positioning 3 1/2 inches inward from cake edges and spacing evenly.                                                                 │
+│ Place 8-inch cake on its cardboard on work surface.                                                                                                                         │
+│ Press 1 dowel straight down into and completely through center of cake.                                                                                                     │
+│ Mark dowel 1/4 inch above top of frosting.                                                                                                                                  │
+│ Remove dowel and cut with serrated knife at marked point.                                                                                                                   │
+│ Cut 3 more dowels to same length.                                                                                                                                           │
+│ Press 1 cut dowel back into center of cake.                                                                                                                                 │
+│ Press remaining 3 cut dowels into cake, positioning 2 1/2 inches inward from edges and spacing evenly.                                                                      │
+│ Using large metal spatula as aid, place 8-inch cake on its cardboard atop dowels in 12-inch cake, centering carefully.                                                      │
+│ Gently place 5-inch cake on its cardboard atop dowels in 8-inch cake, centering carefully.                                                                                  │
+│ Using citrus stripper, cut long strips of orange peel from oranges.                                                                                                         │
+│ Cut strips into long segments.                                                                                                                                              │
+│ To make orange peel coils, wrap peel segment around handle of wooden spoon; gently slide peel off handle so that peel keeps coiled shape.                                   │
+│ Garnish cake with orange peel coils, ivy or mint sprigs, and some berries.                                                                                                  │
+│ (Assembled cake can be made up to 8 hours ahead.                                                                                                                            │
+│ Let stand at cool room temperature.)                                                                                                                                        │
+│ Remove top and middle cake tiers.                                                                                                                                           │
+│ Remove dowels from cakes.                                                                                                                                                   │
+│ Cut top and middle cakes into slices.                                                                                                                                       │
+│ To cut 12-inch cake: Starting 3 inches inward from edge and inserting knife straight down, cut through from top to bottom to make 6-inch-diameter circle in center of cake. │
+│ Cut outer portion of cake into slices; cut inner portion into slices and serve with strawberries.                                                                           │
+└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
+
+126 rows in set. Elapsed: 0.011 sec. Processed 8.19 thousand rows, 5.34 MB (737.75 thousand rows/s., 480.59 MB/s.)
+```
+
+### Online playground
+
+The dataset is also available in the [Playground](https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUCiAgICBhcnJheUpvaW4oTkVSKSBBUyBrLAogICAgY291bnQoKSBBUyBjCkZST00gcmVjaXBlcwpHUk9VUCBCWSBrCk9SREVSIEJZIGMgREVTQwpMSU1JVCA1MA==).
--- a/docs/en/interfaces/cli.md
+++ b/docs/en/interfaces/cli.md
@ -11,9 +11,9 @@ ClickHouse provides a native command-line client: `clickhouse-client`. The clien

 ``` bash
 $ clickhouse-client
-ClickHouse client version 19.17.1.1579 (official build).
+ClickHouse client version 20.13.1.5273 (official build).
 Connecting to localhost:9000 as user default.
-Connected to ClickHouse server version 19.17.1 revision 54428.
+Connected to ClickHouse server version 20.13.1 revision 54442.

 :)
 ```
@ -127,6 +127,8 @@ You can pass parameters to `clickhouse-client` (all parameters have a default va
 -   `--history_file` — Path to a file containing command history.
 -   `--param_<name>` — Value for a [query with parameters](#cli-queries-with-parameters).

+Since version 20.5, `clickhouse-client` has automatic syntax highlighting (always enabled).
+
 ### Configuration Files {#configuration_files}

 `clickhouse-client` uses the first existing file of the following:
--- a/docs/en/interfaces/third-party/client-libraries.md
+++ b/docs/en/interfaces/third-party/client-libraries.md
@ -60,5 +60,7 @@ toc_title: Client Libraries
    -   [pillar](https://github.com/sofakingworld/pillar)
 -   Nim
    -   [nim-clickhouse](https://github.com/leonardoce/nim-clickhouse)
+-   Haskell
+    -   [hdbc-clickhouse](https://github.com/zaneli/hdbc-clickhouse)

 [Original article](https://clickhouse.tech/docs/en/interfaces/third-party/client_libraries/) <!--hide-->
--- a/docs/en/sql-reference/aggregate-functions/combinators.md
+++ b/docs/en/sql-reference/aggregate-functions/combinators.md
@ -53,6 +53,11 @@ Merges the intermediate aggregation states in the same way as the -Merge combina

 Converts an aggregate function for tables into an aggregate function for arrays that aggregates the corresponding array items and returns an array of results. For example, `sumForEach` for the arrays `[1, 2]`, `[3, 4, 5]`and`[6, 7]`returns the result `[10, 13, 5]` after adding together the corresponding array items.

+## -Distinct {#agg-functions-combinator-distinct}
+
+Every unique combination of arguments will be aggregated only once. Repeating values are ignored.
+Examples: `sum(DISTINCT x)`, `groupArray(DISTINCT x)`, `corrStableDistinct(DISTINCT x, y)` and so on.
+
 ## -OrDefault {#agg-functions-combinator-ordefault}

 Changes behavior of an aggregate function.
@ -244,4 +249,5 @@ FROM people
 └────────┴───────────────────────────┘
 ```

+
 [Original article](https://clickhouse.tech/docs/en/query_language/agg_functions/combinators/) <!--hide-->
--- a/docs/ru/interfaces/cli.md
+++ b/docs/ru/interfaces/cli.md
@ -11,9 +11,9 @@ ClickHouse предоставляет собственный клиент ком

 ``` bash
 $ clickhouse-client
-ClickHouse client version 19.17.1.1579 (official build).
+ClickHouse client version 20.13.1.5273 (official build).
 Connecting to localhost:9000 as user default.
-Connected to ClickHouse server version 19.17.1 revision 54428.
+Connected to ClickHouse server version 20.13.1 revision 54442.

 :)
 ```
@ -131,6 +131,8 @@ $ clickhouse-client --param_tuple_in_tuple="(10, ('dt', 10))" -q "SELECT * FROM
 -   `--secure` — если указано, будет использован безопасный канал.
 -   `--param_<name>` — значение параметра для [запроса с параметрами](#cli-queries-with-parameters).

+Начиная с версии 20.5, в `clickhouse-client` есть автоматическая подсветка синтаксиса (включена всегда).
+
 ### Конфигурационные файлы {#konfiguratsionnye-faily}

 `clickhouse—client` использует первый существующий файл из:
--- a/docs/ru/sql-reference/aggregate-functions/combinators.md
+++ b/docs/ru/sql-reference/aggregate-functions/combinators.md
@ -51,6 +51,11 @@ toc_title: "\u041a\u043e\u043c\u0431\u0438\u043d\u0430\u0442\u043e\u0440\u044b\u

 Преобразует агрегатную функцию для таблиц в агрегатную функцию для массивов, которая применяет агрегирование для соответствующих элементов массивов и возвращает массив результатов. Например, `sumForEach` для массивов `[1, 2]`, `[3, 4, 5]` и `[6, 7]` даст результат `[10, 13, 5]`, сложив соответственные элементы массивов.

+## -Distinct {#agg-functions-combinator-distinct}
+
+При наличии комбинатора Distinct, каждое уникальное значение аргументов, будет учитано в агрегатной функции только один раз.
+Примеры: `sum(DISTINCT x)`, `groupArray(DISTINCT x)`, `corrStableDistinct(DISTINCT x, y)` и т.п.
+
 ## -OrDefault {#agg-functions-combinator-ordefault}

 Изменяет поведение агрегатной функции.
@ -59,7 +64,7 @@ toc_title: "\u041a\u043e\u043c\u0431\u0438\u043d\u0430\u0442\u043e\u0440\u044b\u

 `-OrDefault` можно использовать с другими комбинаторами.

-**Синтаксис** 
+**Синтаксис**

 ``` sql
 <aggFunction>OrDefault(x)
@ -69,8 +74,8 @@ toc_title: "\u041a\u043e\u043c\u0431\u0438\u043d\u0430\u0442\u043e\u0440\u044b\u

 - `x` — Параметры агрегатной функции.

-**Возращаемые зачения** 
- 
+**Возращаемые зачения**
+
 Возвращает значение по умолчанию для соответствующего типа агрегатной функции, если агрегировать нечего.

 Тип данных зависит от используемой агрегатной функции.
@ -116,11 +121,11 @@ FROM

 Изменяет поведение агрегатной функции.

-Комбинатор преобразует результат агрегатной функции к типу [Nullable](../data-types/nullable.md). Если агрегатная функция не получает данных на вход, то с комбинатором она возвращает [NULL](../syntax.md#null-literal). 
+Комбинатор преобразует результат агрегатной функции к типу [Nullable](../data-types/nullable.md). Если агрегатная функция не получает данных на вход, то с комбинатором она возвращает [NULL](../syntax.md#null-literal).

 `-OrNull` может использоваться с другими комбинаторами.

-**Синтаксис** 
+**Синтаксис**

 ``` sql
 <aggFunction>OrNull(x)
@ -129,8 +134,8 @@ FROM
 **Параметры**

 - `x` — Параметры агрегатной функции.
- 
-**Возвращаемые значения** 
+
+**Возвращаемые значения**

 - Результат агрегатной функции, преобразованный в тип данных `Nullable`.
 - `NULL`, если у агрегатной функции нет входных данных.
--- a/src/Access/ContextAccess.cpp
+++ b/src/Access/ContextAccess.cpp
@ -376,7 +376,7 @@ bool ContextAccess::checkAccessImpl2(const AccessFlags & flags, const Args &...
        return true;
    };

-    auto access_denied = [&](const String & error_msg, int error_code)
+    auto access_denied = [&](const String & error_msg, int error_code [[maybe_unused]])
    {
        if (trace_log)
            LOG_TRACE(trace_log, "Access denied: {}{}", (AccessRightsElement{flags, args...}.toString()),
@ -558,7 +558,7 @@ bool ContextAccess::checkAdminOptionImpl2(const Container & role_ids, const GetN
    if (!std::size(role_ids) || is_full_access)
        return true;

-    auto show_error = [this](const String & msg, int error_code)
+    auto show_error = [this](const String & msg, int error_code [[maybe_unused]])
    {
        UNUSED(this);
        if constexpr (throw_if_denied)
--- a/src/Core/MultiEnum.h
+++ b/src/Core/MultiEnum.h
@ -13,12 +13,12 @@ struct MultiEnum
    MultiEnum() = default;

    template <typename ... EnumValues, typename = std::enable_if_t<std::conjunction_v<std::is_same<EnumTypeT, EnumValues>...>>>
-    explicit MultiEnum(EnumValues ... v)
+    constexpr explicit MultiEnum(EnumValues ... v)
        : MultiEnum((toBitFlag(v) | ... | 0u))
    {}

    template <typename ValueType, typename = std::enable_if_t<std::is_convertible_v<ValueType, StorageType>>>
-    explicit MultiEnum(ValueType v)
+    constexpr explicit MultiEnum(ValueType v)
        : bitset(v)
    {
        static_assert(std::is_unsigned_v<ValueType>);
@ -95,5 +95,5 @@ struct MultiEnum
 private:
    StorageType bitset = 0;

-    static StorageType toBitFlag(EnumType v) { return StorageType{1} << static_cast<StorageType>(v); }
+    static constexpr StorageType toBitFlag(EnumType v) { return StorageType{1} << static_cast<StorageType>(v); }
 };
--- a/src/DataTypes/getLeastSupertype.cpp
+++ b/src/DataTypes/getLeastSupertype.cpp
@ -14,6 +14,7 @@
 #include <DataTypes/DataTypeString.h>
 #include <DataTypes/DataTypeDateTime.h>
 #include <DataTypes/DataTypeDateTime64.h>
+#include <DataTypes/DataTypeEnum.h>
 #include <DataTypes/DataTypesNumber.h>
 #include <DataTypes/DataTypesDecimal.h>

@ -360,9 +361,9 @@ DataTypePtr getLeastSupertype(const DataTypes & types)
                maximize(max_bits_of_unsigned_integer, 64);
            else if (typeid_cast<const DataTypeUInt256 *>(type.get()))
                maximize(max_bits_of_unsigned_integer, 256);
-            else if (typeid_cast<const DataTypeInt8 *>(type.get()))
+            else if (typeid_cast<const DataTypeInt8 *>(type.get()) || typeid_cast<const DataTypeEnum8 *>(type.get()))
                maximize(max_bits_of_signed_integer, 8);
-            else if (typeid_cast<const DataTypeInt16 *>(type.get()))
+            else if (typeid_cast<const DataTypeInt16 *>(type.get()) || typeid_cast<const DataTypeEnum16 *>(type.get()))
                maximize(max_bits_of_signed_integer, 16);
            else if (typeid_cast<const DataTypeInt32 *>(type.get()))
                maximize(max_bits_of_signed_integer, 32);
--- a/src/Interpreters/ExtractExpressionInfoVisitor.cpp
+++ b/src/Interpreters/ExtractExpressionInfoVisitor.cpp
@ -27,8 +27,14 @@ void ExpressionInfoMatcher::visit(const ASTFunction & ast_function, const ASTPtr
        const auto & function = FunctionFactory::instance().tryGet(ast_function.name, data.context);

        /// Skip lambda, tuple and other special functions
-        if (function && function->isStateful())
-            data.is_stateful_function = true;
+        if (function)
+        {
+            if (function->isStateful())
+                data.is_stateful_function = true;
+
+            if (!function->isDeterministicInScopeOfQuery())
+                data.is_deterministic_function = false;
+        }
    }
 }

--- a/src/Interpreters/ExtractExpressionInfoVisitor.h
+++ b/src/Interpreters/ExtractExpressionInfoVisitor.h
@ -21,6 +21,7 @@ struct ExpressionInfoMatcher
        bool is_array_join = false;
        bool is_stateful_function = false;
        bool is_aggregate_function = false;
+        bool is_deterministic_function = true;
        std::unordered_set<size_t> unique_reference_tables_pos = {};
    };

--- a/src/Interpreters/InterpreterInsertQuery.cpp
+++ b/src/Interpreters/InterpreterInsertQuery.cpp
@ -34,6 +34,7 @@
 #include <Interpreters/QueryLog.h>
 #include <Interpreters/TranslateQualifiedNamesVisitor.h>
 #include <Interpreters/getTableExpressions.h>
+#include <Interpreters/processColumnTransformers.h>

 namespace
 {
@ -52,7 +53,6 @@ namespace ErrorCodes
    extern const int LOGICAL_ERROR;
 }

-
 InterpreterInsertQuery::InterpreterInsertQuery(
    const ASTPtr & query_ptr_, const Context & context_, bool allow_materialized_, bool no_squash_, bool no_destination_)
    : query_ptr(query_ptr_)
@ -95,27 +95,7 @@ Block InterpreterInsertQuery::getSampleBlock(

    Block table_sample = metadata_snapshot->getSampleBlock();

-    /// Process column transformers (e.g. * EXCEPT(a)), asterisks and qualified columns.
-    const auto & columns = metadata_snapshot->getColumns();
-    auto names_and_types = columns.getOrdinary();
-    removeDuplicateColumns(names_and_types);
-    auto table_expr = std::make_shared<ASTTableExpression>();
-    table_expr->database_and_table_name = createTableIdentifier(table->getStorageID());
-    table_expr->children.push_back(table_expr->database_and_table_name);
-    TablesWithColumns tables_with_columns;
-    tables_with_columns.emplace_back(DatabaseAndTableWithAlias(*table_expr, context.getCurrentDatabase()), names_and_types);
-
-    tables_with_columns[0].addHiddenColumns(columns.getMaterialized());
-    tables_with_columns[0].addHiddenColumns(columns.getAliases());
-    tables_with_columns[0].addHiddenColumns(table->getVirtuals());
-
-    NameSet source_columns_set;
-    for (const auto & identifier : query.columns->children)
-        source_columns_set.insert(identifier->getColumnName());
-    TranslateQualifiedNamesVisitor::Data visitor_data(source_columns_set, tables_with_columns);
-    TranslateQualifiedNamesVisitor visitor(visitor_data);
-    auto columns_ast = query.columns->clone();
-    visitor.visit(columns_ast);
+    const auto columns_ast = processColumnTransformers(context.getCurrentDatabase(), table, metadata_snapshot, query.columns);

    /// Form the block based on the column names from the query
    Block res;
--- a/src/Interpreters/InterpreterOptimizeQuery.cpp
+++ b/src/Interpreters/InterpreterOptimizeQuery.cpp
@ -5,13 +5,18 @@
 #include <Interpreters/InterpreterOptimizeQuery.h>
 #include <Access/AccessRightsElement.h>
 #include <Common/typeid_cast.h>
+#include <Parsers/ASTExpressionList.h>

+#include <Interpreters/processColumnTransformers.h>
+
+#include <memory>

 namespace DB
 {

 namespace ErrorCodes
 {
+    extern const int THERE_IS_NO_COLUMN;
 }


@ -27,7 +32,44 @@ BlockIO InterpreterOptimizeQuery::execute()
    auto table_id = context.resolveStorageID(ast, Context::ResolveOrdinary);
    StoragePtr table = DatabaseCatalog::instance().getTable(table_id, context);
    auto metadata_snapshot = table->getInMemoryMetadataPtr();
-    table->optimize(query_ptr, metadata_snapshot, ast.partition, ast.final, ast.deduplicate, context);
+
+    // Empty list of names means we deduplicate by all columns, but user can explicitly state which columns to use.
+    Names column_names;
+    if (ast.deduplicate_by_columns)
+    {
+        // User requested custom set of columns for deduplication, possibly with Column Transformer expression.
+        {
+            // Expand asterisk, column transformers, etc into list of column names.
+            const auto cols = processColumnTransformers(context.getCurrentDatabase(), table, metadata_snapshot, ast.deduplicate_by_columns);
+            for (const auto & col : cols->children)
+                column_names.emplace_back(col->getColumnName());
+        }
+
+        metadata_snapshot->check(column_names, NamesAndTypesList{}, table_id);
+        Names required_columns;
+        {
+            required_columns = metadata_snapshot->getColumnsRequiredForSortingKey();
+            const auto partitioning_cols = metadata_snapshot->getColumnsRequiredForPartitionKey();
+            required_columns.reserve(required_columns.size() + partitioning_cols.size());
+            required_columns.insert(required_columns.end(), partitioning_cols.begin(), partitioning_cols.end());
+        }
+        for (const auto & required_col : required_columns)
+        {
+            // Deduplication is performed only for adjacent rows in a block,
+            // and all rows in block are in the sorting key order within a single partition,
+            // hence deduplication always implicitly takes sorting keys and parition keys in account.
+            // So we just explicitly state that limitation in order to avoid confusion.
+            if (std::find(column_names.begin(), column_names.end(), required_col) == column_names.end())
+                throw Exception(ErrorCodes::THERE_IS_NO_COLUMN,
+                        "DEDUPLICATE BY expression must include all columns used in table's"
+                        " ORDER BY, PRIMARY KEY, or PARTITION BY but '{}' is missing."
+                        " Expanded DEDUPLICATE BY columns expression: ['{}']",
+                        required_col, fmt::join(column_names, "', '"));
+        }
+    }
+
+    table->optimize(query_ptr, metadata_snapshot, ast.partition, ast.final, ast.deduplicate, column_names, context);
+
    return {};
 }

--- a/src/Interpreters/PredicateExpressionsOptimizer.cpp
+++ b/src/Interpreters/PredicateExpressionsOptimizer.cpp
@ -90,8 +90,8 @@ std::vector<ASTs> PredicateExpressionsOptimizer::extractTablesPredicates(const A
        ExpressionInfoVisitor::Data expression_info{.context = context, .tables = tables_with_columns};
        ExpressionInfoVisitor(expression_info).visit(predicate_expression);

-        if (expression_info.is_stateful_function)
-            return {};   /// give up the optimization when the predicate contains stateful function
+        if (expression_info.is_stateful_function || !expression_info.is_deterministic_function)
+            return {};   /// Not optimized when predicate contains stateful function or indeterministic function

        if (!expression_info.is_array_join)
        {
--- a/src/Interpreters/processColumnTransformers.cpp
+++ b/src/Interpreters/processColumnTransformers.cpp
@ -0,0 +1,49 @@
+#include <Interpreters/processColumnTransformers.h>
+
+#include <Interpreters/DatabaseAndTableWithAlias.h>
+#include <Interpreters/TranslateQualifiedNamesVisitor.h>
+#include <Interpreters/getTableExpressions.h>
+#include <Parsers/ASTIdentifier.h>
+#include <Parsers/ASTTablesInSelectQuery.h>
+#include <Parsers/IAST.h>
+#include <Storages/IStorage.h>
+#include <Storages/StorageInMemoryMetadata.h>
+
+namespace DB
+{
+
+ASTPtr processColumnTransformers(
+        const String & current_database,
+        const StoragePtr & table,
+        const StorageMetadataPtr & metadata_snapshot,
+        ASTPtr query_columns)
+{
+    const auto & columns = metadata_snapshot->getColumns();
+    auto names_and_types = columns.getOrdinary();
+    removeDuplicateColumns(names_and_types);
+
+    TablesWithColumns tables_with_columns;
+    {
+        auto table_expr = std::make_shared<ASTTableExpression>();
+        table_expr->database_and_table_name = createTableIdentifier(table->getStorageID());
+        table_expr->children.push_back(table_expr->database_and_table_name);
+        tables_with_columns.emplace_back(DatabaseAndTableWithAlias(*table_expr, current_database), names_and_types);
+    }
+
+    tables_with_columns[0].addHiddenColumns(columns.getMaterialized());
+    tables_with_columns[0].addHiddenColumns(columns.getAliases());
+    tables_with_columns[0].addHiddenColumns(table->getVirtuals());
+
+    NameSet source_columns_set;
+    for (const auto & identifier : query_columns->children)
+        source_columns_set.insert(identifier->getColumnName());
+
+    TranslateQualifiedNamesVisitor::Data visitor_data(source_columns_set, tables_with_columns);
+    TranslateQualifiedNamesVisitor visitor(visitor_data);
+    auto columns_ast = query_columns->clone();
+    visitor.visit(columns_ast);
+
+    return columns_ast;
+}
+
+}
--- a/src/Interpreters/processColumnTransformers.h
+++ b/src/Interpreters/processColumnTransformers.h
@ -0,0 +1,19 @@
+#pragma once
+
+#include <Parsers/IAST_fwd.h>
+#include <Storages/IStorage_fwd.h>
+
+namespace DB
+{
+
+struct StorageInMemoryMetadata;
+using StorageMetadataPtr = std::shared_ptr<const StorageInMemoryMetadata>;
+
+/// Process column transformers (e.g. * EXCEPT(a)), asterisks and qualified columns.
+ASTPtr processColumnTransformers(
+        const String & current_database,
+        const StoragePtr & table,
+        const StorageMetadataPtr & metadata_snapshot,
+        ASTPtr query_columns);
+
+}
--- a/src/Interpreters/ya.make
+++ b/src/Interpreters/ya.make
@ -157,6 +157,7 @@ SRCS(
    interpretSubquery.cpp
    join_common.cpp
    loadMetadata.cpp
+    processColumnTransformers.cpp
    sortBlock.cpp

 )
--- a/src/Parsers/ASTOptimizeQuery.cpp
+++ b/src/Parsers/ASTOptimizeQuery.cpp
@ -23,6 +23,12 @@ void ASTOptimizeQuery::formatQueryImpl(const FormatSettings & settings, FormatSt

    if (deduplicate)
        settings.ostr << (settings.hilite ? hilite_keyword : "") << " DEDUPLICATE" << (settings.hilite ? hilite_none : "");
+
+    if (deduplicate_by_columns)
+    {
+        settings.ostr << (settings.hilite ? hilite_keyword : "") << " BY " << (settings.hilite ? hilite_none : "");
+        deduplicate_by_columns->formatImpl(settings, state, frame);
+    }
 }

 }
--- a/src/Parsers/ASTOptimizeQuery.h
+++ b/src/Parsers/ASTOptimizeQuery.h
@ -16,9 +16,11 @@ public:
    /// The partition to optimize can be specified.
    ASTPtr partition;
    /// A flag can be specified - perform optimization "to the end" instead of one step.
-    bool final;
+    bool final = false;
    /// Do deduplicate (default: false)
-    bool deduplicate;
+    bool deduplicate = false;
+    /// Deduplicate by columns.
+    ASTPtr deduplicate_by_columns;

    /** Get the text that identifies this element. */
    String getID(char delim) const override
@ -37,6 +39,12 @@ public:
            res->children.push_back(res->partition);
        }

+        if (deduplicate_by_columns)
+        {
+            res->deduplicate_by_columns = deduplicate_by_columns->clone();
+            res->children.push_back(res->deduplicate_by_columns);
+        }
+
        return res;
    }

--- a/src/Parsers/ExpressionElementParsers.cpp
+++ b/src/Parsers/ExpressionElementParsers.cpp
@ -1259,7 +1259,7 @@ bool ParserColumnsMatcher::parseImpl(Pos & pos, ASTPtr & node, Expected & expect
        res->children.push_back(regex_node);
    }

-    ParserColumnsTransformers transformers_p;
+    ParserColumnsTransformers transformers_p(allowed_transformers);
    ASTPtr transformer;
    while (transformers_p.parse(pos, transformer, expected))
    {
@ -1278,7 +1278,7 @@ bool ParserColumnsTransformers::parseImpl(Pos & pos, ASTPtr & node, Expected & e
    ParserKeyword as("AS");
    ParserKeyword strict("STRICT");

-    if (apply.ignore(pos, expected))
+    if (allowed_transformers.isSet(ColumnTransformer::APPLY) && apply.ignore(pos, expected))
    {
        bool with_open_round_bracket = false;

@ -1331,7 +1331,7 @@ bool ParserColumnsTransformers::parseImpl(Pos & pos, ASTPtr & node, Expected & e
        node = std::move(res);
        return true;
    }
-    else if (except.ignore(pos, expected))
+    else if (allowed_transformers.isSet(ColumnTransformer::EXCEPT) && except.ignore(pos, expected))
    {
        if (strict.ignore(pos, expected))
            is_strict = true;
@ -1371,7 +1371,7 @@ bool ParserColumnsTransformers::parseImpl(Pos & pos, ASTPtr & node, Expected & e
        node = std::move(res);
        return true;
    }
-    else if (replace.ignore(pos, expected))
+    else if (allowed_transformers.isSet(ColumnTransformer::REPLACE) && replace.ignore(pos, expected))
    {
        if (strict.ignore(pos, expected))
            is_strict = true;
@ -1434,7 +1434,7 @@ bool ParserAsterisk::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
    {
        ++pos;
        auto asterisk = std::make_shared<ASTAsterisk>();
-        ParserColumnsTransformers transformers_p;
+        ParserColumnsTransformers transformers_p(allowed_transformers);
        ASTPtr transformer;
        while (transformers_p.parse(pos, transformer, expected))
        {
--- a/src/Parsers/ExpressionElementParsers.h
+++ b/src/Parsers/ExpressionElementParsers.h
@ -1,6 +1,7 @@
 #pragma once

 #include <Core/Field.h>
+#include <Core/MultiEnum.h>
 #include <Parsers/IParserBase.h>


@ -70,12 +71,47 @@ protected:
    bool allow_query_parameter;
 };

+/** *, t.*, db.table.*, COLUMNS('<regular expression>') APPLY(...) or EXCEPT(...) or REPLACE(...)
+  */
+class ParserColumnsTransformers : public IParserBase
+{
+public:
+    enum class ColumnTransformer : UInt8
+    {
+        APPLY,
+        EXCEPT,
+        REPLACE,
+    };
+    using ColumnTransformers = MultiEnum<ColumnTransformer, UInt8>;
+    static constexpr auto AllTransformers = ColumnTransformers{ColumnTransformer::APPLY, ColumnTransformer::EXCEPT, ColumnTransformer::REPLACE};
+
+    ParserColumnsTransformers(ColumnTransformers allowed_transformers_ = AllTransformers, bool is_strict_ = false)
+        : allowed_transformers(allowed_transformers_)
+        , is_strict(is_strict_)
+    {}
+
+protected:
+    const char * getName() const override { return "COLUMNS transformers"; }
+    bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
+    ColumnTransformers allowed_transformers;
+    bool is_strict;
+};
+
+
 /// Just *
 class ParserAsterisk : public IParserBase
 {
+public:
+    using ColumnTransformers = ParserColumnsTransformers::ColumnTransformers;
+    ParserAsterisk(ColumnTransformers allowed_transformers_ = ParserColumnsTransformers::AllTransformers)
+        : allowed_transformers(allowed_transformers_)
+    {}
+
 protected:
    const char * getName() const override { return "asterisk"; }
    bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
+
+    ColumnTransformers allowed_transformers;
 };

 /** Something like t.* or db.table.*
@ -91,21 +127,17 @@ protected:
  */
 class ParserColumnsMatcher : public IParserBase
 {
+public:
+    using ColumnTransformers = ParserColumnsTransformers::ColumnTransformers;
+    ParserColumnsMatcher(ColumnTransformers allowed_transformers_ = ParserColumnsTransformers::AllTransformers)
+        : allowed_transformers(allowed_transformers_)
+    {}
+
 protected:
    const char * getName() const override { return "COLUMNS matcher"; }
    bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
-};

-/** *, t.*, db.table.*, COLUMNS('<regular expression>') APPLY(...) or EXCEPT(...) or REPLACE(...)
-  */
-class ParserColumnsTransformers : public IParserBase
-{
-public:
-    ParserColumnsTransformers(bool is_strict_ = false): is_strict(is_strict_) {}
-protected:
-    const char * getName() const override { return "COLUMNS transformers"; }
-    bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
-    bool is_strict;
+    ColumnTransformers allowed_transformers;
 };

 /** A function, for example, f(x, y + 1, g(z)).
--- a/src/Parsers/ParserOptimizeQuery.cpp
+++ b/src/Parsers/ParserOptimizeQuery.cpp
@ -4,11 +4,24 @@

 #include <Parsers/ASTOptimizeQuery.h>
 #include <Parsers/ASTIdentifier.h>
+#include <Parsers/ExpressionListParsers.h>


 namespace DB
 {

+bool ParserOptimizeQueryColumnsSpecification::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
+{
+    // Do not allow APPLY and REPLACE transformers.
+    // Since we use Columns Transformers only to get list of columns,
+    // ad we can't actuall modify content of the columns for deduplication.
+    const auto allowed_transformers = ParserColumnsTransformers::ColumnTransformers{ParserColumnsTransformers::ColumnTransformer::EXCEPT};
+
+    return ParserColumnsMatcher(allowed_transformers).parse(pos, node, expected)
+        || ParserAsterisk(allowed_transformers).parse(pos, node, expected)
+        || ParserIdentifier(false).parse(pos, node, expected);
+}
+

 bool ParserOptimizeQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
 {
@ -16,6 +29,7 @@ bool ParserOptimizeQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expecte
    ParserKeyword s_partition("PARTITION");
    ParserKeyword s_final("FINAL");
    ParserKeyword s_deduplicate("DEDUPLICATE");
+    ParserKeyword s_by("BY");
    ParserToken s_dot(TokenType::Dot);
    ParserIdentifier name_p;
    ParserPartition partition_p;
@ -55,6 +69,14 @@ bool ParserOptimizeQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expecte
    if (s_deduplicate.ignore(pos, expected))
        deduplicate = true;

+    ASTPtr deduplicate_by_columns;
+    if (deduplicate && s_by.ignore(pos, expected))
+    {
+        if (!ParserList(std::make_unique<ParserOptimizeQueryColumnsSpecification>(), std::make_unique<ParserToken>(TokenType::Comma), false)
+                .parse(pos, deduplicate_by_columns, expected))
+            return false;
+    }
+
    auto query = std::make_shared<ASTOptimizeQuery>();
    node = query;

@ -66,6 +88,7 @@ bool ParserOptimizeQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expecte
        query->children.push_back(partition);
    query->final = final;
    query->deduplicate = deduplicate;
+    query->deduplicate_by_columns = deduplicate_by_columns;

    return true;
 }
--- a/src/Parsers/ParserOptimizeQuery.h
+++ b/src/Parsers/ParserOptimizeQuery.h
@ -7,6 +7,13 @@
 namespace DB
 {

+class ParserOptimizeQueryColumnsSpecification : public IParserBase
+{
+protected:
+    const char * getName() const override { return "column specification for OPTIMIZE ... DEDUPLICATE BY"; }
+    bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
+};
+
 /** Query OPTIMIZE TABLE [db.]name [PARTITION partition] [FINAL] [DEDUPLICATE]
  */
 class ParserOptimizeQuery : public IParserBase
--- a/src/Parsers/tests/gtest_Parser.cpp
+++ b/src/Parsers/tests/gtest_Parser.cpp
@ -0,0 +1,134 @@
+#include <Parsers/ParserOptimizeQuery.h>
+
+#include <Parsers/ParserQueryWithOutput.h>
+#include <Parsers/parseQuery.h>
+#include <Parsers/formatAST.h>
+#include <IO/WriteBufferFromOStream.h>
+
+#include <string_view>
+
+#include <gtest/gtest.h>
+
+namespace
+{
+using namespace DB;
+using namespace std::literals;
+}
+
+struct ParserTestCase
+{
+    std::shared_ptr<IParser> parser;
+    const std::string_view input_text;
+    const char * expected_ast = nullptr;
+};
+
+std::ostream & operator<<(std::ostream & ostr, const ParserTestCase & test_case)
+{
+    return ostr << "parser: " << test_case.parser->getName() << ", input: " << test_case.input_text;
+}
+
+class ParserTest : public ::testing::TestWithParam<ParserTestCase>
+{};
+
+TEST_P(ParserTest, parseQuery)
+{
+    const auto & [parser, input_text, expected_ast] = GetParam();
+
+    ASSERT_NE(nullptr, parser);
+
+    if (expected_ast)
+    {
+        ASTPtr ast;
+        ASSERT_NO_THROW(ast = parseQuery(*parser, input_text.begin(), input_text.end(), 0, 0));
+        EXPECT_EQ(expected_ast, serializeAST(*ast->clone(), false));
+    }
+    else
+    {
+        ASSERT_THROW(parseQuery(*parser, input_text.begin(), input_text.end(), 0, 0), DB::Exception);
+    }
+}
+
+
+INSTANTIATE_TEST_SUITE_P(ParserOptimizeQuery, ParserTest, ::testing::Values(
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('a, b')",
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('a, b')"
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]')",
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]')"
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') EXCEPT b",
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') EXCEPT b"
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') EXCEPT (a, b)",
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') EXCEPT (a, b)"
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY a, b, c",
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY a, b, c"
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY *",
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY *"
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY * EXCEPT a",
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY * EXCEPT a"
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY * EXCEPT (a, b)",
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY * EXCEPT (a, b)"
+    }
+));
+
+INSTANTIATE_TEST_SUITE_P(ParserOptimizeQuery_FAIL, ParserTest, ::testing::Values(
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY",
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') APPLY(x)",
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') REPLACE(y)",
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY * APPLY(x)",
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY * REPLACE(y)",
+    },
+    ParserTestCase
+    {
+        std::make_shared<ParserOptimizeQuery>(),
+        "OPTIMIZE TABLE table_name DEDUPLICATE BY db.a, db.b, db.c",
+    }
+));
--- a/src/Storages/IStorage.h
+++ b/src/Storages/IStorage.h
@ -380,6 +380,7 @@ public:
        const ASTPtr & /*partition*/,
        bool /*final*/,
        bool /*deduplicate*/,
+        const Names & /* deduplicate_by_columns */,
        const Context & /*context*/)
    {
        throw Exception("Method optimize is not supported by storage " + getName(), ErrorCodes::NOT_IMPLEMENTED);
--- a/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
+++ b/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
@ -652,7 +652,8 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
    time_t time_of_merge,
    const Context & context,
    const ReservationPtr & space_reservation,
-    bool deduplicate)
+    bool deduplicate,
+    const Names & deduplicate_by_columns)
 {
    static const String TMP_PREFIX = "tmp_merge_";

@ -667,6 +668,13 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
    const MergeTreeData::DataPartsVector & parts = future_part.parts;

    LOG_DEBUG(log, "Merging {} parts: from {} to {} into {}", parts.size(), parts.front()->name, parts.back()->name, future_part.type.toString());
+    if (deduplicate)
+    {
+        if (deduplicate_by_columns.empty())
+            LOG_DEBUG(log, "DEDUPLICATE BY all columns");
+        else
+            LOG_DEBUG(log, "DEDUPLICATE BY ('{}')", fmt::join(deduplicate_by_columns, "', '"));
+    }

    auto disk = space_reservation->getDisk();
    String part_path = data.relative_data_path;
@ -891,7 +899,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
    BlockInputStreamPtr merged_stream = std::make_shared<PipelineExecutingBlockInputStream>(std::move(pipeline));

    if (deduplicate)
-        merged_stream = std::make_shared<DistinctSortedBlockInputStream>(merged_stream, sort_description, SizeLimits(), 0 /*limit_hint*/, Names());
+        merged_stream = std::make_shared<DistinctSortedBlockInputStream>(merged_stream, sort_description, SizeLimits(), 0 /*limit_hint*/, deduplicate_by_columns);

    if (need_remove_expired_values)
        merged_stream = std::make_shared<TTLBlockInputStream>(merged_stream, data, metadata_snapshot, new_data_part, time_of_merge, force_ttl);
--- a/src/Storages/MergeTree/MergeTreeDataMergerMutator.h
+++ b/src/Storages/MergeTree/MergeTreeDataMergerMutator.h
@ -127,7 +127,8 @@ public:
        time_t time_of_merge,
        const Context & context,
        const ReservationPtr & space_reservation,
-        bool deduplicate);
+        bool deduplicate,
+        const Names & deduplicate_by_columns);

    /// Mutate a single data part with the specified commands. Will create and return a temporary part.
    MergeTreeData::MutableDataPartPtr mutatePartToTemporaryPart(
--- a/src/Storages/MergeTree/ReplicatedMergeTreeLogEntry.cpp
+++ b/src/Storages/MergeTree/ReplicatedMergeTreeLogEntry.cpp
@ -6,6 +6,7 @@
 #include <IO/ReadBufferFromString.h>
 #include <IO/WriteBufferFromString.h>
 #include <IO/ReadHelpers.h>
+#include <IO/WriteHelpers.h>


 namespace DB
@ -16,15 +17,29 @@ namespace ErrorCodes
    extern const int LOGICAL_ERROR;
 }

+enum FormatVersion : UInt8
+{
+    FORMAT_WITH_CREATE_TIME = 2,
+    FORMAT_WITH_BLOCK_ID = 3,
+    FORMAT_WITH_DEDUPLICATE = 4,
+    FORMAT_WITH_UUID = 5,
+    FORMAT_WITH_DEDUPLICATE_BY_COLUMNS = 6,
+
+    FORMAT_LAST
+};
+

 void ReplicatedMergeTreeLogEntryData::writeText(WriteBuffer & out) const
 {
-    UInt8 format_version = 4;
+    UInt8 format_version = FORMAT_WITH_DEDUPLICATE;
+
+    if (!deduplicate_by_columns.empty())
+        format_version = std::max<UInt8>(format_version, FORMAT_WITH_DEDUPLICATE_BY_COLUMNS);

    /// Conditionally bump format_version only when uuid has been assigned.
    /// If some other feature requires bumping format_version to >= 5 then this code becomes no-op.
    if (new_part_uuid != UUIDHelpers::Nil)
-        format_version = std::max(format_version, static_cast<UInt8>(5));
+        format_version = std::max<UInt8>(format_version, FORMAT_WITH_UUID);

    out << "format version: " << format_version << "\n"
        << "create_time: " << LocalDateTime(create_time ? create_time : time(nullptr)) << "\n"
@ -50,6 +65,17 @@ void ReplicatedMergeTreeLogEntryData::writeText(WriteBuffer & out) const
            if (new_part_uuid != UUIDHelpers::Nil)
                out << "\ninto_uuid: " << new_part_uuid;

+            if (!deduplicate_by_columns.empty())
+            {
+                out << "\ndeduplicate_by_columns: ";
+                for (size_t i = 0; i < deduplicate_by_columns.size(); ++i)
+                {
+                    out << quote << deduplicate_by_columns[i];
+                    if (i != deduplicate_by_columns.size() - 1)
+                        out << ",";
+                }
+            }
+
            break;

        case DROP_RANGE:
@ -129,10 +155,10 @@ void ReplicatedMergeTreeLogEntryData::readText(ReadBuffer & in)

    in >> "format version: " >> format_version >> "\n";

-    if (format_version < 1 || format_version > 5)
+    if (format_version < 1 || format_version >= FORMAT_LAST)
        throw Exception("Unknown ReplicatedMergeTreeLogEntry format version: " + DB::toString(format_version), ErrorCodes::UNKNOWN_FORMAT_VERSION);

-    if (format_version >= 2)
+    if (format_version >= FORMAT_WITH_CREATE_TIME)
    {
        LocalDateTime create_time_dt;
        in >> "create_time: " >> create_time_dt >> "\n";
@ -141,7 +167,7 @@ void ReplicatedMergeTreeLogEntryData::readText(ReadBuffer & in)

    in >> "source replica: " >> source_replica >> "\n";

-    if (format_version >= 3)
+    if (format_version >= FORMAT_WITH_BLOCK_ID)
    {
        in >> "block_id: " >> escape >> block_id >> "\n";
    }
@ -167,7 +193,7 @@ void ReplicatedMergeTreeLogEntryData::readText(ReadBuffer & in)
        }
        in >> new_part_name;

-        if (format_version >= 4)
+        if (format_version >= FORMAT_WITH_DEDUPLICATE)
        {
            in >> "\ndeduplicate: " >> deduplicate;

@ -184,6 +210,20 @@ void ReplicatedMergeTreeLogEntryData::readText(ReadBuffer & in)
                }
                else if (checkString("into_uuid: ", in))
                    in >> new_part_uuid;
+                else if (checkString("deduplicate_by_columns: ", in))
+                {
+                    Strings new_deduplicate_by_columns;
+                    for (;;)
+                    {
+                        String tmp_column_name;
+                        in >> quote >> tmp_column_name;
+                        new_deduplicate_by_columns.emplace_back(std::move(tmp_column_name));
+                        if (!checkString(",", in))
+                            break;
+                    }
+
+                    deduplicate_by_columns = std::move(new_deduplicate_by_columns);
+                }
                else
                    trailing_newline_found = true;
            }
--- a/src/Storages/MergeTree/ReplicatedMergeTreeLogEntry.h
+++ b/src/Storages/MergeTree/ReplicatedMergeTreeLogEntry.h
@ -81,6 +81,7 @@ struct ReplicatedMergeTreeLogEntryData

    Strings source_parts;
    bool deduplicate = false; /// Do deduplicate on merge
+    Strings deduplicate_by_columns = {}; // Which columns should be checked for duplicates, empty means 'all' (default).
    MergeType merge_type = MergeType::REGULAR;
    String column_name;
    String index_name;
@ -111,10 +112,10 @@ struct ReplicatedMergeTreeLogEntryData
    /// Version of metadata which will be set after this alter
    /// Also present in MUTATE_PART command, to track mutations
    /// required for complete alter execution.
-    int alter_version; /// May be equal to -1, if it's normal mutation, not metadata update.
+    int alter_version = -1; /// May be equal to -1, if it's normal mutation, not metadata update.

    /// only ALTER METADATA command
-    bool have_mutation; /// If this alter requires additional mutation step, for data update
+    bool have_mutation = false; /// If this alter requires additional mutation step, for data update

    String columns_str; /// New columns data corresponding to alter_version
    String metadata_str; /// New metadata corresponding to alter_version
--- a/src/Storages/MergeTree/tests/gtest_ReplicatedMergeTreeLogEntry.cpp
+++ b/src/Storages/MergeTree/tests/gtest_ReplicatedMergeTreeLogEntry.cpp
@ -0,0 +1,348 @@
+#include <Storages/MergeTree/ReplicatedMergeTreeLogEntry.h>
+
+#include <IO/ReadBufferFromString.h>
+
+#include <Core/iostream_debug_helpers.h>
+
+#include <type_traits>
+#include <regex>
+
+#include <gtest/gtest.h>
+
+namespace DB
+{
+std::ostream & operator<<(std::ostream & ostr, const MergeTreeDataPartType & type)
+{
+    return ostr << type.toString();
+}
+
+std::ostream & operator<<(std::ostream & ostr, const UInt128 & v)
+{
+    return ostr << v.toHexString();
+}
+
+template <typename T, typename Tag>
+std::ostream & operator<<(std::ostream & ostr, const StrongTypedef<T, Tag> & v)
+{
+    return ostr << v.toUnderType();
+}
+
+std::ostream & operator<<(std::ostream & ostr, const MergeType & v)
+{
+    return ostr << toString(v);
+}
+
+}
+
+namespace std
+{
+
+std::ostream & operator<<(std::ostream & ostr, const std::exception_ptr & exception)
+{
+    try
+    {
+        if (exception)
+        {
+            std::rethrow_exception(exception);
+        }
+        return ostr << "<NULL EXCEPTION>";
+    }
+    catch (const std::exception& e)
+    {
+        return ostr << e.what();
+    }
+}
+
+template <typename T>
+inline std::ostream& operator<<(std::ostream & ostr, const std::vector<T> & v)
+{
+    ostr << "[";
+    for (size_t i = 0; i < v.size(); ++i)
+    {
+        ostr << i;
+        if (i != v.size() - 1)
+            ostr << ", ";
+    }
+    return ostr << "] (" << v.size() << ") items";
+}
+
+}
+
+namespace
+{
+using namespace DB;
+
+template <typename T>
+void compareAttributes(::testing::AssertionResult & result, const char * name, const T & expected_value, const T & actual_value);
+
+#define CMP_ATTRIBUTE(attribute) compareAttributes(result, #attribute, expected.attribute, actual.attribute)
+
+::testing::AssertionResult compare(
+        const ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry & expected,
+        const ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry & actual)
+{
+    auto result = ::testing::AssertionSuccess();
+
+    CMP_ATTRIBUTE(drop_range_part_name);
+    CMP_ATTRIBUTE(from_database);
+    CMP_ATTRIBUTE(from_table);
+    CMP_ATTRIBUTE(src_part_names);
+    CMP_ATTRIBUTE(new_part_names);
+    CMP_ATTRIBUTE(part_names_checksums);
+    CMP_ATTRIBUTE(columns_version);
+
+    return result;
+}
+
+template <typename T>
+bool compare(const T & expected, const T & actual)
+{
+    return expected == actual;
+}
+
+template <typename T>
+::testing::AssertionResult compare(const std::shared_ptr<T> & expected, const std::shared_ptr<T> & actual)
+{
+    if (!!expected != !!actual)
+        return ::testing::AssertionFailure()
+                << "expected : " << static_cast<const void*>(expected.get())
+                << "\nactual   : " << static_cast<const void*>(actual.get());
+
+    if (expected && actual)
+        return compare(*expected, *actual);
+
+    return ::testing::AssertionSuccess();
+}
+
+template <typename T>
+void compareAttributes(::testing::AssertionResult & result, const char * name, const T & expected_value, const T & actual_value)
+{
+    const auto cmp_result = compare(expected_value, actual_value);
+    if (cmp_result == false)
+    {
+        if (result)
+            result = ::testing::AssertionFailure();
+
+        result << "\nMismatching attribute: \"" << name << "\"";
+        if constexpr (std::is_same_v<std::decay_t<decltype(cmp_result)>, ::testing::AssertionResult>)
+            result << "\n" << cmp_result.message();
+        else
+            result << "\n\texpected: " << expected_value
+                   << "\n\tactual  : " << actual_value;
+    }
+};
+
+::testing::AssertionResult compare(const ReplicatedMergeTreeLogEntryData & expected, const ReplicatedMergeTreeLogEntryData & actual)
+{
+    ::testing::AssertionResult result = ::testing::AssertionSuccess();
+
+    CMP_ATTRIBUTE(znode_name);
+    CMP_ATTRIBUTE(type);
+    CMP_ATTRIBUTE(source_replica);
+    CMP_ATTRIBUTE(new_part_name);
+    CMP_ATTRIBUTE(new_part_type);
+    CMP_ATTRIBUTE(block_id);
+    CMP_ATTRIBUTE(actual_new_part_name);
+    CMP_ATTRIBUTE(new_part_uuid);
+    CMP_ATTRIBUTE(source_parts);
+    CMP_ATTRIBUTE(deduplicate);
+    CMP_ATTRIBUTE(deduplicate_by_columns);
+    CMP_ATTRIBUTE(merge_type);
+    CMP_ATTRIBUTE(column_name);
+    CMP_ATTRIBUTE(index_name);
+    CMP_ATTRIBUTE(detach);
+    CMP_ATTRIBUTE(replace_range_entry);
+    CMP_ATTRIBUTE(alter_version);
+    CMP_ATTRIBUTE(have_mutation);
+    CMP_ATTRIBUTE(columns_str);
+    CMP_ATTRIBUTE(metadata_str);
+    CMP_ATTRIBUTE(currently_executing);
+    CMP_ATTRIBUTE(removed_by_other_entry);
+    CMP_ATTRIBUTE(num_tries);
+    CMP_ATTRIBUTE(exception);
+    CMP_ATTRIBUTE(last_attempt_time);
+    CMP_ATTRIBUTE(num_postponed);
+    CMP_ATTRIBUTE(postpone_reason);
+    CMP_ATTRIBUTE(last_postpone_time);
+    CMP_ATTRIBUTE(create_time);
+    CMP_ATTRIBUTE(quorum);
+
+    return result;
+}
+}
+
+
+class ReplicatedMergeTreeLogEntryDataTest : public ::testing::TestWithParam<std::tuple<ReplicatedMergeTreeLogEntryData, const char* /* serialized RE*/>>
+{};
+
+TEST_P(ReplicatedMergeTreeLogEntryDataTest, transcode)
+{
+    const auto & [expected, match_regex] = GetParam();
+    const auto str = expected.toString();
+
+    if (match_regex)
+    {
+        try
+        {
+            // egrep since "." matches newline and we can also use "\n" explicitly
+            std::regex re(match_regex, std::regex::egrep);
+            EXPECT_TRUE(std::regex_match(str, re))
+                    << "Failed to match serialized ReplicatedMergeTreeLogEntryData: {\n"
+                    << str << "} \nwith regex: \"" << match_regex << "\"\n";
+        }
+        catch (const std::regex_error &e)
+        {
+            FAIL() << e.what()
+                   << " on regex: " << match_regex
+                   << " (" << strlen(match_regex) << " bytes)" << std::endl;
+        }
+        catch (...)
+        {
+            throw;
+        }
+    }
+
+    ReplicatedMergeTreeLogEntryData actual;
+    {
+        DB::ReadBufferFromString buffer(str);
+        EXPECT_NO_THROW(actual.readText(buffer)) << "While reading:\n" << str;
+    }
+
+    ASSERT_TRUE(compare(expected, actual)) << "Via text:\n" << str;
+}
+
+// Enabling this warning would ruin test brievity without adding anything else in return,
+// since most of the fields have default constructors or be will be zero-initialized as by standard,
+// so values are predicatable and stable accross runs.
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wmissing-field-initializers"
+
+INSTANTIATE_TEST_SUITE_P(Merge, ReplicatedMergeTreeLogEntryDataTest,
+        ::testing::ValuesIn(std::initializer_list<std::tuple<ReplicatedMergeTreeLogEntryData, const char*>>{
+    {
+        {
+            // Basic: minimal set of attributes.
+            .type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
+            .new_part_type = MergeTreeDataPartType::WIDE,
+            .create_time = 123, // 0 means 'now' which could cause flaky tests.
+        },
+        R"re(^format version: 4.+merge.+into.+deduplicate: 0.+$)re"
+    },
+    {
+        {
+            .type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
+            .new_part_type = MergeTreeDataPartType::WIDE,
+
+            // Format version 4
+            .deduplicate = true,
+
+            .create_time = 123,
+        },
+        R"re(^format version: 4.+merge.+into.+deduplicate: 1.+$)re"
+    },
+    {
+        {
+            .type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
+            .new_part_type = MergeTreeDataPartType::WIDE,
+
+            // Format version 5
+            .new_part_uuid = UUID(UInt128(123456789, 10111213141516)),
+
+            .create_time = 123,
+        },
+        R"re(^format version: 5.+merge.+into.+deduplicate: 0.+into_uuid: 00000000-075b-cd15-0000-093233447e0c.+$)re"
+    },
+    {
+        {
+            .type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
+            .new_part_type = MergeTreeDataPartType::WIDE,
+
+            // Format version 6
+            .deduplicate = true,
+            .deduplicate_by_columns = {"foo", "bar", "qux"},
+
+            .create_time = 123,
+        },
+        R"re(^format version: 6.+merge.+into.+deduplicate: 1.+deduplicate_by_columns: 'foo','bar','qux'.*$)re"
+    },
+    {
+        {
+            .type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
+            .new_part_type = MergeTreeDataPartType::WIDE,
+
+            // Mixing features
+            .new_part_uuid = UUID(UInt128(123456789, 10111213141516)),
+            .deduplicate = true,
+            .deduplicate_by_columns = {"foo", "bar", "qux"},
+
+            .create_time = 123,
+        },
+        R"re(^format version: 6.+merge.+into.+deduplicate: 1.+into_uuid: 00000000-075b-cd15-0000-093233447e0c.+deduplicate_by_columns: 'foo','bar','qux'.*$)re"
+    },
+    {
+        // Validate that exotic column names are serialized/deserialized properly
+        {
+            .type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
+            .new_part_type = MergeTreeDataPartType::WIDE,
+
+            // Mixing features
+            .new_part_uuid = UUID(UInt128(123456789, 10111213141516)),
+            .deduplicate = true,
+            .deduplicate_by_columns = {"name with space", "\"column\"", "'column'", "колонка", "\u30ab\u30e9\u30e0", "\x01\x03 column \x10\x11\x12"},
+
+            .create_time = 123,
+        },
+        R"re(^format version: 6.+merge.+deduplicate_by_columns: 'name with space','"column"','\\'column\\'','колонка')re"
+                ",'\u30ab\u30e9\u30e0','\x01\x03 column \x10\x11\x12'.*$"
+    },
+}));
+
+#pragma GCC diagnostic pop
+
+// This is just an example of how to set all fields. Can't be used as is since depending on type,
+// only some fields are serialized/deserialized, and even if everything works perfectly,
+// some fileds in deserialized object would be unset (hence differ from expected).
+// INSTANTIATE_TEST_SUITE_P(Full, ReplicatedMergeTreeLogEntryDataTest,
+//         ::testing::ValuesIn(std::initializer_list<ReplicatedMergeTreeLogEntryData>{
+//     {
+//         .znode_name = "znode name",
+//         .type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
+//         .source_replica = "source replica",
+//         .new_part_name = "new part name",
+//         .new_part_type = MergeTreeDataPartType::WIDE,
+//         .block_id = "block id",
+//         .actual_new_part_name = "new part name",
+//         .new_part_uuid = UUID(UInt128(123456789, 10111213141516)),
+//         .source_parts = {"part1", "part2"},
+//         .deduplicate = true,
+//         .deduplicate_by_columns = {"col1", "col2"},
+//         .merge_type = MergeType::REGULAR,
+//         .column_name = "column name",
+//         .index_name = "index name",
+//         .detach = false,
+//         .replace_range_entry = std::make_shared<ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry>(
+//             ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry
+//             {
+//                 .drop_range_part_name = "drop range part name",
+//                 .from_database = "from database",
+//                 .src_part_names = {"src part name1", "src part name2"},
+//                 .new_part_names = {"new part name1", "new part name2"},
+//                 .columns_version = 123456,
+//             }),
+//         .alter_version = 56789,
+//         .have_mutation = false,
+//         .columns_str = "columns str",
+//         .metadata_str = "metadata str",
+//         // Those attributes are not serialized to string, hence it makes no sense to set.
+//         // .currently_executing
+//         // .removed_by_other_entry
+//         // .num_tries
+//         // .exception
+//         // .last_attempt_time
+//         // .num_postponed
+//         // .postpone_reason
+//         // .last_postpone_time,
+//         .create_time = static_cast<time_t>(123456789),
+//         .quorum = 321,
+//     },
+// }));
--- a/src/Storages/StorageBuffer.cpp
+++ b/src/Storages/StorageBuffer.cpp
@ -583,7 +583,7 @@ void StorageBuffer::shutdown()

    try
    {
-        optimize(nullptr /*query*/, getInMemoryMetadataPtr(), {} /*partition*/, false /*final*/, false /*deduplicate*/, global_context);
+        optimize(nullptr /*query*/, getInMemoryMetadataPtr(), {} /*partition*/, false /*final*/, false /*deduplicate*/, {}, global_context);
    }
    catch (...)
    {
@ -608,6 +608,7 @@ bool StorageBuffer::optimize(
    const ASTPtr & partition,
    bool final,
    bool deduplicate,
+    const Names & /* deduplicate_by_columns */,
    const Context & /*context*/)
 {
    if (partition)
@ -906,7 +907,7 @@ void StorageBuffer::alter(const AlterCommands & params, const Context & context,
    /// Flush all buffers to storages, so that no non-empty blocks of the old
    /// structure remain. Structure of empty blocks will be updated during first
    /// insert.
-    optimize({} /*query*/, metadata_snapshot, {} /*partition_id*/, false /*final*/, false /*deduplicate*/, context);
+    optimize({} /*query*/, metadata_snapshot, {} /*partition_id*/, false /*final*/, false /*deduplicate*/, {}, context);

    StorageInMemoryMetadata new_metadata = *metadata_snapshot;
    params.apply(new_metadata, context);
--- a/src/Storages/StorageBuffer.h
+++ b/src/Storages/StorageBuffer.h
@ -87,6 +87,7 @@ public:
        const ASTPtr & partition,
        bool final,
        bool deduplicate,
+        const Names & deduplicate_by_columns,
        const Context & context) override;

    bool supportsSampling() const override { return true; }
--- a/src/Storages/StorageMaterializedView.cpp
+++ b/src/Storages/StorageMaterializedView.cpp
@ -233,12 +233,13 @@ bool StorageMaterializedView::optimize(
    const ASTPtr & partition,
    bool final,
    bool deduplicate,
+    const Names & deduplicate_by_columns,
    const Context & context)
 {
    checkStatementCanBeForwarded();
    auto storage_ptr = getTargetTable();
    auto metadata_snapshot = storage_ptr->getInMemoryMetadataPtr();
-    return getTargetTable()->optimize(query, metadata_snapshot, partition, final, deduplicate, context);
+    return getTargetTable()->optimize(query, metadata_snapshot, partition, final, deduplicate, deduplicate_by_columns, context);
 }

 void StorageMaterializedView::alter(
--- a/src/Storages/StorageMaterializedView.h
+++ b/src/Storages/StorageMaterializedView.h
@ -46,6 +46,7 @@ public:
        const ASTPtr & partition,
        bool final,
        bool deduplicate,
+        const Names & deduplicate_by_columns,
        const Context & context) override;

    void alter(const AlterCommands & params, const Context & context, TableLockHolder & table_lock_holder) override;
--- a/src/Storages/StorageMergeTree.cpp
+++ b/src/Storages/StorageMergeTree.cpp
@ -741,6 +741,7 @@ bool StorageMergeTree::merge(
    const String & partition_id,
    bool final,
    bool deduplicate,
+    const Names & deduplicate_by_columns,
    String * out_disable_reason,
    bool optimize_skip_merged_partitions)
 {
@ -758,10 +759,15 @@ bool StorageMergeTree::merge(
    if (!merge_mutate_entry)
        return false;

-    return mergeSelectedParts(metadata_snapshot, deduplicate, *merge_mutate_entry, table_lock_holder);
+    return mergeSelectedParts(metadata_snapshot, deduplicate, deduplicate_by_columns, *merge_mutate_entry, table_lock_holder);
 }

-bool StorageMergeTree::mergeSelectedParts(const StorageMetadataPtr & metadata_snapshot, bool deduplicate, MergeMutateSelectedEntry & merge_mutate_entry, TableLockHolder & table_lock_holder)
+bool StorageMergeTree::mergeSelectedParts(
+    const StorageMetadataPtr & metadata_snapshot,
+    bool deduplicate,
+    const Names & deduplicate_by_columns,
+    MergeMutateSelectedEntry & merge_mutate_entry,
+    TableLockHolder & table_lock_holder)
 {
    auto & future_part = merge_mutate_entry.future_part;
    Stopwatch stopwatch;
@ -786,7 +792,7 @@ bool StorageMergeTree::mergeSelectedParts(const StorageMetadataPtr & metadata_sn
    {
        new_part = merger_mutator.mergePartsToTemporaryPart(
            future_part, metadata_snapshot, *(merge_list_entry), table_lock_holder, time(nullptr),
-            global_context, merge_mutate_entry.tagger->reserved_space, deduplicate);
+            global_context, merge_mutate_entry.tagger->reserved_space, deduplicate, deduplicate_by_columns);

        merger_mutator.renameMergedTemporaryPart(new_part, future_part.parts, nullptr);
        write_part_log({});
@ -953,7 +959,7 @@ std::optional<JobAndPool> StorageMergeTree::getDataProcessingJob()
        return JobAndPool{[this, metadata_snapshot, merge_entry, mutate_entry, share_lock] () mutable
        {
            if (merge_entry)
-                mergeSelectedParts(metadata_snapshot, false, *merge_entry, share_lock);
+                mergeSelectedParts(metadata_snapshot, false, {}, *merge_entry, share_lock);
            else if (mutate_entry)
                mutateSelectedPart(metadata_snapshot, *mutate_entry, share_lock);
        }, PoolType::MERGE_MUTATE};
@ -1036,8 +1042,17 @@ bool StorageMergeTree::optimize(
    const ASTPtr & partition,
    bool final,
    bool deduplicate,
+    const Names & deduplicate_by_columns,
    const Context & context)
 {
+    if (deduplicate)
+    {
+        if (deduplicate_by_columns.empty())
+            LOG_DEBUG(log, "DEDUPLICATE BY all columns");
+        else
+            LOG_DEBUG(log, "DEDUPLICATE BY ('{}')", fmt::join(deduplicate_by_columns, "', '"));
+    }
+
    String disable_reason;
    if (!partition && final)
    {
@ -1049,7 +1064,7 @@ bool StorageMergeTree::optimize(

        for (const String & partition_id : partition_ids)
        {
-            if (!merge(true, partition_id, true, deduplicate, &disable_reason, context.getSettingsRef().optimize_skip_merged_partitions))
+            if (!merge(true, partition_id, true, deduplicate, deduplicate_by_columns, &disable_reason, context.getSettingsRef().optimize_skip_merged_partitions))
            {
                constexpr const char * message = "Cannot OPTIMIZE table: {}";
                if (disable_reason.empty())
@ -1068,7 +1083,7 @@ bool StorageMergeTree::optimize(
        if (partition)
            partition_id = getPartitionIDFromQuery(partition, context);

-        if (!merge(true, partition_id, final, deduplicate, &disable_reason, context.getSettingsRef().optimize_skip_merged_partitions))
+        if (!merge(true, partition_id, final, deduplicate, deduplicate_by_columns, &disable_reason, context.getSettingsRef().optimize_skip_merged_partitions))
        {
            constexpr const char * message = "Cannot OPTIMIZE table: {}";
            if (disable_reason.empty())
--- a/src/Storages/StorageMergeTree.h
+++ b/src/Storages/StorageMergeTree.h
@ -70,6 +70,7 @@ public:
        const ASTPtr & partition,
        bool final,
        bool deduplicate,
+        const Names & deduplicate_by_columns,
        const Context & context) override;

    void mutate(const MutationCommands & commands, const Context & context) override;
@ -132,7 +133,7 @@ private:
      * If aggressive - when selects parts don't takes into account their ratio size and novelty (used for OPTIMIZE query).
      * Returns true if merge is finished successfully.
      */
-    bool merge(bool aggressive, const String & partition_id, bool final, bool deduplicate, String * out_disable_reason = nullptr, bool optimize_skip_merged_partitions = false);
+    bool merge(bool aggressive, const String & partition_id, bool final, bool deduplicate, const Names & deduplicate_by_columns, String * out_disable_reason = nullptr, bool optimize_skip_merged_partitions = false);

    ActionLock stopMergesAndWait();

@ -183,7 +184,8 @@ private:
        TableLockHolder & table_lock_holder,
        bool optimize_skip_merged_partitions = false,
        SelectPartsDecision * select_decision_out = nullptr);
-    bool mergeSelectedParts(const StorageMetadataPtr & metadata_snapshot, bool deduplicate, MergeMutateSelectedEntry & entry, TableLockHolder & table_lock_holder);
+
+    bool mergeSelectedParts(const StorageMetadataPtr & metadata_snapshot, bool deduplicate, const Names & deduplicate_by_columns, MergeMutateSelectedEntry & entry, TableLockHolder & table_lock_holder);

    std::shared_ptr<MergeMutateSelectedEntry> selectPartsToMutate(const StorageMetadataPtr & metadata_snapshot, String * disable_reason, TableLockHolder & table_lock_holder);
    bool mutateSelectedPart(const StorageMetadataPtr & metadata_snapshot, MergeMutateSelectedEntry & entry, TableLockHolder & table_lock_holder);
--- a/src/Storages/StorageProxy.h
+++ b/src/Storages/StorageProxy.h
@ -122,9 +122,10 @@ public:
            const ASTPtr & partition,
            bool final,
            bool deduplicate,
+            const Names & deduplicate_by_columns,
            const Context & context) override
    {
-        return getNested()->optimize(query, metadata_snapshot, partition, final, deduplicate, context);
+        return getNested()->optimize(query, metadata_snapshot, partition, final, deduplicate, deduplicate_by_columns, context);
    }

    void mutate(const MutationCommands & commands, const Context & context) override { getNested()->mutate(commands, context); }
--- a/src/Storages/StorageReplicatedMergeTree.cpp
+++ b/src/Storages/StorageReplicatedMergeTree.cpp
@ -1508,7 +1508,7 @@ bool StorageReplicatedMergeTree::tryExecuteMerge(const LogEntry & entry)
    {
        part = merger_mutator.mergePartsToTemporaryPart(
            future_merged_part, metadata_snapshot, *merge_entry,
-            table_lock, entry.create_time, global_context, reserved_space, entry.deduplicate);
+            table_lock, entry.create_time, global_context, reserved_space, entry.deduplicate, entry.deduplicate_by_columns);

        merger_mutator.renameMergedTemporaryPart(part, parts, &transaction);

@ -2712,6 +2712,7 @@ void StorageReplicatedMergeTree::mergeSelectingTask()

    const auto storage_settings_ptr = getSettings();
    const bool deduplicate = false; /// TODO: read deduplicate option from table config
+    const Names deduplicate_by_columns = {};
    CreateMergeEntryResult create_result = CreateMergeEntryResult::Other;

    try
@ -2762,6 +2763,7 @@ void StorageReplicatedMergeTree::mergeSelectingTask()
                    future_merged_part.uuid,
                    future_merged_part.type,
                    deduplicate,
+                    deduplicate_by_columns,
                    nullptr,
                    merge_pred.getVersion(),
                    future_merged_part.merge_type);
@ -2851,6 +2853,7 @@ StorageReplicatedMergeTree::CreateMergeEntryResult StorageReplicatedMergeTree::c
    const UUID & merged_part_uuid,
    const MergeTreeDataPartType & merged_part_type,
    bool deduplicate,
+    const Names & deduplicate_by_columns,
    ReplicatedMergeTreeLogEntryData * out_log_entry,
    int32_t log_version,
    MergeType merge_type)
@ -2888,6 +2891,7 @@ StorageReplicatedMergeTree::CreateMergeEntryResult StorageReplicatedMergeTree::c
    entry.new_part_type = merged_part_type;
    entry.merge_type = merge_type;
    entry.deduplicate = deduplicate;
+    entry.deduplicate_by_columns = deduplicate_by_columns;
    entry.merge_type = merge_type;
    entry.create_time = time(nullptr);

@ -3862,6 +3866,7 @@ BlockOutputStreamPtr StorageReplicatedMergeTree::write(const ASTPtr & /*query*/,
    const Settings & query_settings = context.getSettingsRef();
    bool deduplicate = storage_settings_ptr->replicated_deduplication_window != 0 && query_settings.insert_deduplicate;

+    // TODO: should we also somehow pass list of columns to deduplicate on to the ReplicatedMergeTreeBlockOutputStream ?
    return std::make_shared<ReplicatedMergeTreeBlockOutputStream>(
        *this, metadata_snapshot, query_settings.insert_quorum,
        query_settings.insert_quorum_timeout.totalMilliseconds(),
@ -3878,6 +3883,7 @@ bool StorageReplicatedMergeTree::optimize(
    const ASTPtr & partition,
    bool final,
    bool deduplicate,
+    const Names & deduplicate_by_columns,
    const Context & query_context)
 {
    assertNotReadonly();
@ -3935,7 +3941,8 @@ bool StorageReplicatedMergeTree::optimize(
                    ReplicatedMergeTreeLogEntryData merge_entry;
                    CreateMergeEntryResult create_result = createLogEntryToMergeParts(
                        zookeeper, future_merged_part.parts,
-                        future_merged_part.name, future_merged_part.uuid, future_merged_part.type, deduplicate,
+                        future_merged_part.name, future_merged_part.uuid, future_merged_part.type,
+                        deduplicate, deduplicate_by_columns,
                        &merge_entry, can_merge.getVersion(), future_merged_part.merge_type);

                    if (create_result == CreateMergeEntryResult::MissingPart)
@ -3996,7 +4003,8 @@ bool StorageReplicatedMergeTree::optimize(
                ReplicatedMergeTreeLogEntryData merge_entry;
                CreateMergeEntryResult create_result = createLogEntryToMergeParts(
                    zookeeper, future_merged_part.parts,
-                    future_merged_part.name, future_merged_part.uuid, future_merged_part.type, deduplicate,
+                    future_merged_part.name, future_merged_part.uuid, future_merged_part.type,
+                    deduplicate, deduplicate_by_columns,
                    &merge_entry, can_merge.getVersion(), future_merged_part.merge_type);

                if (create_result == CreateMergeEntryResult::MissingPart)
--- a/src/Storages/StorageReplicatedMergeTree.h
+++ b/src/Storages/StorageReplicatedMergeTree.h
@ -120,6 +120,7 @@ public:
        const ASTPtr & partition,
        bool final,
        bool deduplicate,
+        const Names & deduplicate_by_columns,
        const Context & query_context) override;

    void alter(const AlterCommands & commands, const Context & query_context, TableLockHolder & table_lock_holder) override;
@ -470,6 +471,7 @@ private:
        const UUID & merged_part_uuid,
        const MergeTreeDataPartType & merged_part_type,
        bool deduplicate,
+        const Names & deduplicate_by_columns,
        ReplicatedMergeTreeLogEntryData * out_log_entry,
        int32_t log_version,
        MergeType merge_type);
--- a/tests/performance/encrypt_decrypt_empty_string_slow.xml
+++ b/tests/performance/encrypt_decrypt_empty_string_slow.xml
@ -37,7 +37,7 @@
        <substitution>
           <name>table</name>
           <values>
-               <value>numbers(200000)</value>
+               <value>numbers(2000000)</value>
           </values>
        </substitution>
        <substitution>
--- a/tests/queries/0_stateless/01526_client_start_and_exit.sh
+++ b/tests/queries/0_stateless/01526_client_start_and_exit.sh
@ -9,7 +9,7 @@ ${CLICKHOUSE_CLIENT} -q "SELECT 'CREATE TABLE test_' || hex(randomPrintableASCII
 function stress()
 {
    while true; do
-        "${CURDIR}"/01526_client_start_and_exit.expect | grep -v -P 'ClickHouse client|Connecting|Connected|:\) Bye\.|^\s*$|spawn bash|^0\s*$'
+        "${CURDIR}"/01526_client_start_and_exit.expect | grep -v -P 'ClickHouse client|Connecting|Connected|:\) Bye\.|new year|^\s*$|spawn bash|^0\s*$'
    done
 }

--- a/tests/queries/0_stateless/01581_deduplicate_by_columns_local.reference
+++ b/tests/queries/0_stateless/01581_deduplicate_by_columns_local.reference
@ -0,0 +1,39 @@
+TOTAL rows	4
+OLD DEDUPLICATE
+0	0	0	1
+1	1	2	1
+1	1	3	1
+DEDUPLICATE BY *
+0	0	0	1
+1	1	2	1
+1	1	3	1
+DEDUPLICATE BY * EXCEPT mat
+0	0	0	1
+1	1	2	1
+1	1	3	1
+DEDUPLICATE BY pk,sk,val,mat,partition_key
+0	0	0	1
+1	1	2	1
+1	1	3	1
+Can not remove full duplicates
+OLD DEDUPLICATE
+4
+DEDUPLICATE BY pk,sk,val,mat
+4
+Remove partial duplicates
+DEDUPLICATE BY *
+3
+DEDUPLICATE BY * EXCEPT mat
+0	0	0	1
+1	1	2	1
+1	1	3	1
+DEDUPLICATE BY COLUMNS("*") EXCEPT mat
+0	0	0	1
+1	1	2	1
+1	1	3	1
+DEDUPLICATE BY pk,sk
+0	0	0	1
+1	1	2	1
+DEDUPLICATE BY COLUMNS(".*k")
+0	0	0	1
+1	1	2	1
--- a/tests/queries/0_stateless/01581_deduplicate_by_columns_local.sql
+++ b/tests/queries/0_stateless/01581_deduplicate_by_columns_local.sql
@ -0,0 +1,122 @@
+--- See also tests/queries/0_stateless/01581_deduplicate_by_columns_replicated.sql
+
+--- local case
+
+-- Just in case if previous tests run left some stuff behind.
+DROP TABLE IF EXISTS source_data;
+
+CREATE TABLE source_data (
+    pk Int32, sk Int32, val UInt32, partition_key UInt32 DEFAULT 1,
+    PRIMARY KEY (pk)
+) ENGINE=MergeTree
+ORDER BY (pk, sk);
+
+INSERT INTO source_data (pk, sk, val) VALUES (0, 0, 0), (0, 0, 0), (1, 1, 2), (1, 1, 3);
+
+SELECT 'TOTAL rows', count() FROM source_data;
+
+DROP TABLE IF EXISTS full_duplicates;
+-- table with duplicates on MATERIALIZED columns
+CREATE TABLE full_duplicates  (
+    pk Int32, sk Int32, val UInt32, partition_key UInt32, mat UInt32 MATERIALIZED 12345, alias UInt32 ALIAS 2,
+    PRIMARY KEY (pk)
+) ENGINE=MergeTree
+PARTITION BY (partition_key + 1) -- ensure that column in expression is properly handled when deduplicating. See [1] below.
+ORDER BY (pk, toString(sk * 10)); -- silly order key to ensure that key column is checked even when it is a part of expression. See [1] below.
+
+-- ERROR cases
+OPTIMIZE TABLE full_duplicates DEDUPLICATE BY pk, sk, val, mat, alias; -- { serverError 16 } -- alias column is present
+OPTIMIZE TABLE full_duplicates DEDUPLICATE BY sk, val; -- { serverError 8 } -- primary key column is missing
+OPTIMIZE TABLE full_duplicates DEDUPLICATE BY * EXCEPT(pk, sk, val, mat, alias, partition_key); -- { serverError 51 } -- list is empty
+OPTIMIZE TABLE full_duplicates DEDUPLICATE BY * EXCEPT(pk); -- { serverError 8 } -- primary key column is missing [1]
+OPTIMIZE TABLE full_duplicates DEDUPLICATE BY * EXCEPT(sk); -- { serverError 8 } -- sorting key column is missing [1]
+OPTIMIZE TABLE full_duplicates DEDUPLICATE BY * EXCEPT(partition_key); -- { serverError 8 } -- partitioning column is missing [1]
+
+OPTIMIZE TABLE full_duplicates DEDUPLICATE BY; -- { clientError 62 } -- empty list is a syntax error
+OPTIMIZE TABLE partial_duplicates DEDUPLICATE BY pk,sk,val,mat EXCEPT mat; -- { clientError 62 } -- invalid syntax
+OPTIMIZE TABLE partial_duplicates DEDUPLICATE BY pk APPLY(pk + 1); -- { clientError 62 } -- APPLY column transformer is not supported
+OPTIMIZE TABLE partial_duplicates DEDUPLICATE BY pk REPLACE(pk + 1); -- { clientError 62 } -- REPLACE column transformer is not supported
+
+-- Valid cases
+-- NOTE: here and below we need FINAL to force deduplication in such a small set of data in only 1 part.
+
+SELECT 'OLD DEDUPLICATE';
+INSERT INTO full_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE full_duplicates FINAL DEDUPLICATE;
+SELECT * FROM full_duplicates;
+TRUNCATE full_duplicates;
+
+SELECT 'DEDUPLICATE BY *';
+INSERT INTO full_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE full_duplicates FINAL DEDUPLICATE BY *;
+SELECT * FROM full_duplicates;
+TRUNCATE full_duplicates;
+
+SELECT 'DEDUPLICATE BY * EXCEPT mat';
+INSERT INTO full_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE full_duplicates FINAL DEDUPLICATE BY * EXCEPT mat;
+SELECT * FROM full_duplicates;
+TRUNCATE full_duplicates;
+
+SELECT 'DEDUPLICATE BY pk,sk,val,mat,partition_key';
+INSERT INTO full_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE full_duplicates FINAL DEDUPLICATE BY pk,sk,val,mat,partition_key;
+SELECT * FROM full_duplicates;
+TRUNCATE full_duplicates;
+
+--DROP TABLE full_duplicates;
+
+-- Now to the partial duplicates when MATERIALIZED column alway has unique value.
+DROP TABLE IF EXISTS partial_duplicates;
+CREATE TABLE partial_duplicates  (
+    pk Int32, sk Int32, val UInt32, partition_key UInt32 DEFAULT 1, mat UInt32 MATERIALIZED rand(), alias UInt32 ALIAS 2,
+    PRIMARY KEY (pk)
+) ENGINE=MergeTree
+ORDER BY (pk, sk);
+
+SELECT 'Can not remove full duplicates';
+
+-- should not remove anything
+SELECT 'OLD DEDUPLICATE';
+INSERT INTO partial_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE;
+SELECT count() FROM partial_duplicates;
+TRUNCATE partial_duplicates;
+
+SELECT 'DEDUPLICATE BY pk,sk,val,mat';
+INSERT INTO partial_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY pk,sk,val,mat;
+SELECT count() FROM partial_duplicates;
+TRUNCATE partial_duplicates;
+
+SELECT 'Remove partial duplicates';
+
+SELECT 'DEDUPLICATE BY *'; -- all except MATERIALIZED columns, hence will reduce number of rows.
+INSERT INTO partial_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY *;
+SELECT count() FROM partial_duplicates;
+TRUNCATE partial_duplicates;
+
+SELECT 'DEDUPLICATE BY * EXCEPT mat';
+INSERT INTO partial_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY * EXCEPT mat;
+SELECT * FROM partial_duplicates;
+TRUNCATE partial_duplicates;
+
+SELECT 'DEDUPLICATE BY COLUMNS("*") EXCEPT mat';
+INSERT INTO partial_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY COLUMNS('.*') EXCEPT mat;
+SELECT * FROM partial_duplicates;
+TRUNCATE partial_duplicates;
+
+SELECT 'DEDUPLICATE BY pk,sk';
+INSERT INTO partial_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY pk,sk;
+SELECT * FROM partial_duplicates;
+TRUNCATE partial_duplicates;
+
+SELECT 'DEDUPLICATE BY COLUMNS(".*k")';
+INSERT INTO partial_duplicates SELECT * FROM source_data;
+OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY COLUMNS('.*k');
+SELECT * FROM partial_duplicates;
+TRUNCATE partial_duplicates;
--- a/tests/queries/0_stateless/01581_deduplicate_by_columns_replicated.reference
+++ b/tests/queries/0_stateless/01581_deduplicate_by_columns_replicated.reference
@ -0,0 +1,47 @@
+check that we have a data
+r1	1	1001	3	2	2
+r1	1	2001	1	1	1
+r1	2	1002	1	1	1
+r1	2	2002	1	1	1
+r1	3	1003	2	2	2
+r1	4	1004	2	2	2
+r1	5	2005	2	2	1
+r1	9	1002	1	1	1
+r2	1	1001	3	2	2
+r2	1	2001	1	1	1
+r2	2	1002	1	1	1
+r2	2	2002	1	1	1
+r2	3	1003	2	2	2
+r2	4	1004	2	2	2
+r2	5	2005	2	2	1
+r2	9	1002	1	1	1
+after old OPTIMIZE DEDUPLICATE
+r1	1	1001	3	2	2
+r1	1	2001	1	1	1
+r1	2	1002	1	1	1
+r1	2	2002	1	1	1
+r1	3	1003	2	2	2
+r1	4	1004	2	2	2
+r1	5	2005	2	2	1
+r1	9	1002	1	1	1
+r2	1	1001	3	2	2
+r2	1	2001	1	1	1
+r2	2	1002	1	1	1
+r2	2	2002	1	1	1
+r2	3	1003	2	2	2
+r2	4	1004	2	2	2
+r2	5	2005	2	2	1
+r2	9	1002	1	1	1
+check data again after multiple deduplications with new syntax
+r1	1	1001	1	1	1
+r1	2	1002	1	1	1
+r1	3	1003	1	1	1
+r1	4	1004	1	1	1
+r1	5	2005	1	1	1
+r1	9	1002	1	1	1
+r2	1	1001	1	1	1
+r2	2	1002	1	1	1
+r2	3	1003	1	1	1
+r2	4	1004	1	1	1
+r2	5	2005	1	1	1
+r2	9	1002	1	1	1
--- a/tests/queries/0_stateless/01581_deduplicate_by_columns_replicated.sql
+++ b/tests/queries/0_stateless/01581_deduplicate_by_columns_replicated.sql
@ -0,0 +1,60 @@
+--- See also tests/queries/0_stateless/01581_deduplicate_by_columns_local.sql
+
+--- replicated case
+
+-- Just in case if previous tests run left some stuff behind.
+DROP TABLE IF EXISTS replicated_deduplicate_by_columns_r1;
+DROP TABLE IF EXISTS replicated_deduplicate_by_columns_r2;
+
+SET replication_alter_partitions_sync = 2;
+
+-- IRL insert_replica_id were filled from hostname
+CREATE TABLE IF NOT EXISTS replicated_deduplicate_by_columns_r1 (
+	id Int32, val UInt32, unique_value UInt64 MATERIALIZED rowNumberInBlock(), insert_replica_id UInt8 MATERIALIZED randConstant()
+) ENGINE=ReplicatedMergeTree('/clickhouse/tables/test_01581/replicated_deduplicate', 'r1') ORDER BY id;
+
+CREATE TABLE IF NOT EXISTS replicated_deduplicate_by_columns_r2 (
+    id Int32, val UInt32, unique_value UInt64 MATERIALIZED rowNumberInBlock(), insert_replica_id UInt8 MATERIALIZED randConstant()
+) ENGINE=ReplicatedMergeTree('/clickhouse/tables/test_01581/replicated_deduplicate', 'r2') ORDER BY id;
+
+
+SYSTEM STOP REPLICATED SENDS;
+SYSTEM STOP FETCHES;
+SYSTEM STOP REPLICATION QUEUES;
+
+-- insert some data, 2 records: (3, 1003), (4, 1004) are duplicated and have difference in unique_value / insert_replica_id
+-- (1, 1001), (5, 2005) has full duplicates
+INSERT INTO replicated_deduplicate_by_columns_r1 VALUES (1, 1001), (1, 1001), (2, 1002), (3, 1003), (4, 1004), (1, 2001), (9, 1002);
+INSERT INTO replicated_deduplicate_by_columns_r2 VALUES (1, 1001), (2, 2002), (3, 1003), (4, 1004), (5, 2005), (5, 2005);
+
+SYSTEM START REPLICATION QUEUES;
+SYSTEM START FETCHES;
+SYSTEM START REPLICATED SENDS;
+
+-- wait for syncing replicas
+SYSTEM SYNC REPLICA replicated_deduplicate_by_columns_r2;
+SYSTEM SYNC REPLICA replicated_deduplicate_by_columns_r1;
+
+SELECT 'check that we have a data';
+SELECT 'r1', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r1 GROUP BY id, val ORDER BY id, val;
+SELECT 'r2', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r2 GROUP BY id, val ORDER BY id, val;
+
+-- NOTE: here and below we need FINAL to force deduplication in such a small set of data in only 1 part.
+-- that should remove full duplicates
+OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE;
+
+SELECT 'after old OPTIMIZE DEDUPLICATE';
+SELECT 'r1', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r1 GROUP BY id, val ORDER BY id, val;
+SELECT 'r2', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r2 GROUP BY id, val ORDER BY id, val;
+
+OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY id, val;
+OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY COLUMNS('[id, val]');
+OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY COLUMNS('[i]') EXCEPT(unique_value, insert_replica_id);
+
+SELECT 'check data again after multiple deduplications with new syntax';
+SELECT 'r1', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r1 GROUP BY id, val ORDER BY id, val;
+SELECT 'r2', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r2 GROUP BY id, val ORDER BY id, val;
+
+-- cleanup the mess
+DROP TABLE replicated_deduplicate_by_columns_r1;
+DROP TABLE replicated_deduplicate_by_columns_r2;
--- a/tests/queries/0_stateless/01582_deterministic_function_with_predicate.reference
+++ b/tests/queries/0_stateless/01582_deterministic_function_with_predicate.reference
@ -0,0 +1,11 @@
+SELECT count()
+FROM 
+(
+    SELECT number
+    FROM 
+    (
+        SELECT number
+        FROM numbers(1000000)
+    )
+    WHERE rand64() < (0.01 * 18446744073709552000.)
+)
--- a/tests/queries/0_stateless/01582_deterministic_function_with_predicate.sql
+++ b/tests/queries/0_stateless/01582_deterministic_function_with_predicate.sql
@ -0,0 +1 @@
+EXPLAIN SYNTAX SELECT count(*) FROM ( SELECT number FROM ( SELECT number FROM numbers(1000000) ) WHERE rand64() < (0.01 * 18446744073709552000.));
--- a/tests/queries/0_stateless/01605_key_condition_enum_int.reference
+++ b/tests/queries/0_stateless/01605_key_condition_enum_int.reference
@ -0,0 +1 @@
+one
--- a/tests/queries/0_stateless/01605_key_condition_enum_int.sql
+++ b/tests/queries/0_stateless/01605_key_condition_enum_int.sql
@ -0,0 +1,4 @@
+drop table if exists enum;
+create table enum engine MergeTree order by enum as select cast(1, 'Enum8(\'zero\'=0, \'one\'=1)') AS enum;
+select * from enum where enum = 1;
+drop table if exists enum;
				`@ -0,0 +1 @@`
				`EXPLAIN SYNTAX SELECT count() FROM ( SELECT number FROM ( SELECT number FROM numbers(1000000) ) WHERE rand64() < (0.01 18446744073709552000.));`