Merge branch 'master' into FawnD2-switch-upstream-for-arrow-submodule

This commit is contained in:
Alexey Milovidov 2020-12-21 10:22:14 +03:00
commit 252f2ae876
52 changed files with 1609 additions and 92 deletions

View File

@ -1,4 +1,4 @@
## system.table_name {#system-tables_table-name}
# system.table_name {#system-tables_table-name}
Description.

View File

@ -16,4 +16,6 @@ You can also use the following database engines:
- [Lazy](../../engines/database-engines/lazy.md)
- [MaterializeMySQL](../../engines/database-engines/materialize-mysql.md)
[Original article](https://clickhouse.tech/docs/en/database_engines/) <!--hide-->

View File

@ -0,0 +1,156 @@
---
toc_priority: 29
toc_title: MaterializeMySQL
---
# MaterializeMySQL {#materialize-mysql}
Creates ClickHouse database with all the tables existing in MySQL, and all the data in those tables.
ClickHouse server works as MySQL replica. It reads binlog and performs DDL and DML queries.
## Creating a Database {#creating-a-database}
``` sql
CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster]
ENGINE = MaterializeMySQL('host:port', ['database' | database], 'user', 'password') [SETTINGS ...]
```
**Engine Parameters**
- `host:port` — MySQL server endpoint.
- `database` — MySQL database name.
- `user` — MySQL user.
- `password` — User password.
## Virtual columns {#virtual-columns}
When working with the `MaterializeMySQL` database engine, [ReplacingMergeTree](../../engines/table-engines/mergetree-family/replacingmergetree.md) tables are used with virtual `_sign` and `_version` columns.
- `_version` — Transaction counter. Type [UInt64](../../sql-reference/data-types/int-uint.md).
- `_sign` — Deletion mark. Type [Int8](../../sql-reference/data-types/int-uint.md). Possible values:
- `1` — Row is not deleted,
- `-1` — Row is deleted.
## Data Types Support {#data_types-support}
| MySQL | ClickHouse |
|-------------------------|--------------------------------------------------------------|
| TINY | [Int8](../../sql-reference/data-types/int-uint.md) |
| SHORT | [Int16](../../sql-reference/data-types/int-uint.md) |
| INT24 | [Int32](../../sql-reference/data-types/int-uint.md) |
| LONG | [UInt32](../../sql-reference/data-types/int-uint.md) |
| LONGLONG | [UInt64](../../sql-reference/data-types/int-uint.md) |
| FLOAT | [Float32](../../sql-reference/data-types/float.md) |
| DOUBLE | [Float64](../../sql-reference/data-types/float.md) |
| DECIMAL, NEWDECIMAL | [Decimal](../../sql-reference/data-types/decimal.md) |
| DATE, NEWDATE | [Date](../../sql-reference/data-types/date.md) |
| DATETIME, TIMESTAMP | [DateTime](../../sql-reference/data-types/datetime.md) |
| DATETIME2, TIMESTAMP2 | [DateTime64](../../sql-reference/data-types/datetime64.md) |
| STRING | [String](../../sql-reference/data-types/string.md) |
| VARCHAR, VAR_STRING | [String](../../sql-reference/data-types/string.md) |
| BLOB | [String](../../sql-reference/data-types/string.md) |
Other types are not supported. If MySQL table contains a column of such type, ClickHouse throws exception "Unhandled data type" and stops replication.
[Nullable](../../sql-reference/data-types/nullable.md) is supported.
## Specifics and Recommendations {#specifics-and-recommendations}
### DDL Queries {#ddl-queries}
MySQL DDL queries are converted into the corresponding ClickHouse DDL queries ([ALTER](../../sql-reference/statements/alter/index.md), [CREATE](../../sql-reference/statements/create/index.md), [DROP](../../sql-reference/statements/drop.md), [RENAME](../../sql-reference/statements/rename.md)). If ClickHouse cannot parse some DDL query, the query is ignored.
### Data Replication {#data-replication}
MaterializeMySQL does not support direct `INSERT`, `DELETE` and `UPDATE` queries. However, they are supported in terms of data replication:
- MySQL `INSERT` query is converted into `INSERT` with `_sign=1`.
- MySQl `DELETE` query is converted into `INSERT` with `_sign=-1`.
- MySQL `UPDATE` query is converted into `INSERT` with `_sign=-1` and `INSERT` with `_sign=1`.
### Selecting from MaterializeMySQL Tables {#select}
`SELECT` query form MaterializeMySQL tables has some specifics:
- If `_version` is not specified in the `SELECT` query, [FINAL](../../sql-reference/statements/select/from.md#select-from-final) modifier is used. So only rows with `MAX(_version)` are selected.
- If `_sign` is not specified in the `SELECT` query, `WHERE _sign=1` is used by default, so the deleted rows are not included into the result set.
### Index Conversion {#index-conversion}
MySQL `PRIMARY KEY` and `INDEX` clauses are converted into `ORDER BY` tuples in ClickHouse tables.
ClickHouse has only one physical order, which is determined by `ORDER BY` clause. To create a new physical order, use [materialized views](../../sql-reference/statements/create/view.md#materialized).
**Notes**
- Rows with `_sign=-1` are not deleted physically from the tables.
- Cascade `UPDATE/DELETE` queries are not supported by the `MaterializeMySQL` engine.
- Replication can be easily broken.
- Manual operations on database and tables are forbidden.
## Examples of Use {#examples-of-use}
Queries in MySQL:
``` sql
mysql> CREATE DATABASE db;
mysql> CREATE TABLE db.test (a INT PRIMARY KEY, b INT);
mysql> INSERT INTO db.test VALUES (1, 11), (2, 22);
mysql> DELETE FROM db.test WHERE a=1;
mysql> ALTER TABLE db.test ADD COLUMN c VARCHAR(16);
mysql> UPDATE db.test SET c='Wow!', b=222;
mysql> SELECT * FROM test;
```
```text
+---+------+------+
| a | b | c |
+---+------+------+
| 2 | 222 | Wow! |
+---+------+------+
```
Database in ClickHouse, exchanging data with the MySQL server:
The database and the table created:
``` sql
CREATE DATABASE mysql ENGINE = MaterializeMySQL('localhost:3306', 'db', 'user', '***');
SHOW TABLES FROM mysql;
```
``` text
┌─name─┐
│ test │
└──────┘
```
After inserting data:
``` sql
SELECT * FROM mysql.test;
```
``` text
┌─a─┬──b─┐
│ 1 │ 11 │
│ 2 │ 22 │
└───┴────┘
```
After deleting data, adding the column and updating:
``` sql
SELECT * FROM mysql.test;
```
``` text
┌─a─┬───b─┬─c────┐
│ 2 │ 222 │ Wow! │
└───┴─────┴──────┘
```
[Original article](https://clickhouse.tech/docs/en/database_engines/materialize-mysql/) <!--hide-->

View File

@ -0,0 +1,317 @@
---
toc_priority: 16
toc_title: Recipes Dataset
---
# Recipes Dataset
RecipeNLG dataset is available for download [here](https://recipenlg.cs.put.poznan.pl/dataset). It contains 2.2 million recipes. The size is slightly less than 1 GB.
## Download and unpack the dataset
Accept Terms and Conditions and download it [here](https://recipenlg.cs.put.poznan.pl/dataset). Unpack the zip file with `unzip`. You will get the `full_dataset.csv` file.
## Create a table
Run clickhouse-client and execute the following CREATE query:
```
CREATE TABLE recipes
(
title String,
ingredients Array(String),
directions Array(String),
link String,
source LowCardinality(String),
NER Array(String)
) ENGINE = MergeTree ORDER BY title;
```
## Insert the data
Run the following command:
```
clickhouse-client --query "
INSERT INTO recipes
SELECT
title,
JSONExtract(ingredients, 'Array(String)'),
JSONExtract(directions, 'Array(String)'),
link,
source,
JSONExtract(NER, 'Array(String)')
FROM input('num UInt32, title String, ingredients String, directions String, link String, source LowCardinality(String), NER String')
FORMAT CSVWithNames
" --input_format_with_names_use_header 0 --format_csv_allow_single_quote 0 --input_format_allow_errors_num 10 < full_dataset.csv
```
This is a showcase how to parse custom CSV, as it requires multiple tunes.
Explanation:
- the dataset is in CSV format, but it requires some preprocessing on insertion; we use table function [input](../../sql-reference/table-functions/input/) to perform preprocessing;
- the structure of CSV file is specified in the argument of the table function `input`;
- the field `num` (row number) is unneeded - we parse it from file and ignore;
- we use `FORMAT CSVWithNames` but the header in CSV will be ignored (by command line parameter `--input_format_with_names_use_header 0`), because the header does not contain the name for the first field;
- file is using only double quotes to enclose CSV strings; some strings are not enclosed in double quotes, and single quote must not be parsed as the string enclosing - that's why we also add the `--format_csv_allow_single_quote 0` parameter;
- some strings from CSV cannot parse, because they contain `\M/` sequence at the beginning of the value; the only value starting with backslash in CSV can be `\N` that is parsed as SQL NULL. We add `--input_format_allow_errors_num 10` parameter and up to ten malformed records can be skipped;
- there are arrays for ingredients, directions and NER fields; these arrays are represented in unusual form: they are serialized into string as JSON and then placed in CSV - we parse them as String and then use [JSONExtract](../../sql-reference/functions/json-functions/) function to transform it to Array.
## Validate the inserted data
By checking the row count:
```
SELECT count() FROM recipes
┌─count()─┐
│ 2231141 │
└─────────┘
```
## Example queries
### Top components by the number of recipes:
```
SELECT
arrayJoin(NER) AS k,
count() AS c
FROM recipes
GROUP BY k
ORDER BY c DESC
LIMIT 50
┌─k────────────────────┬──────c─┐
│ salt │ 890741 │
│ sugar │ 620027 │
│ butter │ 493823 │
│ flour │ 466110 │
│ eggs │ 401276 │
│ onion │ 372469 │
│ garlic │ 358364 │
│ milk │ 346769 │
│ water │ 326092 │
│ vanilla │ 270381 │
│ olive oil │ 197877 │
│ pepper │ 179305 │
│ brown sugar │ 174447 │
│ tomatoes │ 163933 │
│ egg │ 160507 │
│ baking powder │ 148277 │
│ lemon juice │ 146414 │
│ Salt │ 122557 │
│ cinnamon │ 117927 │
│ sour cream │ 116682 │
│ cream cheese │ 114423 │
│ margarine │ 112742 │
│ celery │ 112676 │
│ baking soda │ 110690 │
│ parsley │ 102151 │
│ chicken │ 101505 │
│ onions │ 98903 │
│ vegetable oil │ 91395 │
│ oil │ 85600 │
│ mayonnaise │ 84822 │
│ pecans │ 79741 │
│ nuts │ 78471 │
│ potatoes │ 75820 │
│ carrots │ 75458 │
│ pineapple │ 74345 │
│ soy sauce │ 70355 │
│ black pepper │ 69064 │
│ thyme │ 68429 │
│ mustard │ 65948 │
│ chicken broth │ 65112 │
│ bacon │ 64956 │
│ honey │ 64626 │
│ oregano │ 64077 │
│ ground beef │ 64068 │
│ unsalted butter │ 63848 │
│ mushrooms │ 61465 │
│ Worcestershire sauce │ 59328 │
│ cornstarch │ 58476 │
│ green pepper │ 58388 │
│ Cheddar cheese │ 58354 │
└──────────────────────┴────────┘
50 rows in set. Elapsed: 0.112 sec. Processed 2.23 million rows, 361.57 MB (19.99 million rows/s., 3.24 GB/s.)
```
In this example we learn how to use [arrayJoin](../../sql-reference/functions/array-join/) function to multiply data by array elements.
### The most complex recipes with strawberry
```
SELECT
title,
length(NER),
length(directions)
FROM recipes
WHERE has(NER, 'strawberry')
ORDER BY length(directions) DESC
LIMIT 10
┌─title────────────────────────────────────────────────────────────┬─length(NER)─┬─length(directions)─┐
│ Chocolate-Strawberry-Orange Wedding Cake │ 24 │ 126 │
│ Strawberry Cream Cheese Crumble Tart │ 19 │ 47 │
│ Charlotte-Style Ice Cream │ 11 │ 45 │
│ Sinfully Good a Million Layers Chocolate Layer Cake, With Strawb │ 31 │ 45 │
│ Sweetened Berries With Elderflower Sherbet │ 24 │ 44 │
│ Chocolate-Strawberry Mousse Cake │ 15 │ 42 │
│ Rhubarb Charlotte with Strawberries and Rum │ 20 │ 42 │
│ Chef Joey's Strawberry Vanilla Tart │ 7 │ 37 │
│ Old-Fashioned Ice Cream Sundae Cake │ 17 │ 37 │
│ Watermelon Cake │ 16 │ 36 │
└──────────────────────────────────────────────────────────────────┴─────────────┴────────────────────┘
10 rows in set. Elapsed: 0.215 sec. Processed 2.23 million rows, 1.48 GB (10.35 million rows/s., 6.86 GB/s.)
```
In this example, we involve [has](../../sql-reference/functions/array-functions/#hasarr-elem) function to filter by array elements and sort by the number of directions.
There is a wedding cake that requires the whole 126 steps to produce!
Show that directions:
```
SELECT arrayJoin(directions)
FROM recipes
WHERE title = 'Chocolate-Strawberry-Orange Wedding Cake'
┌─arrayJoin(directions)───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Position 1 rack in center and 1 rack in bottom third of oven and preheat to 350F. │
│ Butter one 5-inch-diameter cake pan with 2-inch-high sides, one 8-inch-diameter cake pan with 2-inch-high sides and one 12-inch-diameter cake pan with 2-inch-high sides. │
│ Dust pans with flour; line bottoms with parchment. │
│ Combine 1/3 cup orange juice and 2 ounces unsweetened chocolate in heavy small saucepan. │
│ Stir mixture over medium-low heat until chocolate melts. │
│ Remove from heat. │
│ Gradually mix in 1 2/3 cups orange juice. │
│ Sift 3 cups flour, 2/3 cup cocoa, 2 teaspoons baking soda, 1 teaspoon salt and 1/2 teaspoon baking powder into medium bowl. │
│ using electric mixer, beat 1 cup (2 sticks) butter and 3 cups sugar in large bowl until blended (mixture will look grainy). │
│ Add 4 eggs, 1 at a time, beating to blend after each. │
│ Beat in 1 tablespoon orange peel and 1 tablespoon vanilla extract. │
│ Add dry ingredients alternately with orange juice mixture in 3 additions each, beating well after each addition. │
│ Mix in 1 cup chocolate chips. │
│ Transfer 1 cup plus 2 tablespoons batter to prepared 5-inch pan, 3 cups batter to prepared 8-inch pan and remaining batter (about 6 cups) to 12-inch pan. │
│ Place 5-inch and 8-inch pans on center rack of oven. │
│ Place 12-inch pan on lower rack of oven. │
│ Bake cakes until tester inserted into center comes out clean, about 35 minutes. │
│ Transfer cakes in pans to racks and cool completely. │
│ Mark 4-inch diameter circle on one 6-inch-diameter cardboard cake round. │
│ Cut out marked circle. │
│ Mark 7-inch-diameter circle on one 8-inch-diameter cardboard cake round. │
│ Cut out marked circle. │
│ Mark 11-inch-diameter circle on one 12-inch-diameter cardboard cake round. │
│ Cut out marked circle. │
│ Cut around sides of 5-inch-cake to loosen. │
│ Place 4-inch cardboard over pan. │
│ Hold cardboard and pan together; turn cake out onto cardboard. │
│ Peel off parchment.Wrap cakes on its cardboard in foil. │
│ Repeat turning out, peeling off parchment and wrapping cakes in foil, using 7-inch cardboard for 8-inch cake and 11-inch cardboard for 12-inch cake. │
│ Using remaining ingredients, make 1 more batch of cake batter and bake 3 more cake layers as described above. │
│ Cool cakes in pans. │
│ Cover cakes in pans tightly with foil. │
│ (Can be prepared ahead. │
│ Let stand at room temperature up to 1 day or double-wrap all cake layers and freeze up to 1 week. │
│ Bring cake layers to room temperature before using.) │
│ Place first 12-inch cake on its cardboard on work surface. │
│ Spread 2 3/4 cups ganache over top of cake and all the way to edge. │
│ Spread 2/3 cup jam over ganache, leaving 1/2-inch chocolate border at edge. │
│ Drop 1 3/4 cups white chocolate frosting by spoonfuls over jam. │
│ Gently spread frosting over jam, leaving 1/2-inch chocolate border at edge. │
│ Rub some cocoa powder over second 12-inch cardboard. │
│ Cut around sides of second 12-inch cake to loosen. │
│ Place cardboard, cocoa side down, over pan. │
│ Turn cake out onto cardboard. │
│ Peel off parchment. │
│ Carefully slide cake off cardboard and onto filling on first 12-inch cake. │
│ Refrigerate. │
│ Place first 8-inch cake on its cardboard on work surface. │
│ Spread 1 cup ganache over top all the way to edge. │
│ Spread 1/4 cup jam over, leaving 1/2-inch chocolate border at edge. │
│ Drop 1 cup white chocolate frosting by spoonfuls over jam. │
│ Gently spread frosting over jam, leaving 1/2-inch chocolate border at edge. │
│ Rub some cocoa over second 8-inch cardboard. │
│ Cut around sides of second 8-inch cake to loosen. │
│ Place cardboard, cocoa side down, over pan. │
│ Turn cake out onto cardboard. │
│ Peel off parchment. │
│ Slide cake off cardboard and onto filling on first 8-inch cake. │
│ Refrigerate. │
│ Place first 5-inch cake on its cardboard on work surface. │
│ Spread 1/2 cup ganache over top of cake and all the way to edge. │
│ Spread 2 tablespoons jam over, leaving 1/2-inch chocolate border at edge. │
│ Drop 1/3 cup white chocolate frosting by spoonfuls over jam. │
│ Gently spread frosting over jam, leaving 1/2-inch chocolate border at edge. │
│ Rub cocoa over second 6-inch cardboard. │
│ Cut around sides of second 5-inch cake to loosen. │
│ Place cardboard, cocoa side down, over pan. │
│ Turn cake out onto cardboard. │
│ Peel off parchment. │
│ Slide cake off cardboard and onto filling on first 5-inch cake. │
│ Chill all cakes 1 hour to set filling. │
│ Place 12-inch tiered cake on its cardboard on revolving cake stand. │
│ Spread 2 2/3 cups frosting over top and sides of cake as a first coat. │
│ Refrigerate cake. │
│ Place 8-inch tiered cake on its cardboard on cake stand. │
│ Spread 1 1/4 cups frosting over top and sides of cake as a first coat. │
│ Refrigerate cake. │
│ Place 5-inch tiered cake on its cardboard on cake stand. │
│ Spread 3/4 cup frosting over top and sides of cake as a first coat. │
│ Refrigerate all cakes until first coats of frosting set, about 1 hour. │
│ (Cakes can be made to this point up to 1 day ahead; cover and keep refrigerate.) │
│ Prepare second batch of frosting, using remaining frosting ingredients and following directions for first batch. │
│ Spoon 2 cups frosting into pastry bag fitted with small star tip. │
│ Place 12-inch cake on its cardboard on large flat platter. │
│ Place platter on cake stand. │
│ Using icing spatula, spread 2 1/2 cups frosting over top and sides of cake; smooth top. │
│ Using filled pastry bag, pipe decorative border around top edge of cake. │
│ Refrigerate cake on platter. │
│ Place 8-inch cake on its cardboard on cake stand. │
│ Using icing spatula, spread 1 1/2 cups frosting over top and sides of cake; smooth top. │
│ Using pastry bag, pipe decorative border around top edge of cake. │
│ Refrigerate cake on its cardboard. │
│ Place 5-inch cake on its cardboard on cake stand. │
│ Using icing spatula, spread 3/4 cup frosting over top and sides of cake; smooth top. │
│ Using pastry bag, pipe decorative border around top edge of cake, spooning more frosting into bag if necessary. │
│ Refrigerate cake on its cardboard. │
│ Keep all cakes refrigerated until frosting sets, about 2 hours. │
│ (Can be prepared 2 days ahead. │
│ Cover loosely; keep refrigerated.) │
│ Place 12-inch cake on platter on work surface. │
│ Press 1 wooden dowel straight down into and completely through center of cake. │
│ Mark dowel 1/4 inch above top of frosting. │
│ Remove dowel and cut with serrated knife at marked point. │
│ Cut 4 more dowels to same length. │
│ Press 1 cut dowel back into center of cake. │
│ Press remaining 4 cut dowels into cake, positioning 3 1/2 inches inward from cake edges and spacing evenly. │
│ Place 8-inch cake on its cardboard on work surface. │
│ Press 1 dowel straight down into and completely through center of cake. │
│ Mark dowel 1/4 inch above top of frosting. │
│ Remove dowel and cut with serrated knife at marked point. │
│ Cut 3 more dowels to same length. │
│ Press 1 cut dowel back into center of cake. │
│ Press remaining 3 cut dowels into cake, positioning 2 1/2 inches inward from edges and spacing evenly. │
│ Using large metal spatula as aid, place 8-inch cake on its cardboard atop dowels in 12-inch cake, centering carefully. │
│ Gently place 5-inch cake on its cardboard atop dowels in 8-inch cake, centering carefully. │
│ Using citrus stripper, cut long strips of orange peel from oranges. │
│ Cut strips into long segments. │
│ To make orange peel coils, wrap peel segment around handle of wooden spoon; gently slide peel off handle so that peel keeps coiled shape. │
│ Garnish cake with orange peel coils, ivy or mint sprigs, and some berries. │
│ (Assembled cake can be made up to 8 hours ahead. │
│ Let stand at cool room temperature.) │
│ Remove top and middle cake tiers. │
│ Remove dowels from cakes. │
│ Cut top and middle cakes into slices. │
│ To cut 12-inch cake: Starting 3 inches inward from edge and inserting knife straight down, cut through from top to bottom to make 6-inch-diameter circle in center of cake. │
│ Cut outer portion of cake into slices; cut inner portion into slices and serve with strawberries. │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
126 rows in set. Elapsed: 0.011 sec. Processed 8.19 thousand rows, 5.34 MB (737.75 thousand rows/s., 480.59 MB/s.)
```
### Online playground
The dataset is also available in the [Playground](https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUCiAgICBhcnJheUpvaW4oTkVSKSBBUyBrLAogICAgY291bnQoKSBBUyBjCkZST00gcmVjaXBlcwpHUk9VUCBCWSBrCk9SREVSIEJZIGMgREVTQwpMSU1JVCA1MA==).

View File

@ -11,9 +11,9 @@ ClickHouse provides a native command-line client: `clickhouse-client`. The clien
``` bash
$ clickhouse-client
ClickHouse client version 19.17.1.1579 (official build).
ClickHouse client version 20.13.1.5273 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.17.1 revision 54428.
Connected to ClickHouse server version 20.13.1 revision 54442.
:)
```
@ -127,6 +127,8 @@ You can pass parameters to `clickhouse-client` (all parameters have a default va
- `--history_file` — Path to a file containing command history.
- `--param_<name>` — Value for a [query with parameters](#cli-queries-with-parameters).
Since version 20.5, `clickhouse-client` has automatic syntax highlighting (always enabled).
### Configuration Files {#configuration_files}
`clickhouse-client` uses the first existing file of the following:

View File

@ -60,5 +60,7 @@ toc_title: Client Libraries
- [pillar](https://github.com/sofakingworld/pillar)
- Nim
- [nim-clickhouse](https://github.com/leonardoce/nim-clickhouse)
- Haskell
- [hdbc-clickhouse](https://github.com/zaneli/hdbc-clickhouse)
[Original article](https://clickhouse.tech/docs/en/interfaces/third-party/client_libraries/) <!--hide-->

View File

@ -53,6 +53,11 @@ Merges the intermediate aggregation states in the same way as the -Merge combina
Converts an aggregate function for tables into an aggregate function for arrays that aggregates the corresponding array items and returns an array of results. For example, `sumForEach` for the arrays `[1, 2]`, `[3, 4, 5]`and`[6, 7]`returns the result `[10, 13, 5]` after adding together the corresponding array items.
## -Distinct {#agg-functions-combinator-distinct}
Every unique combination of arguments will be aggregated only once. Repeating values are ignored.
Examples: `sum(DISTINCT x)`, `groupArray(DISTINCT x)`, `corrStableDistinct(DISTINCT x, y)` and so on.
## -OrDefault {#agg-functions-combinator-ordefault}
Changes behavior of an aggregate function.
@ -244,4 +249,5 @@ FROM people
└────────┴───────────────────────────┘
```
[Original article](https://clickhouse.tech/docs/en/query_language/agg_functions/combinators/) <!--hide-->

View File

@ -11,9 +11,9 @@ ClickHouse предоставляет собственный клиент ком
``` bash
$ clickhouse-client
ClickHouse client version 19.17.1.1579 (official build).
ClickHouse client version 20.13.1.5273 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.17.1 revision 54428.
Connected to ClickHouse server version 20.13.1 revision 54442.
:)
```
@ -131,6 +131,8 @@ $ clickhouse-client --param_tuple_in_tuple="(10, ('dt', 10))" -q "SELECT * FROM
- `--secure` — если указано, будет использован безопасный канал.
- `--param_<name>` — значение параметра для [запроса с параметрами](#cli-queries-with-parameters).
Начиная с версии 20.5, в `clickhouse-client` есть автоматическая подсветка синтаксиса (включена всегда).
### Конфигурационные файлы {#konfiguratsionnye-faily}
`clickhouse—client` использует первый существующий файл из:

View File

@ -51,6 +51,11 @@ toc_title: "\u041a\u043e\u043c\u0431\u0438\u043d\u0430\u0442\u043e\u0440\u044b\u
Преобразует агрегатную функцию для таблиц в агрегатную функцию для массивов, которая применяет агрегирование для соответствующих элементов массивов и возвращает массив результатов. Например, `sumForEach` для массивов `[1, 2]`, `[3, 4, 5]` и `[6, 7]` даст результат `[10, 13, 5]`, сложив соответственные элементы массивов.
## -Distinct {#agg-functions-combinator-distinct}
При наличии комбинатора Distinct, каждое уникальное значение аргументов, будет учитано в агрегатной функции только один раз.
Примеры: `sum(DISTINCT x)`, `groupArray(DISTINCT x)`, `corrStableDistinct(DISTINCT x, y)` и т.п.
## -OrDefault {#agg-functions-combinator-ordefault}
Изменяет поведение агрегатной функции.
@ -59,7 +64,7 @@ toc_title: "\u041a\u043e\u043c\u0431\u0438\u043d\u0430\u0442\u043e\u0440\u044b\u
`-OrDefault` можно использовать с другими комбинаторами.
**Синтаксис**
**Синтаксис**
``` sql
<aggFunction>OrDefault(x)
@ -69,8 +74,8 @@ toc_title: "\u041a\u043e\u043c\u0431\u0438\u043d\u0430\u0442\u043e\u0440\u044b\u
- `x` — Параметры агрегатной функции.
**Возращаемые зачения**
**Возращаемые зачения**
Возвращает значение по умолчанию для соответствующего типа агрегатной функции, если агрегировать нечего.
Тип данных зависит от используемой агрегатной функции.
@ -116,11 +121,11 @@ FROM
Изменяет поведение агрегатной функции.
Комбинатор преобразует результат агрегатной функции к типу [Nullable](../data-types/nullable.md). Если агрегатная функция не получает данных на вход, то с комбинатором она возвращает [NULL](../syntax.md#null-literal).
Комбинатор преобразует результат агрегатной функции к типу [Nullable](../data-types/nullable.md). Если агрегатная функция не получает данных на вход, то с комбинатором она возвращает [NULL](../syntax.md#null-literal).
`-OrNull` может использоваться с другими комбинаторами.
**Синтаксис**
**Синтаксис**
``` sql
<aggFunction>OrNull(x)
@ -129,8 +134,8 @@ FROM
**Параметры**
- `x` — Параметры агрегатной функции.
**Возвращаемые значения**
**Возвращаемые значения**
- Результат агрегатной функции, преобразованный в тип данных `Nullable`.
- `NULL`, если у агрегатной функции нет входных данных.

View File

@ -376,7 +376,7 @@ bool ContextAccess::checkAccessImpl2(const AccessFlags & flags, const Args &...
return true;
};
auto access_denied = [&](const String & error_msg, int error_code)
auto access_denied = [&](const String & error_msg, int error_code [[maybe_unused]])
{
if (trace_log)
LOG_TRACE(trace_log, "Access denied: {}{}", (AccessRightsElement{flags, args...}.toString()),
@ -558,7 +558,7 @@ bool ContextAccess::checkAdminOptionImpl2(const Container & role_ids, const GetN
if (!std::size(role_ids) || is_full_access)
return true;
auto show_error = [this](const String & msg, int error_code)
auto show_error = [this](const String & msg, int error_code [[maybe_unused]])
{
UNUSED(this);
if constexpr (throw_if_denied)

View File

@ -13,12 +13,12 @@ struct MultiEnum
MultiEnum() = default;
template <typename ... EnumValues, typename = std::enable_if_t<std::conjunction_v<std::is_same<EnumTypeT, EnumValues>...>>>
explicit MultiEnum(EnumValues ... v)
constexpr explicit MultiEnum(EnumValues ... v)
: MultiEnum((toBitFlag(v) | ... | 0u))
{}
template <typename ValueType, typename = std::enable_if_t<std::is_convertible_v<ValueType, StorageType>>>
explicit MultiEnum(ValueType v)
constexpr explicit MultiEnum(ValueType v)
: bitset(v)
{
static_assert(std::is_unsigned_v<ValueType>);
@ -95,5 +95,5 @@ struct MultiEnum
private:
StorageType bitset = 0;
static StorageType toBitFlag(EnumType v) { return StorageType{1} << static_cast<StorageType>(v); }
static constexpr StorageType toBitFlag(EnumType v) { return StorageType{1} << static_cast<StorageType>(v); }
};

View File

@ -14,6 +14,7 @@
#include <DataTypes/DataTypeString.h>
#include <DataTypes/DataTypeDateTime.h>
#include <DataTypes/DataTypeDateTime64.h>
#include <DataTypes/DataTypeEnum.h>
#include <DataTypes/DataTypesNumber.h>
#include <DataTypes/DataTypesDecimal.h>
@ -360,9 +361,9 @@ DataTypePtr getLeastSupertype(const DataTypes & types)
maximize(max_bits_of_unsigned_integer, 64);
else if (typeid_cast<const DataTypeUInt256 *>(type.get()))
maximize(max_bits_of_unsigned_integer, 256);
else if (typeid_cast<const DataTypeInt8 *>(type.get()))
else if (typeid_cast<const DataTypeInt8 *>(type.get()) || typeid_cast<const DataTypeEnum8 *>(type.get()))
maximize(max_bits_of_signed_integer, 8);
else if (typeid_cast<const DataTypeInt16 *>(type.get()))
else if (typeid_cast<const DataTypeInt16 *>(type.get()) || typeid_cast<const DataTypeEnum16 *>(type.get()))
maximize(max_bits_of_signed_integer, 16);
else if (typeid_cast<const DataTypeInt32 *>(type.get()))
maximize(max_bits_of_signed_integer, 32);

View File

@ -27,8 +27,14 @@ void ExpressionInfoMatcher::visit(const ASTFunction & ast_function, const ASTPtr
const auto & function = FunctionFactory::instance().tryGet(ast_function.name, data.context);
/// Skip lambda, tuple and other special functions
if (function && function->isStateful())
data.is_stateful_function = true;
if (function)
{
if (function->isStateful())
data.is_stateful_function = true;
if (!function->isDeterministicInScopeOfQuery())
data.is_deterministic_function = false;
}
}
}

View File

@ -21,6 +21,7 @@ struct ExpressionInfoMatcher
bool is_array_join = false;
bool is_stateful_function = false;
bool is_aggregate_function = false;
bool is_deterministic_function = true;
std::unordered_set<size_t> unique_reference_tables_pos = {};
};

View File

@ -34,6 +34,7 @@
#include <Interpreters/QueryLog.h>
#include <Interpreters/TranslateQualifiedNamesVisitor.h>
#include <Interpreters/getTableExpressions.h>
#include <Interpreters/processColumnTransformers.h>
namespace
{
@ -52,7 +53,6 @@ namespace ErrorCodes
extern const int LOGICAL_ERROR;
}
InterpreterInsertQuery::InterpreterInsertQuery(
const ASTPtr & query_ptr_, const Context & context_, bool allow_materialized_, bool no_squash_, bool no_destination_)
: query_ptr(query_ptr_)
@ -95,27 +95,7 @@ Block InterpreterInsertQuery::getSampleBlock(
Block table_sample = metadata_snapshot->getSampleBlock();
/// Process column transformers (e.g. * EXCEPT(a)), asterisks and qualified columns.
const auto & columns = metadata_snapshot->getColumns();
auto names_and_types = columns.getOrdinary();
removeDuplicateColumns(names_and_types);
auto table_expr = std::make_shared<ASTTableExpression>();
table_expr->database_and_table_name = createTableIdentifier(table->getStorageID());
table_expr->children.push_back(table_expr->database_and_table_name);
TablesWithColumns tables_with_columns;
tables_with_columns.emplace_back(DatabaseAndTableWithAlias(*table_expr, context.getCurrentDatabase()), names_and_types);
tables_with_columns[0].addHiddenColumns(columns.getMaterialized());
tables_with_columns[0].addHiddenColumns(columns.getAliases());
tables_with_columns[0].addHiddenColumns(table->getVirtuals());
NameSet source_columns_set;
for (const auto & identifier : query.columns->children)
source_columns_set.insert(identifier->getColumnName());
TranslateQualifiedNamesVisitor::Data visitor_data(source_columns_set, tables_with_columns);
TranslateQualifiedNamesVisitor visitor(visitor_data);
auto columns_ast = query.columns->clone();
visitor.visit(columns_ast);
const auto columns_ast = processColumnTransformers(context.getCurrentDatabase(), table, metadata_snapshot, query.columns);
/// Form the block based on the column names from the query
Block res;

View File

@ -5,13 +5,18 @@
#include <Interpreters/InterpreterOptimizeQuery.h>
#include <Access/AccessRightsElement.h>
#include <Common/typeid_cast.h>
#include <Parsers/ASTExpressionList.h>
#include <Interpreters/processColumnTransformers.h>
#include <memory>
namespace DB
{
namespace ErrorCodes
{
extern const int THERE_IS_NO_COLUMN;
}
@ -27,7 +32,44 @@ BlockIO InterpreterOptimizeQuery::execute()
auto table_id = context.resolveStorageID(ast, Context::ResolveOrdinary);
StoragePtr table = DatabaseCatalog::instance().getTable(table_id, context);
auto metadata_snapshot = table->getInMemoryMetadataPtr();
table->optimize(query_ptr, metadata_snapshot, ast.partition, ast.final, ast.deduplicate, context);
// Empty list of names means we deduplicate by all columns, but user can explicitly state which columns to use.
Names column_names;
if (ast.deduplicate_by_columns)
{
// User requested custom set of columns for deduplication, possibly with Column Transformer expression.
{
// Expand asterisk, column transformers, etc into list of column names.
const auto cols = processColumnTransformers(context.getCurrentDatabase(), table, metadata_snapshot, ast.deduplicate_by_columns);
for (const auto & col : cols->children)
column_names.emplace_back(col->getColumnName());
}
metadata_snapshot->check(column_names, NamesAndTypesList{}, table_id);
Names required_columns;
{
required_columns = metadata_snapshot->getColumnsRequiredForSortingKey();
const auto partitioning_cols = metadata_snapshot->getColumnsRequiredForPartitionKey();
required_columns.reserve(required_columns.size() + partitioning_cols.size());
required_columns.insert(required_columns.end(), partitioning_cols.begin(), partitioning_cols.end());
}
for (const auto & required_col : required_columns)
{
// Deduplication is performed only for adjacent rows in a block,
// and all rows in block are in the sorting key order within a single partition,
// hence deduplication always implicitly takes sorting keys and parition keys in account.
// So we just explicitly state that limitation in order to avoid confusion.
if (std::find(column_names.begin(), column_names.end(), required_col) == column_names.end())
throw Exception(ErrorCodes::THERE_IS_NO_COLUMN,
"DEDUPLICATE BY expression must include all columns used in table's"
" ORDER BY, PRIMARY KEY, or PARTITION BY but '{}' is missing."
" Expanded DEDUPLICATE BY columns expression: ['{}']",
required_col, fmt::join(column_names, "', '"));
}
}
table->optimize(query_ptr, metadata_snapshot, ast.partition, ast.final, ast.deduplicate, column_names, context);
return {};
}

View File

@ -90,8 +90,8 @@ std::vector<ASTs> PredicateExpressionsOptimizer::extractTablesPredicates(const A
ExpressionInfoVisitor::Data expression_info{.context = context, .tables = tables_with_columns};
ExpressionInfoVisitor(expression_info).visit(predicate_expression);
if (expression_info.is_stateful_function)
return {}; /// give up the optimization when the predicate contains stateful function
if (expression_info.is_stateful_function || !expression_info.is_deterministic_function)
return {}; /// Not optimized when predicate contains stateful function or indeterministic function
if (!expression_info.is_array_join)
{

View File

@ -0,0 +1,49 @@
#include <Interpreters/processColumnTransformers.h>
#include <Interpreters/DatabaseAndTableWithAlias.h>
#include <Interpreters/TranslateQualifiedNamesVisitor.h>
#include <Interpreters/getTableExpressions.h>
#include <Parsers/ASTIdentifier.h>
#include <Parsers/ASTTablesInSelectQuery.h>
#include <Parsers/IAST.h>
#include <Storages/IStorage.h>
#include <Storages/StorageInMemoryMetadata.h>
namespace DB
{
ASTPtr processColumnTransformers(
const String & current_database,
const StoragePtr & table,
const StorageMetadataPtr & metadata_snapshot,
ASTPtr query_columns)
{
const auto & columns = metadata_snapshot->getColumns();
auto names_and_types = columns.getOrdinary();
removeDuplicateColumns(names_and_types);
TablesWithColumns tables_with_columns;
{
auto table_expr = std::make_shared<ASTTableExpression>();
table_expr->database_and_table_name = createTableIdentifier(table->getStorageID());
table_expr->children.push_back(table_expr->database_and_table_name);
tables_with_columns.emplace_back(DatabaseAndTableWithAlias(*table_expr, current_database), names_and_types);
}
tables_with_columns[0].addHiddenColumns(columns.getMaterialized());
tables_with_columns[0].addHiddenColumns(columns.getAliases());
tables_with_columns[0].addHiddenColumns(table->getVirtuals());
NameSet source_columns_set;
for (const auto & identifier : query_columns->children)
source_columns_set.insert(identifier->getColumnName());
TranslateQualifiedNamesVisitor::Data visitor_data(source_columns_set, tables_with_columns);
TranslateQualifiedNamesVisitor visitor(visitor_data);
auto columns_ast = query_columns->clone();
visitor.visit(columns_ast);
return columns_ast;
}
}

View File

@ -0,0 +1,19 @@
#pragma once
#include <Parsers/IAST_fwd.h>
#include <Storages/IStorage_fwd.h>
namespace DB
{
struct StorageInMemoryMetadata;
using StorageMetadataPtr = std::shared_ptr<const StorageInMemoryMetadata>;
/// Process column transformers (e.g. * EXCEPT(a)), asterisks and qualified columns.
ASTPtr processColumnTransformers(
const String & current_database,
const StoragePtr & table,
const StorageMetadataPtr & metadata_snapshot,
ASTPtr query_columns);
}

View File

@ -157,6 +157,7 @@ SRCS(
interpretSubquery.cpp
join_common.cpp
loadMetadata.cpp
processColumnTransformers.cpp
sortBlock.cpp
)

View File

@ -23,6 +23,12 @@ void ASTOptimizeQuery::formatQueryImpl(const FormatSettings & settings, FormatSt
if (deduplicate)
settings.ostr << (settings.hilite ? hilite_keyword : "") << " DEDUPLICATE" << (settings.hilite ? hilite_none : "");
if (deduplicate_by_columns)
{
settings.ostr << (settings.hilite ? hilite_keyword : "") << " BY " << (settings.hilite ? hilite_none : "");
deduplicate_by_columns->formatImpl(settings, state, frame);
}
}
}

View File

@ -16,9 +16,11 @@ public:
/// The partition to optimize can be specified.
ASTPtr partition;
/// A flag can be specified - perform optimization "to the end" instead of one step.
bool final;
bool final = false;
/// Do deduplicate (default: false)
bool deduplicate;
bool deduplicate = false;
/// Deduplicate by columns.
ASTPtr deduplicate_by_columns;
/** Get the text that identifies this element. */
String getID(char delim) const override
@ -37,6 +39,12 @@ public:
res->children.push_back(res->partition);
}
if (deduplicate_by_columns)
{
res->deduplicate_by_columns = deduplicate_by_columns->clone();
res->children.push_back(res->deduplicate_by_columns);
}
return res;
}

View File

@ -1259,7 +1259,7 @@ bool ParserColumnsMatcher::parseImpl(Pos & pos, ASTPtr & node, Expected & expect
res->children.push_back(regex_node);
}
ParserColumnsTransformers transformers_p;
ParserColumnsTransformers transformers_p(allowed_transformers);
ASTPtr transformer;
while (transformers_p.parse(pos, transformer, expected))
{
@ -1278,7 +1278,7 @@ bool ParserColumnsTransformers::parseImpl(Pos & pos, ASTPtr & node, Expected & e
ParserKeyword as("AS");
ParserKeyword strict("STRICT");
if (apply.ignore(pos, expected))
if (allowed_transformers.isSet(ColumnTransformer::APPLY) && apply.ignore(pos, expected))
{
bool with_open_round_bracket = false;
@ -1331,7 +1331,7 @@ bool ParserColumnsTransformers::parseImpl(Pos & pos, ASTPtr & node, Expected & e
node = std::move(res);
return true;
}
else if (except.ignore(pos, expected))
else if (allowed_transformers.isSet(ColumnTransformer::EXCEPT) && except.ignore(pos, expected))
{
if (strict.ignore(pos, expected))
is_strict = true;
@ -1371,7 +1371,7 @@ bool ParserColumnsTransformers::parseImpl(Pos & pos, ASTPtr & node, Expected & e
node = std::move(res);
return true;
}
else if (replace.ignore(pos, expected))
else if (allowed_transformers.isSet(ColumnTransformer::REPLACE) && replace.ignore(pos, expected))
{
if (strict.ignore(pos, expected))
is_strict = true;
@ -1434,7 +1434,7 @@ bool ParserAsterisk::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
++pos;
auto asterisk = std::make_shared<ASTAsterisk>();
ParserColumnsTransformers transformers_p;
ParserColumnsTransformers transformers_p(allowed_transformers);
ASTPtr transformer;
while (transformers_p.parse(pos, transformer, expected))
{

View File

@ -1,6 +1,7 @@
#pragma once
#include <Core/Field.h>
#include <Core/MultiEnum.h>
#include <Parsers/IParserBase.h>
@ -70,12 +71,47 @@ protected:
bool allow_query_parameter;
};
/** *, t.*, db.table.*, COLUMNS('<regular expression>') APPLY(...) or EXCEPT(...) or REPLACE(...)
*/
class ParserColumnsTransformers : public IParserBase
{
public:
enum class ColumnTransformer : UInt8
{
APPLY,
EXCEPT,
REPLACE,
};
using ColumnTransformers = MultiEnum<ColumnTransformer, UInt8>;
static constexpr auto AllTransformers = ColumnTransformers{ColumnTransformer::APPLY, ColumnTransformer::EXCEPT, ColumnTransformer::REPLACE};
ParserColumnsTransformers(ColumnTransformers allowed_transformers_ = AllTransformers, bool is_strict_ = false)
: allowed_transformers(allowed_transformers_)
, is_strict(is_strict_)
{}
protected:
const char * getName() const override { return "COLUMNS transformers"; }
bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
ColumnTransformers allowed_transformers;
bool is_strict;
};
/// Just *
class ParserAsterisk : public IParserBase
{
public:
using ColumnTransformers = ParserColumnsTransformers::ColumnTransformers;
ParserAsterisk(ColumnTransformers allowed_transformers_ = ParserColumnsTransformers::AllTransformers)
: allowed_transformers(allowed_transformers_)
{}
protected:
const char * getName() const override { return "asterisk"; }
bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
ColumnTransformers allowed_transformers;
};
/** Something like t.* or db.table.*
@ -91,21 +127,17 @@ protected:
*/
class ParserColumnsMatcher : public IParserBase
{
public:
using ColumnTransformers = ParserColumnsTransformers::ColumnTransformers;
ParserColumnsMatcher(ColumnTransformers allowed_transformers_ = ParserColumnsTransformers::AllTransformers)
: allowed_transformers(allowed_transformers_)
{}
protected:
const char * getName() const override { return "COLUMNS matcher"; }
bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
};
/** *, t.*, db.table.*, COLUMNS('<regular expression>') APPLY(...) or EXCEPT(...) or REPLACE(...)
*/
class ParserColumnsTransformers : public IParserBase
{
public:
ParserColumnsTransformers(bool is_strict_ = false): is_strict(is_strict_) {}
protected:
const char * getName() const override { return "COLUMNS transformers"; }
bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
bool is_strict;
ColumnTransformers allowed_transformers;
};
/** A function, for example, f(x, y + 1, g(z)).

View File

@ -4,11 +4,24 @@
#include <Parsers/ASTOptimizeQuery.h>
#include <Parsers/ASTIdentifier.h>
#include <Parsers/ExpressionListParsers.h>
namespace DB
{
bool ParserOptimizeQueryColumnsSpecification::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
// Do not allow APPLY and REPLACE transformers.
// Since we use Columns Transformers only to get list of columns,
// ad we can't actuall modify content of the columns for deduplication.
const auto allowed_transformers = ParserColumnsTransformers::ColumnTransformers{ParserColumnsTransformers::ColumnTransformer::EXCEPT};
return ParserColumnsMatcher(allowed_transformers).parse(pos, node, expected)
|| ParserAsterisk(allowed_transformers).parse(pos, node, expected)
|| ParserIdentifier(false).parse(pos, node, expected);
}
bool ParserOptimizeQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
@ -16,6 +29,7 @@ bool ParserOptimizeQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expecte
ParserKeyword s_partition("PARTITION");
ParserKeyword s_final("FINAL");
ParserKeyword s_deduplicate("DEDUPLICATE");
ParserKeyword s_by("BY");
ParserToken s_dot(TokenType::Dot);
ParserIdentifier name_p;
ParserPartition partition_p;
@ -55,6 +69,14 @@ bool ParserOptimizeQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expecte
if (s_deduplicate.ignore(pos, expected))
deduplicate = true;
ASTPtr deduplicate_by_columns;
if (deduplicate && s_by.ignore(pos, expected))
{
if (!ParserList(std::make_unique<ParserOptimizeQueryColumnsSpecification>(), std::make_unique<ParserToken>(TokenType::Comma), false)
.parse(pos, deduplicate_by_columns, expected))
return false;
}
auto query = std::make_shared<ASTOptimizeQuery>();
node = query;
@ -66,6 +88,7 @@ bool ParserOptimizeQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expecte
query->children.push_back(partition);
query->final = final;
query->deduplicate = deduplicate;
query->deduplicate_by_columns = deduplicate_by_columns;
return true;
}

View File

@ -7,6 +7,13 @@
namespace DB
{
class ParserOptimizeQueryColumnsSpecification : public IParserBase
{
protected:
const char * getName() const override { return "column specification for OPTIMIZE ... DEDUPLICATE BY"; }
bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override;
};
/** Query OPTIMIZE TABLE [db.]name [PARTITION partition] [FINAL] [DEDUPLICATE]
*/
class ParserOptimizeQuery : public IParserBase

View File

@ -0,0 +1,134 @@
#include <Parsers/ParserOptimizeQuery.h>
#include <Parsers/ParserQueryWithOutput.h>
#include <Parsers/parseQuery.h>
#include <Parsers/formatAST.h>
#include <IO/WriteBufferFromOStream.h>
#include <string_view>
#include <gtest/gtest.h>
namespace
{
using namespace DB;
using namespace std::literals;
}
struct ParserTestCase
{
std::shared_ptr<IParser> parser;
const std::string_view input_text;
const char * expected_ast = nullptr;
};
std::ostream & operator<<(std::ostream & ostr, const ParserTestCase & test_case)
{
return ostr << "parser: " << test_case.parser->getName() << ", input: " << test_case.input_text;
}
class ParserTest : public ::testing::TestWithParam<ParserTestCase>
{};
TEST_P(ParserTest, parseQuery)
{
const auto & [parser, input_text, expected_ast] = GetParam();
ASSERT_NE(nullptr, parser);
if (expected_ast)
{
ASTPtr ast;
ASSERT_NO_THROW(ast = parseQuery(*parser, input_text.begin(), input_text.end(), 0, 0));
EXPECT_EQ(expected_ast, serializeAST(*ast->clone(), false));
}
else
{
ASSERT_THROW(parseQuery(*parser, input_text.begin(), input_text.end(), 0, 0), DB::Exception);
}
}
INSTANTIATE_TEST_SUITE_P(ParserOptimizeQuery, ParserTest, ::testing::Values(
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('a, b')",
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('a, b')"
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]')",
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]')"
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') EXCEPT b",
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') EXCEPT b"
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') EXCEPT (a, b)",
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') EXCEPT (a, b)"
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY a, b, c",
"OPTIMIZE TABLE table_name DEDUPLICATE BY a, b, c"
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY *",
"OPTIMIZE TABLE table_name DEDUPLICATE BY *"
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY * EXCEPT a",
"OPTIMIZE TABLE table_name DEDUPLICATE BY * EXCEPT a"
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY * EXCEPT (a, b)",
"OPTIMIZE TABLE table_name DEDUPLICATE BY * EXCEPT (a, b)"
}
));
INSTANTIATE_TEST_SUITE_P(ParserOptimizeQuery_FAIL, ParserTest, ::testing::Values(
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY",
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') APPLY(x)",
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY COLUMNS('[a]') REPLACE(y)",
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY * APPLY(x)",
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY * REPLACE(y)",
},
ParserTestCase
{
std::make_shared<ParserOptimizeQuery>(),
"OPTIMIZE TABLE table_name DEDUPLICATE BY db.a, db.b, db.c",
}
));

View File

@ -380,6 +380,7 @@ public:
const ASTPtr & /*partition*/,
bool /*final*/,
bool /*deduplicate*/,
const Names & /* deduplicate_by_columns */,
const Context & /*context*/)
{
throw Exception("Method optimize is not supported by storage " + getName(), ErrorCodes::NOT_IMPLEMENTED);

View File

@ -652,7 +652,8 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
time_t time_of_merge,
const Context & context,
const ReservationPtr & space_reservation,
bool deduplicate)
bool deduplicate,
const Names & deduplicate_by_columns)
{
static const String TMP_PREFIX = "tmp_merge_";
@ -667,6 +668,13 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
const MergeTreeData::DataPartsVector & parts = future_part.parts;
LOG_DEBUG(log, "Merging {} parts: from {} to {} into {}", parts.size(), parts.front()->name, parts.back()->name, future_part.type.toString());
if (deduplicate)
{
if (deduplicate_by_columns.empty())
LOG_DEBUG(log, "DEDUPLICATE BY all columns");
else
LOG_DEBUG(log, "DEDUPLICATE BY ('{}')", fmt::join(deduplicate_by_columns, "', '"));
}
auto disk = space_reservation->getDisk();
String part_path = data.relative_data_path;
@ -891,7 +899,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
BlockInputStreamPtr merged_stream = std::make_shared<PipelineExecutingBlockInputStream>(std::move(pipeline));
if (deduplicate)
merged_stream = std::make_shared<DistinctSortedBlockInputStream>(merged_stream, sort_description, SizeLimits(), 0 /*limit_hint*/, Names());
merged_stream = std::make_shared<DistinctSortedBlockInputStream>(merged_stream, sort_description, SizeLimits(), 0 /*limit_hint*/, deduplicate_by_columns);
if (need_remove_expired_values)
merged_stream = std::make_shared<TTLBlockInputStream>(merged_stream, data, metadata_snapshot, new_data_part, time_of_merge, force_ttl);

View File

@ -127,7 +127,8 @@ public:
time_t time_of_merge,
const Context & context,
const ReservationPtr & space_reservation,
bool deduplicate);
bool deduplicate,
const Names & deduplicate_by_columns);
/// Mutate a single data part with the specified commands. Will create and return a temporary part.
MergeTreeData::MutableDataPartPtr mutatePartToTemporaryPart(

View File

@ -6,6 +6,7 @@
#include <IO/ReadBufferFromString.h>
#include <IO/WriteBufferFromString.h>
#include <IO/ReadHelpers.h>
#include <IO/WriteHelpers.h>
namespace DB
@ -16,15 +17,29 @@ namespace ErrorCodes
extern const int LOGICAL_ERROR;
}
enum FormatVersion : UInt8
{
FORMAT_WITH_CREATE_TIME = 2,
FORMAT_WITH_BLOCK_ID = 3,
FORMAT_WITH_DEDUPLICATE = 4,
FORMAT_WITH_UUID = 5,
FORMAT_WITH_DEDUPLICATE_BY_COLUMNS = 6,
FORMAT_LAST
};
void ReplicatedMergeTreeLogEntryData::writeText(WriteBuffer & out) const
{
UInt8 format_version = 4;
UInt8 format_version = FORMAT_WITH_DEDUPLICATE;
if (!deduplicate_by_columns.empty())
format_version = std::max<UInt8>(format_version, FORMAT_WITH_DEDUPLICATE_BY_COLUMNS);
/// Conditionally bump format_version only when uuid has been assigned.
/// If some other feature requires bumping format_version to >= 5 then this code becomes no-op.
if (new_part_uuid != UUIDHelpers::Nil)
format_version = std::max(format_version, static_cast<UInt8>(5));
format_version = std::max<UInt8>(format_version, FORMAT_WITH_UUID);
out << "format version: " << format_version << "\n"
<< "create_time: " << LocalDateTime(create_time ? create_time : time(nullptr)) << "\n"
@ -50,6 +65,17 @@ void ReplicatedMergeTreeLogEntryData::writeText(WriteBuffer & out) const
if (new_part_uuid != UUIDHelpers::Nil)
out << "\ninto_uuid: " << new_part_uuid;
if (!deduplicate_by_columns.empty())
{
out << "\ndeduplicate_by_columns: ";
for (size_t i = 0; i < deduplicate_by_columns.size(); ++i)
{
out << quote << deduplicate_by_columns[i];
if (i != deduplicate_by_columns.size() - 1)
out << ",";
}
}
break;
case DROP_RANGE:
@ -129,10 +155,10 @@ void ReplicatedMergeTreeLogEntryData::readText(ReadBuffer & in)
in >> "format version: " >> format_version >> "\n";
if (format_version < 1 || format_version > 5)
if (format_version < 1 || format_version >= FORMAT_LAST)
throw Exception("Unknown ReplicatedMergeTreeLogEntry format version: " + DB::toString(format_version), ErrorCodes::UNKNOWN_FORMAT_VERSION);
if (format_version >= 2)
if (format_version >= FORMAT_WITH_CREATE_TIME)
{
LocalDateTime create_time_dt;
in >> "create_time: " >> create_time_dt >> "\n";
@ -141,7 +167,7 @@ void ReplicatedMergeTreeLogEntryData::readText(ReadBuffer & in)
in >> "source replica: " >> source_replica >> "\n";
if (format_version >= 3)
if (format_version >= FORMAT_WITH_BLOCK_ID)
{
in >> "block_id: " >> escape >> block_id >> "\n";
}
@ -167,7 +193,7 @@ void ReplicatedMergeTreeLogEntryData::readText(ReadBuffer & in)
}
in >> new_part_name;
if (format_version >= 4)
if (format_version >= FORMAT_WITH_DEDUPLICATE)
{
in >> "\ndeduplicate: " >> deduplicate;
@ -184,6 +210,20 @@ void ReplicatedMergeTreeLogEntryData::readText(ReadBuffer & in)
}
else if (checkString("into_uuid: ", in))
in >> new_part_uuid;
else if (checkString("deduplicate_by_columns: ", in))
{
Strings new_deduplicate_by_columns;
for (;;)
{
String tmp_column_name;
in >> quote >> tmp_column_name;
new_deduplicate_by_columns.emplace_back(std::move(tmp_column_name));
if (!checkString(",", in))
break;
}
deduplicate_by_columns = std::move(new_deduplicate_by_columns);
}
else
trailing_newline_found = true;
}

View File

@ -81,6 +81,7 @@ struct ReplicatedMergeTreeLogEntryData
Strings source_parts;
bool deduplicate = false; /// Do deduplicate on merge
Strings deduplicate_by_columns = {}; // Which columns should be checked for duplicates, empty means 'all' (default).
MergeType merge_type = MergeType::REGULAR;
String column_name;
String index_name;
@ -111,10 +112,10 @@ struct ReplicatedMergeTreeLogEntryData
/// Version of metadata which will be set after this alter
/// Also present in MUTATE_PART command, to track mutations
/// required for complete alter execution.
int alter_version; /// May be equal to -1, if it's normal mutation, not metadata update.
int alter_version = -1; /// May be equal to -1, if it's normal mutation, not metadata update.
/// only ALTER METADATA command
bool have_mutation; /// If this alter requires additional mutation step, for data update
bool have_mutation = false; /// If this alter requires additional mutation step, for data update
String columns_str; /// New columns data corresponding to alter_version
String metadata_str; /// New metadata corresponding to alter_version

View File

@ -0,0 +1,348 @@
#include <Storages/MergeTree/ReplicatedMergeTreeLogEntry.h>
#include <IO/ReadBufferFromString.h>
#include <Core/iostream_debug_helpers.h>
#include <type_traits>
#include <regex>
#include <gtest/gtest.h>
namespace DB
{
std::ostream & operator<<(std::ostream & ostr, const MergeTreeDataPartType & type)
{
return ostr << type.toString();
}
std::ostream & operator<<(std::ostream & ostr, const UInt128 & v)
{
return ostr << v.toHexString();
}
template <typename T, typename Tag>
std::ostream & operator<<(std::ostream & ostr, const StrongTypedef<T, Tag> & v)
{
return ostr << v.toUnderType();
}
std::ostream & operator<<(std::ostream & ostr, const MergeType & v)
{
return ostr << toString(v);
}
}
namespace std
{
std::ostream & operator<<(std::ostream & ostr, const std::exception_ptr & exception)
{
try
{
if (exception)
{
std::rethrow_exception(exception);
}
return ostr << "<NULL EXCEPTION>";
}
catch (const std::exception& e)
{
return ostr << e.what();
}
}
template <typename T>
inline std::ostream& operator<<(std::ostream & ostr, const std::vector<T> & v)
{
ostr << "[";
for (size_t i = 0; i < v.size(); ++i)
{
ostr << i;
if (i != v.size() - 1)
ostr << ", ";
}
return ostr << "] (" << v.size() << ") items";
}
}
namespace
{
using namespace DB;
template <typename T>
void compareAttributes(::testing::AssertionResult & result, const char * name, const T & expected_value, const T & actual_value);
#define CMP_ATTRIBUTE(attribute) compareAttributes(result, #attribute, expected.attribute, actual.attribute)
::testing::AssertionResult compare(
const ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry & expected,
const ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry & actual)
{
auto result = ::testing::AssertionSuccess();
CMP_ATTRIBUTE(drop_range_part_name);
CMP_ATTRIBUTE(from_database);
CMP_ATTRIBUTE(from_table);
CMP_ATTRIBUTE(src_part_names);
CMP_ATTRIBUTE(new_part_names);
CMP_ATTRIBUTE(part_names_checksums);
CMP_ATTRIBUTE(columns_version);
return result;
}
template <typename T>
bool compare(const T & expected, const T & actual)
{
return expected == actual;
}
template <typename T>
::testing::AssertionResult compare(const std::shared_ptr<T> & expected, const std::shared_ptr<T> & actual)
{
if (!!expected != !!actual)
return ::testing::AssertionFailure()
<< "expected : " << static_cast<const void*>(expected.get())
<< "\nactual : " << static_cast<const void*>(actual.get());
if (expected && actual)
return compare(*expected, *actual);
return ::testing::AssertionSuccess();
}
template <typename T>
void compareAttributes(::testing::AssertionResult & result, const char * name, const T & expected_value, const T & actual_value)
{
const auto cmp_result = compare(expected_value, actual_value);
if (cmp_result == false)
{
if (result)
result = ::testing::AssertionFailure();
result << "\nMismatching attribute: \"" << name << "\"";
if constexpr (std::is_same_v<std::decay_t<decltype(cmp_result)>, ::testing::AssertionResult>)
result << "\n" << cmp_result.message();
else
result << "\n\texpected: " << expected_value
<< "\n\tactual : " << actual_value;
}
};
::testing::AssertionResult compare(const ReplicatedMergeTreeLogEntryData & expected, const ReplicatedMergeTreeLogEntryData & actual)
{
::testing::AssertionResult result = ::testing::AssertionSuccess();
CMP_ATTRIBUTE(znode_name);
CMP_ATTRIBUTE(type);
CMP_ATTRIBUTE(source_replica);
CMP_ATTRIBUTE(new_part_name);
CMP_ATTRIBUTE(new_part_type);
CMP_ATTRIBUTE(block_id);
CMP_ATTRIBUTE(actual_new_part_name);
CMP_ATTRIBUTE(new_part_uuid);
CMP_ATTRIBUTE(source_parts);
CMP_ATTRIBUTE(deduplicate);
CMP_ATTRIBUTE(deduplicate_by_columns);
CMP_ATTRIBUTE(merge_type);
CMP_ATTRIBUTE(column_name);
CMP_ATTRIBUTE(index_name);
CMP_ATTRIBUTE(detach);
CMP_ATTRIBUTE(replace_range_entry);
CMP_ATTRIBUTE(alter_version);
CMP_ATTRIBUTE(have_mutation);
CMP_ATTRIBUTE(columns_str);
CMP_ATTRIBUTE(metadata_str);
CMP_ATTRIBUTE(currently_executing);
CMP_ATTRIBUTE(removed_by_other_entry);
CMP_ATTRIBUTE(num_tries);
CMP_ATTRIBUTE(exception);
CMP_ATTRIBUTE(last_attempt_time);
CMP_ATTRIBUTE(num_postponed);
CMP_ATTRIBUTE(postpone_reason);
CMP_ATTRIBUTE(last_postpone_time);
CMP_ATTRIBUTE(create_time);
CMP_ATTRIBUTE(quorum);
return result;
}
}
class ReplicatedMergeTreeLogEntryDataTest : public ::testing::TestWithParam<std::tuple<ReplicatedMergeTreeLogEntryData, const char* /* serialized RE*/>>
{};
TEST_P(ReplicatedMergeTreeLogEntryDataTest, transcode)
{
const auto & [expected, match_regex] = GetParam();
const auto str = expected.toString();
if (match_regex)
{
try
{
// egrep since "." matches newline and we can also use "\n" explicitly
std::regex re(match_regex, std::regex::egrep);
EXPECT_TRUE(std::regex_match(str, re))
<< "Failed to match serialized ReplicatedMergeTreeLogEntryData: {\n"
<< str << "} \nwith regex: \"" << match_regex << "\"\n";
}
catch (const std::regex_error &e)
{
FAIL() << e.what()
<< " on regex: " << match_regex
<< " (" << strlen(match_regex) << " bytes)" << std::endl;
}
catch (...)
{
throw;
}
}
ReplicatedMergeTreeLogEntryData actual;
{
DB::ReadBufferFromString buffer(str);
EXPECT_NO_THROW(actual.readText(buffer)) << "While reading:\n" << str;
}
ASSERT_TRUE(compare(expected, actual)) << "Via text:\n" << str;
}
// Enabling this warning would ruin test brievity without adding anything else in return,
// since most of the fields have default constructors or be will be zero-initialized as by standard,
// so values are predicatable and stable accross runs.
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wmissing-field-initializers"
INSTANTIATE_TEST_SUITE_P(Merge, ReplicatedMergeTreeLogEntryDataTest,
::testing::ValuesIn(std::initializer_list<std::tuple<ReplicatedMergeTreeLogEntryData, const char*>>{
{
{
// Basic: minimal set of attributes.
.type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
.new_part_type = MergeTreeDataPartType::WIDE,
.create_time = 123, // 0 means 'now' which could cause flaky tests.
},
R"re(^format version: 4.+merge.+into.+deduplicate: 0.+$)re"
},
{
{
.type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
.new_part_type = MergeTreeDataPartType::WIDE,
// Format version 4
.deduplicate = true,
.create_time = 123,
},
R"re(^format version: 4.+merge.+into.+deduplicate: 1.+$)re"
},
{
{
.type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
.new_part_type = MergeTreeDataPartType::WIDE,
// Format version 5
.new_part_uuid = UUID(UInt128(123456789, 10111213141516)),
.create_time = 123,
},
R"re(^format version: 5.+merge.+into.+deduplicate: 0.+into_uuid: 00000000-075b-cd15-0000-093233447e0c.+$)re"
},
{
{
.type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
.new_part_type = MergeTreeDataPartType::WIDE,
// Format version 6
.deduplicate = true,
.deduplicate_by_columns = {"foo", "bar", "qux"},
.create_time = 123,
},
R"re(^format version: 6.+merge.+into.+deduplicate: 1.+deduplicate_by_columns: 'foo','bar','qux'.*$)re"
},
{
{
.type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
.new_part_type = MergeTreeDataPartType::WIDE,
// Mixing features
.new_part_uuid = UUID(UInt128(123456789, 10111213141516)),
.deduplicate = true,
.deduplicate_by_columns = {"foo", "bar", "qux"},
.create_time = 123,
},
R"re(^format version: 6.+merge.+into.+deduplicate: 1.+into_uuid: 00000000-075b-cd15-0000-093233447e0c.+deduplicate_by_columns: 'foo','bar','qux'.*$)re"
},
{
// Validate that exotic column names are serialized/deserialized properly
{
.type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
.new_part_type = MergeTreeDataPartType::WIDE,
// Mixing features
.new_part_uuid = UUID(UInt128(123456789, 10111213141516)),
.deduplicate = true,
.deduplicate_by_columns = {"name with space", "\"column\"", "'column'", "колонка", "\u30ab\u30e9\u30e0", "\x01\x03 column \x10\x11\x12"},
.create_time = 123,
},
R"re(^format version: 6.+merge.+deduplicate_by_columns: 'name with space','"column"','\\'column\\'','колонка')re"
",'\u30ab\u30e9\u30e0','\x01\x03 column \x10\x11\x12'.*$"
},
}));
#pragma GCC diagnostic pop
// This is just an example of how to set all fields. Can't be used as is since depending on type,
// only some fields are serialized/deserialized, and even if everything works perfectly,
// some fileds in deserialized object would be unset (hence differ from expected).
// INSTANTIATE_TEST_SUITE_P(Full, ReplicatedMergeTreeLogEntryDataTest,
// ::testing::ValuesIn(std::initializer_list<ReplicatedMergeTreeLogEntryData>{
// {
// .znode_name = "znode name",
// .type = ReplicatedMergeTreeLogEntryData::MERGE_PARTS,
// .source_replica = "source replica",
// .new_part_name = "new part name",
// .new_part_type = MergeTreeDataPartType::WIDE,
// .block_id = "block id",
// .actual_new_part_name = "new part name",
// .new_part_uuid = UUID(UInt128(123456789, 10111213141516)),
// .source_parts = {"part1", "part2"},
// .deduplicate = true,
// .deduplicate_by_columns = {"col1", "col2"},
// .merge_type = MergeType::REGULAR,
// .column_name = "column name",
// .index_name = "index name",
// .detach = false,
// .replace_range_entry = std::make_shared<ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry>(
// ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry
// {
// .drop_range_part_name = "drop range part name",
// .from_database = "from database",
// .src_part_names = {"src part name1", "src part name2"},
// .new_part_names = {"new part name1", "new part name2"},
// .columns_version = 123456,
// }),
// .alter_version = 56789,
// .have_mutation = false,
// .columns_str = "columns str",
// .metadata_str = "metadata str",
// // Those attributes are not serialized to string, hence it makes no sense to set.
// // .currently_executing
// // .removed_by_other_entry
// // .num_tries
// // .exception
// // .last_attempt_time
// // .num_postponed
// // .postpone_reason
// // .last_postpone_time,
// .create_time = static_cast<time_t>(123456789),
// .quorum = 321,
// },
// }));

View File

@ -583,7 +583,7 @@ void StorageBuffer::shutdown()
try
{
optimize(nullptr /*query*/, getInMemoryMetadataPtr(), {} /*partition*/, false /*final*/, false /*deduplicate*/, global_context);
optimize(nullptr /*query*/, getInMemoryMetadataPtr(), {} /*partition*/, false /*final*/, false /*deduplicate*/, {}, global_context);
}
catch (...)
{
@ -608,6 +608,7 @@ bool StorageBuffer::optimize(
const ASTPtr & partition,
bool final,
bool deduplicate,
const Names & /* deduplicate_by_columns */,
const Context & /*context*/)
{
if (partition)
@ -906,7 +907,7 @@ void StorageBuffer::alter(const AlterCommands & params, const Context & context,
/// Flush all buffers to storages, so that no non-empty blocks of the old
/// structure remain. Structure of empty blocks will be updated during first
/// insert.
optimize({} /*query*/, metadata_snapshot, {} /*partition_id*/, false /*final*/, false /*deduplicate*/, context);
optimize({} /*query*/, metadata_snapshot, {} /*partition_id*/, false /*final*/, false /*deduplicate*/, {}, context);
StorageInMemoryMetadata new_metadata = *metadata_snapshot;
params.apply(new_metadata, context);

View File

@ -87,6 +87,7 @@ public:
const ASTPtr & partition,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
const Context & context) override;
bool supportsSampling() const override { return true; }

View File

@ -233,12 +233,13 @@ bool StorageMaterializedView::optimize(
const ASTPtr & partition,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
const Context & context)
{
checkStatementCanBeForwarded();
auto storage_ptr = getTargetTable();
auto metadata_snapshot = storage_ptr->getInMemoryMetadataPtr();
return getTargetTable()->optimize(query, metadata_snapshot, partition, final, deduplicate, context);
return getTargetTable()->optimize(query, metadata_snapshot, partition, final, deduplicate, deduplicate_by_columns, context);
}
void StorageMaterializedView::alter(

View File

@ -46,6 +46,7 @@ public:
const ASTPtr & partition,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
const Context & context) override;
void alter(const AlterCommands & params, const Context & context, TableLockHolder & table_lock_holder) override;

View File

@ -741,6 +741,7 @@ bool StorageMergeTree::merge(
const String & partition_id,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
String * out_disable_reason,
bool optimize_skip_merged_partitions)
{
@ -758,10 +759,15 @@ bool StorageMergeTree::merge(
if (!merge_mutate_entry)
return false;
return mergeSelectedParts(metadata_snapshot, deduplicate, *merge_mutate_entry, table_lock_holder);
return mergeSelectedParts(metadata_snapshot, deduplicate, deduplicate_by_columns, *merge_mutate_entry, table_lock_holder);
}
bool StorageMergeTree::mergeSelectedParts(const StorageMetadataPtr & metadata_snapshot, bool deduplicate, MergeMutateSelectedEntry & merge_mutate_entry, TableLockHolder & table_lock_holder)
bool StorageMergeTree::mergeSelectedParts(
const StorageMetadataPtr & metadata_snapshot,
bool deduplicate,
const Names & deduplicate_by_columns,
MergeMutateSelectedEntry & merge_mutate_entry,
TableLockHolder & table_lock_holder)
{
auto & future_part = merge_mutate_entry.future_part;
Stopwatch stopwatch;
@ -786,7 +792,7 @@ bool StorageMergeTree::mergeSelectedParts(const StorageMetadataPtr & metadata_sn
{
new_part = merger_mutator.mergePartsToTemporaryPart(
future_part, metadata_snapshot, *(merge_list_entry), table_lock_holder, time(nullptr),
global_context, merge_mutate_entry.tagger->reserved_space, deduplicate);
global_context, merge_mutate_entry.tagger->reserved_space, deduplicate, deduplicate_by_columns);
merger_mutator.renameMergedTemporaryPart(new_part, future_part.parts, nullptr);
write_part_log({});
@ -953,7 +959,7 @@ std::optional<JobAndPool> StorageMergeTree::getDataProcessingJob()
return JobAndPool{[this, metadata_snapshot, merge_entry, mutate_entry, share_lock] () mutable
{
if (merge_entry)
mergeSelectedParts(metadata_snapshot, false, *merge_entry, share_lock);
mergeSelectedParts(metadata_snapshot, false, {}, *merge_entry, share_lock);
else if (mutate_entry)
mutateSelectedPart(metadata_snapshot, *mutate_entry, share_lock);
}, PoolType::MERGE_MUTATE};
@ -1036,8 +1042,17 @@ bool StorageMergeTree::optimize(
const ASTPtr & partition,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
const Context & context)
{
if (deduplicate)
{
if (deduplicate_by_columns.empty())
LOG_DEBUG(log, "DEDUPLICATE BY all columns");
else
LOG_DEBUG(log, "DEDUPLICATE BY ('{}')", fmt::join(deduplicate_by_columns, "', '"));
}
String disable_reason;
if (!partition && final)
{
@ -1049,7 +1064,7 @@ bool StorageMergeTree::optimize(
for (const String & partition_id : partition_ids)
{
if (!merge(true, partition_id, true, deduplicate, &disable_reason, context.getSettingsRef().optimize_skip_merged_partitions))
if (!merge(true, partition_id, true, deduplicate, deduplicate_by_columns, &disable_reason, context.getSettingsRef().optimize_skip_merged_partitions))
{
constexpr const char * message = "Cannot OPTIMIZE table: {}";
if (disable_reason.empty())
@ -1068,7 +1083,7 @@ bool StorageMergeTree::optimize(
if (partition)
partition_id = getPartitionIDFromQuery(partition, context);
if (!merge(true, partition_id, final, deduplicate, &disable_reason, context.getSettingsRef().optimize_skip_merged_partitions))
if (!merge(true, partition_id, final, deduplicate, deduplicate_by_columns, &disable_reason, context.getSettingsRef().optimize_skip_merged_partitions))
{
constexpr const char * message = "Cannot OPTIMIZE table: {}";
if (disable_reason.empty())

View File

@ -70,6 +70,7 @@ public:
const ASTPtr & partition,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
const Context & context) override;
void mutate(const MutationCommands & commands, const Context & context) override;
@ -132,7 +133,7 @@ private:
* If aggressive - when selects parts don't takes into account their ratio size and novelty (used for OPTIMIZE query).
* Returns true if merge is finished successfully.
*/
bool merge(bool aggressive, const String & partition_id, bool final, bool deduplicate, String * out_disable_reason = nullptr, bool optimize_skip_merged_partitions = false);
bool merge(bool aggressive, const String & partition_id, bool final, bool deduplicate, const Names & deduplicate_by_columns, String * out_disable_reason = nullptr, bool optimize_skip_merged_partitions = false);
ActionLock stopMergesAndWait();
@ -183,7 +184,8 @@ private:
TableLockHolder & table_lock_holder,
bool optimize_skip_merged_partitions = false,
SelectPartsDecision * select_decision_out = nullptr);
bool mergeSelectedParts(const StorageMetadataPtr & metadata_snapshot, bool deduplicate, MergeMutateSelectedEntry & entry, TableLockHolder & table_lock_holder);
bool mergeSelectedParts(const StorageMetadataPtr & metadata_snapshot, bool deduplicate, const Names & deduplicate_by_columns, MergeMutateSelectedEntry & entry, TableLockHolder & table_lock_holder);
std::shared_ptr<MergeMutateSelectedEntry> selectPartsToMutate(const StorageMetadataPtr & metadata_snapshot, String * disable_reason, TableLockHolder & table_lock_holder);
bool mutateSelectedPart(const StorageMetadataPtr & metadata_snapshot, MergeMutateSelectedEntry & entry, TableLockHolder & table_lock_holder);

View File

@ -122,9 +122,10 @@ public:
const ASTPtr & partition,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
const Context & context) override
{
return getNested()->optimize(query, metadata_snapshot, partition, final, deduplicate, context);
return getNested()->optimize(query, metadata_snapshot, partition, final, deduplicate, deduplicate_by_columns, context);
}
void mutate(const MutationCommands & commands, const Context & context) override { getNested()->mutate(commands, context); }

View File

@ -1508,7 +1508,7 @@ bool StorageReplicatedMergeTree::tryExecuteMerge(const LogEntry & entry)
{
part = merger_mutator.mergePartsToTemporaryPart(
future_merged_part, metadata_snapshot, *merge_entry,
table_lock, entry.create_time, global_context, reserved_space, entry.deduplicate);
table_lock, entry.create_time, global_context, reserved_space, entry.deduplicate, entry.deduplicate_by_columns);
merger_mutator.renameMergedTemporaryPart(part, parts, &transaction);
@ -2712,6 +2712,7 @@ void StorageReplicatedMergeTree::mergeSelectingTask()
const auto storage_settings_ptr = getSettings();
const bool deduplicate = false; /// TODO: read deduplicate option from table config
const Names deduplicate_by_columns = {};
CreateMergeEntryResult create_result = CreateMergeEntryResult::Other;
try
@ -2762,6 +2763,7 @@ void StorageReplicatedMergeTree::mergeSelectingTask()
future_merged_part.uuid,
future_merged_part.type,
deduplicate,
deduplicate_by_columns,
nullptr,
merge_pred.getVersion(),
future_merged_part.merge_type);
@ -2851,6 +2853,7 @@ StorageReplicatedMergeTree::CreateMergeEntryResult StorageReplicatedMergeTree::c
const UUID & merged_part_uuid,
const MergeTreeDataPartType & merged_part_type,
bool deduplicate,
const Names & deduplicate_by_columns,
ReplicatedMergeTreeLogEntryData * out_log_entry,
int32_t log_version,
MergeType merge_type)
@ -2888,6 +2891,7 @@ StorageReplicatedMergeTree::CreateMergeEntryResult StorageReplicatedMergeTree::c
entry.new_part_type = merged_part_type;
entry.merge_type = merge_type;
entry.deduplicate = deduplicate;
entry.deduplicate_by_columns = deduplicate_by_columns;
entry.merge_type = merge_type;
entry.create_time = time(nullptr);
@ -3862,6 +3866,7 @@ BlockOutputStreamPtr StorageReplicatedMergeTree::write(const ASTPtr & /*query*/,
const Settings & query_settings = context.getSettingsRef();
bool deduplicate = storage_settings_ptr->replicated_deduplication_window != 0 && query_settings.insert_deduplicate;
// TODO: should we also somehow pass list of columns to deduplicate on to the ReplicatedMergeTreeBlockOutputStream ?
return std::make_shared<ReplicatedMergeTreeBlockOutputStream>(
*this, metadata_snapshot, query_settings.insert_quorum,
query_settings.insert_quorum_timeout.totalMilliseconds(),
@ -3878,6 +3883,7 @@ bool StorageReplicatedMergeTree::optimize(
const ASTPtr & partition,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
const Context & query_context)
{
assertNotReadonly();
@ -3935,7 +3941,8 @@ bool StorageReplicatedMergeTree::optimize(
ReplicatedMergeTreeLogEntryData merge_entry;
CreateMergeEntryResult create_result = createLogEntryToMergeParts(
zookeeper, future_merged_part.parts,
future_merged_part.name, future_merged_part.uuid, future_merged_part.type, deduplicate,
future_merged_part.name, future_merged_part.uuid, future_merged_part.type,
deduplicate, deduplicate_by_columns,
&merge_entry, can_merge.getVersion(), future_merged_part.merge_type);
if (create_result == CreateMergeEntryResult::MissingPart)
@ -3996,7 +4003,8 @@ bool StorageReplicatedMergeTree::optimize(
ReplicatedMergeTreeLogEntryData merge_entry;
CreateMergeEntryResult create_result = createLogEntryToMergeParts(
zookeeper, future_merged_part.parts,
future_merged_part.name, future_merged_part.uuid, future_merged_part.type, deduplicate,
future_merged_part.name, future_merged_part.uuid, future_merged_part.type,
deduplicate, deduplicate_by_columns,
&merge_entry, can_merge.getVersion(), future_merged_part.merge_type);
if (create_result == CreateMergeEntryResult::MissingPart)

View File

@ -120,6 +120,7 @@ public:
const ASTPtr & partition,
bool final,
bool deduplicate,
const Names & deduplicate_by_columns,
const Context & query_context) override;
void alter(const AlterCommands & commands, const Context & query_context, TableLockHolder & table_lock_holder) override;
@ -470,6 +471,7 @@ private:
const UUID & merged_part_uuid,
const MergeTreeDataPartType & merged_part_type,
bool deduplicate,
const Names & deduplicate_by_columns,
ReplicatedMergeTreeLogEntryData * out_log_entry,
int32_t log_version,
MergeType merge_type);

View File

@ -37,7 +37,7 @@
<substitution>
<name>table</name>
<values>
<value>numbers(200000)</value>
<value>numbers(2000000)</value>
</values>
</substitution>
<substitution>

View File

@ -9,7 +9,7 @@ ${CLICKHOUSE_CLIENT} -q "SELECT 'CREATE TABLE test_' || hex(randomPrintableASCII
function stress()
{
while true; do
"${CURDIR}"/01526_client_start_and_exit.expect | grep -v -P 'ClickHouse client|Connecting|Connected|:\) Bye\.|^\s*$|spawn bash|^0\s*$'
"${CURDIR}"/01526_client_start_and_exit.expect | grep -v -P 'ClickHouse client|Connecting|Connected|:\) Bye\.|new year|^\s*$|spawn bash|^0\s*$'
done
}

View File

@ -0,0 +1,39 @@
TOTAL rows 4
OLD DEDUPLICATE
0 0 0 1
1 1 2 1
1 1 3 1
DEDUPLICATE BY *
0 0 0 1
1 1 2 1
1 1 3 1
DEDUPLICATE BY * EXCEPT mat
0 0 0 1
1 1 2 1
1 1 3 1
DEDUPLICATE BY pk,sk,val,mat,partition_key
0 0 0 1
1 1 2 1
1 1 3 1
Can not remove full duplicates
OLD DEDUPLICATE
4
DEDUPLICATE BY pk,sk,val,mat
4
Remove partial duplicates
DEDUPLICATE BY *
3
DEDUPLICATE BY * EXCEPT mat
0 0 0 1
1 1 2 1
1 1 3 1
DEDUPLICATE BY COLUMNS("*") EXCEPT mat
0 0 0 1
1 1 2 1
1 1 3 1
DEDUPLICATE BY pk,sk
0 0 0 1
1 1 2 1
DEDUPLICATE BY COLUMNS(".*k")
0 0 0 1
1 1 2 1

View File

@ -0,0 +1,122 @@
--- See also tests/queries/0_stateless/01581_deduplicate_by_columns_replicated.sql
--- local case
-- Just in case if previous tests run left some stuff behind.
DROP TABLE IF EXISTS source_data;
CREATE TABLE source_data (
pk Int32, sk Int32, val UInt32, partition_key UInt32 DEFAULT 1,
PRIMARY KEY (pk)
) ENGINE=MergeTree
ORDER BY (pk, sk);
INSERT INTO source_data (pk, sk, val) VALUES (0, 0, 0), (0, 0, 0), (1, 1, 2), (1, 1, 3);
SELECT 'TOTAL rows', count() FROM source_data;
DROP TABLE IF EXISTS full_duplicates;
-- table with duplicates on MATERIALIZED columns
CREATE TABLE full_duplicates (
pk Int32, sk Int32, val UInt32, partition_key UInt32, mat UInt32 MATERIALIZED 12345, alias UInt32 ALIAS 2,
PRIMARY KEY (pk)
) ENGINE=MergeTree
PARTITION BY (partition_key + 1) -- ensure that column in expression is properly handled when deduplicating. See [1] below.
ORDER BY (pk, toString(sk * 10)); -- silly order key to ensure that key column is checked even when it is a part of expression. See [1] below.
-- ERROR cases
OPTIMIZE TABLE full_duplicates DEDUPLICATE BY pk, sk, val, mat, alias; -- { serverError 16 } -- alias column is present
OPTIMIZE TABLE full_duplicates DEDUPLICATE BY sk, val; -- { serverError 8 } -- primary key column is missing
OPTIMIZE TABLE full_duplicates DEDUPLICATE BY * EXCEPT(pk, sk, val, mat, alias, partition_key); -- { serverError 51 } -- list is empty
OPTIMIZE TABLE full_duplicates DEDUPLICATE BY * EXCEPT(pk); -- { serverError 8 } -- primary key column is missing [1]
OPTIMIZE TABLE full_duplicates DEDUPLICATE BY * EXCEPT(sk); -- { serverError 8 } -- sorting key column is missing [1]
OPTIMIZE TABLE full_duplicates DEDUPLICATE BY * EXCEPT(partition_key); -- { serverError 8 } -- partitioning column is missing [1]
OPTIMIZE TABLE full_duplicates DEDUPLICATE BY; -- { clientError 62 } -- empty list is a syntax error
OPTIMIZE TABLE partial_duplicates DEDUPLICATE BY pk,sk,val,mat EXCEPT mat; -- { clientError 62 } -- invalid syntax
OPTIMIZE TABLE partial_duplicates DEDUPLICATE BY pk APPLY(pk + 1); -- { clientError 62 } -- APPLY column transformer is not supported
OPTIMIZE TABLE partial_duplicates DEDUPLICATE BY pk REPLACE(pk + 1); -- { clientError 62 } -- REPLACE column transformer is not supported
-- Valid cases
-- NOTE: here and below we need FINAL to force deduplication in such a small set of data in only 1 part.
SELECT 'OLD DEDUPLICATE';
INSERT INTO full_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE full_duplicates FINAL DEDUPLICATE;
SELECT * FROM full_duplicates;
TRUNCATE full_duplicates;
SELECT 'DEDUPLICATE BY *';
INSERT INTO full_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE full_duplicates FINAL DEDUPLICATE BY *;
SELECT * FROM full_duplicates;
TRUNCATE full_duplicates;
SELECT 'DEDUPLICATE BY * EXCEPT mat';
INSERT INTO full_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE full_duplicates FINAL DEDUPLICATE BY * EXCEPT mat;
SELECT * FROM full_duplicates;
TRUNCATE full_duplicates;
SELECT 'DEDUPLICATE BY pk,sk,val,mat,partition_key';
INSERT INTO full_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE full_duplicates FINAL DEDUPLICATE BY pk,sk,val,mat,partition_key;
SELECT * FROM full_duplicates;
TRUNCATE full_duplicates;
--DROP TABLE full_duplicates;
-- Now to the partial duplicates when MATERIALIZED column alway has unique value.
DROP TABLE IF EXISTS partial_duplicates;
CREATE TABLE partial_duplicates (
pk Int32, sk Int32, val UInt32, partition_key UInt32 DEFAULT 1, mat UInt32 MATERIALIZED rand(), alias UInt32 ALIAS 2,
PRIMARY KEY (pk)
) ENGINE=MergeTree
ORDER BY (pk, sk);
SELECT 'Can not remove full duplicates';
-- should not remove anything
SELECT 'OLD DEDUPLICATE';
INSERT INTO partial_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE;
SELECT count() FROM partial_duplicates;
TRUNCATE partial_duplicates;
SELECT 'DEDUPLICATE BY pk,sk,val,mat';
INSERT INTO partial_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY pk,sk,val,mat;
SELECT count() FROM partial_duplicates;
TRUNCATE partial_duplicates;
SELECT 'Remove partial duplicates';
SELECT 'DEDUPLICATE BY *'; -- all except MATERIALIZED columns, hence will reduce number of rows.
INSERT INTO partial_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY *;
SELECT count() FROM partial_duplicates;
TRUNCATE partial_duplicates;
SELECT 'DEDUPLICATE BY * EXCEPT mat';
INSERT INTO partial_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY * EXCEPT mat;
SELECT * FROM partial_duplicates;
TRUNCATE partial_duplicates;
SELECT 'DEDUPLICATE BY COLUMNS("*") EXCEPT mat';
INSERT INTO partial_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY COLUMNS('.*') EXCEPT mat;
SELECT * FROM partial_duplicates;
TRUNCATE partial_duplicates;
SELECT 'DEDUPLICATE BY pk,sk';
INSERT INTO partial_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY pk,sk;
SELECT * FROM partial_duplicates;
TRUNCATE partial_duplicates;
SELECT 'DEDUPLICATE BY COLUMNS(".*k")';
INSERT INTO partial_duplicates SELECT * FROM source_data;
OPTIMIZE TABLE partial_duplicates FINAL DEDUPLICATE BY COLUMNS('.*k');
SELECT * FROM partial_duplicates;
TRUNCATE partial_duplicates;

View File

@ -0,0 +1,47 @@
check that we have a data
r1 1 1001 3 2 2
r1 1 2001 1 1 1
r1 2 1002 1 1 1
r1 2 2002 1 1 1
r1 3 1003 2 2 2
r1 4 1004 2 2 2
r1 5 2005 2 2 1
r1 9 1002 1 1 1
r2 1 1001 3 2 2
r2 1 2001 1 1 1
r2 2 1002 1 1 1
r2 2 2002 1 1 1
r2 3 1003 2 2 2
r2 4 1004 2 2 2
r2 5 2005 2 2 1
r2 9 1002 1 1 1
after old OPTIMIZE DEDUPLICATE
r1 1 1001 3 2 2
r1 1 2001 1 1 1
r1 2 1002 1 1 1
r1 2 2002 1 1 1
r1 3 1003 2 2 2
r1 4 1004 2 2 2
r1 5 2005 2 2 1
r1 9 1002 1 1 1
r2 1 1001 3 2 2
r2 1 2001 1 1 1
r2 2 1002 1 1 1
r2 2 2002 1 1 1
r2 3 1003 2 2 2
r2 4 1004 2 2 2
r2 5 2005 2 2 1
r2 9 1002 1 1 1
check data again after multiple deduplications with new syntax
r1 1 1001 1 1 1
r1 2 1002 1 1 1
r1 3 1003 1 1 1
r1 4 1004 1 1 1
r1 5 2005 1 1 1
r1 9 1002 1 1 1
r2 1 1001 1 1 1
r2 2 1002 1 1 1
r2 3 1003 1 1 1
r2 4 1004 1 1 1
r2 5 2005 1 1 1
r2 9 1002 1 1 1

View File

@ -0,0 +1,60 @@
--- See also tests/queries/0_stateless/01581_deduplicate_by_columns_local.sql
--- replicated case
-- Just in case if previous tests run left some stuff behind.
DROP TABLE IF EXISTS replicated_deduplicate_by_columns_r1;
DROP TABLE IF EXISTS replicated_deduplicate_by_columns_r2;
SET replication_alter_partitions_sync = 2;
-- IRL insert_replica_id were filled from hostname
CREATE TABLE IF NOT EXISTS replicated_deduplicate_by_columns_r1 (
id Int32, val UInt32, unique_value UInt64 MATERIALIZED rowNumberInBlock(), insert_replica_id UInt8 MATERIALIZED randConstant()
) ENGINE=ReplicatedMergeTree('/clickhouse/tables/test_01581/replicated_deduplicate', 'r1') ORDER BY id;
CREATE TABLE IF NOT EXISTS replicated_deduplicate_by_columns_r2 (
id Int32, val UInt32, unique_value UInt64 MATERIALIZED rowNumberInBlock(), insert_replica_id UInt8 MATERIALIZED randConstant()
) ENGINE=ReplicatedMergeTree('/clickhouse/tables/test_01581/replicated_deduplicate', 'r2') ORDER BY id;
SYSTEM STOP REPLICATED SENDS;
SYSTEM STOP FETCHES;
SYSTEM STOP REPLICATION QUEUES;
-- insert some data, 2 records: (3, 1003), (4, 1004) are duplicated and have difference in unique_value / insert_replica_id
-- (1, 1001), (5, 2005) has full duplicates
INSERT INTO replicated_deduplicate_by_columns_r1 VALUES (1, 1001), (1, 1001), (2, 1002), (3, 1003), (4, 1004), (1, 2001), (9, 1002);
INSERT INTO replicated_deduplicate_by_columns_r2 VALUES (1, 1001), (2, 2002), (3, 1003), (4, 1004), (5, 2005), (5, 2005);
SYSTEM START REPLICATION QUEUES;
SYSTEM START FETCHES;
SYSTEM START REPLICATED SENDS;
-- wait for syncing replicas
SYSTEM SYNC REPLICA replicated_deduplicate_by_columns_r2;
SYSTEM SYNC REPLICA replicated_deduplicate_by_columns_r1;
SELECT 'check that we have a data';
SELECT 'r1', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r1 GROUP BY id, val ORDER BY id, val;
SELECT 'r2', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r2 GROUP BY id, val ORDER BY id, val;
-- NOTE: here and below we need FINAL to force deduplication in such a small set of data in only 1 part.
-- that should remove full duplicates
OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE;
SELECT 'after old OPTIMIZE DEDUPLICATE';
SELECT 'r1', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r1 GROUP BY id, val ORDER BY id, val;
SELECT 'r2', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r2 GROUP BY id, val ORDER BY id, val;
OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY id, val;
OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY COLUMNS('[id, val]');
OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY COLUMNS('[i]') EXCEPT(unique_value, insert_replica_id);
SELECT 'check data again after multiple deduplications with new syntax';
SELECT 'r1', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r1 GROUP BY id, val ORDER BY id, val;
SELECT 'r2', id, val, count(), uniqExact(unique_value), uniqExact(insert_replica_id) FROM replicated_deduplicate_by_columns_r2 GROUP BY id, val ORDER BY id, val;
-- cleanup the mess
DROP TABLE replicated_deduplicate_by_columns_r1;
DROP TABLE replicated_deduplicate_by_columns_r2;

View File

@ -0,0 +1,11 @@
SELECT count()
FROM
(
SELECT number
FROM
(
SELECT number
FROM numbers(1000000)
)
WHERE rand64() < (0.01 * 18446744073709552000.)
)

View File

@ -0,0 +1 @@
EXPLAIN SYNTAX SELECT count(*) FROM ( SELECT number FROM ( SELECT number FROM numbers(1000000) ) WHERE rand64() < (0.01 * 18446744073709552000.));

View File

@ -0,0 +1 @@
one

View File

@ -0,0 +1,4 @@
drop table if exists enum;
create table enum engine MergeTree order by enum as select cast(1, 'Enum8(\'zero\'=0, \'one\'=1)') AS enum;
select * from enum where enum = 1;
drop table if exists enum;