diff --git a/docs/en/operations/table_engines/file.md b/docs/en/operations/table_engines/file.md index 816b8d94cd6..a394db256db 100644 --- a/docs/en/operations/table_engines/file.md +++ b/docs/en/operations/table_engines/file.md @@ -60,7 +60,7 @@ SELECT * FROM file_engine_table ## Usage in Clickhouse-local -In [clickhouse-local](../utils/clickhouse-local.md +In [clickhouse-local](../utils/clickhouse-local.md) File engine accepts file path in addition to `Format`. Default input/output streams can be specified using numeric or human-readable names like `0` or `stdin`, `1` or `stdout`. **Example:** ```bash diff --git a/docs/en/operations/table_engines/mergetree.md b/docs/en/operations/table_engines/mergetree.md index f8b2a00cf83..0aba93e4591 100644 --- a/docs/en/operations/table_engines/mergetree.md +++ b/docs/en/operations/table_engines/mergetree.md @@ -79,7 +79,7 @@ ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDa In the example, we set partitioning by month. -We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users. +We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md#sample) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users. `index_granularity` could be omitted because 8192 is the default value. diff --git a/docs/ru/operations/table_engines/mergetree.md b/docs/ru/operations/table_engines/mergetree.md index 94143ce9d2a..2828a893fe5 100644 --- a/docs/ru/operations/table_engines/mergetree.md +++ b/docs/ru/operations/table_engines/mergetree.md @@ -78,7 +78,7 @@ ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDa В примере мы устанавливаем партиционирование по месяцам. -Также мы задаем выражение для сэмплирования в виде хэша по идентификатору посетителя. Это позволяет псевдослучайным образом перемешать данные в таблице для каждого `CounterID` и `EventDate`. Если при выборке данных задать секцию [SAMPLE](../../query_language/select.md +Также мы задаем выражение для сэмплирования в виде хэша по идентификатору посетителя. Это позволяет псевдослучайным образом перемешать данные в таблице для каждого `CounterID` и `EventDate`. Если при выборке данных задать секцию [SAMPLE](../../query_language/select.md#sample], то ClickHouse вернёт равномерно-псевдослучайную выборку данных для подмножества посетителей. `index_granularity` можно было не указывать, поскольку 8192 — это значение по умолчанию.
Устаревший способ создания таблицы @@ -189,7 +189,7 @@ ClickHouse не требует уникального первичного кл В этом сценарии имеет смысл оставить в первичном ключе всего несколько столбцов, которые обеспечат эффективную фильтрацию по индексу, а остальные столбцы-измерения добавить в выражение ключа сортировки. -[ALTER ключа сортировки](../../query_language/alter.mdоперация, так как при одновременном добавлении нового столбца в таблицу и ключ сортировки не нужно изменять +[ALTER ключа сортировки](../../query_language/alter.md) — легкая операция, так как при одновременном добавлении нового столбца в таблицу и ключ сортировки не нужно изменять данные кусков (они остаются упорядоченными и по новому выражению ключа). ### Использование индексов и партиций в запросах diff --git a/docs/zh/operations/table_engines/file.md b/docs/zh/operations/table_engines/file.md index 816b8d94cd6..a394db256db 100644 --- a/docs/zh/operations/table_engines/file.md +++ b/docs/zh/operations/table_engines/file.md @@ -60,7 +60,7 @@ SELECT * FROM file_engine_table ## Usage in Clickhouse-local -In [clickhouse-local](../utils/clickhouse-local.md +In [clickhouse-local](../utils/clickhouse-local.md) File engine accepts file path in addition to `Format`. Default input/output streams can be specified using numeric or human-readable names like `0` or `stdin`, `1` or `stdout`. **Example:** ```bash diff --git a/docs/zh/operations/table_engines/mergetree.md b/docs/zh/operations/table_engines/mergetree.md index 12ed0cc094f..0aba93e4591 100644 --- a/docs/zh/operations/table_engines/mergetree.md +++ b/docs/zh/operations/table_engines/mergetree.md @@ -39,6 +39,7 @@ CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] ) ENGINE = MergeTree() [PARTITION BY expr] [ORDER BY expr] +[PRIMARY KEY expr] [SAMPLE BY expr] [SETTINGS name=value, ...] ``` @@ -49,16 +50,23 @@ For a description of request parameters, see [request description](../../query_l - `ENGINE` - Name and parameters of the engine. `ENGINE = MergeTree()`. `MergeTree` engine does not have parameters. -- `ORDER BY` — Primary key. - - A tuple of columns or arbitrary expressions. Example: `ORDER BY (CounterID, EventDate)`. -If a sampling expression is used, the primary key must contain it. Example: `ORDER BY (CounerID, EventDate, intHash32(UserID))`. - - `PARTITION BY` — The [partitioning key](custom_partitioning_key.md). For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](../../data_types/date.md). The partition names here have the `"YYYYMM"` format. -- `SAMPLE BY` — An expression for sampling. Example: `intHash32(UserID))`. +- `ORDER BY` — The sorting key. + + A tuple of columns or arbitrary expressions. Example: `ORDER BY (CounterID, EventDate)`. + +- `PRIMARY KEY` - The primary key if it [differs from the sorting key](mergetree.md). + + By default the primary key is the same as the sorting key (which is specified by the `ORDER BY` clause). + Thus in most cases it is unnecessary to specify a separate `PRIMARY KEY` clause. + +- `SAMPLE BY` — An expression for sampling. + + If a sampling expression is used, the primary key must contain it. Example: + `SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID))`. - `SETTINGS` — Additional parameters that control the behavior of the `MergeTree`: - `index_granularity` — The granularity of an index. The number of data rows between the "marks" of an index. By default, 8192. @@ -71,7 +79,7 @@ ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDa In the example, we set partitioning by month. -We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users. +We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md#sample) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users. `index_granularity` could be omitted because 8192 is the default value. @@ -133,7 +141,7 @@ If the data query specifies: - `CounterID in ('a', 'h')`, the server reads the data in the ranges of marks `[0, 3)` and `[6, 8)`. - `CounterID IN ('a', 'h') AND Date = 3`, the server reads the data in the ranges of marks `[1, 3)` and `[7, 8)`. -- `Date = 3`, the server reads the data in the range of marks `[1, 10)`. +- `Date = 3`, the server reads the data in the range of marks `[1, 10]`. The examples above show that it is always more effective to use an index than a full scan. @@ -159,10 +167,32 @@ The number of columns in the primary key is not explicitly limited. Depending on - Provide additional logic when data parts merging in the [CollapsingMergeTree](collapsingmergetree.md#table_engine-collapsingmergetree) and [SummingMergeTree](summingmergetree.md) engines. - You may need many fields in the primary key even if they are not necessary for the previous steps. + In this case it makes sense to specify the *sorting key* that is different from the primary key. A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during `SELECT` queries. + +### Choosing the Primary Key that differs from the Sorting Key + +It is possible to specify the primary key (the expression, values of which are written into the index file +for each mark) that is different from the sorting key (the expression for sorting the rows in data parts). +In this case the primary key expression tuple must be a prefix of the sorting key expression tuple. + +This feature is helpful when using the [SummingMergeTree](summingmergetree.md) and +[AggregatingMergeTree](aggregatingmergetree.md) table engines. In a common case when using these engines the +table has two types of columns: *dimensions* and *measures*. Typical queries aggregate values of measure +columns with arbitrary `GROUP BY` and filtering by dimensions. As SummingMergeTree and AggregatingMergeTree +aggregate rows with the same value of the sorting key, it is natural to add all dimensions to it. As a result +the key expression consists of a long list of columns and this list must be frequently updated with newly +added dimensions. + +In this case it makes sense to leave only a few columns in the primary key that will provide efficient +range scans and add the remaining dimension columns to the sorting key tuple. + +[ALTER of the sorting key](../../query_language/alter.md) is a +lightweight operation because when a new column is simultaneously added to the table and to the sorting key +data parts need not be changed (they remain sorted by the new sorting key expression). + ### Use of Indexes and Partitions in Queries For`SELECT` queries, ClickHouse analyzes whether an index can be used. An index can be used if the `WHERE/PREWHERE` clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has `IN` or `LIKE` with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions.