more docs fixes

This commit is contained in:
Ivan Blinkov 2018-12-12 19:52:43 +03:00
parent 79f0c7feb9
commit 06e02bfd86
5 changed files with 44 additions and 14 deletions

View File

@ -60,7 +60,7 @@ SELECT * FROM file_engine_table
## Usage in Clickhouse-local ## Usage in Clickhouse-local
In [clickhouse-local](../utils/clickhouse-local.md In [clickhouse-local](../utils/clickhouse-local.md) File engine accepts file path in addition to `Format`. Default input/output streams can be specified using numeric or human-readable names like `0` or `stdin`, `1` or `stdout`.
**Example:** **Example:**
```bash ```bash

View File

@ -79,7 +79,7 @@ ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDa
In the example, we set partitioning by month. In the example, we set partitioning by month.
We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users. We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md#sample) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users.
`index_granularity` could be omitted because 8192 is the default value. `index_granularity` could be omitted because 8192 is the default value.

View File

@ -78,7 +78,7 @@ ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDa
В примере мы устанавливаем партиционирование по месяцам. В примере мы устанавливаем партиционирование по месяцам.
Также мы задаем выражение для сэмплирования в виде хэша по идентификатору посетителя. Это позволяет псевдослучайным образом перемешать данные в таблице для каждого `CounterID` и `EventDate`. Если при выборке данных задать секцию [SAMPLE](../../query_language/select.md Также мы задаем выражение для сэмплирования в виде хэша по идентификатору посетителя. Это позволяет псевдослучайным образом перемешать данные в таблице для каждого `CounterID` и `EventDate`. Если при выборке данных задать секцию [SAMPLE](../../query_language/select.md#sample], то ClickHouse вернёт равномерно-псевдослучайную выборку данных для подмножества посетителей.
`index_granularity` можно было не указывать, поскольку 8192 — это значение по умолчанию. `index_granularity` можно было не указывать, поскольку 8192 — это значение по умолчанию.
<details markdown="1"><summary>Устаревший способ создания таблицы</summary> <details markdown="1"><summary>Устаревший способ создания таблицы</summary>
@ -189,7 +189,7 @@ ClickHouse не требует уникального первичного кл
В этом сценарии имеет смысл оставить в первичном ключе всего несколько столбцов, которые обеспечат эффективную В этом сценарии имеет смысл оставить в первичном ключе всего несколько столбцов, которые обеспечат эффективную
фильтрацию по индексу, а остальные столбцы-измерения добавить в выражение ключа сортировки. фильтрацию по индексу, а остальные столбцы-измерения добавить в выражение ключа сортировки.
[ALTER ключа сортировки](../../query_language/alter.mdоперация, так как при одновременном добавлении нового столбца в таблицу и ключ сортировки не нужно изменять [ALTER ключа сортировки](../../query_language/alter.md) — легкая операция, так как при одновременном добавлении нового столбца в таблицу и ключ сортировки не нужно изменять
данные кусков (они остаются упорядоченными и по новому выражению ключа). данные кусков (они остаются упорядоченными и по новому выражению ключа).
### Использование индексов и партиций в запросах ### Использование индексов и партиций в запросах

View File

@ -60,7 +60,7 @@ SELECT * FROM file_engine_table
## Usage in Clickhouse-local ## Usage in Clickhouse-local
In [clickhouse-local](../utils/clickhouse-local.md In [clickhouse-local](../utils/clickhouse-local.md) File engine accepts file path in addition to `Format`. Default input/output streams can be specified using numeric or human-readable names like `0` or `stdin`, `1` or `stdout`.
**Example:** **Example:**
```bash ```bash

View File

@ -39,6 +39,7 @@ CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
) ENGINE = MergeTree() ) ENGINE = MergeTree()
[PARTITION BY expr] [PARTITION BY expr]
[ORDER BY expr] [ORDER BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr] [SAMPLE BY expr]
[SETTINGS name=value, ...] [SETTINGS name=value, ...]
``` ```
@ -49,16 +50,23 @@ For a description of request parameters, see [request description](../../query_l
- `ENGINE` - Name and parameters of the engine. `ENGINE = MergeTree()`. `MergeTree` engine does not have parameters. - `ENGINE` - Name and parameters of the engine. `ENGINE = MergeTree()`. `MergeTree` engine does not have parameters.
- `ORDER BY` — Primary key.
A tuple of columns or arbitrary expressions. Example: `ORDER BY (CounterID, EventDate)`.
If a sampling expression is used, the primary key must contain it. Example: `ORDER BY (CounerID, EventDate, intHash32(UserID))`.
- `PARTITION BY` — The [partitioning key](custom_partitioning_key.md). - `PARTITION BY` — The [partitioning key](custom_partitioning_key.md).
For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](../../data_types/date.md). The partition names here have the `"YYYYMM"` format. For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](../../data_types/date.md). The partition names here have the `"YYYYMM"` format.
- `SAMPLE BY` — An expression for sampling. Example: `intHash32(UserID))`. - `ORDER BY` — The sorting key.
A tuple of columns or arbitrary expressions. Example: `ORDER BY (CounterID, EventDate)`.
- `PRIMARY KEY` - The primary key if it [differs from the sorting key](mergetree.md).
By default the primary key is the same as the sorting key (which is specified by the `ORDER BY` clause).
Thus in most cases it is unnecessary to specify a separate `PRIMARY KEY` clause.
- `SAMPLE BY` — An expression for sampling.
If a sampling expression is used, the primary key must contain it. Example:
`SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID))`.
- `SETTINGS` — Additional parameters that control the behavior of the `MergeTree`: - `SETTINGS` — Additional parameters that control the behavior of the `MergeTree`:
- `index_granularity` — The granularity of an index. The number of data rows between the "marks" of an index. By default, 8192. - `index_granularity` — The granularity of an index. The number of data rows between the "marks" of an index. By default, 8192.
@ -71,7 +79,7 @@ ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDa
In the example, we set partitioning by month. In the example, we set partitioning by month.
We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users. We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md#sample) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users.
`index_granularity` could be omitted because 8192 is the default value. `index_granularity` could be omitted because 8192 is the default value.
@ -133,7 +141,7 @@ If the data query specifies:
- `CounterID in ('a', 'h')`, the server reads the data in the ranges of marks `[0, 3)` and `[6, 8)`. - `CounterID in ('a', 'h')`, the server reads the data in the ranges of marks `[0, 3)` and `[6, 8)`.
- `CounterID IN ('a', 'h') AND Date = 3`, the server reads the data in the ranges of marks `[1, 3)` and `[7, 8)`. - `CounterID IN ('a', 'h') AND Date = 3`, the server reads the data in the ranges of marks `[1, 3)` and `[7, 8)`.
- `Date = 3`, the server reads the data in the range of marks `[1, 10)`. - `Date = 3`, the server reads the data in the range of marks `[1, 10]`.
The examples above show that it is always more effective to use an index than a full scan. The examples above show that it is always more effective to use an index than a full scan.
@ -159,10 +167,32 @@ The number of columns in the primary key is not explicitly limited. Depending on
- Provide additional logic when data parts merging in the [CollapsingMergeTree](collapsingmergetree.md#table_engine-collapsingmergetree) and [SummingMergeTree](summingmergetree.md) engines. - Provide additional logic when data parts merging in the [CollapsingMergeTree](collapsingmergetree.md#table_engine-collapsingmergetree) and [SummingMergeTree](summingmergetree.md) engines.
You may need many fields in the primary key even if they are not necessary for the previous steps. In this case it makes sense to specify the *sorting key* that is different from the primary key.
A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during `SELECT` queries. A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during `SELECT` queries.
### Choosing the Primary Key that differs from the Sorting Key
It is possible to specify the primary key (the expression, values of which are written into the index file
for each mark) that is different from the sorting key (the expression for sorting the rows in data parts).
In this case the primary key expression tuple must be a prefix of the sorting key expression tuple.
This feature is helpful when using the [SummingMergeTree](summingmergetree.md) and
[AggregatingMergeTree](aggregatingmergetree.md) table engines. In a common case when using these engines the
table has two types of columns: *dimensions* and *measures*. Typical queries aggregate values of measure
columns with arbitrary `GROUP BY` and filtering by dimensions. As SummingMergeTree and AggregatingMergeTree
aggregate rows with the same value of the sorting key, it is natural to add all dimensions to it. As a result
the key expression consists of a long list of columns and this list must be frequently updated with newly
added dimensions.
In this case it makes sense to leave only a few columns in the primary key that will provide efficient
range scans and add the remaining dimension columns to the sorting key tuple.
[ALTER of the sorting key](../../query_language/alter.md) is a
lightweight operation because when a new column is simultaneously added to the table and to the sorting key
data parts need not be changed (they remain sorted by the new sorting key expression).
### Use of Indexes and Partitions in Queries ### Use of Indexes and Partitions in Queries
For`SELECT` queries, ClickHouse analyzes whether an index can be used. An index can be used if the `WHERE/PREWHERE` clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has `IN` or `LIKE` with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions. For`SELECT` queries, ClickHouse analyzes whether an index can be used. An index can be used if the `WHERE/PREWHERE` clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has `IN` or `LIKE` with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions.