mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-10-07 09:00:49 +00:00
more docs fixes
This commit is contained in:
parent
79f0c7feb9
commit
06e02bfd86
@ -60,7 +60,7 @@ SELECT * FROM file_engine_table
|
|||||||
|
|
||||||
## Usage in Clickhouse-local
|
## Usage in Clickhouse-local
|
||||||
|
|
||||||
In [clickhouse-local](../utils/clickhouse-local.md
|
In [clickhouse-local](../utils/clickhouse-local.md) File engine accepts file path in addition to `Format`. Default input/output streams can be specified using numeric or human-readable names like `0` or `stdin`, `1` or `stdout`.
|
||||||
**Example:**
|
**Example:**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
@ -79,7 +79,7 @@ ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDa
|
|||||||
|
|
||||||
In the example, we set partitioning by month.
|
In the example, we set partitioning by month.
|
||||||
|
|
||||||
We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users.
|
We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md#sample) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users.
|
||||||
|
|
||||||
`index_granularity` could be omitted because 8192 is the default value.
|
`index_granularity` could be omitted because 8192 is the default value.
|
||||||
|
|
||||||
|
@ -78,7 +78,7 @@ ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDa
|
|||||||
|
|
||||||
В примере мы устанавливаем партиционирование по месяцам.
|
В примере мы устанавливаем партиционирование по месяцам.
|
||||||
|
|
||||||
Также мы задаем выражение для сэмплирования в виде хэша по идентификатору посетителя. Это позволяет псевдослучайным образом перемешать данные в таблице для каждого `CounterID` и `EventDate`. Если при выборке данных задать секцию [SAMPLE](../../query_language/select.md
|
Также мы задаем выражение для сэмплирования в виде хэша по идентификатору посетителя. Это позволяет псевдослучайным образом перемешать данные в таблице для каждого `CounterID` и `EventDate`. Если при выборке данных задать секцию [SAMPLE](../../query_language/select.md#sample], то ClickHouse вернёт равномерно-псевдослучайную выборку данных для подмножества посетителей.
|
||||||
`index_granularity` можно было не указывать, поскольку 8192 — это значение по умолчанию.
|
`index_granularity` можно было не указывать, поскольку 8192 — это значение по умолчанию.
|
||||||
|
|
||||||
<details markdown="1"><summary>Устаревший способ создания таблицы</summary>
|
<details markdown="1"><summary>Устаревший способ создания таблицы</summary>
|
||||||
@ -189,7 +189,7 @@ ClickHouse не требует уникального первичного кл
|
|||||||
В этом сценарии имеет смысл оставить в первичном ключе всего несколько столбцов, которые обеспечат эффективную
|
В этом сценарии имеет смысл оставить в первичном ключе всего несколько столбцов, которые обеспечат эффективную
|
||||||
фильтрацию по индексу, а остальные столбцы-измерения добавить в выражение ключа сортировки.
|
фильтрацию по индексу, а остальные столбцы-измерения добавить в выражение ключа сортировки.
|
||||||
|
|
||||||
[ALTER ключа сортировки](../../query_language/alter.mdоперация, так как при одновременном добавлении нового столбца в таблицу и ключ сортировки не нужно изменять
|
[ALTER ключа сортировки](../../query_language/alter.md) — легкая операция, так как при одновременном добавлении нового столбца в таблицу и ключ сортировки не нужно изменять
|
||||||
данные кусков (они остаются упорядоченными и по новому выражению ключа).
|
данные кусков (они остаются упорядоченными и по новому выражению ключа).
|
||||||
|
|
||||||
### Использование индексов и партиций в запросах
|
### Использование индексов и партиций в запросах
|
||||||
|
@ -60,7 +60,7 @@ SELECT * FROM file_engine_table
|
|||||||
|
|
||||||
## Usage in Clickhouse-local
|
## Usage in Clickhouse-local
|
||||||
|
|
||||||
In [clickhouse-local](../utils/clickhouse-local.md
|
In [clickhouse-local](../utils/clickhouse-local.md) File engine accepts file path in addition to `Format`. Default input/output streams can be specified using numeric or human-readable names like `0` or `stdin`, `1` or `stdout`.
|
||||||
**Example:**
|
**Example:**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
@ -39,6 +39,7 @@ CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
|
|||||||
) ENGINE = MergeTree()
|
) ENGINE = MergeTree()
|
||||||
[PARTITION BY expr]
|
[PARTITION BY expr]
|
||||||
[ORDER BY expr]
|
[ORDER BY expr]
|
||||||
|
[PRIMARY KEY expr]
|
||||||
[SAMPLE BY expr]
|
[SAMPLE BY expr]
|
||||||
[SETTINGS name=value, ...]
|
[SETTINGS name=value, ...]
|
||||||
```
|
```
|
||||||
@ -49,16 +50,23 @@ For a description of request parameters, see [request description](../../query_l
|
|||||||
|
|
||||||
- `ENGINE` - Name and parameters of the engine. `ENGINE = MergeTree()`. `MergeTree` engine does not have parameters.
|
- `ENGINE` - Name and parameters of the engine. `ENGINE = MergeTree()`. `MergeTree` engine does not have parameters.
|
||||||
|
|
||||||
- `ORDER BY` — Primary key.
|
|
||||||
|
|
||||||
A tuple of columns or arbitrary expressions. Example: `ORDER BY (CounterID, EventDate)`.
|
|
||||||
If a sampling expression is used, the primary key must contain it. Example: `ORDER BY (CounerID, EventDate, intHash32(UserID))`.
|
|
||||||
|
|
||||||
- `PARTITION BY` — The [partitioning key](custom_partitioning_key.md).
|
- `PARTITION BY` — The [partitioning key](custom_partitioning_key.md).
|
||||||
|
|
||||||
For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](../../data_types/date.md). The partition names here have the `"YYYYMM"` format.
|
For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](../../data_types/date.md). The partition names here have the `"YYYYMM"` format.
|
||||||
|
|
||||||
- `SAMPLE BY` — An expression for sampling. Example: `intHash32(UserID))`.
|
- `ORDER BY` — The sorting key.
|
||||||
|
|
||||||
|
A tuple of columns or arbitrary expressions. Example: `ORDER BY (CounterID, EventDate)`.
|
||||||
|
|
||||||
|
- `PRIMARY KEY` - The primary key if it [differs from the sorting key](mergetree.md).
|
||||||
|
|
||||||
|
By default the primary key is the same as the sorting key (which is specified by the `ORDER BY` clause).
|
||||||
|
Thus in most cases it is unnecessary to specify a separate `PRIMARY KEY` clause.
|
||||||
|
|
||||||
|
- `SAMPLE BY` — An expression for sampling.
|
||||||
|
|
||||||
|
If a sampling expression is used, the primary key must contain it. Example:
|
||||||
|
`SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID))`.
|
||||||
|
|
||||||
- `SETTINGS` — Additional parameters that control the behavior of the `MergeTree`:
|
- `SETTINGS` — Additional parameters that control the behavior of the `MergeTree`:
|
||||||
- `index_granularity` — The granularity of an index. The number of data rows between the "marks" of an index. By default, 8192.
|
- `index_granularity` — The granularity of an index. The number of data rows between the "marks" of an index. By default, 8192.
|
||||||
@ -71,7 +79,7 @@ ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDa
|
|||||||
|
|
||||||
In the example, we set partitioning by month.
|
In the example, we set partitioning by month.
|
||||||
|
|
||||||
We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users.
|
We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each `CounterID` and `EventDate`. If, when selecting the data, you define a [SAMPLE](../../query_language/select.md#sample) clause, ClickHouse will return an evenly pseudorandom data sample for a subset of users.
|
||||||
|
|
||||||
`index_granularity` could be omitted because 8192 is the default value.
|
`index_granularity` could be omitted because 8192 is the default value.
|
||||||
|
|
||||||
@ -133,7 +141,7 @@ If the data query specifies:
|
|||||||
|
|
||||||
- `CounterID in ('a', 'h')`, the server reads the data in the ranges of marks `[0, 3)` and `[6, 8)`.
|
- `CounterID in ('a', 'h')`, the server reads the data in the ranges of marks `[0, 3)` and `[6, 8)`.
|
||||||
- `CounterID IN ('a', 'h') AND Date = 3`, the server reads the data in the ranges of marks `[1, 3)` and `[7, 8)`.
|
- `CounterID IN ('a', 'h') AND Date = 3`, the server reads the data in the ranges of marks `[1, 3)` and `[7, 8)`.
|
||||||
- `Date = 3`, the server reads the data in the range of marks `[1, 10)`.
|
- `Date = 3`, the server reads the data in the range of marks `[1, 10]`.
|
||||||
|
|
||||||
The examples above show that it is always more effective to use an index than a full scan.
|
The examples above show that it is always more effective to use an index than a full scan.
|
||||||
|
|
||||||
@ -159,10 +167,32 @@ The number of columns in the primary key is not explicitly limited. Depending on
|
|||||||
|
|
||||||
- Provide additional logic when data parts merging in the [CollapsingMergeTree](collapsingmergetree.md#table_engine-collapsingmergetree) and [SummingMergeTree](summingmergetree.md) engines.
|
- Provide additional logic when data parts merging in the [CollapsingMergeTree](collapsingmergetree.md#table_engine-collapsingmergetree) and [SummingMergeTree](summingmergetree.md) engines.
|
||||||
|
|
||||||
You may need many fields in the primary key even if they are not necessary for the previous steps.
|
In this case it makes sense to specify the *sorting key* that is different from the primary key.
|
||||||
|
|
||||||
A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during `SELECT` queries.
|
A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during `SELECT` queries.
|
||||||
|
|
||||||
|
|
||||||
|
### Choosing the Primary Key that differs from the Sorting Key
|
||||||
|
|
||||||
|
It is possible to specify the primary key (the expression, values of which are written into the index file
|
||||||
|
for each mark) that is different from the sorting key (the expression for sorting the rows in data parts).
|
||||||
|
In this case the primary key expression tuple must be a prefix of the sorting key expression tuple.
|
||||||
|
|
||||||
|
This feature is helpful when using the [SummingMergeTree](summingmergetree.md) and
|
||||||
|
[AggregatingMergeTree](aggregatingmergetree.md) table engines. In a common case when using these engines the
|
||||||
|
table has two types of columns: *dimensions* and *measures*. Typical queries aggregate values of measure
|
||||||
|
columns with arbitrary `GROUP BY` and filtering by dimensions. As SummingMergeTree and AggregatingMergeTree
|
||||||
|
aggregate rows with the same value of the sorting key, it is natural to add all dimensions to it. As a result
|
||||||
|
the key expression consists of a long list of columns and this list must be frequently updated with newly
|
||||||
|
added dimensions.
|
||||||
|
|
||||||
|
In this case it makes sense to leave only a few columns in the primary key that will provide efficient
|
||||||
|
range scans and add the remaining dimension columns to the sorting key tuple.
|
||||||
|
|
||||||
|
[ALTER of the sorting key](../../query_language/alter.md) is a
|
||||||
|
lightweight operation because when a new column is simultaneously added to the table and to the sorting key
|
||||||
|
data parts need not be changed (they remain sorted by the new sorting key expression).
|
||||||
|
|
||||||
### Use of Indexes and Partitions in Queries
|
### Use of Indexes and Partitions in Queries
|
||||||
|
|
||||||
For`SELECT` queries, ClickHouse analyzes whether an index can be used. An index can be used if the `WHERE/PREWHERE` clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has `IN` or `LIKE` with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions.
|
For`SELECT` queries, ClickHouse analyzes whether an index can be used. An index can be used if the `WHERE/PREWHERE` clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has `IN` or `LIKE` with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions.
|
||||||
|
Loading…
Reference in New Issue
Block a user