Added english docs on Aggregating, Replacing and SummingMergeTree.

This commit is contained in:
BayoNet 2018-10-02 17:42:33 +03:00
parent e5183df038
commit 78e044906c
14 changed files with 353 additions and 122 deletions

View File

@ -1,4 +1,64 @@
<a name="data_type-aggregatefunction"></a>
# AggregateFunction(name, types_of_arguments...) # AggregateFunction(name, types_of_arguments...)
The intermediate state of an aggregate function. To get it, use aggregate functions with the '-State' suffix. For more information, see "AggregatingMergeTree". The intermediate state of an aggregate function. To get it, use aggregate functions with the `-State` suffix. To get aggregated data in the future, you must use the same aggregate functions with the `-Merge`suffix.
`AggregateFunction` — parametric data type.
**Parameters**
- Name of the aggregate function.
If the function is parametric specify its parameters too.
- Types of the aggregate function arguments.
**Example**
```sql
CREATE TABLE t
(
column1 AggregateFunction(uniq, UInt64),
column2 AggregateFunction(anyIf, String, UInt8),
column3 AggregateFunction(quantiles(0.5, 0.9), UInt64)
) ENGINE = ...
```
[uniq](../../query_language/agg_functions/reference.md#agg_function-uniq), anyIf ([any](../../query_language/agg_functions/reference.md#agg_function-any)+[If](../../query_language/agg_functions/combinators.md#agg-functions-combine-if)) and [quantiles](../../query_language/agg_functions/reference.md#agg_function-quantiles) are the aggregate functions supported in ClickHouse.
## Usage
### Data Insertion
To insert data, use `INSERT SELECT` with aggregate `-State`- functions.
**Function examples**
```
uniqState(UserID)
quantilesState(0.5, 0.9)(SendTiming)
```
In contrast to the corresponding functions `uniq` and `quantiles`, `-State`- functions return the state, instead the final value. In other words, they return a value of `AggregateFunction` type .
In the results of `SELECT` query the values of `AggregateFunction` type have implementation-specific binary representation for all of the ClickHouse output formats. If dump data into, for example, `TabSeparated` format with `SELECT` query then this dump can be loaded back using `INSERT` query.
### Data Selection
When selecting data from `AggregatingMergeTree` table, use `GROUP BY` clause and the same aggregate functions as when inserting data, but using `-Merge`suffix.
An aggregate function with `-Merge` suffix takes a set of states, combines them, and returns the result of complete data aggregation.
For example, the following two queries return the same result:
```sql
SELECT uniq(UserID) FROM table
SELECT uniqMerge(state) FROM (SELECT uniqState(UserID) AS state FROM table GROUP BY RegionID)
```
## Usage Example
See [AggregatingMergeTree](../../operations/table_engines/aggregatingmergetree.md#table_engine-aggregatingmergetree) engine description.

View File

@ -1,59 +1,66 @@
<a name="table_engine-aggregatingmergetree"></a>
# AggregatingMergeTree # AggregatingMergeTree
This engine differs from `MergeTree` in that the merge combines the states of aggregate functions stored in the table for rows with the same primary key value. The engine inherits from [MergeTree](mergetree.md#table_engines-mergetree), altering the logic for data parts merging. ClickHouse replaces all rows with the same primary key with a single row (within a one data part) that stores a combination of states of aggregate functions.
For this to work, it uses the `AggregateFunction` data type, as well as `-State` and `-Merge` modifiers for aggregate functions. Let's examine it more closely.
There is an `AggregateFunction` data type. It is a parametric data type. As parameters, the name of the aggregate function is passed, then the types of its arguments.
Examples:
```sql
CREATE TABLE t
(
column1 AggregateFunction(uniq, UInt64),
column2 AggregateFunction(anyIf, String, UInt8),
column3 AggregateFunction(quantiles(0.5, 0.9), UInt64)
) ENGINE = ...
```
This type of column stores the state of an aggregate function.
To get this type of value, use aggregate functions with the `State` suffix.
Example:`uniqState(UserID), quantilesState(0.5, 0.9)(SendTiming)`
In contrast to the corresponding `uniq` and `quantiles` functions, these functions return the state, rather than the prepared value. In other words, they return an `AggregateFunction` type value.
An `AggregateFunction` type value can't be output in Pretty formats. In other formats, these types of values are output as implementation-specific binary data. The `AggregateFunction` type values are not intended for output or saving in a dump.
The only useful thing you can do with `AggregateFunction` type values is to combine the states and get a result, which essentially means to finish aggregation. Aggregate functions with the 'Merge' suffix are used for this purpose.
Example: `uniqMerge(UserIDState)`, where `UserIDState` has the `AggregateFunction` type.
In other words, an aggregate function with the 'Merge' suffix takes a set of states, combines them, and returns the result.
As an example, these two queries return the same result:
```sql
SELECT uniq(UserID) FROM table
SELECT uniqMerge(state) FROM (SELECT uniqState(UserID) AS state FROM table GROUP BY RegionID)
```
There is an `AggregatingMergeTree` engine. Its job during a merge is to combine the states of aggregate functions from different table rows with the same primary key value.
You can't use a normal INSERT to insert a row in a table containing `AggregateFunction` columns, because you can't explicitly define the `AggregateFunction` value. Instead, use `INSERT SELECT` with `-State` aggregate functions for inserting data.
With SELECT from an `AggregatingMergeTree` table, use GROUP BY and aggregate functions with the '-Merge' modifier in order to complete data aggregation.
You can use `AggregatingMergeTree` tables for incremental data aggregation, including for aggregated materialized views. You can use `AggregatingMergeTree` tables for incremental data aggregation, including for aggregated materialized views.
Example: The engine processes all columns with [AggregateFunction](../../data_types/nested_data_structures/aggregatefunction.md#data_type-aggregatefunction) type.
Create an `AggregatingMergeTree` materialized view that watches the `test.visits` table: It is appropriate to use `AggregatingMergeTree` if it reduces the number of rows by orders.
## Creating a Table
```
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE = AggregatingMergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[SAMPLE BY expr]
[SETTINGS name=value, ...]
```
For a description of request parameters, see [request description](../../query_language/create.md#query_language-queries-create_table).
**Query clauses**
When creating a `ReplacingMergeTree` table the same [clauses](mergetree.md#table_engines-mergetree-configuring) are required, as when creating a `MergeTree` table.
### Deprecated Method for Creating a Table
!!!attention Do not use this method in new projects and, if possible, switch the old projects to the method described above.
```sql
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE [=] AggregatingMergeTree(date-column [, sampling_expression], (primary, key), index_granularity)
```
All of the parameters have the same meaning as in `MergeTree`.
## SELECT and INSERT
To insert data, use `INSERT SELECT` query with aggregate `- State`- functions.
When selecting data from `AggregatingMergeTree` table, use `GROUP BY` clause and the same aggregate functions as when inserting data, but using `- Merge` suffix.
In the results of `SELECT` query the values of `AggregateFunction` type have implementation-specific binary representation for all of the ClickHouse output formats. If dump data into, for example, `TabSeparated` format with `SELECT` query then this dump can be loaded back using `INSERT` query.
## Example of an Aggregated Materialized View
`AggregatingMergeTree` materialized view that watches the `test.visits` table:
```sql ```sql
CREATE MATERIALIZED VIEW test.basic CREATE MATERIALIZED VIEW test.basic
ENGINE = AggregatingMergeTree(StartDate, (CounterID, StartDate), 8192) ENGINE = AggregatingMergeTree() PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate)
AS SELECT AS SELECT
CounterID, CounterID,
StartDate, StartDate,
@ -63,13 +70,15 @@ FROM test.visits
GROUP BY CounterID, StartDate; GROUP BY CounterID, StartDate;
``` ```
Insert data in the `test.visits` table. Data will also be inserted in the view, where it will be aggregated: Inserting of data into the `test.visits` table.
```sql ```sql
INSERT INTO test.visits ... INSERT INTO test.visits ...
``` ```
Perform `SELECT` from the view using `GROUP BY` in order to complete data aggregation: The data are inserted in both the table and view `test.basic` that will perform the aggregation.
To get the aggregated data, we need to execute a query such as `SELECT ... GROUP BY ...` from the view `test.basic`:
```sql ```sql
SELECT SELECT
@ -81,7 +90,3 @@ GROUP BY StartDate
ORDER BY StartDate; ORDER BY StartDate;
``` ```
You can create a materialized view like this and assign a normal view to it that finishes data aggregation.
Note that in most cases, using `AggregatingMergeTree` is not justified, since queries can be run efficiently enough on non-aggregated data.

View File

@ -4,7 +4,7 @@
The `MergeTree` engine and other engines of this family (`*MergeTree`) are the most robust ClickHousе table engines. The `MergeTree` engine and other engines of this family (`*MergeTree`) are the most robust ClickHousе table engines.
!!! info !!!info
The [Merge](merge.md#table_engine-merge) engine does not belong to the `*MergeTree` family. The [Merge](merge.md#table_engine-merge) engine does not belong to the `*MergeTree` family.
Main features: Main features:
@ -25,29 +25,44 @@ Main features:
If necessary, you can set the data sampling method in the table. If necessary, you can set the data sampling method in the table.
## Engine Configuration When Creating a Table <a name="table_engines-mergetree-configuring"></a>
## Creating a Table
``` ```
ENGINE [=] MergeTree() [PARTITION BY expr] [ORDER BY expr] [SAMPLE BY expr] [SETTINGS name=value, ...] CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE = MergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[SAMPLE BY expr]
[SETTINGS name=value, ...]
``` ```
**ENGINE clauses** For a description of request parameters, see [request description](../../query_language/create.md#query_language-queries-create_table).
**Query clauses**
- `ENGINE` - Name and parameters of the engine. `ENGINE = MergeTree()`. `MergeTree` engine does not have parameters.
- `ORDER BY` — Primary key. - `ORDER BY` — Primary key.
A tuple of columns or arbitrary expressions. Example: `ORDER BY (CounterID, EventDate)`. A tuple of columns or arbitrary expressions. Example: `ORDER BY (CounterID, EventDate)`.
If a sampling key is used, the primary key must contain it. Example: `ORDER BY (CounerID, EventDate, intHash32(UserID))`. If a sampling expression is used, the primary key must contain it. Example: `ORDER BY (CounerID, EventDate, intHash32(UserID))`.
- `PARTITION BY` — The [partitioning key](custom_partitioning_key.md#table_engines-custom_partitioning_key). - `PARTITION BY` — The [partitioning key](custom_partitioning_key.md#table_engines-custom_partitioning_key).
For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](../../data_types/date.md#data_type-date). The partition names here have the `"YYYYMM"` format. For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](../../data_types/date.md#data_type-date). The partition names here have the `"YYYYMM"` format.
- `SAMPLE BY` — An expression for sampling (optional). Example: `intHash32(UserID))`. - `SAMPLE BY` — An expression for sampling. Example: `intHash32(UserID))`.
- `SETTINGS` — Additional parameters that control the behavior of the `MergeTree` (optional): - `SETTINGS` — Additional parameters that control the behavior of the `MergeTree`:
- `index_granularity` — The granularity of an index. The number of data rows between the "marks" of an index. By default, 8192. - `index_granularity` — The granularity of an index. The number of data rows between the "marks" of an index. By default, 8192.
**Example** **Example of sections setting**
``` ```
ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity=8192 ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity=8192
@ -59,13 +74,18 @@ We also set an expression for sampling as a hash by the user ID. This allows you
`index_granularity` could be omitted because 8192 is the default value. `index_granularity` could be omitted because 8192 is the default value.
### Deprecated Method for Engine Configuration ### Deprecated Method for Creating a Table
!!! attention !!!attention
Do not use this method in new projects and, if possible, switch the old projects to the method described above. Do not use this method in new projects and, if possible, switch the old projects to the method described above.
``` ```
ENGINE [=] MergeTree(date-column [, sampling_expression], (primary, key), index_granularity) CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE [=] MergeTree(date-column [, sampling_expression], (primary, key), index_granularity)
``` ```
**MergeTree() parameters** **MergeTree() parameters**
@ -89,7 +109,7 @@ A table consists of data *parts* sorted by primary key.
When data is inserted in a table, separate data parts are created and each of them is lexicographically sorted by primary key. For example, if the primary key is `(CounterID, Date)`, the data in the part is sorted by `CounterID`, and within each `CounterID`, it is ordered by `Date`. When data is inserted in a table, separate data parts are created and each of them is lexicographically sorted by primary key. For example, if the primary key is `(CounterID, Date)`, the data in the part is sorted by `CounterID`, and within each `CounterID`, it is ordered by `Date`.
Data belonging to different partitions are separated into different parts. In the background, ClickHouse merges data parts for more efficient storage. Parts belonging to different partitions are not merged. Data belonging to different partitions are separated into different parts. In the background, ClickHouse merges data parts for more efficient storage. Parts belonging to different partitions are not merged. The merge mechanism does not guarantee that all rows with the same primary key will be in the same data part.
For each data part, ClickHouse creates an index file that contains the primary key value for each index row ("mark"). Index row numbers are defined as `n * index_granularity`. The maximum value `n` is equal to the integer part of dividing the total number of rows by the `index_granularity`. For each column, the "marks" are also written for the same index rows as the primary key. These "marks" allow you to find the data directly in the columns. For each data part, ClickHouse creates an index file that contains the primary key value for each index row ("mark"). Index row numbers are defined as `n * index_granularity`. The maximum value `n` is equal to the integer part of dividing the total number of rows by the `index_granularity`. For each column, the "marks" are also written for the same index rows as the primary key. These "marks" allow you to find the data directly in the columns.
@ -136,13 +156,13 @@ The number of columns in the primary key is not explicitly limited. Depending on
ClickHouse sorts data by primary key, so the higher the consistency, the better the compression. ClickHouse sorts data by primary key, so the higher the consistency, the better the compression.
- To provide additional logic when merging in the [CollapsingMergeTree](collapsingmergetree.md#table_engine-collapsingmergetree) and [SummingMergeTree](summingmergetree.md#table_engine-summingmergetree) engines. - Provide additional logic when data parts merging in the [CollapsingMergeTree](collapsingmergetree.md#table_engine-collapsingmergetree) and [SummingMergeTree](summingmergetree.md#table_engine-summingmergetree) engines.
You may need to have many fields in the primary key even if they are not necessary for the previous steps. You may need many fields in the primary key even if they are not necessary for the previous steps.
A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during `SELECT` queries. A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during `SELECT` queries.
### Usage of Indexes and Partitions in Queries ### Use of Indexes and Partitions in Queries
For`SELECT` queries, ClickHouse analyzes whether an index can be used. An index can be used if the `WHERE/PREWHERE` clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has `IN` or `LIKE` with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions. For`SELECT` queries, ClickHouse analyzes whether an index can be used. An index can be used if the `WHERE/PREWHERE` clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has `IN` or `LIKE` with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions.
@ -181,4 +201,3 @@ The key for partitioning by month allows reading only those data blocks which co
For concurrent table access, we use multi-versioning. In other words, when a table is simultaneously read and updated, data is read from a set of parts that is current at the time of the query. There are no lengthy locks. Inserts do not get in the way of read operations. For concurrent table access, we use multi-versioning. In other words, when a table is simultaneously read and updated, data is read from a set of parts that is current at the time of the query. There are no lengthy locks. Inserts do not get in the way of read operations.
Reading from a table is automatically parallelized. Reading from a table is automatically parallelized.

View File

@ -1,18 +1,54 @@
# ReplacingMergeTree # ReplacingMergeTree
This engine table differs from `MergeTree` in that it removes duplicate entries with the same primary key value. The engine differs from [MergeTree](mergetree.md#table_engines-mergetree) in that it removes duplicate entries with the same primary key value.
The last optional parameter for the table engine is the version column. When merging, it reduces all rows with the same primary key value to just one row. If the version column is specified, it leaves the row with the highest version; otherwise, it leaves the last row. Data deduplication occurs only during a merge. Merging occurs in the background at an unknown time, so you can't plan for it. Some of the data may remain unprocessed. Although you can run an unscheduled merge using the `OPTIMIZE` query, don't count on using it, because the `OPTIMIZE` query will read and write a large amount of data.
The version column must have a type from the `UInt` family, `Date`, or `DateTime`.
```sql
ReplacingMergeTree(EventDate, (OrderID, EventDate, BannerID, ...), 8192, ver)
```
Note that data is only deduplicated during merges. Merging occurs in the background at an unknown time, so you can't plan for it. Some of the data may remain unprocessed. Although you can run an unscheduled merge using the OPTIMIZE query, don't count on using it, because the OPTIMIZE query will read and write a large amount of data.
Thus, `ReplacingMergeTree` is suitable for clearing out duplicate data in the background in order to save space, but it doesn't guarantee the absence of duplicates. Thus, `ReplacingMergeTree` is suitable for clearing out duplicate data in the background in order to save space, but it doesn't guarantee the absence of duplicates.
*This engine is not used in Yandex.Metrica, but it has been applied in other Yandex projects.* ## Creating a Table
```
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE = ReplacingMergeTree([ver])
[PARTITION BY expr]
[ORDER BY expr]
[SAMPLE BY expr]
[SETTINGS name=value, ...]
```
For a description of request parameters, see [request description](../../query_language/create.md#query_language-queries-create_table).
**ReplacingMergeTree Parameters**
- `ver` — column with version. Type `UInt*`, `Date` or `DateTime`. Optional parameter.
When merging, `ReplacingMergeTree` from all the rows with the same primary key leaves only one:
- Last in the selection, if `ver` not set.
- With the maximum version, if `ver` specified.
**Query clauses**
When creating a `ReplacingMergeTree` table the same [clausesmergetree.md](#table_engines-mergetree-configuring) are required, as when creating a `MergeTree` table.
### Deprecated Method for Creating a Table
!!!attention Do not use this method in new projects and, if possible, switch the old projects to the method described above.
```sql
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE [=] ReplacingMergeTree(date-column [, sampling_expression], (primary, key), index_granularity, [ver])
```
All of the parameters excepting `ver` have the same meaning as in `MergeTree`.
- `ver` - column with the version. Optional parameter. For a description, see the text above.

View File

@ -2,33 +2,120 @@
# SummingMergeTree # SummingMergeTree
This engine differs from `MergeTree` in that it totals data while merging. The engine inherits from [MergeTree](mergetree.md#table_engines-mergetree). The difference is that when merging data parts for `SummingMergeTree` tables ClickHouse replaces all the rows with the same primary key with one row which contains summarized values for the columns with the numeric data type. If the primary key is composed in a way that a single key value corresponds to large number of rows, this significantly reduces storage volume and speeds up data selection.
```sql We recommend to use the engine together with `MergeTree`. Store complete data in `MergeTree` table, and use `SummingMergeTree` for aggregated data storing, for example, when preparing reports. Such an approach will prevent you from losing valuable data due to an incorrectly composed primary key.
SummingMergeTree(EventDate, (OrderID, EventDate, BannerID, ...), 8192)
## Creating a Table
```
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE = MergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[SAMPLE BY expr]
[SETTINGS name=value, ...]
``` ```
The columns to total are implicit. When merging, all rows with the same primary key value (in the example, OrderId, EventDate, BannerID, ...) have their values totaled in numeric columns that are not part of the primary key. For a description of request parameters, see [request description](../../query_language/create.md#query_language-queries-create_table).
```sql **Parameters of SummingMergeTree**
SummingMergeTree(EventDate, (OrderID, EventDate, BannerID, ...), 8192, (Shows, Clicks, Cost, ...))
- `columns` - a tuple with the names of columns where values will be summarized. Optional parameter.
The columns must be of a numeric type and must not be in the primary key.
If `columns` not specified, ClickHouse summarizes the values in all columns with a numeric data type that are not in the primary key.
**Query clauses**
When creating a `SummingMergeTree` table the same [clauses](mergetree.md#table_engines-mergetree-configuring) are required, as when creating a `MergeTree` table.
### Deprecated Method for Creating a Table
!!!attention Do not use this method in new projects and, if possible, switch the old projects to the method described above.
```
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE [=] SummingMergeTree(date-column [, sampling_expression], (primary, key), index_granularity, [columns])
``` ```
The columns to total are set explicitly (the last parameter Shows, Clicks, Cost, ...). When merging, all rows with the same primary key value have their values totaled in the specified columns. The specified columns also must be numeric and must not be part of the primary key. All of the parameters excepting `columns` have the same meaning as in `MergeTree`.
If the values were zero in all of these columns, the row is deleted. - `columns` — tuple with names of columns values of which will be summarized. Optional parameter. For a description, see the text above.
For the other columns that are not part of the primary key, the first value that occurs is selected when merging. But for the AggregateFunction type of columns, aggregation is performed according to the set function, so this engine actually behaves like `AggregatingMergeTree`. ## Usage Example
Summation is not performed for a read operation. If it is necessary, write the appropriate GROUP BY. Consider the following table:
In addition, a table can have nested data structures that are processed in a special way. ```sql
If the name of a nested table ends in 'Map' and it contains at least two columns that meet the following criteria: CREATE TABLE summtt
(
key UInt32,
value UInt32
)
ENGINE = SummingMergeTree()
ORDER BY key
```
- The first table is numeric ((U)IntN, Date, DateTime), which we'll refer to as the 'key'. Insert data to it:
- The other columns are arithmetic ((U)IntN, Float32/64), which we'll refer to as '(values...)'.
Then this nested table is interpreted as a mapping of key `=>` (values...), and when merging its rows, the elements of two data sets are merged by 'key' with a summation of the corresponding (values...). ```
:) INSERT INTO summtt Values(1,1),(1,2),(2,1)
```
ClickHouse may sum all the rows not completely ([see below](#summary-data-processing)), so we use an aggregate function `sum` and `GROUP BY` clause in the query.
```sql
SELECT key, sum(value) FROM summtt GROUP BY key
```
```
┌─key─┬─sum(value)─┐
│ 2 │ 1 │
│ 1 │ 3 │
└─────┴────────────┘
```
<a name="summingmergetree-data-processing"></a>
## Data Processing
When data are inserted into a table, they are saved as-is. Clickhouse merges the inserted parts of data periodically and this is when rows with the same primary key are summed and replaced with one for each resulting part of data.
ClickHouse can merge the data parts so that different resulting parts of data cat consist rows with the same primary key, i.e. the summation will be incomplete. Therefore (`SELECT`) an aggregate function [sum()](../../query_language/agg_functions/reference.md#agg_function-sum) and `GROUP BY` clause should be used in a query as described in the example above.
### Common rules for summation
The values in the columns with the numeric data type are summarized. The set of columns is defined by the parameter `columns`.
If the values were 0 in all of the columns for summation, the row is deleted.
If column is not in the primary key and is not summarized, an arbitrary value is selected from the existing ones.
The values are not summarized for columns in the primary key.
### The Summation in the AggregateFunction Columns
For columns of [AggregateFunction type](../../data_types/nested_data_structures/aggregatefunction.md#data_type-aggregatefunction) ClickHouse behaves as [AggregatingMergeTree](aggregatingmergetree.md#table_engine-aggregatingmergetree) engine aggregating according to the function.
### Nested Structures
Table can have nested data structures that are processed in a special way.
If the name of a nested table ends with `Map` and it contains at least two columns that meet the following criteria:
- the first column is numeric `(*Int*, Date, DateTime)`, let's call it `key`,
- the other columns are arithmetic `(*Int*, Float32/64)`, let's call it `(values...)`,
then this nested table is interpreted as a mapping of `key => (values...)`, and when merging its rows, the elements of two data sets are merged by `key` with a summation of the corresponding `(values...)`.
Examples: Examples:
@ -39,9 +126,7 @@ Examples:
[(1, 100), (2, 150)] + [(1, -100)] -> [(2, 150)] [(1, 100), (2, 150)] + [(1, -100)] -> [(2, 150)]
``` ```
For aggregation of Map, use the function sumMap(key, value). When requesting data, use the [sumMap(key, value)](../../query_language/agg_functions/reference.md#agg_function-summary) function for aggregation of `Map`.
For nested data structures, you don't need to specify the columns as a list of columns for totaling. For nested data structure, you do not need to specify its columns in the tuple of columns for summation.
This table engine is not particularly useful. Remember that when saving just pre-aggregated data, you lose some of the system's advantages.

View File

@ -4,6 +4,8 @@
The name of an aggregate function can have a suffix appended to it. This changes the way the aggregate function works. The name of an aggregate function can have a suffix appended to it. This changes the way the aggregate function works.
<a name="agg-functions-combinator-if"></a>
## -If ## -If
The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument a condition (Uint8 type). The aggregate function processes only the rows that trigger the condition. If the condition was not triggered even once, it returns a default value (usually zeros or empty strings). The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument a condition (Uint8 type). The aggregate function processes only the rows that trigger the condition. If the condition was not triggered even once, it returns a default value (usually zeros or empty strings).
@ -24,7 +26,7 @@ Example 2: `uniqArray(arr)` Count the number of unique elements in all 'arr'
## -State ## -State
If you apply this combinator, the aggregate function doesn't return the resulting value (such as the number of unique values for the 'uniq' function), but an intermediate state of the aggregation (for `uniq`, this is the hash table for calculating the number of unique values). This is an AggregateFunction(...) that can be used for further processing or stored in a table to finish aggregating later. See the sections "AggregatingMergeTree" and "Functions for working with intermediate aggregation states". If you apply this combinator, the aggregate function doesn't return the resulting value (such as the number of unique values for the `uniq` function), but an intermediate state of the aggregation (for `uniq`, this is the hash table for calculating the number of unique values). This is an AggregateFunction(...) that can be used for further processing or stored in a table to finish aggregating later. See the sections "AggregatingMergeTree" and "Functions for working with intermediate aggregation states".
## -Merge ## -Merge
@ -37,4 +39,3 @@ Merges the intermediate aggregation states in the same way as the -Merge combina
## -ForEach ## -ForEach
Converts an aggregate function for tables into an aggregate function for arrays that aggregates the corresponding array items and returns an array of results. For example, `sumForEach` for the arrays `[1, 2]`, `[3, 4, 5]`and`[6, 7]`returns the result `[10, 13, 5]` after adding together the corresponding array items. Converts an aggregate function for tables into an aggregate function for arrays that aggregates the corresponding array items and returns an array of results. For example, `sumForEach` for the arrays `[1, 2]`, `[3, 4, 5]`and`[6, 7]`returns the result `[10, 13, 5]` after adding together the corresponding array items.

View File

@ -9,6 +9,8 @@ The syntax `COUNT(DISTINCT x)` is not supported. The separate `uniq` aggregate f
A `SELECT count() FROM table` query is not optimized, because the number of entries in the table is not stored separately. It will select some small column from the table and count the number of values in it. A `SELECT count() FROM table` query is not optimized, because the number of entries in the table is not stored separately. It will select some small column from the table and count the number of values in it.
<a name="agg_function-any"></a>
## any(x) ## any(x)
Selects the first encountered value. Selects the first encountered value.
@ -67,6 +69,8 @@ Calculates the 'arg' value for a minimal 'val' value. If there are several diffe
Calculates the 'arg' value for a maximum 'val' value. If there are several different values of 'arg' for maximum values of 'val', the first of these values encountered is output. Calculates the 'arg' value for a maximum 'val' value. If there are several different values of 'arg' for maximum values of 'val', the first of these values encountered is output.
<a name="agg_function-sum"></a>
## sum(x) ## sum(x)
Calculates the sum. Calculates the sum.
@ -78,6 +82,8 @@ Computes the sum of the numbers, using the same data type for the result as for
Only works for numbers. Only works for numbers.
<a name="agg_function-summap"></a>
## sumMap(key, value) ## sumMap(key, value)
Totals the 'value' array according to the keys specified in the 'key' array. Totals the 'value' array according to the keys specified in the 'key' array.
@ -120,6 +126,8 @@ Calculates the average.
Only works for numbers. Only works for numbers.
The result is always Float64. The result is always Float64.
<a name="agg_function-uniq"></a>
## uniq(x) ## uniq(x)
Calculates the approximate number of different values of the argument. Works for numbers, strings, dates, date-with-time, and for multiple arguments and tuple arguments. Calculates the approximate number of different values of the argument. Works for numbers, strings, dates, date-with-time, and for multiple arguments and tuple arguments.
@ -253,7 +261,7 @@ A hash table is used as the algorithm. Because of this, if the passed values
Approximates the quantile level using the [t-digest](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf) algorithm. The maximum error is 1%. Memory consumption by State is proportional to the logarithm of the number of passed values. Approximates the quantile level using the [t-digest](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf) algorithm. The maximum error is 1%. Memory consumption by State is proportional to the logarithm of the number of passed values.
The performance of the function is lower than for `quantile`, `quantileTiming`. In terms of the ratio of State size to precision, this function is much better than `quantile`. The performance of the function is lower than for `quantile` or `quantileTiming`. In terms of the ratio of State size to precision, this function is much better than `quantile`.
The result depends on the order of running the query, and is nondeterministic. The result depends on the order of running the query, and is nondeterministic.

View File

@ -44,6 +44,8 @@ Creates a table with a structure like the result of the `SELECT` query, with the
In all cases, if `IF NOT EXISTS` is specified, the query won't return an error if the table already exists. In this case, the query won't do anything. In all cases, if `IF NOT EXISTS` is specified, the query won't return an error if the table already exists. In this case, the query won't do anything.
There can be other clauses after the `ENGINE` clause in the query. See detailed documentation on how to create tables in the descriptions of [table engines](../operations/table_engines/index.md#table_engines).
### Default Values ### Default Values
The column description can specify an expression for a default value, in one of the following ways:`DEFAULT expr`, `MATERIALIZED expr`, `ALIAS expr`. The column description can specify an expression for a default value, in one of the following ways:`DEFAULT expr`, `MATERIALIZED expr`, `ALIAS expr`.

View File

@ -10,7 +10,7 @@
- Имя агрегатной функции. - Имя агрегатной функции.
Для параметрических агрегатных функций указываются их параметры. Для параметрических агрегатных функций указываются также их параметры.
- Типы аргументов агрегатной функции. - Типы аргументов агрегатной функции.

View File

@ -18,7 +18,7 @@ CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
... ...
) ENGINE = MergeTree() ) ENGINE = AggregatingMergeTree()
[PARTITION BY expr] [PARTITION BY expr]
[ORDER BY expr] [ORDER BY expr]
[SAMPLE BY expr] [SAMPLE BY expr]
@ -29,10 +29,10 @@ CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
**Секции запроса** **Секции запроса**
`AggregatingMergeTree` использует те же [секции ENGINE](mergetree.md#table_engines-mergetree-configuring), что и `MergeTree`. При создании таблицы `AggregatingMergeTree` используются те же [секции](mergetree.md#table_engines-mergetree-configuring), что и при создании таблицы `MergeTree`.
### Устаревший способ конфигурирования движка ### Устаревший способ создания таблицы
!!!attention !!!attention
Не используйте этот способ в новых проектах и по возможности переведите старые проекты на способ описанный выше. Не используйте этот способ в новых проектах и по возможности переведите старые проекты на способ описанный выше.

View File

@ -34,7 +34,7 @@ CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
**Секции запроса** **Секции запроса**
При создании таблицы `ReplacingMergeTree` используются те же [секции ENGINE](mergetree.md#table_engines-mergetree-configuring), что и при создании таблицы `MergeTree`. При создании таблицы `ReplacingMergeTree` используются те же [секции](mergetree.md#table_engines-mergetree-configuring), что и при создании таблицы `MergeTree`.
### Устаревший способ создания таблицы ### Устаревший способ создания таблицы

View File

@ -6,31 +6,46 @@
Мы рекомендуем использовать движок в паре с `MergeTree`. В `MergeTree` храните полные данные, а `SummingMergeTree` используйте для хранения агрегированных данных, например, при подготовке отчетов. Такой подход позволит не утратить ценные данные из-за неправильно выбранного первичного ключа. Мы рекомендуем использовать движок в паре с `MergeTree`. В `MergeTree` храните полные данные, а `SummingMergeTree` используйте для хранения агрегированных данных, например, при подготовке отчетов. Такой подход позволит не утратить ценные данные из-за неправильно выбранного первичного ключа.
## Конфигурирование движка при создании таблицы ## Создание таблицы
``` ```
ENGINE [=] SummingMergeTree([columns]) [PARTITION BY expr] [ORDER BY expr] [SAMPLE BY expr] [SETTINGS name=value, ...] CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE = MergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[SAMPLE BY expr]
[SETTINGS name=value, ...]
``` ```
Описание параметров запроса смотрите в [описании запроса](../../query_language/create.md#query_language-queries-create_table).
**Параметры SummingMergeTree** **Параметры SummingMergeTree**
- `columns` — кортеж с именами столбцов для суммирования данных. Необязательный параметр. - `columns` — кортеж с именами столбцов, в которых будут суммироваться данные. Необязательный параметр.
Столбцы должны иметь числовой тип и не должны входить в первичный ключ. Столбцы должны иметь числовой тип и не должны входить в первичный ключ.
Если `columns` не задан, то ClickHouse суммирует значения во всех столбцах с числовым типом данных, не входящих в первичный ключ. Если `columns` не задан, то ClickHouse суммирует значения во всех столбцах с числовым типом данных, не входящих в первичный ключ.
**Подчиненные секции ENGINE** **Секции запроса**
`SummingMergeTree` использует те же [подчиненные секции ENGINE](mergetree.md#table_engines-mergetree-configuring), что и `MergeTree`. При создании таблицы `SummingMergeTree` использутся те же [секции](mergetree.md#table_engines-mergetree-configuring) запроса, что и при создании таблицы `MergeTree`.
### Устаревший способ создания таблицы
### Устаревший способ конфигурирования движка
!!!attention !!!attention
Не используйте этот способ в новых проектах и по возможности переведите старые проекты на способ описанный выше. Не используйте этот способ в новых проектах и по возможности переведите старые проекты на способ описанный выше.
```sql ```
SummingMergeTree(date-column [, sampling_expression], (primary, key), index_granularity, [columns]) CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE [=] SummingMergeTree(date-column [, sampling_expression], (primary, key), index_granularity, [columns])
``` ```
Все параметры, кроме `columns` имеют то же значение, что в и `MergeTree`. Все параметры, кроме `columns` имеют то же значение, что в и `MergeTree`.
@ -74,19 +89,19 @@ SELECT key, sum(value) FROM summtt GROUP BY key
## Обработка данных ## Обработка данных
При вставке данных в таблицу они сохраняются как есть. Периодически ClickHouse выполняет слияние вставленных кусков данных и именно в этот момент производится суммирование и замена многих строк с одинаковым первичным ключом на одну в пределах одного куска данных. При вставке данных в таблицу они сохраняются как есть. Периодически ClickHouse выполняет слияние вставленных кусков данных и именно в этот момент производится суммирование и замена многих строк с одинаковым первичным ключом на одну для каждого результирующего куска данных.
ClickHouse может слить куски данных таким образом, что не все строки с одинаковым первичным ключом окажутся в одном финальном куске, т.е. суммирование будет не полным. Поэтому, при выборке данных (`SELECT`) необходимо использовать агрегатную функцию [sum()](../../query_language/agg_functions/reference.md#agg_function-sum) и секцию `GROUP BY` как описано в примере выше. ClickHouse может слить куски данных таким образом, что не все строки с одинаковым первичным ключом окажутся в одном финальном куске, т.е. суммирование будет не полным. Поэтому, при выборке данных (`SELECT`) необходимо использовать агрегатную функцию [sum()](../../query_language/agg_functions/reference.md#agg_function-sum) и секцию `GROUP BY` как описано в примере выше.
### Общие правила суммирования ### Общие правила суммирования
Суммируются столбцы с числовым типом данных. Набор столбцов определяется параметром `columns`. Суммируются значения в столбцах с числовым типом данных. Набор столбцов определяется параметром `columns`.
Если значения во всех столбцах для суммирования оказались нулевыми, то строчка удаляется. Если значения во всех столбцах для суммирования оказались нулевыми, то строчка удаляется.
Для столбцов, не входящих в первичный ключ и не суммирующихся, выбирается произвольное значение из имеющихся. Для столбцов, не входящих в первичный ключ и не суммирующихся, выбирается произвольное значение из имеющихся.
Столбцы, входящие в первичный ключ не суммируются. Значения для столбцов, входящих в первичный ключ, не суммируются.
### Суммирование в столбцах AggregateFunction ### Суммирование в столбцах AggregateFunction

View File

@ -27,7 +27,7 @@
## -State ## -State
В случае применения этого комбинатора, агрегатная функция возвращает не готовое значение (например, в случае функции uniq - количество уникальных значений), а промежуточное состояние агрегации (например, в случае функции `uniq` - хэш-таблицу для рассчёта количества уникальных значений), которое имеет тип AggregateFunction(...) и может использоваться для дальнейшей обработки или может быть сохранено в таблицу для последующей доагрегации - смотрите разделы «AggregatingMergeTree» и «функции для работы с промежуточными состояниями агрегации». В случае применения этого комбинатора, агрегатная функция возвращает не готовое значение (например, в случае функции `uniq` количество уникальных значений), а промежуточное состояние агрегации (например, в случае функции `uniq` — хэш-таблицу для расчёта количества уникальных значений), которое имеет тип AggregateFunction(...) и может использоваться для дальнейшей обработки или может быть сохранено в таблицу для последующей доагрегации - смотрите разделы «AggregatingMergeTree» и «функции для работы с промежуточными состояниями агрегации».
## -Merge ## -Merge

View File

@ -43,7 +43,7 @@ CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]name ENGINE = engine AS SELECT ...
Во всех случаях, если указано `IF NOT EXISTS`, то запрос не будет возвращать ошибку, если таблица уже существует. В этом случае, запрос будет ничего не делать. Во всех случаях, если указано `IF NOT EXISTS`, то запрос не будет возвращать ошибку, если таблица уже существует. В этом случае, запрос будет ничего не делать.
После секции ENGINE в запросе могут использоваться и другие секции в зависимости от движка. Подробную документацию по созданию таблиц смотрите в описаниях [движков](../operations/table_engines/index.md#table_engines). После секции `ENGINE` в запросе могут использоваться и другие секции в зависимости от движка. Подробную документацию по созданию таблиц смотрите в описаниях [движков](../operations/table_engines/index.md#table_engines).
### Значения по умолчанию ### Значения по умолчанию