Merge pull request #5272 from BayoNet/DOCAPI-4197-limitations-settings

DOCAPI-4197: Limitation settings
This commit is contained in:
alexey-milovidov 2019-06-16 03:03:29 +03:00 committed by GitHub
commit 9b9122f868
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 150 additions and 21 deletions

View File

@ -64,7 +64,7 @@ Maximum number of bytes (uncompressed data) that can be read from a table when r
What to do when the volume of data read exceeds one of the limits: 'throw' or 'break'. By default, throw.
## max_rows_to_group_by
## max_rows_to_group_by {#settings-max_rows_to_group_by}
Maximum number of unique keys received from aggregation. This setting lets you limit memory consumption when aggregating.
@ -73,6 +73,17 @@ Maximum number of unique keys received from aggregation. This setting lets you l
What to do when the number of unique keys for aggregation exceeds the limit: 'throw', 'break', or 'any'. By default, throw.
Using the 'any' value lets you run an approximation of GROUP BY. The quality of this approximation depends on the statistical nature of the data.
## max_bytes_before_external_group_by {#settings-max_bytes_before_external_group_by}
Enables or disables execution of `GROUP BY` clause in external memory. See [GROUP BY in external memory](../../query_language/select.md#select-group-by-in-external-memory).
Possible values:
- Maximum volume or RAM (in bytes) that can be used by [GROUP BY](../../query_language/select.md#select-group-by-clause) operation.
- 0 — `GROUP BY` in external memory disabled.
Default value: 0.
## max_rows_to_sort
Maximum number of rows before sorting. This allows you to limit memory consumption when sorting.
@ -194,12 +205,64 @@ Maximum number of bytes (uncompressed data) that can be passed to a remote serve
What to do when the amount of data exceeds one of the limits: 'throw' or 'break'. By default, throw.
## max_rows_in_join {#settings-max_rows_in_join}
Limits the number of rows in the hash table that is used when joining tables.
This settings applies to [SELECT ... JOIN](../../query_language/select.md#select-join) operations and the [Join table engine](../table_engines/join.md).
If a query contains multiple joins, ClickHouse checks this setting for every intermediate result.
ClickHouse can proceed with different actions when the limit is reached. Use the [join_overflow_mode](#settings-join_overflow_mode) settings to choose the action.
Possible values:
- Positive integer.
- 0 — Unlimited number of rows.
Default value: 0.
## max_bytes_in_join {#settings-max_bytes_in_join}
Limits hash table size in bytes that is used when joining tables.
This settings applies to [SELECT ... JOIN](../../query_language/select.md#select-join) operations and the [Join table engine](../table_engines/join.md).
If query contains some joins, ClickHouse checks this setting for every intermediate result.
ClickHouse can proceed with different actions when the limit is reached. Use the [join_overflow_mode](#settings-join_overflow_mode) settings to choose the action.
Possible values:
- Positive integer.
- 0 — Memory control is disabled.
Default value: 0.
## join_overflow_mode {#settings-join_overflow_mode}
Defines an action that ClickHouse performs, when any of the following join limits is reached:
- [max_bytes_in_join](#settings-max_bytes_in_join)
- [max_rows_in_join](#settings-max_rows_in_join)
Possible values:
- `THROW` — ClickHouse throws an exception and breaks operation.
- `BREAK` — ClickHouse breaks operation and doesn't throw an exception.
Default value: `THROW`.
**See Also**
- [JOIN clause](../../query_language/select.md#select-join)
- [Join table engine](../table_engines/join.md)
## max_partitions_per_insert_block
Limits the maximum number of partitions in a single inserted block.
Possible values:
- Positive integer.
- 0 — Unlimited number of partitions.
@ -211,6 +274,4 @@ When inserting data, ClickHouse calculates the number of partitions in the inser
> "Too many partitions for single INSERT block (more than " + toString(max_parts) + "). The limit is controlled by 'max_partitions_per_insert_block' setting. Large number of partitions is a common misconception. It will lead to severe negative performance impact, including slow server startup, slow INSERT queries and slow SELECT queries. Recommended total number of partitions for a table is under 1000..10000. Please note, that partitioning is not intended to speed up SELECT queries (ORDER BY key is sufficient to make range queries fast). Partitions are intended for data manipulation (DROP PARTITION, etc)."
[Original article](https://clickhouse.yandex/docs/en/operations/settings/query_complexity/) <!--hide-->

View File

@ -258,6 +258,25 @@ Sets default strictness for [JOIN clauses](../../query_language/select.md#select
**Default value**: `ALL`
## join_any_take_last_row {#settings-join_any_take_last_row}
Changes behavior of join operations with `ANY` strictness.
!!! note "Attention"
This setting applies only for the [Join](../table_engines/join.md) table engine.
Possible values:
- 0 — If the right table has more than one matching row, only the first one found is joined.
- 1 — If the right table has more than one matching row, only the last one found is joined.
Default value: 0.
**See Also**
- [JOIN clause](../../query_language/select.md#select-join)
- [Join table engine](../table_engines/join.md)
- [join_default_strictness](#settings-join_default_strictness)
## join_use_nulls {#settings-join_use_nulls}
@ -674,6 +693,49 @@ When sequential consistency is enabled, ClickHouse allows the client to execute
- [insert_quorum](#settings-insert_quorum)
- [insert_quorum_timeout](#settings-insert_quorum_timeout)
## max_network_bytes {#settings-max_network_bytes}
Limits the data volume (in bytes) that is received or transmitted over the network when executing a query. This setting applies for every individual query.
Possible values:
- Positive integer.
- 0 — Data volume control is disabled.
Default value: 0.
## max_network_bandwidth {#settings-max_network_bandwidth}
Limits speed of data exchange over the network in bytes per second. This setting applies for every individual query.
Possible values:
- Positive integer.
- 0 — Bandwidth control is disabled.
Default value: 0.
## max_network_bandwidth_for_user {#settings-max_network_bandwidth_for_user}
Limits speed of data exchange over the network in bytes per second. This setting applies for all concurrently running queries performed by a single user.
Possible values:
- Positive integer.
- 0 — Control of the data speed is disabled.
Default value: 0.
## max_network_bandwidth_for_all_users {#settings-max_network_bandwidth_for_all_users}
Limits speed of data exchange over the network in bytes per second. This setting applies for all concurrently running queries on the server.
Possible values:
- Positive integer.
- 0 — Control of the data speed is disabled.
Default value: 0.
## allow_experimental_cross_to_join_conversion {#settings-allow_experimental_cross_to_join_conversion}
Enables or disables:

View File

@ -428,7 +428,6 @@ Joins the data in the normal [SQL JOIN](https://en.wikipedia.org/wiki/Join_(SQL)
!!! info "Note"
Not related to [ARRAY JOIN](#select-array-join-clause).
``` sql
SELECT <expr_list>
FROM <left_subquery>
@ -485,8 +484,6 @@ Be careful when using `GLOBAL`. For more information, see the section [Distribut
#### Usage Recommendations
All columns that are not needed for the `JOIN` are deleted from the subquery.
When running a `JOIN`, there is no optimization of the order of execution in relation to other stages of the query. The join (a search in the right table) is run before filtering in `WHERE` and before aggregation. In order to explicitly set the processing order, we recommend running a `JOIN` subquery with a subquery.
Example:
@ -542,7 +539,16 @@ Each time a query is run with the same `JOIN`, the subquery is run again because
In some cases, it is more efficient to use `IN` instead of `JOIN`.
Among the various types of `JOIN`, the most efficient is `ANY LEFT JOIN`, then `ANY INNER JOIN`. The least efficient are `ALL LEFT JOIN` and `ALL INNER JOIN`.
If you need a `JOIN` for joining with dimension tables (these are relatively small tables that contain dimension properties, such as names for advertising campaigns), a `JOIN` might not be very convenient due to the bulky syntax and the fact that the right table is re-accessed for every query. For such cases, there is an "external dictionaries" feature that you should use instead of `JOIN`. For more information, see the section [External dictionaries](dicts/external_dicts.md).
If you need a `JOIN` for joining with dimension tables (these are relatively small tables that contain dimension properties, such as names for advertising campaigns), a `JOIN` might not be very convenient due to the fact that the right table is re-accessed for every query. For such cases, there is an "external dictionaries" feature that you should use instead of `JOIN`. For more information, see the section [External dictionaries](dicts/external_dicts.md).
**Memory Limitations**
ClickHouse uses the [hash join](https://en.wikipedia.org/wiki/Hash_join) algorithm. ClickHouse takes the `<right_subquery>` and creates a hash table for it in RAM. If you need to restrict join operation memory consumption use the following settings:
- [max_rows_in_join](../operations/settings/query_complexity.md#settings-max_rows_in_join) — Limits number of rows in the hash table.
- [max_bytes_in_join](../operations/settings/query_complexity.md#settings-max_bytes_in_join) — Limits size of the hash table.
When any of these limits is reached, ClickHouse acts as the [join_overflow_mode](../operations/settings/query_complexity.md#settings-join_overflow_mode) setting instructs.
#### Processing of Empty or NULL Cells
@ -683,21 +689,21 @@ You can use WITH TOTALS in subqueries, including subqueries in the JOIN clause (
#### GROUP BY in External Memory {#select-group-by-in-external-memory}
You can enable dumping temporary data to the disk to restrict memory usage during GROUP BY.
The `max_bytes_before_external_group_by` setting determines the threshold RAM consumption for dumping GROUP BY temporary data to the file system. If set to 0 (the default), it is disabled.
You can enable dumping temporary data to the disk to restrict memory usage during `GROUP BY`.
The [max_bytes_before_external_group_by](../operations/settings/settings.md#settings-max_bytes_before_external_group_by) setting determines the threshold RAM consumption for dumping `GROUP BY` temporary data to the file system. If set to 0 (the default), it is disabled.
When using `max_bytes_before_external_group_by`, we recommend that you set max_memory_usage about twice as high. This is necessary because there are two stages to aggregation: reading the date and forming intermediate data (1) and merging the intermediate data (2). Dumping data to the file system can only occur during stage 1. If the temporary data wasn't dumped, then stage 2 might require up to the same amount of memory as in stage 1.
When using `max_bytes_before_external_group_by`, we recommend that you set `max_memory_usage` about twice as high. This is necessary because there are two stages to aggregation: reading the date and forming intermediate data (1) and merging the intermediate data (2). Dumping data to the file system can only occur during stage 1. If the temporary data wasn't dumped, then stage 2 might require up to the same amount of memory as in stage 1.
For example, if `max_memory_usage` was set to 10000000000 and you want to use external aggregation, it makes sense to set `max_bytes_before_external_group_by` to 10000000000, and max_memory_usage to 20000000000. When external aggregation is triggered (if there was at least one dump of temporary data), maximum consumption of RAM is only slightly more than ` max_bytes_before_external_group_by`.
For example, if [max_memory_usage](../operations/settings/settings.md#settings_max_memory_usage) was set to 10000000000 and you want to use external aggregation, it makes sense to set `max_bytes_before_external_group_by` to 10000000000, and max_memory_usage to 20000000000. When external aggregation is triggered (if there was at least one dump of temporary data), maximum consumption of RAM is only slightly more than `max_bytes_before_external_group_by`.
With distributed query processing, external aggregation is performed on remote servers. In order for the requestor server to use only a small amount of RAM, set ` distributed_aggregation_memory_efficient` to 1.
With distributed query processing, external aggregation is performed on remote servers. In order for the requester server to use only a small amount of RAM, set `distributed_aggregation_memory_efficient` to 1.
When merging data flushed to the disk, as well as when merging results from remote servers when the ` distributed_aggregation_memory_efficient` setting is enabled, consumes up to 1/256 \* the number of threads from the total amount of RAM.
When merging data flushed to the disk, as well as when merging results from remote servers when the `distributed_aggregation_memory_efficient` setting is enabled, consumes up to `1/256 * the_number_of_threads` from the total amount of RAM.
When external aggregation is enabled, if there was less than ` max_bytes_before_external_group_by` of data (i.e. data was not flushed), the query runs just as fast as without external aggregation. If any temporary data was flushed, the run time will be several times longer (approximately three times).
When external aggregation is enabled, if there was less than `max_bytes_before_external_group_by` of data (i.e. data was not flushed), the query runs just as fast as without external aggregation. If any temporary data was flushed, the run time will be several times longer (approximately three times).
If you have an ORDER BY with a small LIMIT after GROUP BY, then the ORDER BY CLAUSE will not use significant amounts of RAM.
But if the ORDER BY doesn't have LIMIT, don't forget to enable external sorting (`max_bytes_before_external_sort`).
If you have an `ORDER BY` with a small `LIMIT` after `GROUP BY`, then the `ORDER BY` clause will not use significant amounts of RAM.
But if the `ORDER BY` doesn't have `LIMIT`, don't forget to enable external sorting (`max_bytes_before_external_sort`).
### LIMIT BY Clause
@ -1145,11 +1151,11 @@ It also makes sense to specify a local table in the `GLOBAL IN` clause, in case
In addition to results, you can also get minimum and maximum values for the results columns. To do this, set the **extremes** setting to 1. Minimums and maximums are calculated for numeric types, dates, and dates with times. For other columns, the default values are output.
An extra two rows are calculated the minimums and maximums, respectively. These extra two rows are output in JSON\*, TabSeparated\*, and Pretty\* formats, separate from the other rows. They are not output for other formats.
An extra two rows are calculated the minimums and maximums, respectively. These extra two rows are output in `JSON*`, `TabSeparated*`, and `Pretty*` formats, separate from the other rows. They are not output for other formats.
In JSON\* formats, the extreme values are output in a separate 'extremes' field. In TabSeparated\* formats, the row comes after the main result, and after 'totals' if present. It is preceded by an empty row (after the other data). In Pretty\* formats, the row is output as a separate table after the main result, and after 'totals' if present.
In `JSON*` formats, the extreme values are output in a separate 'extremes' field. In `TabSeparated*` formats, the row comes after the main result, and after 'totals' if present. It is preceded by an empty row (after the other data). In `Pretty*` formats, the row is output as a separate table after the main result, and after 'totals' if present.
Extreme values are calculated for rows that have passed through LIMIT. However, when using 'LIMIT offset, size', the rows before 'offset' are included in 'extremes'. In stream requests, the result may also include a small number of rows that passed through LIMIT.
Extreme values are calculated for rows before `LIMIT`, but after `LIMIT BY`. However, when using `LIMIT offset, size`, the rows before `offset` are included in `extremes`. In stream requests, the result may also include a small number of rows that passed through `LIMIT`.
### Notes