mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-11-24 16:42:05 +00:00
Expand max_parallel_replicas documentation
Based on discussion in #17807
This commit is contained in:
parent
542dd454aa
commit
17a8a4e966
@ -1093,9 +1093,14 @@ See the section “WITH TOTALS modifier”.
|
||||
|
||||
## max_parallel_replicas {#settings-max_parallel_replicas}
|
||||
|
||||
The maximum number of replicas for each shard when executing a query.
|
||||
For consistency (to get different parts of the same data split), this option only works when the sampling key is set.
|
||||
Replica lag is not controlled.
|
||||
The maximum number of replicas for each shard when executing a query. In limited circumstances, this can make a query faster by executing it on more servers. This setting is only useful for replicated tables with a sampling key. There are cases where performance will not improve or even worsen:
|
||||
|
||||
- the position of the sampling key in the partitioning key's order doesn't allow efficient range scans
|
||||
- adding a sampling key to the table makes filtering by other columns less efficient
|
||||
- the sampling key is an expression that is expensive to calculate
|
||||
- the cluster's latency distribution has a long tail, so that querying more servers increases the query's overall latency
|
||||
|
||||
In addition, this setting will produce incorrect results when joins or subqueries are involved, and all tables don't meet certain conditions. See [Distributed Subqueries and max_parallel_replicas](../../sql-reference/operators/in.md/#max_parallel_replica-subqueries) for more details.
|
||||
|
||||
## compile {#compile}
|
||||
|
||||
|
@ -197,3 +197,25 @@ This is more optimal than using the normal IN. However, keep the following point
|
||||
5. If you need to use GLOBAL IN often, plan the location of the ClickHouse cluster so that a single group of replicas resides in no more than one data center with a fast network between them, so that a query can be processed entirely within a single data center.
|
||||
|
||||
It also makes sense to specify a local table in the `GLOBAL IN` clause, in case this local table is only available on the requestor server and you want to use data from it on remote servers.
|
||||
|
||||
### Distributed Subqueries and max_parallel_replicas {#max_parallel_replica-subqueries}
|
||||
|
||||
When max_parallel_replicas is greater than 1, distributed queries are further transformed. For example, the following:
|
||||
|
||||
```sql
|
||||
SEELECT CounterID, count() FROM distributed_table_1 WHERE UserID IN (SELECT UserID FROM local_table_2 WHERE CounterID < 100)
|
||||
SETTINGS max_parallel_replicas=3
|
||||
```
|
||||
|
||||
is transformed on each server into
|
||||
|
||||
```sql
|
||||
SELECT CounterID, count() FROM local_table_1 WHERE UserID IN (SELECT UserID FROM local_table_2 WHERE CounterID < 100)
|
||||
SETTINGS parallel_replicas_count=3, parallel_replicas_offset=M
|
||||
```
|
||||
|
||||
where M is between 1 and 3 depending on which replica the local query is executing on. These settings affect every MergeTree-family table in the query and have the same effect as applying `SAMPLE 1/3 OFFSET (M-1)/3` on each table.
|
||||
|
||||
Therefore adding the max_parallel_replicas setting will only produce correct results if both tables have the same replication scheme and are sampled by UserID or a subkey of it. In particular, if local_table_2 does not have a sampling key, incorrect results will be produced. The same rule applies to JOIN.
|
||||
|
||||
One workaround if local_table_2 doesn't meet the requirements, is to use `GLOBAL IN` or `GLOBAL JOIN`.
|
||||
|
Loading…
Reference in New Issue
Block a user