Expand max_parallel_replicas documentation

Based on discussion in #17807
2024-11-24 16:42:05 +00:00 · 2020-12-12 13:57:01 -08:00 · 2020-12-12 13:57:01 -08:00 · 17a8a4e966
commit 17a8a4e966
parent 542dd454aa
2 changed files with 30 additions and 3 deletions
--- a/docs/en/operations/settings/settings.md
+++ b/docs/en/operations/settings/settings.md
@ -1093,9 +1093,14 @@ See the section “WITH TOTALS modifier”.

 ## max_parallel_replicas {#settings-max_parallel_replicas}

-The maximum number of replicas for each shard when executing a query.
-For consistency (to get different parts of the same data split), this option only works when the sampling key is set.
-Replica lag is not controlled.
+The maximum number of replicas for each shard when executing a query. In limited circumstances, this can make a query faster by executing it on more servers. This setting is only useful for replicated tables with a sampling key. There are cases where performance will not improve or even worsen:
+
+- the position of the sampling key in the partitioning key's order doesn't allow efficient range scans
+- adding a sampling key to the table makes filtering by other columns less efficient
+- the sampling key is an expression that is expensive to calculate
+- the cluster's latency distribution has a long tail, so that querying more servers increases the query's overall latency
+
+In addition, this setting will produce incorrect results when joins or subqueries are involved, and all tables don't meet certain conditions. See [Distributed Subqueries and max_parallel_replicas](../../sql-reference/operators/in.md/#max_parallel_replica-subqueries) for more details.

 ## compile {#compile}

--- a/docs/en/sql-reference/operators/in.md
+++ b/docs/en/sql-reference/operators/in.md
@ -197,3 +197,25 @@ This is more optimal than using the normal IN. However, keep the following point
 5.  If you need to use GLOBAL IN often, plan the location of the ClickHouse cluster so that a single group of replicas resides in no more than one data center with a fast network between them, so that a query can be processed entirely within a single data center.

 It also makes sense to specify a local table in the `GLOBAL IN` clause, in case this local table is only available on the requestor server and you want to use data from it on remote servers.
+
+### Distributed Subqueries and max_parallel_replicas {#max_parallel_replica-subqueries}
+
+When max_parallel_replicas is greater than 1, distributed queries are further transformed. For example, the following:
+
+```sql
+SEELECT CounterID, count() FROM distributed_table_1 WHERE UserID IN (SELECT UserID FROM local_table_2 WHERE CounterID < 100)
+SETTINGS max_parallel_replicas=3
+```
+
+is transformed on each server into
+
+```sql
+SELECT CounterID, count() FROM local_table_1 WHERE UserID IN (SELECT UserID FROM local_table_2 WHERE CounterID < 100)
+SETTINGS parallel_replicas_count=3, parallel_replicas_offset=M
+```
+
+where M is between 1 and 3 depending on which replica the local query is executing on. These settings affect every MergeTree-family table in the query and have the same effect as applying `SAMPLE 1/3 OFFSET (M-1)/3` on each table.
+
+Therefore adding the max_parallel_replicas setting will only produce correct results if both tables have the same replication scheme and are sampled by UserID or a subkey of it. In particular, if local_table_2 does not have a sampling key, incorrect results will be produced. The same rule applies to JOIN.
+
+One workaround if local_table_2 doesn't meet the requirements, is to use `GLOBAL IN` or `GLOBAL JOIN`.