update docs for async insert deduplication

This commit is contained in:
Han Fei 2023-01-20 14:42:11 +01:00
parent 3e08a98f16
commit 5fc4998f10
2 changed files with 70 additions and 1 deletions

View File

@ -176,6 +176,59 @@ Similar to [replicated_deduplication_window](#replicated-deduplication-window),
The time is relative to the time of the most recent record, not to the wall time. If it's the only record it will be stored forever.
## replicated_deduplication_window_for_async_inserts {#replicated-deduplication-window-for-async-inserts}
The number of most recently async inserted blocks for which ClickHouse Keeper stores hash sums to check for duplicates.
Possible values:
- Any positive integer.
- 0 (disable deduplication for async_inserts)
Default value: 10000.
The [Async Insert](./settings.md#async-insert) command will be cached in one or more blocks (parts). For [insert deduplication](../../engines/table-engines/mergetree-family/replication.md), when writing into replicated tables, ClickHouse writes the hash sums of each inserts into ClickHouse Keeper. Hash sums are stored only for the most recent `replicated_deduplication_window_for_async_inserts` blocks. The oldest hash sums are removed from ClickHouse Keeper.
A large number of `replicated_deduplication_window_for_async_inserts` slows down `Async Inserts` because it needs to compare more entries.
The hash sum is calculated from the composition of the field names and types and the data of the insert (stream of bytes).
## replicated_deduplication_window_seconds_for_async_inserts {#replicated-deduplication-window-seconds-for-async_inserts}
The number of seconds after which the hash sums of the async inserts are removed from ClickHouse Keeper.
Possible values:
- Any positive integer.
Default value: 604800 (1 week).
Similar to [replicated_deduplication_window_for_async_inserts](#replicated-deduplication-window-for-async-inserts), `replicated_deduplication_window_seconds_for_async_inserts` specifies how long to store hash sums of blocks for async insert deduplication. Hash sums older than `replicated_deduplication_window_seconds_for_async_inserts` are removed from ClickHouse Keeper, even if they are less than ` replicated_deduplication_window_for_async_inserts`.
The time is relative to the time of the most recent record, not to the wall time. If it's the only record it will be stored forever.
## use_async_block_ids_cache {#use-async-block-ids-cache}
If true, we cache the hash sums of the async inserts.
Possible values:
- true, false
Default value: false.
A block bearing multiple async inserts will generate multiple hash sums. When some of the inserts are duplicated, keeper will only return one duplicated hash sum in one RPC, which will cause unnecessary RPC retries. This cache will watch the hash sums path in keeper. If updates are watched in the keeper, the cache will update as soon as possible, so that we are able to filter the duplicated inserts in the memory.
## async_block_ids_cache_min_update_interval_ms
The minimum interval (in milliseconds) to update the `use_async_block_ids_cache`
Possible values:
- Any positive integer.
Default value: 100.
Normally, the `use_async_block_ids_cache` updates as soon as there're updates in the watching keeper path. However, the cache updates might be too frequent and become a heavy burden. This minimun interval prevent the cache from updating too fast. Note that if we set this value too long, the block with duplicated inserts will have a longer retry time.
## max_replicated_logs_to_keep
How many records may be in the ClickHouse Keeper log if there is inactive replica. An inactive replica becomes lost when when this number exceed.
@ -745,4 +798,4 @@ You can see which parts of `s` were stored using the sparse serialization:
│ id │ Default │
│ s │ Sparse │
└────────┴────────────────────┘
```
```

View File

@ -1394,6 +1394,22 @@ By default, blocks inserted into replicated tables by the `INSERT` statement are
For the replicated tables by default the only 100 of the most recent blocks for each partition are deduplicated (see [replicated_deduplication_window](merge-tree-settings.md/#replicated-deduplication-window), [replicated_deduplication_window_seconds](merge-tree-settings.md/#replicated-deduplication-window-seconds)).
For not replicated tables see [non_replicated_deduplication_window](merge-tree-settings.md/#non-replicated-deduplication-window).
## async_insert_deduplicate {#settings-async-insert-deduplicate}
Enables or disables insert deduplication of `ASYNC INSERT` (for Replicated\* tables).
Possible values:
- 0 — Disabled.
- 1 — Enabled.
Default value: 1.
By default, async inserted into replicated tables by the `INSERT` statement enabling [async_isnert](#async-insert) are deduplicated (see [Data Replication](../../engines/table-engines/mergetree-family/replication.md)).
For the replicated tables by default the only 10000 of the most recent inserts for each partition are deduplicated (see [replicated_deduplication_window_for_async_inserts](merge-tree-settings.md/#replicated-deduplication-window-async-inserts), [replicated_deduplication_window_seconds_for_async_inserts](merge-tree-settings.md/#replicated-deduplication-window-seconds-async-inserts)).
We recommend to enable the [async_block_ids_cache](merge-tree-settings.md/#use-async-block-ids-cache) to increase the efficiency of deduplication.
This function does not work for non-replicated tables.
## deduplicate_blocks_in_dependent_materialized_views {#settings-deduplicate-blocks-in-dependent-materialized-views}
Enables or disables the deduplication check for materialized views that receive data from Replicated\* tables.