From 9123cce70307c87cf74d1f4ee03db607efb607ac Mon Sep 17 00:00:00 2001 From: BayoNet Date: Fri, 6 Dec 2019 09:52:30 +0300 Subject: [PATCH] DOCS-282: DirectoryMonitor settings description. (#7801) * Doc links fix. * More links fix. * CLICKHOUSEDOCS-282: Settings descriptions. * CLICKHOUSEDOCS-282: Edits after the talking with developer. * CLICKHOUSEDOCS-282: Fixes. * Update docs/en/operations/settings/settings.md Co-Authored-By: Ivan Blinkov * Update docs/en/operations/settings/settings.md Co-Authored-By: Ivan Blinkov * Update docs/en/operations/settings/settings.md Co-Authored-By: Ivan Blinkov * Update docs/en/operations/settings/settings.md Co-Authored-By: Ivan Blinkov * CLICKHOUSEDOCS-282: Clarification. --- docs/en/operations/settings/settings.md | 35 +++++++++++++++++++ .../operations/table_engines/distributed.md | 9 ++--- 2 files changed, 38 insertions(+), 6 deletions(-) diff --git a/docs/en/operations/settings/settings.md b/docs/en/operations/settings/settings.md index 3011b575600..ab3f5b95a56 100644 --- a/docs/en/operations/settings/settings.md +++ b/docs/en/operations/settings/settings.md @@ -1007,6 +1007,41 @@ Error count of each replica is capped at this value, preventing a single replica - [Table engine Distributed](../../operations/table_engines/distributed.md) - [`distributed_replica_error_half_life`](#settings-distributed_replica_error_half_life) + +## distributed_directory_monitor_sleep_time_ms {#distributed_directory_monitor_sleep_time_ms} + +Base interval of data sending by the [Distributed](../table_engines/distributed.md) table engine. Actual interval grows exponentially in case of any errors. + +Possible values: + +- Positive integer number of milliseconds. + +Default value: 100 milliseconds. + + +## distributed_directory_monitor_max_sleep_time_ms {#distributed_directory_monitor_max_sleep_time_ms} + +Maximum interval of data sending by the [Distributed](../table_engines/distributed.md) table engine. Limits exponential growth of the interval set in the [distributed_directory_monitor_sleep_time_ms](#distributed_directory_monitor_sleep_time_ms) setting. + +Possible values: + +- Positive integer number of milliseconds. + +Default value: 30000 milliseconds (30 seconds). + +## distributed_directory_monitor_batch_inserts {#distributed_directory_monitor_batch_inserts} + +Enables/disables sending of inserted data in batches. + +When batch sending is enabled, [Distributed](../table_engines/distributed.md) table engine tries to send multiple files of inserted data in one operation instead of sending them separately. Batch sending improves cluster performance by better server and network resources utilization. + +Possible values: + +- 1 — Enabled. +- 0 — Disabled. + +Defaule value: 0. + ## os_thread_priority {#setting-os_thread_priority} Sets the priority ([nice](https://en.wikipedia.org/wiki/Nice_(Unix))) for threads that execute queries. The OS scheduler considers this priority when choosing the next thread to run on each available CPU core. diff --git a/docs/en/operations/table_engines/distributed.md b/docs/en/operations/table_engines/distributed.md index 38d085da568..a22fd43b34f 100644 --- a/docs/en/operations/table_engines/distributed.md +++ b/docs/en/operations/table_engines/distributed.md @@ -87,12 +87,9 @@ The Distributed engine requires writing clusters to the config file. Clusters fr There are two methods for writing data to a cluster: -First, you can define which servers to write which data to, and perform the write directly on each shard. In other words, perform INSERT in the tables that the distributed table "looks at". -This is the most flexible solution – you can use any sharding scheme, which could be non-trivial due to the requirements of the subject area. -This is also the most optimal solution, since data can be written to different shards completely independently. +First, you can define which servers to write which data to, and perform the write directly on each shard. In other words, perform INSERT in the tables that the distributed table "looks at". This is the most flexible solution – you can use any sharding scheme, which could be non-trivial due to the requirements of the subject area. This is also the most optimal solution, since data can be written to different shards completely independently. -Second, you can perform INSERT in a Distributed table. In this case, the table will distribute the inserted data across servers itself. -In order to write to a Distributed table, it must have a sharding key set (the last parameter). In addition, if there is only one shard, the write operation works without specifying the sharding key, since it doesn't have any meaning in this case. +Second, you can perform INSERT in a Distributed table. In this case, the table will distribute the inserted data across servers itself. In order to write to a Distributed table, it must have a sharding key set (the last parameter). In addition, if there is only one shard, the write operation works without specifying the sharding key, since it doesn't have any meaning in this case. Each shard can have a weight defined in the config file. By default, the weight is equal to one. Data is distributed across shards in the amount proportional to the shard weight. For example, if there are two shards and the first has a weight of 9 while the second has a weight of 10, the first will be sent 9 / 19 parts of the rows, and the second will be sent 10 / 19. @@ -115,7 +112,7 @@ You should be concerned about the sharding scheme in the following cases: - Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient. - A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners). In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard. Alternatively, as we've done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into "layers", where a layer may consist of multiple shards. Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them. Distributed tables are created for each layer, and a single shared distributed table is created for global queries. -Data is written asynchronously. For an INSERT to a Distributed table, the data block is just written to the local file system. The data is sent to the remote servers in the background as soon as possible. You should check whether data is sent successfully by checking the list of files (data waiting to be sent) in the table directory: /var/lib/clickhouse/data/database/table/. +Data is written asynchronously. When inserted to the table, the data block is just written to the local file system. The data is sent to the remote servers in the background as soon as possible. The period of data sending is managed by the [distributed_directory_monitor_sleep_time_ms](../settings/settings.md#distributed_directory_monitor_sleep_time_ms) and [distributed_directory_monitor_max_sleep_time_ms](../settings/settings.md#distributed_directory_monitor_max_sleep_time_ms) settings. The `Distributed` engine sends each file with inserted data separately, but you can enable batch sending of files with the [distributed_directory_monitor_batch_inserts](../settings/settings.md#distributed_directory_monitor_batch_inserts) setting. This setting improves cluster performance by better local server and network resources utilization. You should check whether data is sent successfully by checking the list of files (data waiting to be sent) in the table directory: `/var/lib/clickhouse/data/database/table/`. If the server ceased to exist or had a rough restart (for example, after a device failure) after an INSERT to a Distributed table, the inserted data might be lost. If a damaged data part is detected in the table directory, it is transferred to the 'broken' subdirectory and no longer used.