mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-12-12 17:32:32 +00:00
c25d6cd624
* Limit log frequence for "Skipping send data over distributed table" message After SYSTEM STOP DISTRIBUTED SENDS it will constantly print this message. Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com> * Rename directory monitor concept into async INSERT Rename the following query settings (with preserving backward compatiblity, by keeping old name as an alias): - distributed_directory_monitor_sleep_time_ms -> distributed_async_insert_sleep_time_ms - distributed_directory_monitor_max_sleep_time_ms -> distributed_async_insert_max_sleep_time_ms - distributed_directory_monitor_batch -> distributed_async_insert_batch_inserts - distributed_directory_monitor_split_batch_on_failure -> distributed_async_insert_split_batch_on_failure Rename the following table settings (with preserving backward compatiblity, by keeping old name as an alias): - monitor_batch_inserts -> async_insert_batch - monitor_split_batch_on_failure -> async_insert_split_batch_on_failure - directory_monitor_sleep_time_ms -> async_insert_sleep_time_ms - directory_monitor_max_sleep_time_ms -> async_insert_max_sleep_time_ms And also update all the references: $ gg -e directory_monitor_ -e monitor_ tests docs | cut -d: -f1 | sort -u | xargs sed -e 's/distributed_directory_monitor_sleep_time_ms/distributed_async_insert_sleep_time_ms/g' -e 's/distributed_directory_monitor_max_sleep_time_ms/distributed_async_insert_max_sleep_time_ms/g' -e 's/distributed_directory_monitor_batch_inserts/distributed_async_insert_batch/g' -e 's/distributed_directory_monitor_split_batch_on_failure/distributed_async_insert_split_batch_on_failure/g' -e 's/monitor_batch_inserts/async_insert_batch/g' -e 's/monitor_split_batch_on_failure/async_insert_split_batch_on_failure/g' -e 's/monitor_sleep_time_ms/async_insert_sleep_time_ms/g' -e 's/monitor_max_sleep_time_ms/async_insert_max_sleep_time_ms/g' -i Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com> * Rename async_insert for Distributed into background_insert This will avoid amigibuity between general async INSERT's and INSERT into Distributed, which are indeed background, so new term express it even better. Mostly done with: $ git di HEAD^ --name-only | xargs sed -i -e 's/distributed_async_insert/distributed_background_insert/g' -e 's/async_insert_batch/background_insert_batch/g' -e 's/async_insert_split_batch_on_failure/background_insert_split_batch_on_failure/g' -e 's/async_insert_sleep_time_ms/background_insert_sleep_time_ms/g' -e 's/async_insert_max_sleep_time_ms/background_insert_max_sleep_time_ms/g' Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com> * Mark 02417_opentelemetry_insert_on_distributed_table as long CI: https://s3.amazonaws.com/clickhouse-test-reports/55978/7a6abb03a0b507e29e999cb7e04f246a119c6f28/stateless_tests_flaky_check__asan_.html Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com> --------- Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
184 lines
8.7 KiB
Markdown
184 lines
8.7 KiB
Markdown
---
|
||
slug: /ru/operations/utilities/clickhouse-copier
|
||
sidebar_position: 59
|
||
sidebar_label: clickhouse-copier
|
||
---
|
||
|
||
# clickhouse-copier {#clickhouse-copier}
|
||
|
||
Копирует данные из таблиц одного кластера в таблицы другого (или этого же) кластера.
|
||
|
||
Можно запустить несколько `clickhouse-copier` для разных серверах для выполнения одного и того же задания. Для синхронизации между процессами используется ZooKeeper.
|
||
|
||
После запуска, `clickhouse-copier`:
|
||
|
||
- Соединяется с ZooKeeper и получает:
|
||
|
||
- Задания на копирование.
|
||
- Состояние заданий на копирование.
|
||
|
||
- Выполняет задания.
|
||
|
||
Каждый запущенный процесс выбирает "ближайший" шард исходного кластера и копирует данные в кластер назначения, при необходимости перешардируя их.
|
||
|
||
`clickhouse-copier` отслеживает изменения в ZooKeeper и применяет их «на лету».
|
||
|
||
Для снижения сетевого трафика рекомендуем запускать `clickhouse-copier` на том же сервере, где находятся исходные данные.
|
||
|
||
## Запуск Clickhouse-copier {#zapusk-clickhouse-copier}
|
||
|
||
Утилиту следует запускать вручную следующим образом:
|
||
|
||
``` bash
|
||
$ clickhouse-copier --daemon --config zookeeper.xml --task-path /task/path --base-dir /path/to/dir
|
||
```
|
||
|
||
Параметры запуска:
|
||
|
||
- `daemon` - запускает `clickhouse-copier` в режиме демона.
|
||
- `config` - путь к файлу `zookeeper.xml` с параметрами соединения с ZooKeeper.
|
||
- `task-path` - путь к ноде ZooKeeper. Нода используется для синхронизации между процессами `clickhouse-copier` и для хранения заданий. Задания хранятся в `$task-path/description`.
|
||
- `task-file` - необязательный путь к файлу с описанием конфигурация заданий для загрузки в ZooKeeper.
|
||
- `task-upload-force` - Загрузить `task-file` в ZooKeeper даже если уже было загружено.
|
||
- `base-dir` - путь к логам и вспомогательным файлам. При запуске `clickhouse-copier` создает в `$base-dir` подкаталоги `clickhouse-copier_YYYYMMHHSS_<PID>`. Если параметр не указан, то каталоги будут создаваться в каталоге, где `clickhouse-copier` был запущен.
|
||
|
||
## Формат Zookeeper.xml {#format-zookeeper-xml}
|
||
|
||
``` xml
|
||
<clickhouse>
|
||
<logger>
|
||
<level>trace</level>
|
||
<size>100M</size>
|
||
<count>3</count>
|
||
</logger>
|
||
|
||
<zookeeper>
|
||
<node index="1">
|
||
<host>127.0.0.1</host>
|
||
<port>2181</port>
|
||
</node>
|
||
</zookeeper>
|
||
</clickhouse>
|
||
```
|
||
|
||
## Конфигурация заданий на копирование {#konfiguratsiia-zadanii-na-kopirovanie}
|
||
|
||
``` xml
|
||
<clickhouse>
|
||
<!-- Configuration of clusters as in an ordinary server config -->
|
||
<remote_servers>
|
||
<source_cluster>
|
||
<!--
|
||
source cluster & destination clusters accept exactly the same
|
||
parameters as parameters for the usual Distributed table
|
||
see https://clickhouse.com/docs/ru/engines/table-engines/special/distributed/
|
||
-->
|
||
<shard>
|
||
<internal_replication>false</internal_replication>
|
||
<replica>
|
||
<host>127.0.0.1</host>
|
||
<port>9000</port>
|
||
<!--
|
||
<user>default</user>
|
||
<password>default</password>
|
||
<secure>1</secure>
|
||
-->
|
||
</replica>
|
||
</shard>
|
||
...
|
||
</source_cluster>
|
||
|
||
<destination_cluster>
|
||
...
|
||
</destination_cluster>
|
||
</remote_servers>
|
||
|
||
<!-- How many simultaneously active workers are possible. If you run more workers superfluous workers will sleep. -->
|
||
<max_workers>2</max_workers>
|
||
|
||
<!-- Setting used to fetch (pull) data from source cluster tables -->
|
||
<settings_pull>
|
||
<readonly>1</readonly>
|
||
</settings_pull>
|
||
|
||
<!-- Setting used to insert (push) data to destination cluster tables -->
|
||
<settings_push>
|
||
<readonly>0</readonly>
|
||
</settings_push>
|
||
|
||
<!-- Common setting for fetch (pull) and insert (push) operations. Also, copier process context uses it.
|
||
They are overlaid by <settings_pull/> and <settings_push/> respectively. -->
|
||
<settings>
|
||
<connect_timeout>3</connect_timeout>
|
||
<!-- Sync insert is set forcibly, leave it here just in case. -->
|
||
<distributed_foreground_insert>1</distributed_foreground_insert>
|
||
</settings>
|
||
|
||
<!-- Copying tasks description.
|
||
You could specify several table task in the same task description (in the same ZooKeeper node), they will be performed
|
||
sequentially.
|
||
-->
|
||
<tables>
|
||
<!-- A table task, copies one table. -->
|
||
<table_hits>
|
||
<!-- Source cluster name (from <remote_servers/> section) and tables in it that should be copied -->
|
||
<cluster_pull>source_cluster</cluster_pull>
|
||
<database_pull>test</database_pull>
|
||
<table_pull>hits</table_pull>
|
||
|
||
<!-- Destination cluster name and tables in which the data should be inserted -->
|
||
<cluster_push>destination_cluster</cluster_push>
|
||
<database_push>test</database_push>
|
||
<table_push>hits2</table_push>
|
||
|
||
<!-- Engine of destination tables.
|
||
If destination tables have not be created, workers create them using columns definition from source tables and engine
|
||
definition from here.
|
||
|
||
NOTE: If the first worker starts insert data and detects that destination partition is not empty then the partition will
|
||
be dropped and refilled, take it into account if you already have some data in destination tables. You could directly
|
||
specify partitions that should be copied in <enabled_partitions/>, they should be in quoted format like partition column of
|
||
system.parts table.
|
||
-->
|
||
<engine>
|
||
ENGINE=ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/hits2', '{replica}')
|
||
PARTITION BY toMonday(date)
|
||
ORDER BY (CounterID, EventDate)
|
||
</engine>
|
||
|
||
<!-- Sharding key used to insert data to destination cluster -->
|
||
<sharding_key>jumpConsistentHash(intHash64(UserID), 2)</sharding_key>
|
||
|
||
<!-- Optional expression that filter data while pull them from source servers -->
|
||
<where_condition>CounterID != 0</where_condition>
|
||
|
||
<!-- This section specifies partitions that should be copied, other partition will be ignored.
|
||
Partition names should have the same format as
|
||
partition column of system.parts table (i.e. a quoted text).
|
||
Since partition key of source and destination cluster could be different,
|
||
these partition names specify destination partitions.
|
||
|
||
NOTE: In spite of this section is optional (if it is not specified, all partitions will be copied),
|
||
it is strictly recommended to specify them explicitly.
|
||
If you already have some ready partitions on destination cluster they
|
||
will be removed at the start of the copying since they will be interpeted
|
||
as unfinished data from the previous copying!!!
|
||
-->
|
||
<enabled_partitions>
|
||
<partition>'2018-02-26'</partition>
|
||
<partition>'2018-03-05'</partition>
|
||
...
|
||
</enabled_partitions>
|
||
</table_hits>
|
||
|
||
<!-- Next table to copy. It is not copied until previous table is copying. -->
|
||
<table_visits>
|
||
...
|
||
</table_visits>
|
||
...
|
||
</tables>
|
||
</clickhouse>
|
||
```
|
||
|
||
`clickhouse-copier` отслеживает изменения `/task/path/description` и применяет их «на лету». Если вы поменяете, например, значение `max_workers`, то количество процессов, выполняющих задания, также изменится.
|