translate docs/zh/operations/table_engines/distributed.md (#5004)

* translate docs/zh/operations/table_engines/distributed.md

* fix indent

* Update docs/zh/operations/table_engines/distributed.md

Co-Authored-By: neverlee <neverlea@foxmail.com>

* Update docs/zh/operations/table_engines/distributed.md

Co-Authored-By: neverlee <neverlea@foxmail.com>

* Update docs/zh/operations/table_engines/distributed.md

Co-Authored-By: neverlee <neverlea@foxmail.com>

* fix error for docs/zh/operations/table_engines/distributed.md

* optimize docs/zh/operations/table_engines/distributed.md
This commit is contained in:
never lee 2019-04-16 17:51:45 +08:00 committed by Ivan Blinkov
parent c5d6fec2c2
commit 504df89c79
2 changed files with 52 additions and 52 deletions

View File

@ -63,8 +63,8 @@ Cluster names must not contain dots.
The parameters `host`, `port`, and optionally `user`, `password`, `secure`, `compression` are specified for each server:
: - `host` The address of the remote server. You can use either the domain or the IPv4 or IPv6 address. If you specify the domain, the server makes a DNS request when it starts, and the result is stored as long as the server is running. If the DNS request fails, the server doesn't start. If you change the DNS record, restart the server.
- `port` The TCP port for messenger activity ('tcp_port' in the config, usually set to 9000). Do not confuse it with http_port.
- `user` Name of the user for connecting to a remote server. Default value: default. This user must have access to connect to the specified server. Access is configured in the users.xml file. For more information, see the section "Access rights".
- `port` The TCP port for messenger activity ('tcp_port' in the config, usually set to 9000). Do not confuse it with http_port.
- `user` Name of the user for connecting to a remote server. Default value: default. This user must have access to connect to the specified server. Access is configured in the users.xml file. For more information, see the section "Access rights".
- `password` The password for connecting to a remote server (not masked). Default value: empty string.
- `secure` - Use ssl for connection, usually you also should define `port` = 9440. Server should listen on <tcp_port_secure>9440</tcp_port_secure> and have correct certificates.
- `compression` - Use data compression. Default value: true.

View File

@ -1,24 +1,24 @@
# Distributed
**The Distributed engine does not store data itself**, but allows distributed query processing on multiple servers.
Reading is automatically parallelized. During a read, the table indexes on remote servers are used, if there are any.
The Distributed engine accepts parameters: the cluster name in the server's config file, the name of a remote database, the name of a remote table, and (optionally) a sharding key.
Example:
**分布式引擎本身不存储数据**, 但可以在多个服务器上进行分布式查询。
读是自动并行的。读取时,远程服务器表的索引(如果有的话)会被使用。
分布式引擎参数:服务器配置文件中的集群名,远程数据库名,远程表名,数据分片键(可选)。
示例:
```
Distributed(logs, default, hits[, sharding_key])
```
Data will be read from all servers in the 'logs' cluster, from the default.hits table located on every server in the cluster.
Data is not only read, but is partially processed on the remote servers (to the extent that this is possible).
For example, for a query with GROUP BY, data will be aggregated on remote servers, and the intermediate states of aggregate functions will be sent to the requestor server. Then data will be further aggregated.
将会从位于“logs”集群中 default.hits 表所有服务器上读取数据。
远程服务器不仅用于读取数据,还会对尽可能数据做部分处理。
例如,对于使用 GROUP BY 的查询,数据首先在远程服务器聚合,之后返回聚合函数的中间状态给查询请求的服务器。再在请求的服务器上进一步汇总数据。
Instead of the database name, you can use a constant expression that returns a string. For example: currentDatabase().
数据库名参数除了用数据库名之外也可用返回字符串的常量表达式。例如currentDatabase()。
logs The cluster name in the server's config file.
logs 服务器配置文件中的集群名称。
Clusters are set like this:
集群示例配置如下:
```xml
<remote_servers>
@ -54,72 +54,72 @@ Clusters are set like this:
</remote_servers>
```
Here a cluster is defined with the name 'logs' that consists of two shards, each of which contains two replicas.
Shards refer to the servers that contain different parts of the data (in order to read all the data, you must access all the shards).
Replicas are duplicating servers (in order to read all the data, you can access the data on any one of the replicas).
这里定义了一个名为logs的集群它由两个分片组成每个分片包含两个副本。
分片是指包含数据不同部分的服务器(要读取所有数据,必须访问所有分片)。
副本是存储复制数据的服务器(要读取所有数据,访问任一副本上的数据即可)。
Cluster names must not contain dots.
集群名称不能包含点号。
The parameters `host`, `port`, and optionally `user`, `password`, `secure`, `compression` are specified for each server:
每个服务器需要指定 `host``port`,和可选的 `user``password``secure``compression` 的参数:
: - `host` The address of the remote server. You can use either the domain or the IPv4 or IPv6 address. If you specify the domain, the server makes a DNS request when it starts, and the result is stored as long as the server is running. If the DNS request fails, the server doesn't start. If you change the DNS record, restart the server.
- `port` The TCP port for messenger activity ('tcp_port' in the config, usually set to 9000). Do not confuse it with http_port.
- `user` Name of the user for connecting to a remote server. Default value: default. This user must have access to connect to the specified server. Access is configured in the users.xml file. For more information, see the section "Access rights".
- `password` The password for connecting to a remote server (not masked). Default value: empty string.
- `secure` - Use ssl for connection, usually you also should define `port` = 9440. Server should listen on <tcp_port_secure>9440</tcp_port_secure> and have correct certificates.
- `compression` - Use data compression. Default value: true.
: - `host` 远程服务器地址。可以域名、IPv4或IPv6。如果指定域名则服务在启动时发起一个 DNS 请求,并且请求结果会在服务器运行期间一直被记录。如果 DNS 请求失败,则服务不会启动。如果你修改了 DNS 记录,则需要重启服务。
- `port` 消息传递的 TCP 端口「tcp_port」配置通常设为 9000。不要跟 http_port 混淆。
- `user` 用于连接远程服务器的用户名。默认值default。该用户必须有权限访问该远程服务器。访问权限配置在 users.xml 文件中。更多信息,请查看“访问权限”部分。
- `password` 用于连接远程服务器的密码。默认值:空字符串。
- `secure` 是否使用ssl进行连接设为true时通常也应该设置 `port` = 9440。服务器也要监听 <tcp_port_secure>9440</tcp_port_secure> 并有正确的证书。
- `compression` - 是否使用数据压缩。默认值true。
When specifying replicas, one of the available replicas will be selected for each of the shards when reading. You can configure the algorithm for load balancing (the preference for which replica to access) see the 'load_balancing' setting.
If the connection with the server is not established, there will be an attempt to connect with a short timeout. If the connection failed, the next replica will be selected, and so on for all the replicas. If the connection attempt failed for all the replicas, the attempt will be repeated the same way, several times.
This works in favor of resiliency, but does not provide complete fault tolerance: a remote server might accept the connection, but might not work, or work poorly.
配置了副本,读取操作会从每个分片里选择一个可用的副本。可配置负载平衡算法(挑选副本的方式) - 请参阅“load_balancing”设置。
如果跟服务器的连接不可用,则在尝试短超时的重连。如果重连失败,则选择下一个副本,依此类推。如果跟所有副本的连接尝试都失败,则尝试用相同的方式再重复几次。
该机制有利于系统可用性,但不保证完全容错:如有远程服务器能够接受连接,但无法正常工作或状况不佳。
You can specify just one of the shards (in this case, query processing should be called remote, rather than distributed) or up to any number of shards. In each shard, you can specify from one to any number of replicas. You can specify a different number of replicas for each shard.
你可以配置一个(这种情况下,查询操作更应该称为远程查询,而不是分布式查询)或任意多个分片。在每个分片中,可以配置一个或任意多个副本。不同分片可配置不同数量的副本。
You can specify as many clusters as you wish in the configuration.
可以在配置中配置任意数量的集群。
To view your clusters, use the 'system.clusters' table.
要查看集群可使用“system.clusters”表。
The Distributed engine allows working with a cluster like a local server. However, the cluster is inextensible: you must write its configuration in the server config file (even better, for all the cluster's servers).
通过分布式引擎可以像使用本地服务器一样使用集群。但是,集群不是自动扩展的:你必须编写集群配置到服务器配置文件中(最好,给所有集群的服务器写上完整配置)。
There is no support for Distributed tables that look at other Distributed tables (except in cases when a Distributed table only has one shard). As an alternative, make the Distributed table look at the "final" tables.
不支持用分布式表查询别的分布式表(除非该表只有一个分片)。或者说,要用分布表查查询“最终”的数据表。
The Distributed engine requires writing clusters to the config file. Clusters from the config file are updated on the fly, without restarting the server. If you need to send a query to an unknown set of shards and replicas each time, you don't need to create a Distributed table use the 'remote' table function instead. See the section "Table functions".
分布式引擎需要将集群信息写入配置文件。配置文件中的集群信息会即时更新,无需重启服务器。如果你每次是要向不确定的一组分片和副本发送查询,则不适合创建分布式表 - 而应该使用“远程”表函数。 请参阅“表函数”部分。
There are two methods for writing data to a cluster:
向集群写数据的方法有两种:
First, you can define which servers to write which data to, and perform the write directly on each shard. In other words, perform INSERT in the tables that the distributed table "looks at".
This is the most flexible solution you can use any sharding scheme, which could be non-trivial due to the requirements of the subject area.
This is also the most optimal solution, since data can be written to different shards completely independently.
一,自已指定要将哪些数据写入哪些服务器,并直接在每个分片上执行写入。换句话说,在分布式表上“查询”,在数据表上 INSERT。
这是最灵活的解决方案 你可以使用任何分片方案,对于复杂业务特性的需求,这可能是非常重要的。
这也是最佳解决方案,因为数据可以完全独立地写入不同的分片。
Second, you can perform INSERT in a Distributed table. In this case, the table will distribute the inserted data across servers itself.
In order to write to a Distributed table, it must have a sharding key set (the last parameter). In addition, if there is only one shard, the write operation works without specifying the sharding key, since it doesn't have any meaning in this case.
二,在分布式表上执行 INSERT。在这种情况下分布式表会跨服务器分发插入数据。
为了写入分布式表,必须要配置分片键(最后一个参数)。当然,如果只有一个分片,则写操作在没有分片键的情况下也能工作,因为这种情况下分片键没有意义。
Each shard can have a weight defined in the config file. By default, the weight is equal to one. Data is distributed across shards in the amount proportional to the shard weight. For example, if there are two shards and the first has a weight of 9 while the second has a weight of 10, the first will be sent 9 / 19 parts of the rows, and the second will be sent 10 / 19.
每个分片都可以在配置文件中定义权重。默认情况下权重等于1。数据依据分片权重按比例分发到分片上。例如如果有两个分片第一个分片的权重是9而第二个分片的权重是10则发送 9 / 19 的行到第一个分片, 10 / 19 的行到第二个分片。
Each shard can have the 'internal_replication' parameter defined in the config file.
分片可在配置文件中定义 'internal_replication' 参数。
If this parameter is set to 'true', the write operation selects the first healthy replica and writes data to it. Use this alternative if the Distributed table "looks at" replicated tables. In other words, if the table where data will be written is going to replicate them itself.
此参数设置为“true”时写操作只选一个正常的副本写入数据。如果分布式表的子表是复制表(*ReplicaMergeTree),请使用此方案。换句话说,这其实是把数据的复制工作交给实际需要写入数据的表本身而不是分布式表。
If it is set to 'false' (the default), data is written to all replicas. In essence, this means that the Distributed table replicates data itself. This is worse than using replicated tables, because the consistency of replicas is not checked, and over time they will contain slightly different data.
若此参数设置为“false”默认值写操作会将数据写入所有副本。实质上这意味着要分布式表本身来复制数据。这种方式不如使用复制表的好因为不会检查副本的一致性并且随着时间的推移副本数据可能会有些不一样。
To select the shard that a row of data is sent to, the sharding expression is analyzed, and its remainder is taken from dividing it by the total weight of the shards. The row is sent to the shard that corresponds to the half-interval of the remainders from 'prev_weight' to 'prev_weights + weight', where 'prev_weights' is the total weight of the shards with the smallest number, and 'weight' is the weight of this shard. For example, if there are two shards, and the first has a weight of 9 while the second has a weight of 10, the row will be sent to the first shard for the remainders from the range \[0, 9), and to the second for the remainders from the range \[9, 19).
选择将一行数据发送到哪个分片的方法是,首先计算分片表达式,然后将这个计算结果除以所有分片的权重总和得到余数。该行会发送到那个包含该余数的从'prev_weight'到'prev_weights + weight'的半闭半开区间对应的分片上,其中 'prev_weights' 是该分片前面的所有分片的权重和,'weight' 是该分片的权重。例如如果有两个分片第一个分片权重为9而第二个分片权重为10则余数在 \[0,9) 中的行发给第一个分片,余数在 \[9,19) 中的行发给第二个分片。
The sharding expression can be any expression from constants and table columns that returns an integer. For example, you can use the expression 'rand()' for random distribution of data, or 'UserID' for distribution by the remainder from dividing the user's ID (then the data of a single user will reside on a single shard, which simplifies running IN and JOIN by users). If one of the columns is not distributed evenly enough, you can wrap it in a hash function: intHash64(UserID).
分片表达式可以是由常量和表列组成的任何返回整数表达式。例如,您可以使用表达式 'rand()' 来随机分配数据,或者使用 'UserID' 来按用户 ID 的余数分布(相同用户的数据将分配到单个分片上,这可降低带有用户信息的 IN 和 JOIN 的语句运行的复杂度。如果该列数据分布不够均匀可以将其包装在散列函数中intHash64(UserID)。
A simple remainder from division is a limited solution for sharding and isn't always appropriate. It works for medium and large volumes of data (dozens of servers), but not for very large volumes of data (hundreds of servers or more). In the latter case, use the sharding scheme required by the subject area, rather than using entries in Distributed tables.
这种简单的用余数来选择分片的方案是有局限的,并不总适用。它适用于中型和大型数据(数十台服务器)的场景,但不适用于巨量数据(数百台或更多服务器)的场景。后一种情况下,应根据业务特性需求考虑的分片方案,而不是直接用分布式表的多分片。
SELECT queries are sent to all the shards, and work regardless of how data is distributed across the shards (they can be distributed completely randomly). When you add a new shard, you don't have to transfer the old data to it. You can write new data with a heavier weight the data will be distributed slightly unevenly, but queries will work correctly and efficiently.
SELECT 查询会被发送到所有分片,并且无论数据在分片中如何分布(即使数据完全随机分布)都可正常工作。添加新分片时,不必将旧数据传输到该分片。你可以给新分片分配大权重然后写新数据 - 数据可能会稍分布不均,但查询会正确高效地运行。
You should be concerned about the sharding scheme in the following cases:
下面的情况,你需要关注分片方案:
- Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient.
- A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners). In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard. Alternatively, as we've done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into "layers", where a layer may consist of multiple shards. Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them. Distributed tables are created for each layer, and a single shared distributed table is created for global queries.
- 使用需要特定键连接数据( IN 或 JOIN )的查询。如果数据是用该键进行分片,则应使用本地 IN 或 JOIN 而不是 GLOBAL IN 或 GLOBAL JOIN这样效率更高。
- 使用大量服务器(上百或更多),但有大量小查询(个别客户的查询 - 网站,广告商或合作伙伴)。为了使小查询不影响整个集群,让单个客户的数据处于单个分片上是有意义的。或者,正如我们在 Yandex.Metrica 中所做的那样,你可以配置两级分片:将整个集群划分为“层”,一个层可以包含多个分片。单个客户的数据位于单个层上,根据需要将分片添加到层中,层中的数据随机分布。然后给每层创建分布式表,再创建一个全局的分布式表用于全局的查询。
Data is written asynchronously. For an INSERT to a Distributed table, the data block is just written to the local file system. The data is sent to the remote servers in the background as soon as possible. You should check whether data is sent successfully by checking the list of files (data waiting to be sent) in the table directory: /var/lib/clickhouse/data/database/table/.
数据是异步写入的。对于分布式表的 INSERT数据块只写本地文件系统。之后会尽快地在后台发送到远程服务器。你可以通过查看表目录中的文件列表等待发送的数据来检查数据是否成功发送/var/lib/clickhouse/data/database/table/ 。
If the server ceased to exist or had a rough restart (for example, after a device failure) after an INSERT to a Distributed table, the inserted data might be lost. If a damaged data part is detected in the table directory, it is transferred to the 'broken' subdirectory and no longer used.
如果在 INSERT 到分布式表时服务器节点丢失或重启设备故障则插入的数据可能会丢失。如果在表目录中检测到损坏的数据分片则会将其转移到“broken”子目录并不再使用。
When the max_parallel_replicas option is enabled, query processing is parallelized across all replicas within a single shard. For more information, see the section "Settings, max_parallel_replicas".
启用 max_parallel_replicas 选项后会在分表的所有副本上并行查询处理。更多信息请参阅“设置max_parallel_replicas”部分。
[Original article](https://clickhouse.yandex/docs/en/operations/table_engines/distributed/) <!--hide-->