[WIP] translate about table-engines (#3731)

* init zh/operations translate

* finish table_engines about Integrations part

* add table_engine index

* finish table_engines -- For Small Data -- log、tinylog、memory

* finish table_engines -- For Small Data -- buffer,external_data
This commit is contained in:
谢磊 2018-12-05 00:12:56 +08:00 committed by Ivan Blinkov
parent 00d9a18e19
commit 5498ed56fd
5 changed files with 58 additions and 71 deletions

View File

@ -1,56 +1,56 @@
# Buffer # Buffer
Buffers the data to write in RAM, periodically flushing it to another table. During the read operation, data is read from the buffer and the other table simultaneously. 缓冲数据写入 RAM 中,周期性地将数据刷新到另一个表。在读取操作时,同时从缓冲区和另一个表读取数据。
``` ```
Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes) Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes)
``` ```
Engine parameters:database, table The table to flush data to. Instead of the database name, you can use a constant expression that returns a string.num_layers Parallelism layer. Physically, the table will be represented as 'num_layers' of independent buffers. Recommended value: 16.min_time, max_time, min_rows, max_rows, min_bytes, and max_bytes are conditions for flushing data from the buffer. 引擎的参数databasetable - 要刷新数据的表。可以使用返回字符串的常量表达式而不是数据库名称。 num_layers - 并行层数。在物理上,该表将表示为 num_layers 个独立缓冲区。建议值为16。min_timemax_timemin_rowsmax_rowsmin_bytesmax_bytes - 从缓冲区刷新数据的条件。
Data is flushed from the buffer and written to the destination table if all the 'min' conditions or at least one 'max' condition are met.min_time, max_time Condition for the time in seconds from the moment of the first write to the buffer.min_rows, max_rows Condition for the number of rows in the buffer.min_bytes, max_bytes Condition for the number of bytes in the buffer. 如果满足所有 "min" 条件或至少一个 "max" 条件则从缓冲区刷新数据并将其写入目标表。min_timemax_time — 从第一次写入缓冲区时起以秒为单位的时间条件。min_rowsmax_rows - 缓冲区中行数的条件。min_bytesmax_bytes - 缓冲区中字节数的条件。
During the write operation, data is inserted to a 'num_layers' number of random buffers. Or, if the data part to insert is large enough (greater than 'max_rows' or 'max_bytes'), it is written directly to the destination table, omitting the buffer. 写入时,数据从 num_layers 个缓冲区中随机插入。或者,如果插入数据的大小足够大(大于 max_rows 或 max_bytes ),则会绕过缓冲区将其写入目标表。
The conditions for flushing the data are calculated separately for each of the 'num_layers' buffers. For example, if num_layers = 16 and max_bytes = 100000000, the maximum RAM consumption is 1.6 GB. 每个 "num_layers" 缓冲区刷新数据的条件是分别计算。例如,如果 num_layers = 16 且 max_bytes = 100000000则最大RAM消耗将为1.6 GB。
Example: 示例:
``` sql ``` sql
CREATE TABLE merge.hits_buffer AS merge.hits ENGINE = Buffer(merge, hits, 16, 10, 100, 10000, 1000000, 10000000, 100000000) CREATE TABLE merge.hits_buffer AS merge.hits ENGINE = Buffer(merge, hits, 16, 10, 100, 10000, 1000000, 10000000, 100000000)
``` ```
Creating a 'merge.hits_buffer' table with the same structure as 'merge.hits' and using the Buffer engine. When writing to this table, data is buffered in RAM and later written to the 'merge.hits' table. 16 buffers are created. The data in each of them is flushed if either 100 seconds have passed, or one million rows have been written, or 100 MB of data have been written; or if simultaneously 10 seconds have passed and 10,000 rows and 10 MB of data have been written. For example, if just one row has been written, after 100 seconds it will be flushed, no matter what. But if many rows have been written, the data will be flushed sooner. 创建一个 "merge.hits_buffer" 表,其结构与 "merge.hits" 相同,并使用 Buffer 引擎。写入此表时,数据缓冲在 RAM 中,然后写入 "merge.hits" 表。创建了16个缓冲区。如果已经过了100秒或者已写入100万行或者已写入100 MB数据则刷新每个缓冲区的数据或者如果同时已经过了10秒并且已经写入了10,000行和10 MB的数据。例如如果只写了一行那么在100秒之后都会被刷新。但是如果写了很多行数据将会更快地刷新。
When the server is stopped, with DROP TABLE or DETACH TABLE, buffer data is also flushed to the destination table. 当服务器停止时,使用 DROP TABLE 或 DETACH TABLE缓冲区数据也会刷新到目标表。
You can set empty strings in single quotation marks for the database and table name. This indicates the absence of a destination table. In this case, when the data flush conditions are reached, the buffer is simply cleared. This may be useful for keeping a window of data in memory. 可以为数据库和表名在单个引号中设置空字符串。这表示没有目的地表。在这种情况下,当达到数据刷新条件时,缓冲器被简单地清除。这可能对于保持数据窗口在内存中是有用的。
When reading from a Buffer table, data is processed both from the buffer and from the destination table (if there is one). 从 Buffer 表读取时,将从缓冲区和目标表(如果有)处理数据。
Note that the Buffer tables does not support an index. In other words, data in the buffer is fully scanned, which might be slow for large buffers. (For data in a subordinate table, the index that it supports will be used.) 请注意Buffer 表不支持索引。换句话说,缓冲区中的数据被完全扫描,对于大缓冲区来说可能很慢。(对于目标表中的数据,将使用它支持的索引。)
If the set of columns in the Buffer table doesn't match the set of columns in a subordinate table, a subset of columns that exist in both tables is inserted. 如果 Buffer 表中的列集与目标表中的列集不匹配,则会插入两个表中存在的列的子集。
If the types don't match for one of the columns in the Buffer table and a subordinate table, an error message is entered in the server log and the buffer is cleared. 如果类型与 Buffer 表和目标表中的某列不匹配,则会在服务器日志中输入错误消息并清除缓冲区。
The same thing happens if the subordinate table doesn't exist when the buffer is flushed. 如果在刷新缓冲区时目标表不存在,则会发生同样的情况。
If you need to run ALTER for a subordinate table and the Buffer table, we recommend first deleting the Buffer table, running ALTER for the subordinate table, then creating the Buffer table again. 如果需要为目标表和 Buffer 表运行 ALTER我们建议先删除 Buffer 表,为目标表运行 ALTER然后再次创建 Buffer 表。
If the server is restarted abnormally, the data in the buffer is lost. 如果服务器异常重启,缓冲区中的数据将丢失。
PREWHERE, FINAL and SAMPLE do not work correctly for Buffer tables. These conditions are passed to the destination table, but are not used for processing data in the buffer. Because of this, we recommend only using the Buffer table for writing, while reading from the destination table. PREWHEREFINAL 和 SAMPLE 对缓冲表不起作用。这些条件将传递到目标表但不用于处理缓冲区中的数据。因此我们建议只使用Buffer表进行写入同时从目标表进行读取。
When adding data to a Buffer, one of the buffers is locked. This causes delays if a read operation is simultaneously being performed from the table. 将数据添加到缓冲区时,其中一个缓冲区被锁定。如果同时从表执行读操作,则会导致延迟。
Data that is inserted to a Buffer table may end up in the subordinate table in a different order and in different blocks. Because of this, a Buffer table is difficult to use for writing to a CollapsingMergeTree correctly. To avoid problems, you can set 'num_layers' to 1. 插入到 Buffer 表中的数据可能以不同的顺序和不同的块写入目标表中。因此Buffer 表很难用于正确写入 CollapsingMergeTree。为避免出现问题您可以将 "num_layers" 设置为1。
If the destination table is replicated, some expected characteristics of replicated tables are lost when writing to a Buffer table. The random changes to the order of rows and sizes of data parts cause data deduplication to quit working, which means it is not possible to have a reliable 'exactly once' write to replicated tables. 如果目标表是复制表,则在写入 Buffer 表时会丢失复制表的某些预期特征。数据部分的行次序和大小的随机变化导致数据不能去重,这意味着无法对复制表进行可靠的 "exactly once" 写入。
Due to these disadvantages, we can only recommend using a Buffer table in rare cases. 由于这些缺点,我们只建议在极少数情况下使用 Buffer 表。
A Buffer table is used when too many INSERTs are received from a large number of servers over a unit of time and data can't be buffered before insertion, which means the INSERTs can't run fast enough. 当在单位时间内从大量服务器接收到太多 INSERTs 并且在插入之前无法缓冲数据时使用 Buffer 表,这意味着这些 INSERTs 不能足够快地执行。
Note that it doesn't make sense to insert data one row at a time, even for Buffer tables. This will only produce a speed of a few thousand rows per second, while inserting larger blocks of data can produce over a million rows per second (see the section "Performance"). 请注意,一次插入一行数据是没有意义的,即使对于 Buffer 表也是如此。这将只产生每秒几千行的速度,而插入更大的数据块每秒可以产生超过一百万行(参见 "性能" 部分)。
[Original article](https://clickhouse.yandex/docs/en/operations/table_engines/buffer/) <!--hide--> [Original article](https://clickhouse.yandex/docs/zh/operations/table_engines/buffer/) <!--hide-->

View File

@ -1,36 +1,34 @@
<a name="external-data"></a>
# External Data for Query Processing # External Data for Query Processing
ClickHouse allows sending a server the data that is needed for processing a query, together with a SELECT query. This data is put in a temporary table (see the section "Temporary tables") and can be used in the query (for example, in IN operators). ClickHouse 允许向服务器发送处理查询所需的数据以及 SELECT 查询。这些数据放在一个临时表中(请参阅 "临时表" 一节),可以在查询中使用(例如,在 IN 操作符中)。
For example, if you have a text file with important user identifiers, you can upload it to the server along with a query that uses filtration by this list. 例如,如果您有一个包含重要用户标识符的文本文件,则可以将其与使用此列表过滤的查询一起上传到服务器。
If you need to run more than one query with a large volume of external data, don't use this feature. It is better to upload the data to the DB ahead of time. 如果需要使用大量外部数据运行多个查询,请不要使用该特性。最好提前把数据上传到数据库。
External data can be uploaded using the command-line client (in non-interactive mode), or using the HTTP interface. 可以使用命令行客户端(在非交互模式下)或使用 HTTP 接口上传外部数据。
In the command-line client, you can specify a parameters section in the format 在命令行客户端中,您可以指定格式的参数部分
```bash ```bash
--external --file=... [--name=...] [--format=...] [--types=...|--structure=...] --external --file=... [--name=...] [--format=...] [--types=...|--structure=...]
``` ```
You may have multiple sections like this, for the number of tables being transmitted. 对于传输的表的数量,可能有多个这样的部分。
**--external** Marks the beginning of a clause. **--external** 标记子句的开始。
**--file** Path to the file with the table dump, or -, which refers to stdin. **--file** 带有表存储的文件的路径或者它指的是STDIN。
Only a single table can be retrieved from stdin. 只能从 stdin 中检索单个表。
The following parameters are optional: **--name** Name of the table. If omitted, _data is used. 以下的参数是可选的:**--name** 表的名称,如果省略,则采用 _data。
**--format** Data format in the file. If omitted, TabSeparated is used. **--format** 文件中的数据格式。 如果省略,则使用 TabSeparated。
One of the following parameters is required:**--types** A list of comma-separated column types. For example: `UInt64,String`. The columns will be named _1, _2, ... 以下的参数必选一个:**--types** 逗号分隔列类型的列表。例如:`UInt64,String`。列将被命名为 _1_2...
**--structure** The table structure in the format`UserID UInt64`, `URL String`. Defines the column names and types. **--structure** 表结构的格式 `UserID UInt64``URL String`。定义列的名字以及类型。
The files specified in 'file' will be parsed by the format specified in 'format', using the data types specified in 'types' or 'structure'. The table will be uploaded to the server and accessible there as a temporary table with the name in 'name'. 在 "file" 中指定的文件将由 "format" 中指定的格式解析,使用在 "types" 或 "structure" 中指定的数据类型。该表将被上传到服务器,并在作为名称为 "name"临时表。
Examples: 示例:
```bash ```bash
echo -ne "1\n2\n3\n" | clickhouse-client --query="SELECT count() FROM test.visits WHERE TraficSourceID IN _data" --external --file=- --types=Int8 echo -ne "1\n2\n3\n" | clickhouse-client --query="SELECT count() FROM test.visits WHERE TraficSourceID IN _data" --external --file=- --types=Int8
@ -43,9 +41,9 @@ cat /etc/passwd | sed 's/:/\t/g' | clickhouse-client --query="SELECT shell, coun
/bin/sync 1 /bin/sync 1
``` ```
When using the HTTP interface, external data is passed in the multipart/form-data format. Each table is transmitted as a separate file. The table name is taken from the file name. The 'query_string' is passed the parameters 'name_format', 'name_types', and 'name_structure', where 'name' is the name of the table that these parameters correspond to. The meaning of the parameters is the same as when using the command-line client. 当使用HTTP接口时外部数据以 multipart/form-data 格式传递。每个表作为一个单独的文件传输。表名取自文件名。"query_string" 传递参数 "name_format"、"name_types"和"name_structure",其中 "name" 是这些参数对应的表的名称。参数的含义与使用命令行客户端时的含义相同。
Example: 示例:
```bash ```bash
cat /etc/passwd | sed 's/:/\t/g' > passwd.tsv cat /etc/passwd | sed 's/:/\t/g' > passwd.tsv
@ -58,7 +56,7 @@ curl -F 'passwd=@passwd.tsv;' 'http://localhost:8123/?query=SELECT+shell,+count(
/bin/sync 1 /bin/sync 1
``` ```
For distributed query processing, the temporary tables are sent to all the remote servers. 对于分布式查询,将临时表发送到所有远程服务器。
[Original article](https://clickhouse.yandex/docs/en/operations/table_engines/external_data/) <!--hide--> [Original article](https://clickhouse.yandex/docs/zh/operations/table_engines/external_data/) <!--hide-->

View File

@ -1,8 +1,7 @@
# Log # Log
Log differs from TinyLog in that a small file of "marks" resides with the column files. These marks are written on every data block and contain offsets that indicate where to start reading the file in order to skip the specified number of rows. This makes it possible to read table data in multiple threads. 日志与 TinyLog 的不同之处在于,"标记" 的小文件与列文件存在一起。这些标记写在每个数据块上并且包含偏移量这些偏移量指示从哪里开始读取文件以便跳过指定的行数。这使得可以在多个线程中读取表数据。对于并发数据访问可以同时执行读取操作而写入操作则阻塞读取和其它写入。Log 引擎不支持索引。同样如果写入表失败则该表将被破坏并且从该表读取将返回错误。Log 引擎适用于临时数据write-once 表以及测试或演示目的。
For concurrent data access, the read operations can be performed simultaneously, while write operations block reads and each other.
The Log engine does not support indexes. Similarly, if writing to a table failed, the table is broken, and reading from it returns an error. The Log engine is appropriate for temporary data, write-once tables, and for testing or demonstration purposes.
[Original article](https://clickhouse.yandex/docs/en/operations/table_engines/log/) <!--hide--> [Original article](https://clickhouse.yandex/docs/zh/operations/table_engines/log/) <!--hide-->

View File

@ -1,13 +1,9 @@
# Memory # Memory
The Memory engine stores data in RAM, in uncompressed form. Data is stored in exactly the same form as it is received when read. In other words, reading from this table is completely free. Memory 引擎以未压缩的形式将数据存储在 RAM 中。数据完全以读取时获得的形式存储。换句话说从这张表中读取是很轻松的。并发数据访问是同步的。锁范围小读写操作不会相互阻塞。不支持索引。阅读是并行化的。在简单查询上达到最大生产率超过10 GB /秒),因为没有磁盘读取,不需要解压缩或反序列化数据。(值得注意的是,在许多情况下,与 MergeTree 引擎的性能几乎一样高。重新启动服务器时表中的数据消失表将变为空。通常使用此表引擎是不合理的。但是它可用于测试以及在相对较少的行最多约100,000,000上需要最高性能的查询。
Concurrent data access is synchronized. Locks are short: read and write operations don't block each other.
Indexes are not supported. Reading is parallelized.
Maximal productivity (over 10 GB/sec) is reached on simple queries, because there is no reading from the disk, decompressing, or deserializing data. (We should note that in many cases, the productivity of the MergeTree engine is almost as high.)
When restarting a server, data disappears from the table and the table becomes empty.
Normally, using this table engine is not justified. However, it can be used for tests, and for tasks where maximum speed is required on a relatively small number of rows (up to approximately 100,000,000).
The Memory engine is used by the system for temporary tables with external query data (see the section "External data for processing a query"), and for implementing GLOBAL IN (see the section "IN operators"). Memory 引擎是由系统用于临时表进行外部数据的查询(请参阅 "外部数据用于请求处理" 部分),以及用于实现 `GLOBAL IN`(请参见 "IN 运算符" 部分)。
[Original article](https://clickhouse.yandex/docs/en/operations/table_engines/memory/) <!--hide--> [Original article](https://clickhouse.yandex/docs/zh/operations/table_engines/memory/) <!--hide-->

View File

@ -1,21 +1,15 @@
# TinyLog # TinyLog
The simplest table engine, which stores data on a disk. 最简单的表引擎,用于将数据存储在磁盘上。每列都存储在单独的压缩文件中。写入时,数据将附加到文件末尾。
Each column is stored in a separate compressed file.
When writing, data is appended to the end of files.
Concurrent data access is not restricted in any way: 并发数据访问不受任何限制:
- 如果同时从表中读取并在不同的查询中写入,则读取操作将抛出异常
- 如果同时写入多个查询中的表,则数据将被破坏。
- If you are simultaneously reading from a table and writing to it in a different query, the read operation will complete with an error. 这种表引擎的典型用法是 write-once首先只写入一次数据然后根据需要多次读取。查询在单个流中执行。换句话说此引擎适用于相对较小的表建议最多1,000,000行。如果您有许多小表则使用此表引擎是适合的因为它比Log引擎更简单需要打开的文件更少。当您拥有大量小表时可能会导致性能低下但在可能已经在其它 DBMS 时使用过,则您可能会发现切换使用 TinyLog 类型的表更容易。**不支持索引**。
- If you are writing to a table in multiple queries simultaneously, the data will be broken.
The typical way to use this table is write-once: first just write the data one time, then read it as many times as needed. 在 Yandex.Metrica 中TinyLog 表用于小批量处理的中间数据。
Queries are executed in a single stream. In other words, this engine is intended for relatively small tables (recommended up to 1,000,000 rows).
It makes sense to use this table engine if you have many small tables, since it is simpler than the Log engine (fewer files need to be opened).
The situation when you have a large number of small tables guarantees poor productivity, but may already be used when working with another DBMS, and you may find it easier to switch to using TinyLog types of tables.
**Indexes are not supported.**
In Yandex.Metrica, TinyLog tables are used for intermediary data that is processed in small batches.
[Original article](https://clickhouse.yandex/docs/en/operations/table_engines/tinylog/) <!--hide--> [Original article](https://clickhouse.yandex/docs/zh/operations/table_engines/tinylog/) <!--hide-->