ClickHouse/docs/zh/interfaces/formats.md
Ivan Blinkov 790704d081
Restore some old manual anchors in docs (#9803)
* Simplify 404 page

* add es array_functions.md

* restore some old manual anchors

* update sitemaps

* trigger checks

* restore more old manual anchors

* refactor test.md + temporary disable failure again

* fix mistype
2020-03-22 12:14:59 +03:00

52 KiB
Raw Blame History

输入输出格式

ClickHouse 可以接受多种数据格式,可以在 (INSERT) 以及 (SELECT) 请求中使用。

下列表格列出了支持的数据格式以及在 (INSERT) 以及 (SELECT) 请求中使用它们的方式。

格式 INSERT SELECT
TabSeparated
TabSeparatedRaw
TabSeparatedWithNames
TabSeparatedWithNamesAndTypes
Template
TemplateIgnoreSpaces
CSV
CSVWithNames
CustomSeparated
Values
Vertical
VerticalRaw
JSON
JSONCompact
JSONEachRow
TSKV
Pretty
PrettyCompact
PrettyCompactMonoBlock
PrettyNoEscapes
PrettySpace
Protobuf
Avro
AvroConfluent
Parquet
ORC
RowBinary
RowBinaryWithNamesAndTypes
Native
Null
XML
CapnProto

TabSeparated

在 TabSeparated 格式中,数据按行写入。每行包含由制表符分隔的值。除了行中的最后一个值(后面紧跟换行符)之外,每个值都跟随一个制表符。 在任何地方都可以使用严格的 Unix 命令行。最后一行还必须在最后包含换行符。值以文本格式编写,不包含引号,并且要转义特殊字符。

这种格式也可以用 TSV 来表示。

TabSeparated 格式非常方便用于自定义程序或脚本处理数据。HTTP 客户端接口默认会用这种格式,命令行客户端批量模式下也会用这种格式。这种格式允许在不同数据库之间传输数据。例如,从 MYSQL 中导出数据然后导入到 ClickHouse 中,反之亦然。

TabSeparated 格式支持输出数据总值(当使用 WITH TOTALS 以及极值(当 extremes 设置是1。这种情况下总值和极值输出在主数据的后面。主要的数据,总值,极值会以一个空行隔开,例如:

SELECT EventDate, count() AS c FROM test.hits GROUP BY EventDate WITH TOTALS ORDER BY EventDate FORMAT TabSeparated``
2014-03-17      1406958
2014-03-18      1383658
2014-03-19      1405797
2014-03-20      1353623
2014-03-21      1245779
2014-03-22      1031592
2014-03-23      1046491

0000-00-00      8873898

2014-03-17      1031592
2014-03-23      1406958

数据解析方式

整数以十进制形式写入。数字在开头可以包含额外的 + 字符(解析时忽略,格式化时不记录)。非负数不能包含负号。 读取时,允许将空字符串解析为零,或者(对于带符号的类型)将仅包含负号的字符串解析为零。 不符合相应数据类型的数字可能会被解析为不同的数字,而不会显示错误消息。

浮点数以十进制形式写入。点号用作小数点分隔符。支持指数等符号inf+ inf-infnan。 浮点数的输入可以以小数点开始或结束。 格式化的时候,浮点数的精确度可能会丢失。 解析的时候,没有严格需要去读取与机器可以表示的最接近的数值。

日期会以 YYYY-MM-DD 格式写入和解析,但会以任何字符作为分隔符。 带时间的日期会以 YYYY-MM-DD hh:mm:ss 格式写入和解析,但会以任何字符作为分隔符。 这一切都发生在客户端或服务器启动时的系统时区(取决于哪一种格式的数据)。对于具有时间的日期,夏时制时间未指定。 因此,如果转储在夏令时中有时间,则转储不会明确地匹配数据,解析将选择两者之一。 在读取操作期间,不正确的日期和具有时间的日期可以使用自然溢出或空日期和时间进行分析,而不会出现错误消息。

有个例外情况Unix 时间戳格式10个十进制数字也支持使用时间解析日期。结果不是时区相关的。格式 YYYY-MM-DD hh:mm:ss和 NNNNNNNNNN 会自动区分。

字符串以反斜线转义的特殊字符输出。 以下转义序列用于输出:\b\f\r\n\t\0\'\\。 解析还支持\a\v\xHH(十六进制转义字符)和任何\c字符,其中c是任何字符(这些序列被转换为c)。 因此,读取数据支持可以将换行符写为\n\的格式,或者换行。例如,字符串 Hello world 在单词之间换行而不是空格可以解析为以下任何形式:

Hello\nworld

Hello\
world

第二种形式是支持的,因为 MySQL 读取 tab-separated 格式数据集的时候也会使用它。

在 TabSeparated 格式中传递数据时需要转义的最小字符集为Tab换行符LF和反斜杠。

只有一小组符号会被转义。你可以轻易地找到一个字符串值,但这不会正常在你的终端显示。

数组写在方括号内的逗号分隔值列表中。 通常情况下,数组中的数字项目会被拼凑,但日期,带时间的日期以及字符串将使用与上面相同的转义规则用单引号引起来。

NULL 将输出为 \N

TabSeparatedRaw

TabSeparated 格式不一样的是,行数据是不会被转义的。 该格式仅适用于输出查询结果,但不适用于解析输入(将数据插入到表中)。

这种格式也可以使用名称 TSVRaw 来表示。

TabSeparatedWithNames

TabSeparated 格式不一样的是,第一行会显示列的名称。 在解析过程中,第一行完全被忽略。您不能使用列名来确定其位置或检查其正确性。 (未来可能会加入解析头行的功能)

这种格式也可以使用名称 TSVWithNames 来表示。

TabSeparatedWithNamesAndTypes

TabSeparated 格式不一样的是,第一行会显示列的名称,第二行会显示列的类型。 在解析过程中,第一行和第二行完全被忽略。

这种格式也可以使用名称 TSVWithNamesAndTypes 来表示。

Template

This format allows to specify a custom format string with placeholders for values with specified escaping rule.

It uses settings format_schema, format_schema_rows, format_schema_rows_between_delimiter and some settings of other formats (e.g. output_format_json_quote_64bit_integers when using JSON escaping, see further)

Format string format_schema_rows specifies rows format with the following syntax:

delimiter_1${column_1:serializeAs_1}delimiter_2${column_2:serializeAs_2} ... delimiter_N,

where `delimiter_i` is a delimiter between values (`$` symbol can be escaped as `$$`),
`column_i` is a name of a column whose values are to be selected or inserted (if empty, then column will be skipped),
`serializeAs_i` is an escaping rule for the column values. The following escaping rules are supported:

- `CSV`, `JSON`, `XML` (similarly to the formats of the same names)
- `Escaped` (similarly to `TSV`)
- `Quoted` (similarly to `Values`)
- `Raw` (without escaping, similarly to `TSVRaw`)
- `None` (no escaping rule, see further)

If escaping rule is omitted, then`None` will be used. `XML` and `Raw` are suitable only for output.

So, for the following format string:

  `Search phrase: ${SearchPhrase:Quoted}, count: ${c:Escaped}, ad price: $$${price:JSON};`

the values of `SearchPhrase`, `c` and `price` columns, which are escaped as `Quoted`, `Escaped` and `JSON` will be printed (for select) or will be expected (for insert) between `Search phrase: `, `, count: `, `, ad price: $` and `;` delimiters respectively. For example:

`Search phrase: 'bathroom interior design', count: 2166, ad price: $3;`

The format_schema_rows_between_delimiter setting specifies delimiter between rows, which is printed (or expected) after every row except the last one (\n by default)

Format string format_schema has the same syntax as format_schema_rows and allows to specify a prefix, a suffix and a way to print some additional information. It contains the following placeholders instead of column names:

  • data is the rows with data in format_schema_rows format, separated by format_schema_rows_between_delimiter. This placeholder must be the first placeholder in the format string.
  • totals is the row with total values in format_schema_rows format (when using WITH TOTALS)
  • min is the row with minimum values in format_schema_rows format (when extremes is set to 1)
  • max is the row with maximum values in format_schema_rows format (when extremes is set to 1)
  • rows is the total number of output rows
  • rows_before_limit is the minimal number of rows there would have been without LIMIT. Output only if the query contains LIMIT. If the query contains GROUP BY, rows_before_limit_at_least is the exact number of rows there would have been without a LIMIT.
  • time is the request execution time in seconds
  • rows_read is the number of rows have been read
  • bytes_read is the number of bytes (uncompressed) have been read

The placeholders data, totals, min and max must not have escaping rule specified (or None must be specified explicitly). The remaining placeholders may have any escaping rule specified. If the format_schema setting is an empty string, ${data} is used as default value. For insert queries format allows to skip some columns or some fields if prefix or suffix (see example).

Select example:

SELECT SearchPhrase, count() AS c FROM test.hits GROUP BY SearchPhrase ORDER BY c DESC LIMIT 5
FORMAT Template
SETTINGS format_schema = '<!DOCTYPE HTML>
<html> <head> <title>Search phrases</title> </head>
 <body>
  <table border="1"> <caption>Search phrases</caption>
    <tr> <th>Search phrase</th> <th>Count</th> </tr>
    ${data}
  </table>
  <table border="1"> <caption>Max</caption>
    ${max}
  </table>
  <b>Processed ${rows_read:XML} rows in ${time:XML} sec</b>
 </body>
</html>',
format_schema_rows = '<tr> <td>${SearchPhrase:XML}</td> <td>${с:XML}</td> </tr>',
format_schema_rows_between_delimiter = '\n    '
<!DOCTYPE HTML>
<html> <head> <title>Search phrases</title> </head>
 <body>
  <table border="1"> <caption>Search phrases</caption>
    <tr> <th>Search phrase</th> <th>Count</th> </tr>
    <tr> <td></td> <td>8267016</td> </tr>
    <tr> <td>bathroom interior design</td> <td>2166</td> </tr>
    <tr> <td>yandex</td> <td>1655</td> </tr>
    <tr> <td>spring 2014 fashion</td> <td>1549</td> </tr>
    <tr> <td>freeform photos</td> <td>1480</td> </tr>
  </table>
  <table border="1"> <caption>Max</caption>
    <tr> <td></td> <td>8873898</td> </tr>
  </table>
  <b>Processed 3095973 rows in 0.1569913 sec</b>
 </body>
</html>

Insert example:

Some header
Page views: 5, User id: 4324182021466249494, Useless field: hello, Duration: 146, Sign: -1
Page views: 6, User id: 4324182021466249494, Useless field: world, Duration: 185, Sign: 1
Total rows: 2
INSERT INTO UserActivity FORMAT Template SETTINGS
format_schema = 'Some header\n${data}\nTotal rows: ${:CSV}\n',
format_schema_rows = 'Page views: ${PageViews:CSV}, User id: ${UserID:CSV}, Useless field: ${:CSV}, Duration: ${Duration:CSV}, Sign: ${Sign:CSV}'

PageViews, UserID, Duration and Sign inside placeholders are names of columns in the table. Values after Useless field in rows and after \nTotal rows: in suffix will be ignored. All delimiters in the input data must be strictly equal to delimiters in specified format strings.

TemplateIgnoreSpaces

This format is suitable only for input. Similar to Template, but skips whitespace characters between delimiters and values in the input stream. However, if format strings contain whitespace characters, these characters will be expected in the input stream. Also allows to specify empty placeholders (${} or ${:None}) to split some delimiter into separate parts to ignore spaces between them. Such placeholders are used only for skipping whitespace characters. Its possible to read JSON using this format, if values of columns have the same order in all rows. For example, the following request can be used for inserting data from output example of format JSON:

INSERT INTO table_name FORMAT TemplateIgnoreSpaces SETTINGS
format_schema = '{${}"meta"${}:${:JSON},${}"data"${}:${}[${data}]${},${}"totals"${}:${:JSON},${}"extremes"${}:${:JSON},${}"rows"${}:${:JSON},${}"rows_before_limit_at_least"${}:${:JSON}${}}',
format_schema_rows = '{${}"SearchPhrase"${}:${}${phrase:JSON}${},${}"c"${}:${}${cnt:JSON}${}}',
format_schema_rows_between_delimiter = ','

TSKV

TabSeparated 格式类似,但它输出的是 name=value 的格式。名称会和 TabSeparated 格式一样被转义,= 字符也会被转义。

SearchPhrase=   count()=8267016
SearchPhrase=bathroom interior design    count()=2166
SearchPhrase=yandex     count()=1655
SearchPhrase=2014 spring fashion    count()=1549
SearchPhrase=freeform photos       count()=1480
SearchPhrase=angelina jolie    count()=1245
SearchPhrase=omsk       count()=1112
SearchPhrase=photos of dog breeds    count()=1091
SearchPhrase=curtain designs        count()=1064
SearchPhrase=baku       count()=1000

NULL 输出为 \N

SELECT * FROM t_null FORMAT TSKV
x=1 y=\N

当有大量的小列时,这种格式是低效的,通常没有理由使用它。它被用于 Yandex 公司的一些部门。

数据的输出和解析都支持这种格式。对于解析,任何顺序都支持不同列的值。可以省略某些值,用 - 表示, 它们被视为等于它们的默认值。在这种情况下,零和空行被用作默认值。作为默认值,不支持表中指定的复杂值。

对于不带等号或值,可以用附加字段 tskv 来表示,这种在解析上是被允许的。这样的话该字段被忽略。

CSV

按逗号分隔的数据格式(RFC)。

格式化的时候,行是用双引号括起来的。字符串中的双引号会以两个双引号输出,除此之外没有其他规则来做字符转义了。日期和时间也会以双引号包括。数字的输出不带引号。值由一个单独的字符隔开,这个字符默认是 ,。行使用 Unix 换行符LF分隔。 数组序列化成 CSV 规则如下:首先将数组序列化为 TabSeparated 格式的字符串,然后将结果字符串用双引号包括输出到 CSV。CSV 格式的元组被序列化为单独的列(即它们在元组中的嵌套关系会丢失)。

clickhouse-client --format_csv_delimiter="|" --query="INSERT INTO test.csv FORMAT CSV" < data.csv

*默认情况下间隔符是 , ,在 format_csv_delimiter 中可以了解更多间隔符配置。

解析的时候,可以使用或不使用引号来解析所有值。支持双引号和单引号。行也可以不用引号排列。 在这种情况下它们被解析为逗号或换行符CR 或 LF。在解析不带引号的行时若违反 RFC 规则,会忽略前导和尾随的空格和制表符。 对于换行,全部支持 UnixLFWindowsCR LF和 Mac OS ClassicCR LF

NULL 将输出为 \N

CSV 格式是和 TabSeparated 一样的方式输出总数和极值。

CSVWithNames

会输出带头部行,和 TabSeparatedWithNames 一样。

CustomSeparated

Similar to Template, but it prints or reads all columns and uses escaping rule from setting format_custom_escaping_rule and delimiters from settings format_custom_field_delimiter, format_custom_row_before_delimiter, format_custom_row_after_delimiter, format_custom_row_between_delimiter, format_custom_result_before_delimiter and format_custom_result_after_delimiter, not from format strings. There is also CustomSeparatedIgnoreSpaces format, which is similar to TemplateIgnoreSpaces.

JSON

以 JSON 格式输出数据。除了数据表之外,它还输出列名称和类型以及一些附加信息:输出行的总数以及在没有 LIMIT 时可以输出的行数。 例:

SELECT SearchPhrase, count() AS c FROM test.hits GROUP BY SearchPhrase WITH TOTALS ORDER BY c DESC LIMIT 5 FORMAT JSON
{
        "meta":
        [
                {
                        "name": "SearchPhrase",
                        "type": "String"
                },
                {
                        "name": "c",
                        "type": "UInt64"
                }
        ],

        "data":
        [
                {
                        "SearchPhrase": "",
                        "c": "8267016"
                },
                {
                        "SearchPhrase": "bathroom interior design",
                        "c": "2166"
                },
                {
                        "SearchPhrase": "yandex",
                        "c": "1655"
                },
                {
                        "SearchPhrase": "spring 2014 fashion",
                        "c": "1549"
                },
                {
                        "SearchPhrase": "freeform photos",
                        "c": "1480"
                }
        ],

        "totals":
        {
                "SearchPhrase": "",
                "c": "8873898"
        },

        "extremes":
        {
                "min":
                {
                        "SearchPhrase": "",
                        "c": "1480"
                },
                "max":
                {
                        "SearchPhrase": "",
                        "c": "8267016"
                }
        },

        "rows": 5,

        "rows_before_limit_at_least": 141137
}

JSON 与 JavaScript 兼容。为了确保这一点,一些字符被另外转义:斜线/被转义为\/; 替代的换行符 U+2028U+2029 会打断一些浏览器解析,它们会被转义为 \uXXXX。 ASCII 控制字符被转义:退格,换页,换行,回车和水平制表符被替换为\b\f\n\r\t 作为使用\uXXXX序列的00-1F范围内的剩余字节。 无效的 UTF-8 序列更改为替换字符 ,因此输出文本将包含有效的 UTF-8 序列。 为了与 JavaScript 兼容默认情况下Int64 和 UInt64 整数用双引号引起来。要除去引号,可以将配置参数 output_format_json_quote_64bit_integers 设置为0。

rows 结果输出的行数。

rows_before_limit_at_least 去掉 LIMIT 过滤后的最小行总数。 只会在查询包含 LIMIT 条件时输出。 若查询包含 GROUP BYrows_before_limit_at_least 就是去掉 LIMIT 后过滤后的准确行数。

totals 总值 (当使用 TOTALS 条件时)。

extremes 极值 (当 extremes 设置为 1时

该格式仅适用于输出查询结果,但不适用于解析输入(将数据插入到表中)。

ClickHouse 支持 NULL, 在 JSON 格式中以 null 输出来表示.

参考 JSONEachRow 格式。

JSONCompact

与 JSON 格式不同的是它以数组的方式输出结果,而不是以结构体。

示例:

{
        "meta":
        [
                {
                        "name": "SearchPhrase",
                        "type": "String"
                },
                {
                        "name": "c",
                        "type": "UInt64"
                }
        ],

        "data":
        [
                ["", "8267016"],
                ["bathroom interior design", "2166"],
                ["yandex", "1655"],
                ["fashion trends spring 2014", "1549"],
                ["freeform photo", "1480"]
        ],

        "totals": ["","8873898"],

        "extremes":
        {
                "min": ["","1480"],
                "max": ["","8267016"]
        },

        "rows": 5,

        "rows_before_limit_at_least": 141137
}

这种格式仅仅适用于输出结果集,而不适用于解析(将数据插入到表中)。 参考 JSONEachRow 格式。

JSONEachRow

将数据结果每一行以 JSON 结构体输出(换行分割 JSON 结构体)。

{"SearchPhrase":"","count()":"8267016"}
{"SearchPhrase": "bathroom interior design","count()": "2166"}
{"SearchPhrase":"yandex","count()":"1655"}
{"SearchPhrase":"2014 spring fashion","count()":"1549"}
{"SearchPhrase":"freeform photo","count()":"1480"}
{"SearchPhrase":"angelina jolie","count()":"1245"}
{"SearchPhrase":"omsk","count()":"1112"}
{"SearchPhrase":"photos of dog breeds","count()":"1091"}
{"SearchPhrase":"curtain designs","count()":"1064"}
{"SearchPhrase":"baku","count()":"1000"}

与 JSON 格式不同的是没有替换无效的UTF-8序列。任何一组字节都可以在行中输出。这是必要的因为这样数据可以被格式化而不会丢失任何信息。值的转义方式与JSON相同。

对于解析,任何顺序都支持不同列的值。可以省略某些值 - 它们被视为等于它们的默认值。在这种情况下,零和空行被用作默认值。 作为默认值,不支持表中指定的复杂值。元素之间的空白字符被忽略。如果在对象之后放置逗号,它将被忽略。对象不一定必须用新行分隔。

Usage of Nested Structures

If you have a table with the Nested data type columns, you can insert JSON data having the same structure. Enable this functionality with the input_format_import_nested_json setting.

For example, consider the following table:

CREATE TABLE json_each_row_nested (n Nested (s String, i Int32) ) ENGINE = Memory

As you can find in the Nested data type description, ClickHouse treats each component of the nested structure as a separate column, n.s and n.i for our table. So you can insert the data the following way:

INSERT INTO json_each_row_nested FORMAT JSONEachRow {"n.s": ["abc", "def"], "n.i": [1, 23]}

To insert data as hierarchical JSON object set input_format_import_nested_json=1.

{
    "n": {
        "s": ["abc", "def"],
        "i": [1, 23]
    }
}

Without this setting ClickHouse throws the exception.

SELECT name, value FROM system.settings WHERE name = 'input_format_import_nested_json'
┌─name────────────────────────────┬─value─┐
│ input_format_import_nested_json │ 0     │
└─────────────────────────────────┴───────┘
INSERT INTO json_each_row_nested FORMAT JSONEachRow {"n": {"s": ["abc", "def"], "i": [1, 23]}}
Code: 117. DB::Exception: Unknown field found while parsing JSONEachRow format: n: (at row 1)
SET input_format_import_nested_json=1
INSERT INTO json_each_row_nested FORMAT JSONEachRow {"n": {"s": ["abc", "def"], "i": [1, 23]}}
SELECT * FROM json_each_row_nested
┌─n.s───────────┬─n.i────┐
│ ['abc','def'] │ [1,23] │
└───────────────┴────────┘

Native

最高性能的格式。 据通过二进制格式的块进行写入和读取。对于每个块,该块中的行数,列数,列名称和类型以及列的部分将被相继记录。 换句话说,这种格式是 «列式»的 - 它不会将列转换为行。 这是用于在服务器之间进行交互的本地界面中使用的格式,用于使用命令行客户端和 C++ 客户端。

您可以使用此格式快速生成只能由 ClickHouse DBMS 读取的格式。但自己处理这种格式是没有意义的。

Null

没有输出。但是,查询已处理完毕,并且在使用命令行客户端时,数据将传输到客户端。这仅用于测试,包括生产力测试。 显然,这种格式只适用于输出,不适用于解析。

Pretty

将数据以表格形式输出,也可以使用 ANSI 转义字符在终端中设置颜色。 它会绘制一个完整的表格,每行数据在终端中占用两行。 每一个结果块都会以单独的表格输出。这是很有必要的,以便结果块不用缓冲结果输出(缓冲在可以预见结果集宽度的时候是很有必要的)。

NULL 输出为 ᴺᵁᴸᴸ

SELECT * FROM t_null
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
└───┴──────┘

为避免将太多数据传输到终端只打印前10,000行。 如果行数大于或等于10,000则会显示消息«Showed first 10 000»。 该格式仅适用于输出查询结果,但不适用于解析输入(将数据插入到表中)。

Pretty格式支持输出总值当使用 WITH TOTALS 时)和极值(当 extremes 设置为1时。 在这些情况下,总数值和极值在主数据之后以单独的表格形式输出。 示例(以 PrettyCompact 格式显示):

SELECT EventDate, count() AS c FROM test.hits GROUP BY EventDate WITH TOTALS ORDER BY EventDate FORMAT PrettyCompact
┌──EventDate─┬───────c─┐
│ 2014-03-17 │ 1406958 │
│ 2014-03-18 │ 1383658 │
│ 2014-03-19 │ 1405797 │
│ 2014-03-20 │ 1353623 │
│ 2014-03-21 │ 1245779 │
│ 2014-03-22 │ 1031592 │
│ 2014-03-23 │ 1046491 │
└────────────┴─────────┘

Totals:
┌──EventDate─┬───────c─┐
│ 0000-00-00 │ 8873898 │
└────────────┴─────────┘

Extremes:
┌──EventDate─┬───────c─┐
│ 2014-03-17 │ 1031592 │
│ 2014-03-23 │ 1406958 │
└────────────┴─────────┘

PrettyCompact

Pretty 格式不一样的是,PrettyCompact 去掉了行之间的表格分割线,这样使得结果更加紧凑。这种格式会在交互命令行客户端下默认使用。

PrettyCompactMonoBlock

PrettyCompact 格式不一样的是,它支持 10,000 行数据缓冲,然后输出在一个表格中,不会按照块来区分

PrettyNoEscapes

Pretty 格式不一样的是,它不使用 ANSI 字符转义, 这在浏览器显示数据以及在使用 watch 命令行工具是有必要的。

示例:

watch -n1 "clickhouse-client --query='SELECT event, value FROM system.events FORMAT PrettyCompactNoEscapes'"

您可以使用 HTTP 接口来获取数据,显示在浏览器中。

PrettyCompactNoEscapes

用法类似上述。

PrettySpaceNoEscapes

用法类似上述。

PrettySpace

PrettyCompact(#prettycompact) 格式不一样的是,它使用空格来代替网格来显示数据。

RowBinary

以二进制格式逐行格式化和解析数据。行和值连续列出,没有分隔符。 这种格式比 Native 格式效率低,因为它是基于行的。

整数使用固定长度的小端表示法。 例如UInt64 使用8个字节。 DateTime 被表示为 UInt32 类型的Unix 时间戳值。 Date 被表示为 UInt16 对象,它的值为 1970-01-01以来的天数。 字符串表示为 varint 长度(无符号 LEB128),后跟字符串的字节数。 FixedString 被简单地表示为一个字节序列。

数组表示为 varint 长度(无符号 LEB128),后跟有序的数组元素。

对于 NULL 的支持, 一个为 1 或 0 的字节会加在每个 Nullable 值前面。如果为 1, 那么该值就是 NULL。 如果为 0则不为 NULL

RowBinaryWithNamesAndTypes

Similar to RowBinary, but with added header:

  • LEB128-encoded number of columns (N)
  • N Strings specifying column names
  • N Strings specifying column types

Values

在括号中打印每一行。行由逗号分隔。最后一行之后没有逗号。括号内的值也用逗号分隔。数字以十进制格式输出,不含引号。 数组以方括号输出。带有时间的字符串,日期和时间用引号包围输出。转义字符的解析规则与 TabSeparated 格式类似。 在格式化过程中,不插入额外的空格,但在解析过程中,空格是被允许并跳过的(除了数组值之外的空格,这是不允许的)。NULLNULL

以 Values 格式传递数据时需要转义的最小字符集是:单引号和反斜线。

这是 INSERT INTO t VALUES ... 中可以使用的格式,但您也可以将其用于查询结果。

Vertical

使用指定的列名在单独的行上打印每个值。如果每行都包含大量列,则此格式便于打印一行或几行。

NULL 输出为 ᴺᵁᴸᴸ

示例:

SELECT * FROM t_null FORMAT Vertical
Row 1:
──────
x: 1
y: ᴺᵁᴸᴸ

该格式仅适用于输出查询结果,但不适用于解析输入(将数据插入到表中)。

VerticalRaw

Vertical 格式不同点在于,行是不会被转义的。 这种格式仅仅适用于输出,但不适用于解析输入(将数据插入到表中)。

示例:

:) SHOW CREATE TABLE geonames FORMAT VerticalRaw;
Row 1:
──────
statement: CREATE TABLE default.geonames ( geonameid UInt32, date Date DEFAULT CAST('2017-12-08' AS Date)) ENGINE = MergeTree(date, geonameid, 8192)

:) SELECT 'string with \'quotes\' and \t with some special \n characters' AS test FORMAT VerticalRaw;
Row 1:
──────
test: string with 'quotes' and   with some special
 characters

和 Vertical 格式相比:

:) SELECT 'string with \'quotes\' and \t with some special \n characters' AS test FORMAT Vertical;
Row 1:
──────
test: string with \'quotes\' and \t with some special \n characters

XML

该格式仅适用于输出查询结果,但不适用于解析输入,示例:

<?xml version='1.0' encoding='UTF-8' ?>
<result>
        <meta>
                <columns>
                        <column>
                                <name>SearchPhrase</name>
                                <type>String</type>
                        </column>
                        <column>
                                <name>count()</name>
                                <type>UInt64</type>
                        </column>
                </columns>
        </meta>
        <data>
                <row>
                        <SearchPhrase></SearchPhrase>
                        <field>8267016</field>
                </row>
                <row>
                        <SearchPhrase>bathroom interior design</SearchPhrase>
                        <field>2166</field>
                </row>
                <row>
                        <SearchPhrase>yandex</SearchPhrase>
                        <field>1655</field>
                </row>
                <row>
                        <SearchPhrase>2014 spring fashion</SearchPhrase>
                        <field>1549</field>
                </row>
                <row>
                        <SearchPhrase>freeform photos</SearchPhrase>
                        <field>1480</field>
                </row>
                <row>
                        <SearchPhrase>angelina jolie</SearchPhrase>
                        <field>1245</field>
                </row>
                <row>
                        <SearchPhrase>omsk</SearchPhrase>
                        <field>1112</field>
                </row>
                <row>
                        <SearchPhrase>photos of dog breeds</SearchPhrase>
                        <field>1091</field>
                </row>
                <row>
                        <SearchPhrase>curtain designs</SearchPhrase>
                        <field>1064</field>
                </row>
                <row>
                        <SearchPhrase>baku</SearchPhrase>
                        <field>1000</field>
                </row>
        </data>
        <rows>10</rows>
        <rows_before_limit_at_least>141137</rows_before_limit_at_least>
</result>

如果列名称没有可接受的格式,则仅使用 field 作为元素名称。 通常XML 结构遵循 JSON 结构。 就像JSON一样将无效的 UTF-8 字符都作替换,以便输出文本将包含有效的 UTF-8 字符序列。

在字符串值中,字符 < 被转义为 <

数组输出为 <array> <elem> Hello </ elem> <elem> World </ elem> ... </ array>,元组输出为 <tuple> <elem> Hello </ elem> <elem> World </ ELEM> ... </tuple>

CapnProto

Capn Proto 是一种二进制消息格式,类似 Protocol Buffers 和 Thriftis但与 JSON 或 MessagePack 格式不一样。

Capn Proto 消息格式是严格类型的,而不是自我描述,这意味着它们不需要外部的描述。这种格式可以实时地应用,并针对每个查询进行缓存。

SELECT SearchPhrase, count() AS c FROM test.hits
       GROUP BY SearchPhrase FORMAT CapnProto SETTINGS schema = 'schema:Message'

其中 schema.capnp 描述如下:

struct Message {
  SearchPhrase @0 :Text;
  c @1 :Uint64;
}

格式文件存储的目录可以在服务配置中的 format_schema_path 指定。

Capn Proto 反序列化是很高效的,通常不会增加系统的负载。

Protobuf

Protobuf - is a Protocol Buffers format.

This format requires an external format schema. The schema is cached between queries. ClickHouse supports both proto2 and proto3 syntaxes. Repeated/optional/required fields are supported.

Usage examples:

SELECT * FROM test.table FORMAT Protobuf SETTINGS format_schema = 'schemafile:MessageType'
cat protobuf_messages.bin | clickhouse-client --query "INSERT INTO test.table FORMAT Protobuf SETTINGS format_schema='schemafile:MessageType'"

where the file schemafile.proto looks like this:

syntax = "proto3";

message MessageType {
  string name = 1;
  string surname = 2;
  uint32 birthDate = 3;
  repeated string phoneNumbers = 4;
};

To find the correspondence between table columns and fields of Protocol Buffers message type ClickHouse compares their names. This comparison is case-insensitive and the characters _ (underscore) and . (dot) are considered as equal. If types of a column and a field of Protocol Buffers message are different the necessary conversion is applied.

Nested messages are supported. For example, for the field z in the following message type

message MessageType {
  message XType {
    message YType {
      int32 z;
    };
    repeated YType y;
  };
  XType x;
};

ClickHouse tries to find a column named x.y.z (or x_y_z or X.y_Z and so on). Nested messages are suitable to input or output a nested data structures.

Default values defined in a protobuf schema like this

syntax = "proto2";

message MessageType {
  optional int32 result_per_page = 3 [default = 10];
}

are not applied; the table defaults are used instead of them.

ClickHouse inputs and outputs protobuf messages in the length-delimited format. It means before every message should be written its length as a varint. See also how to read/write length-delimited protobuf messages in popular languages.

Avro

Apache Avro is a row-oriented data serialization framework developed within Apaches Hadoop project.

ClickHouse Avro format supports reading and writing Avro data files.

Data Types Matching

The table below shows supported data types and how they match ClickHouse data types in INSERT and SELECT queries.

Avro data type INSERT ClickHouse data type Avro data type SELECT
boolean, int, long, float, double Int(8|16|32), UInt(8|16|32) int
boolean, int, long, float, double Int64, UInt64 long
boolean, int, long, float, double Float32 float
boolean, int, long, float, double Float64 double
bytes, string, fixed, enum String bytes
bytes, string, fixed FixedString(N) fixed(N)
enum Enum(8|16) enum
array(T) Array(T) array(T)
union(null, T), union(T, null) Nullable(T) union(null, T)
null Nullable(Nothing) null
int (date) * Date int (date) *
long (timestamp-millis) * DateTime64(3) long (timestamp-millis) *
long (timestamp-micros) * DateTime64(6) long (timestamp-micros) *

* Avro logical types

Unsupported Avro data types: record (non-root), map

Unsupported Avro logical data types: uuid, time-millis, time-micros, duration

Inserting Data

To insert data from an Avro file into ClickHouse table:

$ cat file.avro | clickhouse-client --query="INSERT INTO {some_table} FORMAT Avro"

The root schema of input Avro file must be of record type.

To find the correspondence between table columns and fields of Avro schema ClickHouse compares their names. This comparison is case-sensitive. Unused fields are skipped.

Data types of a ClickHouse table columns can differ from the corresponding fields of the Avro data inserted. When inserting data, ClickHouse interprets data types according to the table above and then casts the data to corresponding column type.

Selecting Data

To select data from ClickHouse table into an Avro file:

$ clickhouse-client --query="SELECT * FROM {some_table} FORMAT Avro" > file.avro

Column names must:

  • start with [A-Za-z_]
  • subsequently contain only [A-Za-z0-9_]

Output Avro file compression and sync interval can be configured with output_format_avro_codec and output_format_avro_sync_interval respectively.

AvroConfluent

AvroConfluent supports decoding single-object Avro messages commonly used with Kafka and Confluent Schema Registry.

Each Avro message embeds a schema id that can be resolved to the actual schema with help of the Schema Registry.

Schemas are cached once resolved.

Schema Registry URL is configured with format_avro_schema_registry_url

Data Types Matching

Same as Avro

Usage

To quickly verify schema resolution you can use kafkacat with clickhouse-local:

$ kafkacat -b kafka-broker  -C -t topic1 -o beginning -f '%s' -c 3 | clickhouse-local   --input-format AvroConfluent --format_avro_schema_registry_url 'http://schema-registry' -S "field1 Int64, field2 String"  -q 'select *  from table'
1 a
2 b
3 c

To use AvroConfluent with Kafka:

CREATE TABLE topic1_stream
(
    field1 String,
    field2 String
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'kafka-broker',
kafka_topic_list = 'topic1',
kafka_group_name = 'group1',
kafka_format = 'AvroConfluent';

SET format_avro_schema_registry_url = 'http://schema-registry';

SELECT * FROM topic1_stream;

!!! note "Warning" Setting format_avro_schema_registry_url needs to be configured in users.xml to maintain its value after a restart.

Parquet

Apache Parquet is a columnar storage format widespread in the Hadoop ecosystem. ClickHouse supports read and write operations for this format.

Data Types Matching

The table below shows supported data types and how they match ClickHouse data types in INSERT and SELECT queries.

Parquet data type (INSERT) ClickHouse data type Parquet data type (SELECT)
UINT8, BOOL UInt8 UINT8
INT8 Int8 INT8
UINT16 UInt16 UINT16
INT16 Int16 INT16
UINT32 UInt32 UINT32
INT32 Int32 INT32
UINT64 UInt64 UINT64
INT64 Int64 INT64
FLOAT, HALF_FLOAT Float32 FLOAT
DOUBLE Float64 DOUBLE
DATE32 Date UINT16
DATE64, TIMESTAMP DateTime UINT32
STRING, BINARY String STRING
FixedString STRING
DECIMAL Decimal DECIMAL

ClickHouse supports configurable precision of Decimal type. The INSERT query treats the Parquet DECIMAL type as the ClickHouse Decimal128 type.

Unsupported Parquet data types: DATE32, TIME32, FIXED_SIZE_BINARY, JSON, UUID, ENUM.

Data types of a ClickHouse table columns can differ from the corresponding fields of the Parquet data inserted. When inserting data, ClickHouse interprets data types according to the table above and then cast the data to that data type which is set for the ClickHouse table column.

Inserting and Selecting Data

You can insert Parquet data from a file into ClickHouse table by the following command:

$ cat {filename} | clickhouse-client --query="INSERT INTO {some_table} FORMAT Parquet"

You can select data from a ClickHouse table and save them into some file in the Parquet format by the following command:

$ clickhouse-client --query="SELECT * FROM {some_table} FORMAT Parquet" > {some_file.pq}

To exchange data with Hadoop, you can use HDFS table engine.

ORC

Apache ORC is a columnar storage format widespread in the Hadoop ecosystem. You can only insert data in this format to ClickHouse.

Data Types Matching

The table below shows supported data types and how they match ClickHouse data types in INSERT queries.

ORC data type (INSERT) ClickHouse data type
UINT8, BOOL UInt8
INT8 Int8
UINT16 UInt16
INT16 Int16
UINT32 UInt32
INT32 Int32
UINT64 UInt64
INT64 Int64
FLOAT, HALF_FLOAT Float32
DOUBLE Float64
DATE32 Date
DATE64, TIMESTAMP DateTime
STRING, BINARY String
DECIMAL Decimal

ClickHouse supports configurable precision of the Decimal type. The INSERT query treats the ORC DECIMAL type as the ClickHouse Decimal128 type.

Unsupported ORC data types: DATE32, TIME32, FIXED_SIZE_BINARY, JSON, UUID, ENUM.

The data types of ClickHouse table columns dont have to match the corresponding ORC data fields. When inserting data, ClickHouse interprets data types according to the table above and then casts the data to the data type set for the ClickHouse table column.

Inserting Data

You can insert ORC data from a file into ClickHouse table by the following command:

$ cat filename.orc | clickhouse-client --query="INSERT INTO some_table FORMAT ORC"

To exchange data with Hadoop, you can use HDFS table engine.

Format Schema

The file name containing the format schema is set by the setting format_schema. Its required to set this setting when it is used one of the formats Cap'n Proto and Protobuf. The format schema is a combination of a file name and the name of a message type in this file, delimited by colon, e.g. schemafile.proto:MessageType. If the file has the standard extension for the format (for example, .proto for Protobuf), it can be omitted and in this case the format schema looks like schemafile:MessageType.

If you input or output data via the client in the interactive mode, the file name specified in the format schema can contain an absolute path or a path relative to the current directory on the client. If you use the client in the batch mode, the path to the schema must be relative due to security reasons.

If you input or output data via the HTTP interface the file name specified in the format schema should be located in the directory specified in format_schema_path in the server configuration.

Original article

Skipping Errors

Some formats such as CSV, TabSeparated, TSKV, JSONEachRow, Template, CustomSeparated and Protobuf can skip broken row if parsing error occurred and continue parsing from the beginning of next row. See input_format_allow_errors_num and input_format_allow_errors_ratio settings. Limitations:

  • In case of parsing error JSONEachRow skips all data until the new line (or EOF), so rows must be delimited by \n to count errors correctly.
  • Template and CustomSeparated use delimiter after the last column and delimiter between rows to find the beginning of next row, so skipping errors works only if at least one of them is not empty.

来源文章