ClickHouse/schema-inference.md at 74f2864f207dd45be30a1a87618a81dc1f3dcdd5

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-11-23 16:12:01 +00:00

avogar 7e5ba62017 Allow to read Bool values into String in JSON input formats

2024-01-05 20:33:30 +00:00

132 KiB

Raw Blame History

slug	sidebar_position	sidebar_label	title
/en/interfaces/schema-inference	21	Schema inference	Automatic schema inference from input data

ClickHouse can automatically determine the structure of input data in almost all supported Input formats. This document will describe when schema inference is used, how it works with different input formats and which settings can control it.

Usage

Schema inference is used when ClickHouse needs to read the data in a specific data format and the structure is unknown.

Table functions file, s3, url, hdfs.

These table functions have the optional argument structure with the structure of input data. If this argument is not specified or set to auto, the structure will be inferred from the data.

Example:

Let's say we have a file hobbies.jsonl in JSONEachRow format in the user_files directory with this content:

{"id" :  1, "age" :  25, "name" :  "Josh", "hobbies" :  ["football", "cooking", "music"]}
{"id" :  2, "age" :  19, "name" :  "Alan", "hobbies" :  ["tennis", "art"]}
{"id" :  3, "age" :  32, "name" :  "Lana", "hobbies" :  ["fitness", "reading", "shopping"]}
{"id" :  4, "age" :  47, "name" :  "Brayan", "hobbies" :  ["movies", "skydiving"]}

ClickHouse can read this data without you specifying its structure:

SELECT * FROM file('hobbies.jsonl')

┌─id─┬─age─┬─name───┬─hobbies──────────────────────────┐
│  1 │  25 │ Josh   │ ['football','cooking','music']   │
│  2 │  19 │ Alan   │ ['tennis','art']                 │
│  3 │  32 │ Lana   │ ['fitness','reading','shopping'] │
│  4 │  47 │ Brayan │ ['movies','skydiving']           │
└────┴─────┴────────┴──────────────────────────────────┘

Note: the format JSONEachRow was automatically determined by the file extension .jsonl.

You can see an automatically determined structure using the DESCRIBE query:

DESCRIBE file('hobbies.jsonl')

┌─name────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id      │ Nullable(Int64)         │              │                    │         │                  │                │
│ age     │ Nullable(Int64)         │              │                    │         │                  │                │
│ name    │ Nullable(String)        │              │                    │         │                  │                │
│ hobbies │ Array(Nullable(String)) │              │                    │         │                  │                │
└─────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Table engines File, S3, URL, HDFS

If the list of columns is not specified in CREATE TABLE query, the structure of the table will be inferred automatically from the data.

Example:

Let's use the file hobbies.jsonl. We can create a table with engine File with the data from this file:

CREATE TABLE hobbies ENGINE=File(JSONEachRow, 'hobbies.jsonl')

Ok.

SELECT * FROM hobbies

┌─id─┬─age─┬─name───┬─hobbies──────────────────────────┐
│  1 │  25 │ Josh   │ ['football','cooking','music']   │
│  2 │  19 │ Alan   │ ['tennis','art']                 │
│  3 │  32 │ Lana   │ ['fitness','reading','shopping'] │
│  4 │  47 │ Brayan │ ['movies','skydiving']           │
└────┴─────┴────────┴──────────────────────────────────┘

DESCRIBE TABLE hobbies

┌─name────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id      │ Nullable(Int64)         │              │                    │         │                  │                │
│ age     │ Nullable(Int64)         │              │                    │         │                  │                │
│ name    │ Nullable(String)        │              │                    │         │                  │                │
│ hobbies │ Array(Nullable(String)) │              │                    │         │                  │                │
└─────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

clickhouse-local

clickhouse-local has an optional parameter -S/--structure with the structure of input data. If this parameter is not specified or set to auto, the structure will be inferred from the data.

Example:

Let's use the file hobbies.jsonl. We can query the data from this file using clickhouse-local:

clickhouse-local --file='hobbies.jsonl' --table='hobbies' --query='DESCRIBE TABLE hobbies'

id	Nullable(Int64)
age	Nullable(Int64)
name	Nullable(String)
hobbies	Array(Nullable(String))

clickhouse-local --file='hobbies.jsonl' --table='hobbies' --query='SELECT * FROM hobbies'

1	25	Josh	['football','cooking','music']
2	19	Alan	['tennis','art']
3	32	Lana	['fitness','reading','shopping']
4	47	Brayan	['movies','skydiving']

Using structure from insertion table

When table functions file/s3/url/hdfs are used to insert data into a table, there is an option to use the structure from the insertion table instead of extracting it from the data. It can improve insertion performance because schema inference can take some time. Also, it will be helpful when the table has an optimized schema, so no conversions between types will be performed.

There is a special setting use_structure_from_insertion_table_in_table_functions that controls this behaviour. It has 3 possible values:

0 - table function will extract the structure from the data.
1 - table function will use the structure from the insertion table.
2 - ClickHouse will automatically determine if it's possible to use the structure from the insertion table or use schema inference. Default value.

Example 1:

Let's create table hobbies1 with the next structure:

CREATE TABLE hobbies1
(
    `id` UInt64,
    `age` LowCardinality(UInt8),
    `name` String,
    `hobbies` Array(String)
)
ENGINE = MergeTree
ORDER BY id;

And insert data from the file hobbies.jsonl:

INSERT INTO hobbies1 SELECT * FROM file(hobbies.jsonl)

In this case, all columns from the file are inserted into the table without changes, so ClickHouse will use the structure from the insertion table instead of schema inference.

Example 2:

Let's create table hobbies2 with the next structure:

CREATE TABLE hobbies2
(
  `id` UInt64,
  `age` LowCardinality(UInt8),
  `hobbies` Array(String)
)
  ENGINE = MergeTree
ORDER BY id;

And insert data from the file hobbies.jsonl:

INSERT INTO hobbies2 SELECT id, age, hobbies FROM file(hobbies.jsonl)

In this case, all columns in the SELECT query are present in the table, so ClickHouse will use the structure from the insertion table. Note that it will work only for input formats that support reading a subset of columns like JSONEachRow, TSKV, Parquet, etc. (so it won't work for example for TSV format).

Example 3:

Let's create table hobbies3 with the next structure:

CREATE TABLE hobbies3
(
  `identifier` UInt64,
  `age` LowCardinality(UInt8),
  `hobbies` Array(String)
)
  ENGINE = MergeTree
ORDER BY identifier;

And insert data from the file hobbies.jsonl:

INSERT INTO hobbies3 SELECT id, age, hobbies FROM file(hobbies.jsonl)

In this case, column id is used in the SELECT query, but the table doesn't have this column (it has a column with the name identifier), so ClickHouse cannot use the structure from the insertion table, and schema inference will be used.

Example 4:

Let's create table hobbies4 with the next structure:

CREATE TABLE hobbies4
(
  `id` UInt64,
  `any_hobby` Nullable(String)
)
  ENGINE = MergeTree
ORDER BY id;

And insert data from the file hobbies.jsonl:

INSERT INTO hobbies4 SELECT id, empty(hobbies) ? NULL : hobbies[1] FROM file(hobbies.jsonl)

In this case, there are some operations performed on the column hobbies in the SELECT query to insert it into the table, so ClickHouse cannot use the structure from the insertion table, and schema inference will be used.

Schema inference cache

For most input formats schema inference reads some data to determine its structure and this process can take some time. To prevent inferring the same schema every time ClickHouse read the data from the same file, the inferred schema is cached and when accessing the same file again, ClickHouse will use the schema from the cache.

There are special settings that control this cache:

schema_inference_cache_max_elements_for_{file/s3/hdfs/url/azure} - the maximum number of cached schemas for the corresponding table function. The default value is 4096. These settings should be set in the server config.
schema_inference_use_cache_for_{file,s3,hdfs,url,azure} - allows turning on/off using cache for schema inference. These settings can be used in queries.

The schema of the file can be changed by modifying the data or by changing format settings. For this reason, the schema inference cache identifies the schema by file source, format name, used format settings, and the last modification time of the file.

Note: some files accessed by url in url table function may not contain information about the last modification time; for this case, there is a special setting schema_inference_cache_require_modification_time_for_url. Disabling this setting allows the use of the schema from cache without the last modification time for such files.

There is also a system table schema_inference_cache with all current schemas in cache and system query SYSTEM DROP SCHEMA CACHE [FOR File/S3/URL/HDFS] that allows cleaning the schema cache for all sources, or for a specific source.

Examples:

Let's try to infer the structure of a sample dataset from s3 github-2022.ndjson.gz and see how the schema inference cache works:

DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022.ndjson.gz')
SETTINGS allow_experimental_object_type = 1

┌─name───────┬─type─────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ type       │ Nullable(String)         │              │                    │         │                  │                │
│ actor      │ Object(Nullable('json')) │              │                    │         │                  │                │
│ repo       │ Object(Nullable('json')) │              │                    │         │                  │                │
│ created_at │ Nullable(String)         │              │                    │         │                  │                │
│ payload    │ Object(Nullable('json')) │              │                    │         │                  │                │
└────────────┴──────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

5 rows in set. Elapsed: 0.601 sec.

DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022.ndjson.gz')
SETTINGS allow_experimental_object_type = 1

┌─name───────┬─type─────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ type       │ Nullable(String)         │              │                    │         │                  │                │
│ actor      │ Object(Nullable('json')) │              │                    │         │                  │                │
│ repo       │ Object(Nullable('json')) │              │                    │         │                  │                │
│ created_at │ Nullable(String)         │              │                    │         │                  │                │
│ payload    │ Object(Nullable('json')) │              │                    │         │                  │                │
└────────────┴──────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

5 rows in set. Elapsed: 0.059 sec.

As you can see, the second query succeeded almost instantly.

Let's try to change some settings that can affect inferred schema:

DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022.ndjson.gz')
SETTINGS input_format_json_read_objects_as_strings = 1

┌─name───────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ type       │ Nullable(String) │              │                    │         │                  │                │
│ actor      │ Nullable(String) │              │                    │         │                  │                │
│ repo       │ Nullable(String) │              │                    │         │                  │                │
│ created_at │ Nullable(String) │              │                    │         │                  │                │
│ payload    │ Nullable(String) │              │                    │         │                  │                │
└────────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

5 rows in set. Elapsed: 0.611 sec

As you can see, the schema from the cache was not used for the same file, because the setting that can affect inferred schema was changed.

Let's check the content of system.schema_inference_cache table:

SELECT schema, format, source FROM system.schema_inference_cache WHERE storage='S3'

┌─schema──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─format─┬─source───────────────────────────────────────────────────────────────────────────────────────────────────┐
│ type Nullable(String), actor Object(Nullable('json')), repo Object(Nullable('json')), created_at Nullable(String), payload Object(Nullable('json')) │ NDJSON │ datasets-documentation.s3.eu-west-3.amazonaws.com443/datasets-documentation/github/github-2022.ndjson.gz │
│ type Nullable(String), actor Nullable(String), repo Nullable(String), created_at Nullable(String), payload Nullable(String)                         │ NDJSON │ datasets-documentation.s3.eu-west-3.amazonaws.com443/datasets-documentation/github/github-2022.ndjson.gz │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

As you can see, there are two different schemas for the same file.

We can clear the schema cache using a system query:

SYSTEM DROP SCHEMA CACHE FOR S3

Ok.

SELECT count() FROM system.schema_inference_cache WHERE storage='S3'

┌─count()─┐
│       0 │
└─────────┘

Text formats

For text formats, ClickHouse reads the data row by row, extracts column values according to the format, and then uses some recursive parsers and heuristics to determine the type for each value. The maximum number of rows and bytes read from the data in schema inference is controlled by the settings input_format_max_rows_to_read_for_schema_inference (25000 by default) and input_format_max_bytes_to_read_for_schema_inference (32Mb by default). By default, all inferred types are Nullable, but you can change this by setting schema_inference_make_columns_nullable (see examples in the settings section).

JSON formats

In JSON formats ClickHouse parses values according to the JSON specification and then tries to find the most appropriate data type for them.

Let's see how it works, what types can be inferred and what specific settings can be used in JSON formats.

Examples

Here and further, the format table function will be used in examples.

Integers, Floats, Bools, Strings:

DESC format(JSONEachRow, '{"int" : 42, "float" : 42.42, "string" : "Hello, World!"}');

┌─name───┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ int    │ Nullable(Int64)   │              │                    │         │                  │                │
│ float  │ Nullable(Float64) │              │                    │         │                  │                │
│ bool   │ Nullable(Bool)    │              │                    │         │                  │                │
│ string │ Nullable(String)  │              │                    │         │                  │                │
└────────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Dates, DateTimes:

DESC format(JSONEachRow, '{"date" : "2022-01-01", "datetime" : "2022-01-01 00:00:00"}')

┌─name─────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date     │ Nullable(Date)          │              │                    │         │                  │                │
│ datetime │ Nullable(DateTime64(9)) │              │                    │         │                  │                │
└──────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Arrays:

DESC format(JSONEachRow, '{"arr" : [1, 2, 3], "nested_arrays" : [[1, 2, 3], [4, 5, 6], []]}')

┌─name──────────┬─type──────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ arr           │ Array(Nullable(Int64))        │              │                    │         │                  │                │
│ nested_arrays │ Array(Array(Nullable(Int64))) │              │                    │         │                  │                │
└───────────────┴───────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

If an array contains null, ClickHouse will use types from the other array elements:

DESC format(JSONEachRow, '{"arr" : [null, 42, null]}')

┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ arr  │ Array(Nullable(Int64)) │              │                    │         │                  │                │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Named tuples:

When setting input_format_json_try_infer_named_tuples_from_objects is enabled, during schema inference ClickHouse will try to infer named Tuple from JSON objects. The resulting named Tuple will contain all elements from all corresponding JSON objects from sample data.

SET input_format_json_try_infer_named_tuples_from_objects = 1;
DESC format(JSONEachRow, '{"obj" : {"a" : 42, "b" : "Hello"}}, {"obj" : {"a" : 43, "c" : [1, 2, 3]}}, {"obj" : {"d" : {"e" : 42}}}')

┌─name─┬─type───────────────────────────────────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ obj  │ Tuple(a Nullable(Int64), b Nullable(String), c Array(Nullable(Int64)), d Tuple(e Nullable(Int64))) │              │                    │         │                  │                │
└──────┴────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Unnamed Tuples:

In JSON formats we treat Arrays with elements of different types as Unnamed Tuples.

DESC format(JSONEachRow, '{"tuple" : [1, "Hello, World!", [1, 2, 3]]}')

┌─name──┬─type─────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ tuple │ Tuple(Nullable(Int64), Nullable(String), Array(Nullable(Int64))) │              │                    │         │                  │                │
└───────┴──────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

If some values are null or empty, we use types of corresponding values from the other rows:

DESC format(JSONEachRow, $$
                              {"tuple" : [1, null, null]}
                              {"tuple" : [null, "Hello, World!", []]}
                              {"tuple" : [null, null, [1, 2, 3]]}
                         $$)

┌─name──┬─type─────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ tuple │ Tuple(Nullable(Int64), Nullable(String), Array(Nullable(Int64))) │              │                    │         │                  │                │
└───────┴──────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Maps:

In JSON we can read objects with values of the same type as Map type. Note: it will work only when settings input_format_json_read_objects_as_strings and input_format_json_try_infer_named_tuples_from_objects are disabled.

SET input_format_json_read_objects_as_strings = 0, input_format_json_try_infer_named_tuples_from_objects = 0;
DESC format(JSONEachRow, '{"map" : {"key1" : 42, "key2" : 24, "key3" : 4}}')

┌─name─┬─type─────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ map  │ Map(String, Nullable(Int64)) │              │                    │         │                  │                │
└──────┴──────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

JSON Object type (if setting allow_experimental_object_type is enabled):

SET allow_experimental_object_type = 1
DESC format(JSONEachRow, $$
                            {"obj" : {"key1" : 42}}
                            {"obj" : {"key2" : "Hello, World!"}}
                            {"obj" : {"key1" : 24, "key3" : {"a" : 42, "b" : null}}}
                         $$)

┌─name─┬─type─────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ obj  │ Object(Nullable('json')) │              │                    │         │                  │                │
└──────┴──────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Nested complex types:

DESC format(JSONEachRow, '{"value" : [[[42, 24], []], {"key1" : 42, "key2" : 24}]}')

┌─name──┬─type─────────────────────────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ value │ Tuple(Array(Array(Nullable(String))), Tuple(key1 Nullable(Int64), key2 Nullable(Int64))) │              │                    │         │                  │                │
└───────┴──────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

If ClickHouse cannot determine the type for some key, because the data contains only nulls/empty objects/empty arrays, type String will be used if setting input_format_json_infer_incomplete_types_as_strings is enabled or an exception will be thrown otherwise:

DESC format(JSONEachRow, '{"arr" : [null, null]}') SETTINGS input_format_json_infer_incomplete_types_as_strings = 1;

┌─name─┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ arr  │ Array(Nullable(String)) │              │                    │         │                  │                │
└──────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

DESC format(JSONEachRow, '{"arr" : [null, null]}') SETTINGS input_format_json_infer_incomplete_types_as_strings = 0;

Code: 652. DB::Exception: Received from localhost:9000. DB::Exception:
Cannot determine type for column 'arr' by first 1 rows of data,
most likely this column contains only Nulls or empty Arrays/Maps.
...

JSON settings

input_format_json_try_infer_numbers_from_strings

Enabling this setting allows inferring numbers from string values.

This setting is disabled by default.

Example:

SET input_format_json_try_infer_numbers_from_strings = 1;
DESC format(JSONEachRow, $$
                              {"value" : "42"}
                              {"value" : "424242424242"}
                         $$)

┌─name──┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ value │ Nullable(Int64) │              │                    │         │                  │                │
└───────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

input_format_json_try_infer_named_tuples_from_objects

Enabling this setting allows inferring named Tuples from JSON objects. The resulting named Tuple will contain all elements from all corresponding JSON objects from sample data. It can be useful when JSON data is not sparse so the sample of data will contain all possible object keys.

This setting is enabled by default.

Example

SET input_format_json_try_infer_named_tuples_from_objects = 1;
DESC format(JSONEachRow, '{"obj" : {"a" : 42, "b" : "Hello"}}, {"obj" : {"a" : 43, "c" : [1, 2, 3]}}, {"obj" : {"d" : {"e" : 42}}}')

Result:

┌─name─┬─type───────────────────────────────────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ obj  │ Tuple(a Nullable(Int64), b Nullable(String), c Array(Nullable(Int64)), d Tuple(e Nullable(Int64))) │              │                    │         │                  │                │
└──────┴────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

SET input_format_json_try_infer_named_tuples_from_objects = 1;
DESC format(JSONEachRow, '{"array" : [{"a" : 42, "b" : "Hello"}, {}, {"c" : [1,2,3]}, {"d" : "2020-01-01"}]}')

Result:

┌─name──┬─type────────────────────────────────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ array │ Array(Tuple(a Nullable(Int64), b Nullable(String), c Array(Nullable(Int64)), d Nullable(Date))) │              │                    │         │                  │                │
└───────┴─────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

input_format_json_read_objects_as_strings

Enabling this setting allows reading nested JSON objects as strings. This setting can be used to read nested JSON objects without using JSON object type.

This setting is enabled by default.

Note: enabling this setting will take effect only if setting input_format_json_try_infer_named_tuples_from_objects is disabled.

SET input_format_json_read_objects_as_strings = 1, input_format_json_try_infer_named_tuples_from_objects = 0;
DESC format(JSONEachRow, $$
                             {"obj" : {"key1" : 42, "key2" : [1,2,3,4]}}
                             {"obj" : {"key3" : {"nested_key" : 1}}}
                         $$)

┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ obj  │ Nullable(String) │              │                    │         │                  │                │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

input_format_json_read_numbers_as_strings

Enabling this setting allows reading numeric values as strings.

This setting is enabled by default.

Example

SET input_format_json_read_numbers_as_strings = 1;
DESC format(JSONEachRow, $$
                                {"value" : 1055}
                                {"value" : "unknown"}
                         $$)

┌─name──┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ value │ Nullable(String) │              │                    │         │                  │                │
└───────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

input_format_json_read_bools_as_numbers

Enabling this setting allows reading Bool values as numbers.

This setting is enabled by default.

Example:

SET input_format_json_read_bools_as_numbers = 1;
DESC format(JSONEachRow, $$
                                {"value" : true}
                                {"value" : 42}
                         $$)

┌─name──┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ value │ Nullable(Int64) │              │                    │         │                  │                │
└───────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

input_format_json_read_bools_as_strings

Enabling this setting allows reading Bool values as strings.

This setting is enabled by default.

Example:

SET input_format_json_read_bools_as_strings = 1;
DESC format(JSONEachRow, $$
                                {"value" : true}
                                {"value" : "Hello, World"}
                         $$)

┌─name──┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ value │ Nullable(String) │              │                    │         │                  │                │
└───────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

input_format_json_read_arrays_as_strings

Enabling this setting allows reading JSON array values as strings.

This setting is enabled by default.

Example

SET input_format_json_read_arrays_as_strings = 1;
SELECT arr, toTypeName(arr), JSONExtractArrayRaw(arr)[3] from format(JSONEachRow, 'arr String', '{"arr" : [1, "Hello", [1,2,3]]}');

┌─arr───────────────────┬─toTypeName(arr)─┬─arrayElement(JSONExtractArrayRaw(arr), 3)─┐
│ [1, "Hello", [1,2,3]] │ String          │ [1,2,3]                                   │
└───────────────────────┴─────────────────┴───────────────────────────────────────────┘

input_format_json_infer_incomplete_types_as_strings

Enabling this setting allows to use String type for JSON keys that contain only Null/{}/[] in data sample during schema inference. In JSON formats any value can be read as String if all corresponding settings are enabled (they are all enabled by default), and we can avoid errors like Cannot determine type for column 'column_name' by first 25000 rows of data, most likely this column contains only Nulls or empty Arrays/Maps during schema inference by using String type for keys with unknown types.

Example:

SET input_format_json_infer_incomplete_types_as_strings = 1, input_format_json_try_infer_named_tuples_from_objects = 1;
DESCRIBE format(JSONEachRow, '{"obj" : {"a" : [1,2,3], "b" : "hello", "c" : null, "d" : {}, "e" : []}}');
SELECT * FROM format(JSONEachRow, '{"obj" : {"a" : [1,2,3], "b" : "hello", "c" : null, "d" : {}, "e" : []}}');

Result:

┌─name─┬─type───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ obj  │ Tuple(a Array(Nullable(Int64)), b Nullable(String), c Nullable(String), d Nullable(String), e Array(Nullable(String))) │              │                    │         │                  │                │
└──────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

┌─obj────────────────────────────┐
│ ([1,2,3],'hello',NULL,'{}',[]) │
└────────────────────────────────┘

CSV

In CSV format ClickHouse extracts column values from the row according to delimiters. ClickHouse expects all types except numbers and strings to be enclosed in double quotes. If the value is in double quotes, ClickHouse tries to parse the data inside quotes using the recursive parser and then tries to find the most appropriate data type for it. If the value is not in double quotes, ClickHouse tries to parse it as a number, and if the value is not a number, ClickHouse treats it as a string.

If you don't want ClickHouse to try to determine complex types using some parsers and heuristics, you can disable setting input_format_csv_use_best_effort_in_schema_inference and ClickHouse will treat all columns as Strings.

If setting input_format_csv_detect_header is enabled, ClickHouse will try to detect the header with column names (and maybe types) while inferring schema. This setting is enabled by default.

Examples:

Integers, Floats, Bools, Strings:

DESC format(CSV, '42,42.42,true,"Hello,World!"')

┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(Int64)   │              │                    │         │                  │                │
│ c2   │ Nullable(Float64) │              │                    │         │                  │                │
│ c3   │ Nullable(Bool)    │              │                    │         │                  │                │
│ c4   │ Nullable(String)  │              │                    │         │                  │                │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Strings without quotes:

DESC format(CSV, 'Hello world!,World hello!')

┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(String) │              │                    │         │                  │                │
│ c2   │ Nullable(String) │              │                    │         │                  │                │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Dates, DateTimes:

DESC format(CSV, '"2020-01-01","2020-01-01 00:00:00"')

┌─name─┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(Date)          │              │                    │         │                  │                │
│ c2   │ Nullable(DateTime64(9)) │              │                    │         │                  │                │
└──────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Arrays:

DESC format(CSV, '"[1,2,3]","[[1, 2], [], [3, 4]]"')

┌─name─┬─type──────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Nullable(Int64))        │              │                    │         │                  │                │
│ c2   │ Array(Array(Nullable(Int64))) │              │                    │         │                  │                │
└──────┴───────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

DESC format(CSV, $$"['Hello', 'world']","[['Abc', 'Def'], []]"$$)

┌─name─┬─type───────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Nullable(String))        │              │                    │         │                  │                │
│ c2   │ Array(Array(Nullable(String))) │              │                    │         │                  │                │
└──────┴────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

If an array contains null, ClickHouse will use types from the other array elements:

DESC format(CSV, '"[NULL, 42, NULL]"')

┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Nullable(Int64)) │              │                    │         │                  │                │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Maps:

DESC format(CSV, $$"{'key1' : 42, 'key2' : 24}"$$)

┌─name─┬─type─────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Map(String, Nullable(Int64)) │              │                    │         │                  │                │
└──────┴──────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Nested Arrays and Maps:

DESC format(CSV, $$"[{'key1' : [[42, 42], []], 'key2' : [[null], [42]]}]"$$)

┌─name─┬─type──────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Map(String, Array(Array(Nullable(Int64))))) │              │                    │         │                  │                │
└──────┴───────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

If ClickHouse cannot determine the type inside quotes, because the data contains only nulls, ClickHouse will treat it as String:

DESC format(CSV, '"[NULL, NULL]"')

┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(String) │              │                    │         │                  │                │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Example with disabled setting input_format_csv_use_best_effort_in_schema_inference:

SET input_format_csv_use_best_effort_in_schema_inference = 0
DESC format(CSV, '"[1,2,3]",42.42,Hello World!')

┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(String) │              │                    │         │                  │                │
│ c2   │ Nullable(String) │              │                    │         │                  │                │
│ c3   │ Nullable(String) │              │                    │         │                  │                │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Examples of header auto-detection (when input_format_csv_detect_header is enabled):

Only names:

SELECT * FROM format(CSV,
$$"number","string","array"
42,"Hello","[1, 2, 3]"
43,"World","[4, 5, 6]"
$$)

┌─number─┬─string─┬─array───┐
│     42 │ Hello  │ [1,2,3] │
│     43 │ World  │ [4,5,6] │
└────────┴────────┴─────────┘

Names and types:

DESC format(CSV,
$$"number","string","array"
"UInt32","String","Array(UInt16)"
42,"Hello","[1, 2, 3]"
43,"World","[4, 5, 6]"
$$)

┌─name───┬─type──────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ UInt32        │              │                    │         │                  │                │
│ string │ String        │              │                    │         │                  │                │
│ array  │ Array(UInt16) │              │                    │         │                  │                │
└────────┴───────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Note that the header can be detected only if there is at least one column with a non-String type. If all columns have String type, the header is not detected:

SELECT * FROM format(CSV,
$$"first_column","second_column"
"Hello","World"
"World","Hello"
$$)

┌─c1───────────┬─c2────────────┐
│ first_column │ second_column │
│ Hello        │ World         │
│ World        │ Hello         │
└──────────────┴───────────────┘

CSV settings

input_format_csv_try_infer_numbers_from_strings

Enabling this setting allows inferring numbers from string values.

This setting is disabled by default.

Example:

SET input_format_json_try_infer_numbers_from_strings = 1;
DESC format(CSV, '"42","42.42"');

┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(Int64)   │              │                    │         │                  │                │
│ c2   │ Nullable(Float64) │              │                    │         │                  │                │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

TSV/TSKV

In TSV/TSKV formats ClickHouse extracts column value from the row according to tabular delimiters and then parses extracted value using the recursive parser to determine the most appropriate type. If the type cannot be determined, ClickHouse treats this value as String.

If you don't want ClickHouse to try to determine complex types using some parsers and heuristics, you can disable setting input_format_tsv_use_best_effort_in_schema_inference and ClickHouse will treat all columns as Strings.

If setting input_format_tsv_detect_header is enabled, ClickHouse will try to detect the header with column names (and maybe types) while inferring schema. This setting is enabled by default.

Examples:

Integers, Floats, Bools, Strings:

DESC format(TSV, '42	42.42	true	Hello,World!')

┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(Int64)   │              │                    │         │                  │                │
│ c2   │ Nullable(Float64) │              │                    │         │                  │                │
│ c3   │ Nullable(Bool)    │              │                    │         │                  │                │
│ c4   │ Nullable(String)  │              │                    │         │                  │                │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

DESC format(TSKV, 'int=42	float=42.42	bool=true	string=Hello,World!\n')

┌─name───┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ int    │ Nullable(Int64)   │              │                    │         │                  │                │
│ float  │ Nullable(Float64) │              │                    │         │                  │                │
│ bool   │ Nullable(Bool)    │              │                    │         │                  │                │
│ string │ Nullable(String)  │              │                    │         │                  │                │
└────────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Dates, DateTimes:

DESC format(TSV, '2020-01-01	2020-01-01 00:00:00')

┌─name─┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(Date)          │              │                    │         │                  │                │
│ c2   │ Nullable(DateTime64(9)) │              │                    │         │                  │                │
└──────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Arrays:

DESC format(TSV, '[1,2,3]	[[1, 2], [], [3, 4]]')

┌─name─┬─type──────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Nullable(Int64))        │              │                    │         │                  │                │
│ c2   │ Array(Array(Nullable(Int64))) │              │                    │         │                  │                │
└──────┴───────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

DESC format(TSV, '[''Hello'', ''world'']	[[''Abc'', ''Def''], []]')

┌─name─┬─type───────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Nullable(String))        │              │                    │         │                  │                │
│ c2   │ Array(Array(Nullable(String))) │              │                    │         │                  │                │
└──────┴────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

If an array contains null, ClickHouse will use types from the other array elements:

DESC format(TSV, '[NULL, 42, NULL]')

┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Nullable(Int64)) │              │                    │         │                  │                │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Tuples:

DESC format(TSV, $$(42, 'Hello, world!')$$)

┌─name─┬─type─────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Tuple(Nullable(Int64), Nullable(String)) │              │                    │         │                  │                │
└──────┴──────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Maps:

DESC format(TSV, $${'key1' : 42, 'key2' : 24}$$)

┌─name─┬─type─────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Map(String, Nullable(Int64)) │              │                    │         │                  │                │
└──────┴──────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Nested Arrays, Tuples and Maps:

DESC format(TSV, $$[{'key1' : [(42, 'Hello'), (24, NULL)], 'key2' : [(NULL, ','), (42, 'world!')]}]$$)

┌─name─┬─type────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Map(String, Array(Tuple(Nullable(Int64), Nullable(String))))) │              │                    │         │                  │                │
└──────┴─────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

If ClickHouse cannot determine the type, because the data contains only nulls, ClickHouse will treat it as String:

DESC format(TSV, '[NULL, NULL]')

┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(String) │              │                    │         │                  │                │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Example with disabled setting input_format_tsv_use_best_effort_in_schema_inference:

SET input_format_tsv_use_best_effort_in_schema_inference = 0
DESC format(TSV, '[1,2,3]	42.42	Hello World!')

┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(String) │              │                    │         │                  │                │
│ c2   │ Nullable(String) │              │                    │         │                  │                │
│ c3   │ Nullable(String) │              │                    │         │                  │                │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Examples of header auto-detection (when input_format_tsv_detect_header is enabled):

Only names:

SELECT * FROM format(TSV,
$$number	string	array
42	Hello	[1, 2, 3]
43	World	[4, 5, 6]
$$);

┌─number─┬─string─┬─array───┐
│     42 │ Hello  │ [1,2,3] │
│     43 │ World  │ [4,5,6] │
└────────┴────────┴─────────┘

Names and types:

DESC format(TSV,
$$number	string	array
UInt32	String	Array(UInt16)
42	Hello	[1, 2, 3]
43	World	[4, 5, 6]
$$)

┌─name───┬─type──────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ UInt32        │              │                    │         │                  │                │
│ string │ String        │              │                    │         │                  │                │
│ array  │ Array(UInt16) │              │                    │         │                  │                │
└────────┴───────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Note that the header can be detected only if there is at least one column with a non-String type. If all columns have String type, the header is not detected:

SELECT * FROM format(TSV,
$$first_column	second_column
Hello	World
World	Hello
$$)

┌─c1───────────┬─c2────────────┐
│ first_column │ second_column │
│ Hello        │ World         │
│ World        │ Hello         │
└──────────────┴───────────────┘

Values

In Values format ClickHouse extracts column value from the row and then parses it using the recursive parser similar to how literals are parsed.

Examples:

Integers, Floats, Bools, Strings:

DESC format(Values, $$(42, 42.42, true, 'Hello,World!')$$)

┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(Int64)   │              │                    │         │                  │                │
│ c2   │ Nullable(Float64) │              │                    │         │                  │                │
│ c3   │ Nullable(Bool)    │              │                    │         │                  │                │
│ c4   │ Nullable(String)  │              │                    │         │                  │                │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Dates, DateTimes:

DESC format(Values, $$('2020-01-01', '2020-01-01 00:00:00')$$)

┌─name─┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(Date)          │              │                    │         │                  │                │
│ c2   │ Nullable(DateTime64(9)) │              │                    │         │                  │                │
└──────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Arrays:

DESC format(Values, '([1,2,3], [[1, 2], [], [3, 4]])')

┌─name─┬─type──────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Nullable(Int64))        │              │                    │         │                  │                │
│ c2   │ Array(Array(Nullable(Int64))) │              │                    │         │                  │                │
└──────┴───────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

If an array contains null, ClickHouse will use types from the other array elements:

DESC format(Values, '([NULL, 42, NULL])')

┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Nullable(Int64)) │              │                    │         │                  │                │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Tuples:

DESC format(Values, $$((42, 'Hello, world!'))$$)

┌─name─┬─type─────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Tuple(Nullable(Int64), Nullable(String)) │              │                    │         │                  │                │
└──────┴──────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Maps:

DESC format(Values, $$({'key1' : 42, 'key2' : 24})$$)

┌─name─┬─type─────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Map(String, Nullable(Int64)) │              │                    │         │                  │                │
└──────┴──────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Nested Arrays, Tuples and Maps:

DESC format(Values, $$([{'key1' : [(42, 'Hello'), (24, NULL)], 'key2' : [(NULL, ','), (42, 'world!')]}])$$)

┌─name─┬─type────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Array(Map(String, Array(Tuple(Nullable(Int64), Nullable(String))))) │              │                    │         │                  │                │
└──────┴─────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

If ClickHouse cannot determine the type, because the data contains only nulls, an exception will be thrown:

DESC format(Values, '([NULL, NULL])')

Code: 652. DB::Exception: Received from localhost:9000. DB::Exception:
Cannot determine type for column 'c1' by first 1 rows of data,
most likely this column contains only Nulls or empty Arrays/Maps.
...

Example with disabled setting input_format_tsv_use_best_effort_in_schema_inference:

SET input_format_tsv_use_best_effort_in_schema_inference = 0
DESC format(TSV, '[1,2,3]	42.42	Hello World!')

┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(String) │              │                    │         │                  │                │
│ c2   │ Nullable(String) │              │                    │         │                  │                │
│ c3   │ Nullable(String) │              │                    │         │                  │                │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

CustomSeparated

In CustomSeparated format ClickHouse first extracts all column values from the row according to specified delimiters and then tries to infer the data type for each value according to escaping rule.

If setting input_format_custom_detect_header is enabled, ClickHouse will try to detect the header with column names (and maybe types) while inferring schema. This setting is enabled by default.

Example

SET format_custom_row_before_delimiter = '<row_before_delimiter>',
       format_custom_row_after_delimiter = '<row_after_delimiter>\n',
       format_custom_row_between_delimiter = '<row_between_delimiter>\n',
       format_custom_result_before_delimiter = '<result_before_delimiter>\n',
       format_custom_result_after_delimiter = '<result_after_delimiter>\n',
       format_custom_field_delimiter = '<field_delimiter>',
       format_custom_escaping_rule = 'Quoted'

DESC format(CustomSeparated, $$<result_before_delimiter>
<row_before_delimiter>42.42<field_delimiter>'Some string 1'<field_delimiter>[1, NULL, 3]<row_after_delimiter>
<row_between_delimiter>
<row_before_delimiter>NULL<field_delimiter>'Some string 3'<field_delimiter>[1, 2, NULL]<row_after_delimiter>
<result_after_delimiter>
$$)

┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(Float64)      │              │                    │         │                  │                │
│ c2   │ Nullable(String)       │              │                    │         │                  │                │
│ c3   │ Array(Nullable(Int64)) │              │                    │         │                  │                │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Example of header auto-detection (when input_format_custom_detect_header is enabled):

SET format_custom_row_before_delimiter = '<row_before_delimiter>',
       format_custom_row_after_delimiter = '<row_after_delimiter>\n',
       format_custom_row_between_delimiter = '<row_between_delimiter>\n',
       format_custom_result_before_delimiter = '<result_before_delimiter>\n',
       format_custom_result_after_delimiter = '<result_after_delimiter>\n',
       format_custom_field_delimiter = '<field_delimiter>',
       format_custom_escaping_rule = 'Quoted'

DESC format(CustomSeparated, $$<result_before_delimiter>
<row_before_delimiter>'number'<field_delimiter>'string'<field_delimiter>'array'<row_after_delimiter>
<row_between_delimiter>
<row_before_delimiter>42.42<field_delimiter>'Some string 1'<field_delimiter>[1, NULL, 3]<row_after_delimiter>
<row_between_delimiter>
<row_before_delimiter>NULL<field_delimiter>'Some string 3'<field_delimiter>[1, 2, NULL]<row_after_delimiter>
<result_after_delimiter>
$$)

┌─number─┬─string────────┬─array──────┐
│  42.42 │ Some string 1 │ [1,NULL,3] │
│   ᴺᵁᴸᴸ │ Some string 3 │ [1,2,NULL] │
└────────┴───────────────┴────────────┘

Template

In Template format ClickHouse first extracts all column values from the row according to the specified template and then tries to infer the data type for each value according to its escaping rule.

Example

Let's say we have a file resultset with the next content:

<result_before_delimiter>
${data}<result_after_delimiter>

And a file row_format with the next content:

<row_before_delimiter>${column_1:CSV}<field_delimiter_1>${column_2:Quoted}<field_delimiter_2>${column_3:JSON}<row_after_delimiter>

Then we can make the next queries:

SET format_template_rows_between_delimiter = '<row_between_delimiter>\n',
       format_template_row = 'row_format',
       format_template_resultset = 'resultset_format'

DESC format(Template, $$<result_before_delimiter>
<row_before_delimiter>42.42<field_delimiter_1>'Some string 1'<field_delimiter_2>[1, null, 2]<row_after_delimiter>
<row_between_delimiter>
<row_before_delimiter>\N<field_delimiter_1>'Some string 3'<field_delimiter_2>[1, 2, null]<row_after_delimiter>
<result_after_delimiter>
$$)

┌─name─────┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ column_1 │ Nullable(Float64)      │              │                    │         │                  │                │
│ column_2 │ Nullable(String)       │              │                    │         │                  │                │
│ column_3 │ Array(Nullable(Int64)) │              │                    │         │                  │                │
└──────────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Regexp

Similar to Template, in Regexp format ClickHouse first extracts all column values from the row according to specified regular expression and then tries to infer data type for each value according to the specified escaping rule.

Example

SET format_regexp = '^Line: value_1=(.+?), value_2=(.+?), value_3=(.+?)',
       format_regexp_escaping_rule = 'CSV'
       
DESC format(Regexp, $$Line: value_1=42, value_2="Some string 1", value_3="[1, NULL, 3]"
Line: value_1=2, value_2="Some string 2", value_3="[4, 5, NULL]"$$)

┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1   │ Nullable(Int64)        │              │                    │         │                  │                │
│ c2   │ Nullable(String)       │              │                    │         │                  │                │
│ c3   │ Array(Nullable(Int64)) │              │                    │         │                  │                │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Settings for text formats

input_format_max_rows_to_read_for_schema_inference/input_format_max_bytes_to_read_for_schema_inference

These settings control the amount of data to be read while schema inference. The more rows/bytes are read, the more time is spent on schema inference, but the greater the chance to correctly determine the types (especially when the data contains a lot of nulls).

Default values:

25000 for input_format_max_rows_to_read_for_schema_inference.
33554432 (32 Mb) for input_format_max_bytes_to_read_for_schema_inference.

column_names_for_schema_inference

The list of column names to use in schema inference for formats without explicit column names. Specified names will be used instead of default c1,c2,c3,.... The format: column1,column2,column3,....

Example

DESC format(TSV, 'Hello, World!	42	[1, 2, 3]') settings column_names_for_schema_inference = 'str,int,arr'

┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ str  │ Nullable(String)       │              │                    │         │                  │                │
│ int  │ Nullable(Int64)        │              │                    │         │                  │                │
│ arr  │ Array(Nullable(Int64)) │              │                    │         │                  │                │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

schema_inference_hints

The list of column names and types to use in schema inference instead of automatically determined types. The format: 'column_name1 column_type1, column_name2 column_type2, ...'. This setting can be used to specify the types of columns that could not be determined automatically or for optimizing the schema.

Example

DESC format(JSONEachRow, '{"id" : 1, "age" : 25, "name" : "Josh", "status" : null, "hobbies" : ["football", "cooking"]}') SETTINGS schema_inference_hints = 'age LowCardinality(UInt8), status Nullable(String)', allow_suspicious_low_cardinality_types=1

┌─name────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id      │ Nullable(Int64)         │              │                    │         │                  │                │
│ age     │ LowCardinality(UInt8)   │              │                    │         │                  │                │
│ name    │ Nullable(String)        │              │                    │         │                  │                │
│ status  │ Nullable(String)        │              │                    │         │                  │                │
│ hobbies │ Array(Nullable(String)) │              │                    │         │                  │                │
└─────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

schema_inference_make_columns_nullable

Controls making inferred types Nullable in schema inference for formats without information about nullability. If the setting is enabled, all inferred type will be Nullable, if disabled, the inferred type will be Nullable only if input_format_null_as_default is disabled and the column contains NULL in a sample that is parsed during schema inference.

Enabled by default.

Examples

SET schema_inference_make_columns_nullable = 1
DESC format(JSONEachRow, $$
                                {"id" :  1, "age" :  25, "name" : "Josh", "status" : null, "hobbies" : ["football", "cooking"]}
                                {"id" :  2, "age" :  19, "name" :  "Alan", "status" : "married", "hobbies" :  ["tennis", "art"]}
                         $$)

┌─name────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id      │ Nullable(Int64)         │              │                    │         │                  │                │
│ age     │ Nullable(Int64)         │              │                    │         │                  │                │
│ name    │ Nullable(String)        │              │                    │         │                  │                │
│ status  │ Nullable(String)        │              │                    │         │                  │                │
│ hobbies │ Array(Nullable(String)) │              │                    │         │                  │                │
└─────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

SET schema_inference_make_columns_nullable = 0;
SET input_format_null_as_default = 0;    
DESC format(JSONEachRow, $$
                                {"id" :  1, "age" :  25, "name" : "Josh", "status" : null, "hobbies" : ["football", "cooking"]}
                                {"id" :  2, "age" :  19, "name" :  "Alan", "status" : "married", "hobbies" :  ["tennis", "art"]}
                         $$)


┌─name────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id      │ Int64            │              │                    │         │                  │                │
│ age     │ Int64            │              │                    │         │                  │                │
│ name    │ String           │              │                    │         │                  │                │
│ status  │ Nullable(String) │              │                    │         │                  │                │
│ hobbies │ Array(String)    │              │                    │         │                  │                │
└─────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

SET schema_inference_make_columns_nullable = 0;
SET input_format_null_as_default = 1;    
DESC format(JSONEachRow, $$
                                {"id" :  1, "age" :  25, "name" : "Josh", "status" : null, "hobbies" : ["football", "cooking"]}
                                {"id" :  2, "age" :  19, "name" :  "Alan", "status" : "married", "hobbies" :  ["tennis", "art"]}
                         $$)


┌─name────┬─type──────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id      │ Int64         │              │                    │         │                  │                │
│ age     │ Int64         │              │                    │         │                  │                │
│ name    │ String        │              │                    │         │                  │                │
│ status  │ String        │              │                    │         │                  │                │
│ hobbies │ Array(String) │              │                    │         │                  │                │
└─────────┴───────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

input_format_try_infer_integers

If enabled, ClickHouse will try to infer integers instead of floats in schema inference for text formats. If all numbers in the column from sample data are integers, the result type will be Int64, if at least one number is float, the result type will be Float64. If the sample data contains only integers and at least one integer is positive and overflows Int64, ClickHouse will infer UInt64.

Enabled by default.

Examples

SET input_format_try_infer_integers = 0
DESC format(JSONEachRow, $$
                                {"number" : 1}
                                {"number" : 2}
                         $$)

┌─name───┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ Nullable(Float64) │              │                    │         │                  │                │
└────────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

SET input_format_try_infer_integers = 1
DESC format(JSONEachRow, $$
                                {"number" : 1}
                                {"number" : 2}
                         $$)

┌─name───┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ Nullable(Int64) │              │                    │         │                  │                │
└────────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

DESC format(JSONEachRow, $$
                                {"number" : 1}
                                {"number" : 18446744073709551615}
                         $$)

┌─name───┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ Nullable(UInt64) │              │                    │         │                  │                │
└────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

DESC format(JSONEachRow, $$
                                {"number" : 1}
                                {"number" : 2.2}
                         $$)

┌─name───┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ Nullable(Float64) │              │                    │         │                  │                │
└────────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

input_format_try_infer_datetimes

If enabled, ClickHouse will try to infer type DateTime64 from string fields in schema inference for text formats. If all fields from a column in sample data were successfully parsed as datetimes, the result type will be DateTime64(9), if at least one field was not parsed as datetime, the result type will be String.

Enabled by default.

Examples

SET input_format_try_infer_datetimes = 0
DESC format(JSONEachRow, $$
                                {"datetime" : "2021-01-01 00:00:00.000"}
                                {"datetime" : "2022-01-01 00:00:00.000"}
                         $$)

┌─name─────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ datetime │ Nullable(String) │              │                    │         │                  │                │
└──────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

SET input_format_try_infer_datetimes = 1
DESC format(JSONEachRow, $$
                                {"datetime" : "2021-01-01 00:00:00.000"}
                                {"datetime" : "2022-01-01 00:00:00.000"}
                         $$)

┌─name─────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ datetime │ Nullable(DateTime64(9)) │              │                    │         │                  │                │
└──────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

DESC format(JSONEachRow, $$
                                {"datetime" : "2021-01-01 00:00:00.000"}
                                {"datetime" : "unknown"}
                         $$)

┌─name─────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ datetime │ Nullable(String) │              │                    │         │                  │                │
└──────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Note: Parsing datetimes during schema inference respect setting date_time_input_format

input_format_try_infer_dates

If enabled, ClickHouse will try to infer type Date from string fields in schema inference for text formats. If all fields from a column in sample data were successfully parsed as dates, the result type will be Date, if at least one field was not parsed as date, the result type will be String.

Enabled by default.

Examples

SET input_format_try_infer_datetimes = 0, input_format_try_infer_dates = 0
DESC format(JSONEachRow, $$
                                {"date" : "2021-01-01"}
                                {"date" : "2022-01-01"}
                         $$)

┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date │ Nullable(String) │              │                    │         │                  │                │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

SET input_format_try_infer_dates = 1
DESC format(JSONEachRow, $$
                                {"date" : "2021-01-01"}
                                {"date" : "2022-01-01"}
                         $$)

┌─name─┬─type───────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date │ Nullable(Date) │              │                    │         │                  │                │
└──────┴────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

DESC format(JSONEachRow, $$
                                {"date" : "2021-01-01"}
                                {"date" : "unknown"}
                         $$)

┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date │ Nullable(String) │              │                    │         │                  │                │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Self describing formats

Self-describing formats contain information about the structure of the data in the data itself, it can be some header with a description, a binary type tree, or some kind of table. To automatically infer a schema from files in such formats, ClickHouse reads a part of the data containing information about the types and converts it into a schema of the ClickHouse table.

Formats with -WithNamesAndTypes suffix

ClickHouse supports some text formats with the suffix -WithNamesAndTypes. This suffix means that the data contains two additional rows with column names and types before the actual data. While schema inference for such formats, ClickHouse reads the first two rows and extracts column names and types.

Example

DESC format(TSVWithNamesAndTypes,
$$num	str	arr
UInt8	String	Array(UInt8)
42	Hello, World!	[1,2,3]
$$)

┌─name─┬─type─────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ num  │ UInt8        │              │                    │         │                  │                │
│ str  │ String       │              │                    │         │                  │                │
│ arr  │ Array(UInt8) │              │                    │         │                  │                │
└──────┴──────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

JSON formats with metadata

Some JSON input formats (JSON, JSONCompact, JSONColumnsWithMetadata) contain metadata with column names and types. In schema inference for such formats, ClickHouse reads this metadata.

Example

DESC format(JSON, $$
{
	"meta":
	[
		{
			"name": "num",
			"type": "UInt8"
		},
		{
			"name": "str",
			"type": "String"
		},
		{
			"name": "arr",
			"type": "Array(UInt8)"
		}
	],

	"data":
	[
		{
			"num": 42,
			"str": "Hello, World",
			"arr": [1,2,3]
		}
	],

	"rows": 1,

	"statistics":
	{
		"elapsed": 0.005723915,
		"rows_read": 1,
		"bytes_read": 1
	}
}
$$)

┌─name─┬─type─────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ num  │ UInt8        │              │                    │         │                  │                │
│ str  │ String       │              │                    │         │                  │                │
│ arr  │ Array(UInt8) │              │                    │         │                  │                │
└──────┴──────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Avro

In Avro format ClickHouse reads its schema from the data and converts it to ClickHouse schema using the following type matches:

Avro data type	ClickHouse data type
`boolean`	Bool
`int`	Int32
`int (date)` *	Date32
`long`	Int64
`float`	Float32
`double`	Float64
`bytes`, `string`	String
`fixed`	FixedString(N)
`enum`	Enum
`array(T)`	Array(T)
`union(null, T)`, `union(T, null)`	Nullable(T)
`null`	Nullable(Nothing)
`string (uuid)` *	UUID
`binary (decimal)` *	Decimal(P, S)

* Avro logical types

Other Avro types are not supported.

Parquet

In Parquet format ClickHouse reads its schema from the data and converts it to ClickHouse schema using the following type matches:

Parquet data type	ClickHouse data type
`BOOL`	Bool
`UINT8`	UInt8
`INT8`	Int8
`UINT16`	UInt16
`INT16`	Int16
`UINT32`	UInt32
`INT32`	Int32
`UINT64`	UInt64
`INT64`	Int64
`FLOAT`	Float32
`DOUBLE`	Float64
`DATE`	Date32
`TIME (ms)`	DateTime
`TIMESTAMP`, `TIME (us, ns)`	DateTime64
`STRING`, `BINARY`	String
`DECIMAL`	Decimal
`LIST`	Array
`STRUCT`	Tuple
`MAP`	Map

Other Parquet types are not supported. By default, all inferred types are inside Nullable, but it can be changed using the setting schema_inference_make_columns_nullable.

Arrow

In Arrow format ClickHouse reads its schema from the data and converts it to ClickHouse schema using the following type matches:

Arrow data type	ClickHouse data type
`BOOL`	Bool
`UINT8`	UInt8
`INT8`	Int8
`UINT16`	UInt16
`INT16`	Int16
`UINT32`	UInt32
`INT32`	Int32
`UINT64`	UInt64
`INT64`	Int64
`FLOAT`, `HALF_FLOAT`	Float32
`DOUBLE`	Float64
`DATE32`	Date32
`DATE64`	DateTime
`TIMESTAMP`, `TIME32`, `TIME64`	DateTime64
`STRING`, `BINARY`	String
`DECIMAL128`, `DECIMAL256`	Decimal
`LIST`	Array
`STRUCT`	Tuple
`MAP`	Map

Other Arrow types are not supported. By default, all inferred types are inside Nullable, but it can be changed using the setting schema_inference_make_columns_nullable.

ORC

In ORC format ClickHouse reads its schema from the data and converts it to ClickHouse schema using the following type matches:

ORC data type	ClickHouse data type
`Boolean`	Bool
`Tinyint`	Int8
`Smallint`	Int16
`Int`	Int32
`Bigint`	Int64
`Float`	Float32
`Double`	Float64
`Date`	Date32
`Timestamp`	DateTime64
`String`, `Char`, `Varchar`,`BINARY`	String
`Decimal`	Decimal
`List`	Array
`Struct`	Tuple
`Map`	Map

Other ORC types are not supported. By default, all inferred types are inside Nullable, but it can be changed using the setting schema_inference_make_columns_nullable.

Native

Native format is used inside ClickHouse and contains the schema in the data. In schema inference, ClickHouse reads the schema from the data without any transformations.

Formats with external schema

Such formats require a schema describing the data in a separate file in a specific schema language. To automatically infer a schema from files in such formats, ClickHouse reads external schema from a separate file and transforms it to a ClickHouse table schema.

Protobuf

In schema inference for Protobuf format ClickHouse uses the following type matches:

Protobuf data type	ClickHouse data type
`bool`	UInt8
`float`	Float32
`double`	Float64
`int32`, `sint32`, `sfixed32`	Int32
`int64`, `sint64`, `sfixed64`	Int64
`uint32`, `fixed32`	UInt32
`uint64`, `fixed64`	UInt64
`string`, `bytes`	String
`enum`	Enum
`repeated T`	Array(T)
`message`, `group`	Tuple

CapnProto

In schema inference for CapnProto format ClickHouse uses the following type matches:

CapnProto data type	ClickHouse data type
`Bool`	UInt8
`Int8`	Int8
`UInt8`	UInt8
`Int16`	Int16
`UInt16`	UInt16
`Int32`	Int32
`UInt32`	UInt32
`Int64`	Int64
`UInt64`	UInt64
`Float32`	Float32
`Float64`	Float64
`Text`, `Data`	String
`enum`	Enum
`List`	Array
`struct`	Tuple
`union(T, Void)`, `union(Void, T)`	Nullable(T)

Strong-typed binary formats

In such formats, each serialized value contains information about its type (and possibly about its name), but there is no information about the whole table. In schema inference for such formats, ClickHouse reads data row by row (up to input_format_max_rows_to_read_for_schema_inference rows or input_format_max_bytes_to_read_for_schema_inference bytes) and extracts the type (and possibly name) for each value from the data and then converts these types to ClickHouse types.

MsgPack

In MsgPack format there is no delimiter between rows, to use schema inference for this format you should specify the number of columns in the table using the setting input_format_msgpack_number_of_columns. ClickHouse uses the following type matches:

MessagePack data type (`INSERT`)	ClickHouse data type
`int N`, `uint N`, `negative fixint`, `positive fixint`	Int64
`bool`	UInt8
`fixstr`, `str 8`, `str 16`, `str 32`, `bin 8`, `bin 16`, `bin 32`	String
`float 32`	Float32
`float 64`	Float64
`uint 16`	Date
`uint 32`	DateTime
`uint 64`	DateTime64
`fixarray`, `array 16`, `array 32`	Array
`fixmap`, `map 16`, `map 32`	Map

By default, all inferred types are inside Nullable, but it can be changed using the setting schema_inference_make_columns_nullable.

BSONEachRow

In BSONEachRow each row of data is presented as a BSON document. In schema inference ClickHouse reads BSON documents one by one and extracts values, names, and types from the data and then transforms these types to ClickHouse types using the following type matches:

BSON Type	ClickHouse type
`\x08` boolean	Bool
`\x10` int32	Int32
`\x12` int64	Int64
`\x01` double	Float64
`\x09` datetime	DateTime64
`\x05` binary with`\x00` binary subtype, `\x02` string, `\x0E` symbol, `\x0D` JavaScript code	String
`\x07` ObjectId,	FixedString(12)
`\x05` binary with `\x04` uuid subtype, size = 16	UUID
`\x04` array	Array/Tuple (if nested types are different)
`\x03` document	Named Tuple/Map (with String keys)

By default, all inferred types are inside Nullable, but it can be changed using the setting schema_inference_make_columns_nullable.

Formats with constant schema

Data in such formats always have the same schema.

LineAsString

In this format, ClickHouse reads the whole line from the data into a single column with String data type. The inferred type for this format is always String and the column name is line.

Example

DESC format(LineAsString, 'Hello\nworld!')

┌─name─┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ line │ String │              │                    │         │                  │                │
└──────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

JSONAsString

In this format, ClickHouse reads the whole JSON object from the data into a single column with String data type. The inferred type for this format is always String and the column name is json.

Example

DESC format(JSONAsString, '{"x" : 42, "y" : "Hello, World!"}')

┌─name─┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ json │ String │              │                    │         │                  │                │
└──────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

JSONAsObject

In this format, ClickHouse reads the whole JSON object from the data into a single column with Object('json') data type. Inferred type for this format is always String and the column name is json.

Note: This format works only if allow_experimental_object_type is enabled.

Example

DESC format(JSONAsString, '{"x" : 42, "y" : "Hello, World!"}') SETTINGS allow_experimental_object_type=1

┌─name─┬─type───────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ json │ Object('json') │              │                    │         │                  │                │
└──────┴────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Schema inference modes

Schema inference from the set of data files can work in 2 different modes: default and union. The mode is controlled by the setting schema_inference_mode.

Default mode

In default mode, ClickHouse assumes that all files have the same schema and tries to infer the schema by reading files one by one until it succeeds.

Example:

Let's say we have 3 files data1.jsonl, data2.jsonl and data3.jsonl with the next content:

data1.jsonl:

{"field1" :  1, "field2" :  null}
{"field1" :  2, "field2" :  null}
{"field1" :  3, "field2" :  null}

data2.jsonl:

{"field1" :  4, "field2" :  "Data4"}
{"field1" :  5, "field2" :  "Data5"}
{"field1" :  6, "field2" :  "Data5"}

data3.jsonl:

{"field1" :  7, "field2" :  "Data7", "field3" :  [1, 2, 3]}
{"field1" :  8, "field2" :  "Data8", "field3" :  [4, 5, 6]}
{"field1" :  9, "field2" :  "Data9", "field3" :  [7, 8, 9]}

Let's try to use schema inference on these 3 files:

:) DESCRIBE file('data{1,2,3}.jsonl') SETTINGS schema_inference_mode='default'

Result:

┌─name───┬─type─────────────┐
│ field1 │ Nullable(Int64)  │
│ field2 │ Nullable(String) │
└────────┴──────────────────┘

As we can see, we don't have field3 from file data3.jsonl. It happens because ClickHouse first tried to infer schema from file data1.jsonl, failed because of only nulls for field field2, and then tried to infer schema from data2.jsonl and succeeded, so data from file data3.jsonl wasn't read.

Union mode

In union mode, ClickHouse assumes that files can have different schemas, so it infer schemas of all files and then union them to the common schema.

Let's say we have 3 files data1.jsonl, data2.jsonl and data3.jsonl with the next content:

data1.jsonl:

{"field1" :  1}
{"field1" :  2}
{"field1" :  3}

data2.jsonl:

{"field2" :  "Data4"}
{"field2" :  "Data5"}
{"field2" :  "Data5"}

data3.jsonl:

{"field3" :  [1, 2, 3]}
{"field3" :  [4, 5, 6]}
{"field3" :  [7, 8, 9]}

Let's try to use schema inference on these 3 files:

:) DESCRIBE file('data{1,2,3}.jsonl') SETTINGS schema_inference_mode='union'

Result:

┌─name───┬─type───────────────────┐
│ field1 │ Nullable(Int64)        │
│ field2 │ Nullable(String)       │
│ field3 │ Array(Nullable(Int64)) │
└────────┴────────────────────────┘

As we can see, we have all fields from all files.

Note:

As some of the files may not contain some columns from the resulting schema, union mode is supported only for formats that support reading subset of columns (like JSONEachRow, Parquet, TSVWithNames, etc) and won't work for other formats (like CSV, TSV, JSONCompactEachRow, etc).
If ClickHouse cannot infer the schema from one of the files, the exception will be thrown.
If you have a lot of files, reading schema from all of them can take a lot of time.

132 KiB Raw Blame History

Usage

Table functions file, s3, url, hdfs.

Table engines File, S3, URL, HDFS

clickhouse-local

Using structure from insertion table

Schema inference cache

Text formats

JSON formats

JSON settings

input_format_json_try_infer_numbers_from_strings

input_format_json_try_infer_named_tuples_from_objects

input_format_json_read_objects_as_strings

input_format_json_read_numbers_as_strings

input_format_json_read_bools_as_numbers

input_format_json_read_bools_as_strings

input_format_json_read_arrays_as_strings

input_format_json_infer_incomplete_types_as_strings

CSV

CSV settings

input_format_csv_try_infer_numbers_from_strings

TSV/TSKV

Values

CustomSeparated

Template

Regexp

Settings for text formats

input_format_max_rows_to_read_for_schema_inference/input_format_max_bytes_to_read_for_schema_inference

column_names_for_schema_inference

schema_inference_hints

schema_inference_make_columns_nullable

input_format_try_infer_integers

input_format_try_infer_datetimes

input_format_try_infer_dates

Self describing formats

Formats with -WithNamesAndTypes suffix

JSON formats with metadata

Avro

Parquet

Arrow

ORC

Native

Formats with external schema

Protobuf

CapnProto

Strong-typed binary formats

MsgPack

BSONEachRow

Formats with constant schema

LineAsString

JSONAsString

JSONAsObject

Schema inference modes

Default mode

Union mode

132 KiB

Raw Blame History