ClickHouse/docs/en/interfaces/schema-inference.md

1604 lines
110 KiB
Markdown
Raw Normal View History

---
slug: /en/interfaces/schema-inference
sidebar_position: 21
sidebar_label: Schema inference
title: Automatic schema inference from input data
---
ClickHouse can automatically determine the structure of input data in almost all supported [Input formats](formats.md).
This document will describe when schema inference is used, how it works with different input formats and which settings
can control it.
# Table of Contents
- [Usage](#usage)
- [Using structure from insertion table](#using-structure-from-insertion-table)
- [Schema inference cache](#schema-inference-cache)
- [Text formats](#text-formats)
- [JSON formats](#json-formats)
- [JSON settings](#json-settings)
- [CSV](#csv)
- [TSV/TSKV](#tsv-tskv)
- [Values](#values)
- [CustomSeparated](#custom-separated)
- [Template](#template)
- [Regexp](#regexp)
- [Settings for text formats](#settings-for-text-formats)
- [Self describing formats](#self-describing-formats)
- [Formats with -WithNamesAndTypes suffix](#formats-with-names-and-types)
- [JSON formats with metadata](#json-with-metadata)
- [Avro](#avro)
- [Parquet](#parquet)
- [Arrow](#arrow)
- [ORC](#orc)
- [Native](#native)
- [Formats with external schema](#formats-with-external-schema)
- [Protobuf](#protobuf)
- [CapnProto](#capnproto)
- [Strong-typed binary formats](#strong-typed-binary-formats)
- [MsgPack](#msgpack)
- [BSONEachRow](#bsoneachrow)
- [Formats with constant schema](#formats-with-constant-schema)
- [LineAsString](#line-as-string)
- [JSONAsString](#json-as-string)
- [JSONAsObject](#json-as-object)
# Usage {#usage}
Schema inference is used when ClickHouse needs to read the data in a specific data format and the structure is unknown.
## Table functions [file](../sql-reference/table-functions/file.md), [s3](../sql-reference/table-functions/s3.md), [url](../sql-reference/table-functions/url.md), [hdfs](../sql-reference/table-functions/hdfs.md).
2022-12-30 03:34:25 +00:00
These table functions have the optional argument `structure` with the structure of input data. If this argument is not specified or set to `auto`, the structure will be inferred from the data.
**Example:**
2022-12-30 03:34:25 +00:00
Let's say we have a file `hobbies.jsonl` in JSONEachRow format in the `user_files` directory with this content:
```json
{"id" : 1, "age" : 25, "name" : "Josh", "hobbies" : ["football", "cooking", "music"]}
{"id" : 2, "age" : 19, "name" : "Alan", "hobbies" : ["tennis", "art"]}
2022-12-30 03:34:25 +00:00
{"id" : 3, "age" : 32, "name" : "Lana", "hobbies" : ["fitness", "reading", "shopping"]}
{"id" : 4, "age" : 47, "name" : "Brayan", "hobbies" : ["movies", "skydiving"]}
```
2022-12-30 03:34:25 +00:00
ClickHouse can read this data without you specifying its structure:
```sql
2022-12-30 14:53:35 +00:00
SELECT * FROM file('hobbies.jsonl')
2022-12-30 03:34:25 +00:00
```
```response
┌─id─┬─age─┬─name───┬─hobbies──────────────────────────┐
│ 1 │ 25 │ Josh │ ['football','cooking','music'] │
│ 2 │ 19 │ Alan │ ['tennis','art'] │
2022-12-30 14:53:35 +00:00
│ 3 │ 32 │ Lana │ ['fitness','reading','shopping'] │
│ 4 │ 47 │ Brayan │ ['movies','skydiving'] │
└────┴─────┴────────┴──────────────────────────────────┘
```
2022-12-30 14:21:12 +00:00
Note: the format `JSONEachRow` was automatically determined by the file extension `.jsonl`.
2022-12-30 03:34:25 +00:00
You can see an automatically determined structure using the `DESCRIBE` query:
```sql
2022-12-30 03:34:25 +00:00
DESCRIBE file('hobbies.jsonl')
```
```response
┌─name────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id │ Nullable(Int64) │ │ │ │ │ │
│ age │ Nullable(Int64) │ │ │ │ │ │
│ name │ Nullable(String) │ │ │ │ │ │
│ hobbies │ Array(Nullable(String)) │ │ │ │ │ │
└─────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## Table engines [File](../engines/table-engines/special/file.md), [S3](../engines/table-engines/integrations/s3.md), [URL](../engines/table-engines/special/url.md), [HDFS](../engines/table-engines/integrations/hdfs.md)
If the list of columns is not specified in `CREATE TABLE` query, the structure of the table will be inferred automatically from the data.
**Example:**
2022-12-30 14:21:12 +00:00
Let's use the file `hobbies.jsonl`. We can create a table with engine `File` with the data from this file:
```sql
2022-12-30 14:53:35 +00:00
CREATE TABLE hobbies ENGINE=File(JSONEachRow, 'hobbies.jsonl')
```
```response
Ok.
2022-12-30 14:53:35 +00:00
```
```sql
SELECT * FROM hobbies
```
```response
┌─id─┬─age─┬─name───┬─hobbies──────────────────────────┐
│ 1 │ 25 │ Josh │ ['football','cooking','music'] │
│ 2 │ 19 │ Alan │ ['tennis','art'] │
2022-12-30 14:53:35 +00:00
│ 3 │ 32 │ Lana │ ['fitness','reading','shopping'] │
│ 4 │ 47 │ Brayan │ ['movies','skydiving'] │
└────┴─────┴────────┴──────────────────────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
DESCRIBE TABLE hobbies
```
```response
┌─name────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id │ Nullable(Int64) │ │ │ │ │ │
│ age │ Nullable(Int64) │ │ │ │ │ │
│ name │ Nullable(String) │ │ │ │ │ │
│ hobbies │ Array(Nullable(String)) │ │ │ │ │ │
└─────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## clickhouse-local
2022-12-30 03:34:25 +00:00
`clickhouse-local` has an optional parameter `-S/--structure` with the structure of input data. If this parameter is not specified or set to `auto`, the structure will be inferred from the data.
**Example:**
2022-12-30 14:21:12 +00:00
Let's use the file `hobbies.jsonl`. We can query the data from this file using `clickhouse-local`:
```shell
2022-12-30 14:53:35 +00:00
clickhouse-local --file='hobbies.jsonl' --table='hobbies' --query='DESCRIBE TABLE hobbies'
```
```response
id Nullable(Int64)
age Nullable(Int64)
name Nullable(String)
hobbies Array(Nullable(String))
2022-12-30 14:53:35 +00:00
```
```shell
clickhouse-local --file='hobbies.jsonl' --table='hobbies' --query='SELECT * FROM hobbies'
```
```response
1 25 Josh ['football','cooking','music']
2 19 Alan ['tennis','art']
2022-12-30 14:53:35 +00:00
3 32 Lana ['fitness','reading','shopping']
4 47 Brayan ['movies','skydiving']
```
# Using structure from insertion table {#using-structure-from-insertion-table}
2022-12-29 21:21:37 +00:00
When table functions `file/s3/url/hdfs` are used to insert data into a table,
there is an option to use the structure from the insertion table instead of extracting it from the data.
2022-12-30 03:34:25 +00:00
It can improve insertion performance because schema inference can take some time. Also, it will be helpful when the table has an optimized schema, so
no conversions between types will be performed.
2022-12-30 03:34:25 +00:00
There is a special setting [use_structure_from_insertion_table_in_table_functions](/docs/en/operations/settings/settings.md/#use_structure_from_insertion_table_in_table_functions)
that controls this behaviour. It has 3 possible values:
- 0 - table function will extract the structure from the data.
2022-12-30 03:34:25 +00:00
- 1 - table function will use the structure from the insertion table.
- 2 - ClickHouse will automatically determine if it's possible to use the structure from the insertion table or use schema inference. Default value.
**Example 1:**
Let's create table `hobbies1` with the next structure:
```sql
CREATE TABLE hobbies1
(
`id` UInt64,
`age` LowCardinality(UInt8),
`name` String,
`hobbies` Array(String)
)
ENGINE = MergeTree
ORDER BY id;
```
And insert data from the file `hobbies.jsonl`:
```sql
INSERT INTO hobbies1 SELECT * FROM file(hobbies.jsonl)
```
2022-12-30 03:34:25 +00:00
In this case, all columns from the file are inserted into the table without changes, so ClickHouse will use the structure from the insertion table instead of schema inference.
**Example 2:**
Let's create table `hobbies2` with the next structure:
```sql
CREATE TABLE hobbies2
(
`id` UInt64,
`age` LowCardinality(UInt8),
`hobbies` Array(String)
)
ENGINE = MergeTree
ORDER BY id;
```
And insert data from the file `hobbies.jsonl`:
```sql
INSERT INTO hobbies2 SELECT id, age, hobbies FROM file(hobbies.jsonl)
```
2022-12-30 03:34:25 +00:00
In this case, all columns in the `SELECT` query are present in the table, so ClickHouse will use the structure from the insertion table.
Note that it will work only for input formats that support reading a subset of columns like JSONEachRow, TSKV, Parquet, etc. (so it won't work for example for TSV format).
**Example 3:**
Let's create table `hobbies3` with the next structure:
```sql
CREATE TABLE hobbies3
(
2022-12-29 21:21:37 +00:00
`identifier` UInt64,
`age` LowCardinality(UInt8),
`hobbies` Array(String)
)
ENGINE = MergeTree
2022-12-29 21:21:37 +00:00
ORDER BY identifier;
```
And insert data from the file `hobbies.jsonl`:
```sql
INSERT INTO hobbies3 SELECT id, age, hobbies FROM file(hobbies.jsonl)
```
2022-12-30 03:34:25 +00:00
In this case, column `id` is used in the `SELECT` query, but the table doesn't have this column (it has a column with the name `identifier`),
2022-12-29 21:21:37 +00:00
so ClickHouse cannot use the structure from the insertion table, and schema inference will be used.
**Example 4:**
Let's create table `hobbies4` with the next structure:
```sql
CREATE TABLE hobbies4
(
`id` UInt64,
`any_hobby` Nullable(String)
)
ENGINE = MergeTree
ORDER BY id;
```
And insert data from the file `hobbies.jsonl`:
```sql
2022-12-29 21:21:37 +00:00
INSERT INTO hobbies4 SELECT id, empty(hobbies) ? NULL : hobbies[1] FROM file(hobbies.jsonl)
```
2022-12-30 03:34:25 +00:00
In this case, there are some operations performed on the column `hobbies` in the `SELECT` query to insert it into the table, so ClickHouse cannot use the structure from the insertion table, and schema inference will be used.
# Schema inference cache {#schema-inference-cache}
For most input formats schema inference reads some data to determine its structure and this process can take some time.
To prevent inferring the same schema every time ClickHouse read the data from the same file, the inferred schema is cached and when accessing the same file again, ClickHouse will use the schema from the cache.
2022-12-30 03:34:25 +00:00
There are special settings that control this cache:
- `schema_inference_cache_max_elements_for_{file/s3/hdfs/url}` - the maximum number of cached schemas for the corresponding table function. The default value is `4096`. These settings should be set in the server config.
- `use_cache_for_{file,s3,hdfs,url}_schema_inference` - allows turning on/off using cache for schema inference. These settings can be used in queries.
The schema of the file can be changed by modifying the data or by changing format settings.
2022-12-30 14:21:12 +00:00
For this reason, the schema inference cache identifies the schema by file source, format name, used format settings, and the last modification time of the file.
2022-12-30 14:21:12 +00:00
Note: some files accessed by url in `url` table function may not contain information about the last modification time; for this case, there is a special setting
`schema_inference_cache_require_modification_time_for_url`. Disabling this setting allows the use of the schema from cache without the last modification time for such files.
There is also a system table [schema_inference_cache](../operations/system-tables/schema_inference_cache.md) with all current schemas in cache and system query `SYSTEM DROP SCHEMA CACHE [FOR File/S3/URL/HDFS]`
2022-12-30 03:34:25 +00:00
that allows cleaning the schema cache for all sources, or for a specific source.
**Examples:**
2022-12-30 03:34:25 +00:00
Let's try to infer the structure of a sample dataset from s3 `github-2022.ndjson.gz` and see how the schema inference cache works:
```sql
2022-12-30 14:53:35 +00:00
DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022.ndjson.gz')
SETTINGS allow_experimental_object_type = 1
2022-12-30 14:53:35 +00:00
```
```response
┌─name───────┬─type─────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ type │ Nullable(String) │ │ │ │ │ │
│ actor │ Object(Nullable('json')) │ │ │ │ │ │
│ repo │ Object(Nullable('json')) │ │ │ │ │ │
│ created_at │ Nullable(String) │ │ │ │ │ │
│ payload │ Object(Nullable('json')) │ │ │ │ │ │
└────────────┴──────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
5 rows in set. Elapsed: 0.601 sec.
2022-12-30 14:53:35 +00:00
```
```sql
DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022.ndjson.gz')
SETTINGS allow_experimental_object_type = 1
2022-12-30 14:53:35 +00:00
```
```response
┌─name───────┬─type─────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ type │ Nullable(String) │ │ │ │ │ │
│ actor │ Object(Nullable('json')) │ │ │ │ │ │
│ repo │ Object(Nullable('json')) │ │ │ │ │ │
│ created_at │ Nullable(String) │ │ │ │ │ │
│ payload │ Object(Nullable('json')) │ │ │ │ │ │
└────────────┴──────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
5 rows in set. Elapsed: 0.059 sec.
```
As you can see, the second query succeeded almost instantly.
Let's try to change some settings that can affect inferred schema:
```sql
2022-12-30 14:53:35 +00:00
DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022.ndjson.gz')
SETTINGS input_format_json_read_objects_as_strings = 1
┌─name───────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ type │ Nullable(String) │ │ │ │ │ │
│ actor │ Nullable(String) │ │ │ │ │ │
│ repo │ Nullable(String) │ │ │ │ │ │
│ created_at │ Nullable(String) │ │ │ │ │ │
│ payload │ Nullable(String) │ │ │ │ │ │
└────────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
5 rows in set. Elapsed: 0.611 sec
```
As you can see, the schema from the cache was not used for the same file, because the setting that can affect inferred schema was changed.
Let's check the content of `system.schema_inference_cache` table:
```sql
2022-12-30 03:34:25 +00:00
SELECT schema, format, source FROM system.schema_inference_cache WHERE storage='S3'
```
```response
2022-12-29 21:21:37 +00:00
┌─schema──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─format─┬─source───────────────────────────────────────────────────────────────────────────────────────────────────┐
│ type Nullable(String), actor Object(Nullable('json')), repo Object(Nullable('json')), created_at Nullable(String), payload Object(Nullable('json')) │ NDJSON │ datasets-documentation.s3.eu-west-3.amazonaws.com443/datasets-documentation/github/github-2022.ndjson.gz │
│ type Nullable(String), actor Nullable(String), repo Nullable(String), created_at Nullable(String), payload Nullable(String) │ NDJSON │ datasets-documentation.s3.eu-west-3.amazonaws.com443/datasets-documentation/github/github-2022.ndjson.gz │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
```
2022-12-29 21:21:37 +00:00
As you can see, there are two different schemas for the same file.
2022-12-30 03:34:25 +00:00
We can clear the schema cache using a system query:
```sql
2022-12-30 03:34:25 +00:00
SYSTEM DROP SCHEMA CACHE FOR S3
```
```response
Ok.
2022-12-30 03:34:25 +00:00
```
```sql
SELECT count() FROM system.schema_inference_cache WHERE storage='S3'
```
```response
┌─count()─┐
│ 0 │
└─────────┘
```
# Text formats {#text-formats}
2022-12-30 03:34:25 +00:00
For text formats, ClickHouse reads the data row by row, extracts column values according to the format,
2022-12-29 21:21:37 +00:00
and then uses some recursive parsers and heuristics to determine the type for each value. The maximum number of rows read from the data in schema inference
2022-12-29 20:42:03 +00:00
is controlled by the setting `input_format_max_rows_to_read_for_schema_inference` with default value 25000.
2022-12-30 03:34:25 +00:00
By default, all inferred types are [Nullable](../sql-reference/data-types/nullable.md), but you can change this by setting `schema_inference_make_columns_nullable` (see examples in the [settings](#settings-for-text-formats) section).
## JSON formats {#json-formats}
2022-12-30 03:34:25 +00:00
In JSON formats ClickHouse parses values according to the JSON specification and then tries to find the most appropriate data type for them.
Let's see how it works, what types can be inferred and what specific settings can be used in JSON formats.
**Examples**
2022-12-30 03:34:25 +00:00
Here and further, the [format](../sql-reference/table-functions/format.md) table function will be used in examples.
Integers, Floats, Bools, Strings:
```sql
2022-12-30 03:34:25 +00:00
DESC format(JSONEachRow, '{"int" : 42, "float" : 42.42, "string" : "Hello, World!"}');
```
```response
┌─name───┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ int │ Nullable(Int64) │ │ │ │ │ │
│ float │ Nullable(Float64) │ │ │ │ │ │
│ bool │ Nullable(Bool) │ │ │ │ │ │
│ string │ Nullable(String) │ │ │ │ │ │
└────────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Dates, DateTimes:
```sql
2022-12-30 03:34:25 +00:00
DESC format(JSONEachRow, '{"date" : "2022-01-01", "datetime" : "2022-01-01 00:00:00"}')
```
```response
┌─name─────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date │ Nullable(Date) │ │ │ │ │ │
│ datetime │ Nullable(DateTime64(9)) │ │ │ │ │ │
└──────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Arrays:
```sql
2022-12-30 03:34:25 +00:00
DESC format(JSONEachRow, '{"arr" : [1, 2, 3], "nested_arrays" : [[1, 2, 3], [4, 5, 6], []]}')
```
```response
┌─name──────────┬─type──────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ arr │ Array(Nullable(Int64)) │ │ │ │ │ │
│ nested_arrays │ Array(Array(Nullable(Int64))) │ │ │ │ │ │
└───────────────┴───────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
If an array contains `null`, ClickHouse will use types from the other array elements:
```sql
2022-12-30 03:34:25 +00:00
DESC format(JSONEachRow, '{"arr" : [null, 42, null]}')
```
```response
┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ arr │ Array(Nullable(Int64)) │ │ │ │ │ │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Tuples:
In JSON formats we treat Arrays with elements of different types as Tuples.
```sql
2022-12-30 03:34:25 +00:00
DESC format(JSONEachRow, '{"tuple" : [1, "Hello, World!", [1, 2, 3]]}')
```
```response
┌─name──┬─type─────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ tuple │ Tuple(Nullable(Int64), Nullable(String), Array(Nullable(Int64))) │ │ │ │ │ │
└───────┴──────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
If some values are `null` or empty, we use types of corresponding values from the other rows:
```sql
2022-12-30 03:34:25 +00:00
DESC format(JSONEachRow, $$
{"tuple" : [1, null, null]}
{"tuple" : [null, "Hello, World!", []]}
{"tuple" : [null, null, [1, 2, 3]]}
$$)
2022-12-30 03:34:25 +00:00
```
```response
┌─name──┬─type─────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ tuple │ Tuple(Nullable(Int64), Nullable(String), Array(Nullable(Int64))) │ │ │ │ │ │
└───────┴──────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Maps:
In JSON we can read objects with values of the same type as Map type.
```sql
2022-12-30 03:34:25 +00:00
DESC format(JSONEachRow, '{"map" : {"key1" : 42, "key2" : 24, "key3" : 4}}')
```
```response
┌─name─┬─type─────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ map │ Map(String, Nullable(Int64)) │ │ │ │ │ │
└──────┴──────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
JSON Object type (if setting `allow_experimental_object_type` is enabled):
```sql
2022-12-30 03:34:25 +00:00
SET allow_experimental_object_type = 1
DESC format(JSONEachRow, $$
{"obj" : {"key1" : 42}}
{"obj" : {"key2" : "Hello, World!"}}
{"obj" : {"key1" : 24, "key3" : {"a" : 42, "b" : null}}}
$$)
```
```response
┌─name─┬─type─────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ obj │ Object(Nullable('json')) │ │ │ │ │ │
└──────┴──────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Nested complex types:
```sql
2022-12-30 03:34:25 +00:00
DESC format(JSONEachRow, '{"value" : [[[42, 24], []], {"key1" : 42, "key2" : 24}]}')
```
```response
┌─name──┬─type───────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ value │ Tuple(Array(Array(Nullable(Int64))), Map(String, Nullable(Int64))) │ │ │ │ │ │
└───────┴────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
2022-12-29 21:21:37 +00:00
If ClickHouse cannot determine the type, because the data contains only nulls, an exception will be thrown:
```sql
2022-12-30 03:34:25 +00:00
DESC format(JSONEachRow, '{"arr" : [null, null]}')
```
```response
Code: 652. DB::Exception: Received from localhost:9000. DB::Exception:
Cannot determine type for column 'arr' by first 1 rows of data,
most likely this column contains only Nulls or empty Arrays/Maps.
...
```
### JSON settings {#json-settings}
#### input_format_json_read_objects_as_strings
Enabling this setting allows reading nested JSON objects as strings.
This setting can be used to read nested JSON objects without using JSON object type.
2022-12-29 21:21:37 +00:00
This setting is enabled by default.
```sql
2022-12-30 03:34:25 +00:00
SET input_format_json_read_objects_as_strings = 1;
DESC format(JSONEachRow, $$
{"obj" : {"key1" : 42, "key2" : [1,2,3,4]}}
{"obj" : {"key3" : {"nested_key" : 1}}}
$$)
```
```response
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ obj │ Nullable(String) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
#### input_format_json_try_infer_numbers_from_strings
2022-12-30 03:34:25 +00:00
Enabling this setting allows inferring numbers from string values.
This setting is enabled by default.
**Example:**
```sql
2022-12-30 03:34:25 +00:00
SET input_format_json_try_infer_numbers_from_strings = 1;
DESC format(JSONEachRow, $$
{"value" : "42"}
{"value" : "424242424242"}
$$)
```
```reponse
┌─name──┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ value │ Nullable(Int64) │ │ │ │ │ │
└───────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
#### input_format_json_read_numbers_as_strings
Enabling this setting allows reading numeric values as strings.
This setting is disabled by default.
**Example**
```sql
2022-12-30 14:53:35 +00:00
SET input_format_json_read_numbers_as_strings = 1;
DESC format(JSONEachRow, $$
{"value" : 1055}
{"value" : "unknown"}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name──┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ value │ Nullable(String) │ │ │ │ │ │
└───────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
#### input_format_json_read_bools_as_numbers
Enabling this setting allows reading Bool values as numbers.
This setting is enabled by default.
**Example:**
```sql
2022-12-30 14:53:35 +00:00
SET input_format_json_read_bools_as_numbers = 1;
DESC format(JSONEachRow, $$
{"value" : true}
{"value" : 42}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name──┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ value │ Nullable(Int64) │ │ │ │ │ │
└───────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## CSV {#csv}
2022-12-30 14:21:12 +00:00
In CSV format ClickHouse extracts column values from the row according to delimiters. ClickHouse expects all types except numbers and strings to be enclosed in double quotes. If the value is in double quotes, ClickHouse tries to parse
the data inside quotes using the recursive parser and then tries to find the most appropriate data type for it. If the value is not in double quotes, ClickHouse tries to parse it as a number,
2022-12-29 21:21:37 +00:00
and if the value is not a number, ClickHouse treats it as a string.
If you don't want ClickHouse to try to determine complex types using some parsers and heuristics, you can disable setting `input_format_csv_use_best_effort_in_schema_inference`
and ClickHouse will treat all columns as Strings.
**Examples:**
Integers, Floats, Bools, Strings:
```sql
2022-12-30 14:53:35 +00:00
DESC format(CSV, '42,42.42,true,"Hello,World!"')
```
```response
┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(Int64) │ │ │ │ │ │
│ c2 │ Nullable(Float64) │ │ │ │ │ │
│ c3 │ Nullable(Bool) │ │ │ │ │ │
│ c4 │ Nullable(String) │ │ │ │ │ │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Strings without quotes:
```sql
2022-12-30 14:53:35 +00:00
DESC format(CSV, 'Hello world!,World hello!')
```
```response
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Dates, DateTimes:
```sql
2022-12-30 14:53:35 +00:00
DESC format(CSV, '"2020-01-01","2020-01-01 00:00:00"')
```
```response
┌─name─┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(Date) │ │ │ │ │ │
│ c2 │ Nullable(DateTime64(9)) │ │ │ │ │ │
└──────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Arrays:
```sql
2022-12-30 14:53:35 +00:00
DESC format(CSV, '"[1,2,3]","[[1, 2], [], [3, 4]]"')
```
```response
┌─name─┬─type──────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Nullable(Int64)) │ │ │ │ │ │
│ c2 │ Array(Array(Nullable(Int64))) │ │ │ │ │ │
└──────┴───────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
DESC format(CSV, $$"['Hello', 'world']","[['Abc', 'Def'], []]"$$)
```
```response
┌─name─┬─type───────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Nullable(String)) │ │ │ │ │ │
│ c2 │ Array(Array(Nullable(String))) │ │ │ │ │ │
└──────┴────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
If an array contains null, ClickHouse will use types from the other array elements:
```sql
2022-12-30 14:53:35 +00:00
DESC format(CSV, '"[NULL, 42, NULL]"')
```
```response
┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Nullable(Int64)) │ │ │ │ │ │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Maps:
```sql
2022-12-30 14:53:35 +00:00
DESC format(CSV, $$"{'key1' : 42, 'key2' : 24}"$$)
```
```response
┌─name─┬─type─────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Map(String, Nullable(Int64)) │ │ │ │ │ │
└──────┴──────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Nested Arrays and Maps:
```sql
2022-12-30 14:53:35 +00:00
DESC format(CSV, $$"[{'key1' : [[42, 42], []], 'key2' : [[null], [42]]}]"$$)
```
```response
┌─name─┬─type──────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Map(String, Array(Array(Nullable(Int64))))) │ │ │ │ │ │
└──────┴───────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
If ClickHouse cannot determine the type inside quotes, because the data contains only nulls, ClickHouse will treat it as String:
```sql
2022-12-30 14:53:35 +00:00
DESC format(CSV, '"[NULL, NULL]"')
```
```response
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Example with disabled setting `input_format_csv_use_best_effort_in_schema_inference`:
```sql
2022-12-30 14:53:35 +00:00
SET input_format_csv_use_best_effort_in_schema_inference = 0
DESC format(CSV, '"[1,2,3]",42.42,Hello World!')
```
```response
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
│ c3 │ Nullable(String) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## TSV/TSKV {#tsv-tskv}
In TSV/TSKV formats ClickHouse extracts column value from the row according to tabular delimiters and then parses extracted value using
2022-12-30 14:21:12 +00:00
the recursive parser to determine the most appropriate type. If the type cannot be determined, ClickHouse treats this value as String.
If you don't want ClickHouse to try to determine complex types using some parsers and heuristics, you can disable setting `input_format_tsv_use_best_effort_in_schema_inference`
and ClickHouse will treat all columns as Strings.
**Examples:**
Integers, Floats, Bools, Strings:
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSV, '42 42.42 true Hello,World!')
```
```response
┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(Int64) │ │ │ │ │ │
│ c2 │ Nullable(Float64) │ │ │ │ │ │
│ c3 │ Nullable(Bool) │ │ │ │ │ │
│ c4 │ Nullable(String) │ │ │ │ │ │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
DESC format(TSKV, 'int=42 float=42.42 bool=true string=Hello,World!\n')
```
```response
┌─name───┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ int │ Nullable(Int64) │ │ │ │ │ │
│ float │ Nullable(Float64) │ │ │ │ │ │
│ bool │ Nullable(Bool) │ │ │ │ │ │
│ string │ Nullable(String) │ │ │ │ │ │
└────────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Dates, DateTimes:
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSV, '2020-01-01 2020-01-01 00:00:00')
```
```response
┌─name─┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(Date) │ │ │ │ │ │
│ c2 │ Nullable(DateTime64(9)) │ │ │ │ │ │
└──────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Arrays:
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSV, '[1,2,3] [[1, 2], [], [3, 4]]')
```
```response
┌─name─┬─type──────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Nullable(Int64)) │ │ │ │ │ │
│ c2 │ Array(Array(Nullable(Int64))) │ │ │ │ │ │
└──────┴───────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
DESC format(TSV, '[''Hello'', ''world''] [[''Abc'', ''Def''], []]')
```
```response
┌─name─┬─type───────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Nullable(String)) │ │ │ │ │ │
│ c2 │ Array(Array(Nullable(String))) │ │ │ │ │ │
└──────┴────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
If an array contains null, ClickHouse will use types from the other array elements:
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSV, '[NULL, 42, NULL]')
```
```response
┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Nullable(Int64)) │ │ │ │ │ │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Tuples:
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSV, $$(42, 'Hello, world!')$$)
```
```response
┌─name─┬─type─────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Tuple(Nullable(Int64), Nullable(String)) │ │ │ │ │ │
└──────┴──────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Maps:
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSV, $${'key1' : 42, 'key2' : 24}$$)
```
```response
┌─name─┬─type─────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Map(String, Nullable(Int64)) │ │ │ │ │ │
└──────┴──────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Nested Arrays, Tuples and Maps:
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSV, $$[{'key1' : [(42, 'Hello'), (24, NULL)], 'key2' : [(NULL, ','), (42, 'world!')]}]$$)
```
```response
┌─name─┬─type────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Map(String, Array(Tuple(Nullable(Int64), Nullable(String))))) │ │ │ │ │ │
└──────┴─────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
If ClickHouse cannot determine the type, because the data contains only nulls, ClickHouse will treat it as String:
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSV, '[NULL, NULL]')
```
```response
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Example with disabled setting `input_format_tsv_use_best_effort_in_schema_inference`:
```sql
2022-12-30 14:53:35 +00:00
SET input_format_tsv_use_best_effort_in_schema_inference = 0
DESC format(TSV, '[1,2,3] 42.42 Hello World!')
```
```response
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
│ c3 │ Nullable(String) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## Values {#values}
In Values format ClickHouse extracts column value from the row and then parses it using
2022-12-30 14:21:12 +00:00
the recursive parser similar to how literals are parsed.
**Examples:**
Integers, Floats, Bools, Strings:
```sql
2022-12-30 14:53:35 +00:00
DESC format(Values, $$(42, 42.42, true, 'Hello,World!')$$)
```
```response
┌─name─┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(Int64) │ │ │ │ │ │
│ c2 │ Nullable(Float64) │ │ │ │ │ │
│ c3 │ Nullable(Bool) │ │ │ │ │ │
│ c4 │ Nullable(String) │ │ │ │ │ │
└──────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Dates, DateTimes:
```sql
2022-12-30 14:53:35 +00:00
DESC format(Values, $$('2020-01-01', '2020-01-01 00:00:00')$$)
```
```response
┌─name─┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(Date) │ │ │ │ │ │
│ c2 │ Nullable(DateTime64(9)) │ │ │ │ │ │
└──────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Arrays:
```sql
2022-12-30 14:53:35 +00:00
DESC format(Values, '([1,2,3], [[1, 2], [], [3, 4]])')
```
```response
┌─name─┬─type──────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Nullable(Int64)) │ │ │ │ │ │
│ c2 │ Array(Array(Nullable(Int64))) │ │ │ │ │ │
└──────┴───────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
If an array contains null, ClickHouse will use types from the other array elements:
```sql
2022-12-30 14:53:35 +00:00
DESC format(Values, '([NULL, 42, NULL])')
```
```response
┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Nullable(Int64)) │ │ │ │ │ │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Tuples:
```sql
2022-12-30 14:53:35 +00:00
DESC format(Values, $$((42, 'Hello, world!'))$$)
```
```response
┌─name─┬─type─────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Tuple(Nullable(Int64), Nullable(String)) │ │ │ │ │ │
└──────┴──────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Maps:
```sql
2022-12-30 14:53:35 +00:00
DESC format(Values, $$({'key1' : 42, 'key2' : 24})$$)
```
```response
┌─name─┬─type─────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Map(String, Nullable(Int64)) │ │ │ │ │ │
└──────┴──────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
Nested Arrays, Tuples and Maps:
```sql
2022-12-30 14:53:35 +00:00
DESC format(Values, $$([{'key1' : [(42, 'Hello'), (24, NULL)], 'key2' : [(NULL, ','), (42, 'world!')]}])$$)
```
```response
┌─name─┬─type────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Array(Map(String, Array(Tuple(Nullable(Int64), Nullable(String))))) │ │ │ │ │ │
└──────┴─────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
2022-12-29 21:21:37 +00:00
If ClickHouse cannot determine the type, because the data contains only nulls, an exception will be thrown:
```sql
2022-12-30 14:53:35 +00:00
DESC format(Values, '([NULL, NULL])')
```
```response
Code: 652. DB::Exception: Received from localhost:9000. DB::Exception:
Cannot determine type for column 'c1' by first 1 rows of data,
most likely this column contains only Nulls or empty Arrays/Maps.
...
```
Example with disabled setting `input_format_tsv_use_best_effort_in_schema_inference`:
```sql
2022-12-30 14:53:35 +00:00
SET input_format_tsv_use_best_effort_in_schema_inference = 0
DESC format(TSV, '[1,2,3] 42.42 Hello World!')
```
```response
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(String) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
│ c3 │ Nullable(String) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## CustomSeparated {#custom-separated}
In CustomSeparated format ClickHouse first extracts all column values from the row according to specified delimiters and then tries to infer
2022-12-30 14:21:12 +00:00
the data type for each value according to escaping rule.
**Example**
```sql
2022-12-30 14:53:35 +00:00
SET format_custom_row_before_delimiter = '<row_before_delimiter>',
format_custom_row_after_delimiter = '<row_after_delimiter>\n',
format_custom_row_between_delimiter = '<row_between_delimiter>\n',
format_custom_result_before_delimiter = '<result_before_delimiter>\n',
format_custom_result_after_delimiter = '<result_after_delimiter>\n',
format_custom_field_delimiter = '<field_delimiter>',
format_custom_escaping_rule = 'Quoted'
2022-12-30 14:53:35 +00:00
DESC format(CustomSeparated, $$<result_before_delimiter>
<row_before_delimiter>42.42<field_delimiter>'Some string 1'<field_delimiter>[1, NULL, 3]<row_after_delimiter>
<row_between_delimiter>
<row_before_delimiter>NULL<field_delimiter>'Some string 3'<field_delimiter>[1, 2, NULL]<row_after_delimiter>
<result_after_delimiter>
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(Float64) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
│ c3 │ Array(Nullable(Int64)) │ │ │ │ │ │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## Template {#template}
2022-12-30 14:21:12 +00:00
In Template format ClickHouse first extracts all column values from the row according to the specified template and then tries to infer the
data type for each value according to its escaping rule.
**Example**
Let's say we have a file `resultset` with the next content:
```
<result_before_delimiter>
${data}<result_after_delimiter>
```
And a file `row_format` with the next content:
```
<row_before_delimiter>${column_1:CSV}<field_delimiter_1>${column_2:Quoted}<field_delimiter_2>${column_3:JSON}<row_after_delimiter>
```
Then we can make the next queries:
```sql
2022-12-30 14:53:35 +00:00
SET format_template_rows_between_delimiter = '<row_between_delimiter>\n',
format_template_row = 'row_format',
format_template_resultset = 'resultset_format'
2022-12-30 14:53:35 +00:00
DESC format(Template, $$<result_before_delimiter>
<row_before_delimiter>42.42<field_delimiter_1>'Some string 1'<field_delimiter_2>[1, null, 2]<row_after_delimiter>
<row_between_delimiter>
<row_before_delimiter>\N<field_delimiter_1>'Some string 3'<field_delimiter_2>[1, 2, null]<row_after_delimiter>
<result_after_delimiter>
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─────┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ column_1 │ Nullable(Float64) │ │ │ │ │ │
│ column_2 │ Nullable(String) │ │ │ │ │ │
│ column_3 │ Array(Nullable(Int64)) │ │ │ │ │ │
└──────────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## Regexp {#regexp}
Similar to Template, in Regexp format ClickHouse first extracts all column values from the row according to specified regular expression and then tries to infer
2022-12-30 14:21:12 +00:00
data type for each value according to the specified escaping rule.
**Example**
```sql
2022-12-30 14:53:35 +00:00
SET format_regexp = '^Line: value_1=(.+?), value_2=(.+?), value_3=(.+?)',
format_regexp_escaping_rule = 'CSV'
2022-12-30 14:53:35 +00:00
DESC format(Regexp, $$Line: value_1=42, value_2="Some string 1", value_3="[1, NULL, 3]"
Line: value_1=2, value_2="Some string 2", value_3="[4, 5, NULL]"$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ c1 │ Nullable(Int64) │ │ │ │ │ │
│ c2 │ Nullable(String) │ │ │ │ │ │
│ c3 │ Array(Nullable(Int64)) │ │ │ │ │ │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## Settings for text formats {settings-for-text-formats}
### input_format_max_rows_to_read_for_schema_inference
This setting controls the maximum number of rows to be read while schema inference.
The more rows are read, the more time is spent on schema inference, but the greater the chance to
correctly determine the types (especially when the data contains a lot of nulls).
Default value: `25000`.
### column_names_for_schema_inference
The list of column names to use in schema inference for formats without explicit column names. Specified names will be used instead of default `c1,c2,c3,...`. The format: `column1,column2,column3,...`.
**Example**
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSV, 'Hello, World! 42 [1, 2, 3]') settings column_names_for_schema_inference = 'str,int,arr'
```
```response
┌─name─┬─type───────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ str │ Nullable(String) │ │ │ │ │ │
│ int │ Nullable(Int64) │ │ │ │ │ │
│ arr │ Array(Nullable(Int64)) │ │ │ │ │ │
└──────┴────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
### schema_inference_hints
The list of column names and types to use in schema inference instead of automatically determined types. The format: 'column_name1 column_type1, column_name2 column_type2, ...'.
2022-12-29 21:21:37 +00:00
This setting can be used to specify the types of columns that could not be determined automatically or for optimizing the schema.
**Example**
```sql
2022-12-30 14:53:35 +00:00
DESC format(JSONEachRow, '{"id" : 1, "age" : 25, "name" : "Josh", "status" : null, "hobbies" : ["football", "cooking"]}'
SETTINGS schema_inference_hints = 'age LowCardinality(UInt8), status Nullable(String)'
2022-12-30 14:53:35 +00:00
```
```response
┌─name────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id │ Nullable(Int64) │ │ │ │ │ │
│ age │ LowCardinality(UInt8) │ │ │ │ │ │
│ name │ Nullable(String) │ │ │ │ │ │
│ status │ Nullable(String) │ │ │ │ │ │
│ hobbies │ Array(Nullable(String)) │ │ │ │ │ │
└─────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
### schema_inference_make_columns_nullable
Controls making inferred types `Nullable` in schema inference for formats without information about nullability.
2022-12-30 14:21:12 +00:00
If the setting is enabled, the inferred type will be `Nullable` only if the column contains `NULL` in a sample that is parsed during schema inference.
2022-12-29 21:21:37 +00:00
Enabled by default.
**Examples**
```sql
2022-12-30 14:53:35 +00:00
SET schema_inference_make_columns_nullable = 0
DESC format(JSONEachRow, $$
{"id" : 1, "age" : 25, "name" : "Josh", "status" : null, "hobbies" : ["football", "cooking"]}
{"id" : 2, "age" : 19, "name" : "Alan", "status" : "married", "hobbies" : ["tennis", "art"]}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id │ Nullable(Int64) │ │ │ │ │ │
│ age │ Nullable(Int64) │ │ │ │ │ │
│ name │ Nullable(String) │ │ │ │ │ │
│ status │ Nullable(String) │ │ │ │ │ │
│ hobbies │ Array(Nullable(String)) │ │ │ │ │ │
└─────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
SET schema_inference_make_columns_nullable = 0
DESC format(JSONEachRow, $$
{"id" : 1, "age" : 25, "name" : "Josh", "status" : null, "hobbies" : ["football", "cooking"]}
{"id" : 2, "age" : 19, "name" : "Alan", "status" : "married", "hobbies" : ["tennis", "art"]}
$$)
┌─name────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ id │ Int64 │ │ │ │ │ │
│ age │ Int64 │ │ │ │ │ │
│ name │ String │ │ │ │ │ │
│ status │ Nullable(String) │ │ │ │ │ │
│ hobbies │ Array(String) │ │ │ │ │ │
└─────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
### input_format_try_infer_integers
If enabled, ClickHouse will try to infer integers instead of floats in schema inference for text formats.
If all numbers in the column from sample data are integers, the result type will be `Int64`, if at least one number is float, the result type will be `Float64`.
If the sample data contains only integers and at least one integer is positive and overflows `Int64`, ClickHouse will infer `UInt64`.
Enabled by default.
**Examples**
```sql
2022-12-30 14:53:35 +00:00
SET input_format_try_infer_integers = 0
DESC format(JSONEachRow, $$
{"number" : 1}
{"number" : 2}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name───┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ Nullable(Float64) │ │ │ │ │ │
└────────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
SET input_format_try_infer_integers = 1
DESC format(JSONEachRow, $$
{"number" : 1}
{"number" : 2}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name───┬─type────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ Nullable(Int64) │ │ │ │ │ │
└────────┴─────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
DESC format(JSONEachRow, $$
{"number" : 1}
{"number" : 18446744073709551615}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name───┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ Nullable(UInt64) │ │ │ │ │ │
└────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
DESC format(JSONEachRow, $$
{"number" : 1}
{"number" : 2.2}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name───┬─type──────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ number │ Nullable(Float64) │ │ │ │ │ │
└────────┴───────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
### input_format_try_infer_datetimes
If enabled, ClickHouse will try to infer type `DateTime64` from string fields in schema inference for text formats.
If all fields from a column in sample data were successfully parsed as datetimes, the result type will be `DateTime64(9)`,
if at least one field was not parsed as datetime, the result type will be `String`.
Enabled by default.
**Examples**
```sql
2022-12-30 14:53:35 +00:00
SET input_format_try_infer_datetimes = 0
DESC format(JSONEachRow, $$
{"datetime" : "2021-01-01 00:00:00.000"}
{"datetime" : "2022-01-01 00:00:00.000"}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ datetime │ Nullable(String) │ │ │ │ │ │
└──────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
SET input_format_try_infer_datetimes = 1
DESC format(JSONEachRow, $$
{"datetime" : "2021-01-01 00:00:00.000"}
{"datetime" : "2022-01-01 00:00:00.000"}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ datetime │ Nullable(DateTime64(9)) │ │ │ │ │ │
└──────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
DESC format(JSONEachRow, $$
{"datetime" : "2021-01-01 00:00:00.000"}
{"datetime" : "unknown"}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ datetime │ Nullable(String) │ │ │ │ │ │
└──────────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
2022-12-30 14:21:12 +00:00
Note: Parsing datetimes during schema inference respect setting [date_time_input_format](/docs/en/operations/settings/settings.md#date_time_input_format)
### input_format_try_infer_dates
If enabled, ClickHouse will try to infer type `Date` from string fields in schema inference for text formats.
If all fields from a column in sample data were successfully parsed as dates, the result type will be `Date`,
if at least one field was not parsed as date, the result type will be `String`.
Enabled by default.
**Examples**
```sql
2022-12-30 14:53:35 +00:00
SET input_format_try_infer_datetimes = 0, input_format_try_infer_dates = 0
DESC format(JSONEachRow, $$
{"date" : "2021-01-01"}
{"date" : "2022-01-01"}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date │ Nullable(String) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
SET input_format_try_infer_dates = 1
DESC format(JSONEachRow, $$
{"date" : "2021-01-01"}
{"date" : "2022-01-01"}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─┬─type───────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date │ Nullable(Date) │ │ │ │ │ │
└──────┴────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
2022-12-30 14:53:35 +00:00
```
```sql
DESC format(JSONEachRow, $$
{"date" : "2021-01-01"}
{"date" : "unknown"}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─┬─type─────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ date │ Nullable(String) │ │ │ │ │ │
└──────┴──────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
# Self describing formats {#self-describing-formats}
Self-describing formats contain information about the structure of the data in the data itself,
it can be some header with a description, a binary type tree, or some kind of table.
To automatically infer a schema from files in such formats, ClickHouse reads a part of the data containing
2022-12-30 14:21:12 +00:00
information about the types and converts it into a schema of the ClickHouse table.
## Formats with -WithNamesAndTypes suffix {#formats-with-names-and-types}
2022-12-30 14:21:12 +00:00
ClickHouse supports some text formats with the suffix -WithNamesAndTypes. This suffix means that the data contains two additional rows with column names and types before the actual data.
While schema inference for such formats, ClickHouse reads the first two rows and extracts column names and types.
**Example**
```sql
2022-12-30 14:53:35 +00:00
DESC format(TSVWithNamesAndTypes,
$$num str arr
UInt8 String Array(UInt8)
42 Hello, World! [1,2,3]
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─┬─type─────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ num │ UInt8 │ │ │ │ │ │
│ str │ String │ │ │ │ │ │
│ arr │ Array(UInt8) │ │ │ │ │ │
└──────┴──────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## JSON formats with metadata {#json-with-metadata}
2022-12-29 21:21:37 +00:00
Some JSON input formats ([JSON](formats.md#json), [JSONCompact](formats.md#json-compact), [JSONColumnsWithMetadata](formats.md#jsoncolumnswithmetadata)) contain metadata with column names and types.
In schema inference for such formats, ClickHouse reads this metadata.
**Example**
```sql
2022-12-30 14:53:35 +00:00
DESC format(JSON, $$
{
"meta":
[
{
"name": "num",
"type": "UInt8"
},
{
"name": "str",
"type": "String"
},
{
"name": "arr",
"type": "Array(UInt8)"
}
],
"data":
[
{
"num": 42,
"str": "Hello, World",
"arr": [1,2,3]
}
],
"rows": 1,
"statistics":
{
"elapsed": 0.005723915,
"rows_read": 1,
"bytes_read": 1
}
}
$$)
2022-12-30 14:53:35 +00:00
```
```response
┌─name─┬─type─────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ num │ UInt8 │ │ │ │ │ │
│ str │ String │ │ │ │ │ │
│ arr │ Array(UInt8) │ │ │ │ │ │
└──────┴──────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## Avro {#avro}
2022-12-30 14:21:12 +00:00
In Avro format ClickHouse reads its schema from the data and converts it to ClickHouse schema using the following type matches:
| Avro data type | ClickHouse data type |
|------------------------------------|--------------------------------------------------------------------------------|
| `boolean` | [Bool](../sql-reference/data-types/boolean.md) |
| `int` | [Int32](../sql-reference/data-types/int-uint.md) |
| `long` | [Int64](../sql-reference/data-types/int-uint.md) |
| `float` | [Float32](../sql-reference/data-types/float.md) |
| `double` | [Float64](../sql-reference/data-types/float.md) |
| `bytes`, `string` | [String](../sql-reference/data-types/string.md) |
| `fixed` | [FixedString(N)](../sql-reference/data-types/fixedstring.md) |
2022-12-29 20:42:03 +00:00
| `enum` | [Enum](../sql-reference/data-types/enum.md) |
| `array(T)` | [Array(T)](../sql-reference/data-types/array.md) |
| `union(null, T)`, `union(T, null)` | [Nullable(T)](../sql-reference/data-types/date.md) |
| `null` | [Nullable(Nothing)](../sql-reference/data-types/special-data-types/nothing.md) |
Other Avro types are not supported.
## Parquet {#parquet}
2022-12-30 14:21:12 +00:00
In Parquet format ClickHouse reads its schema from the data and converts it to ClickHouse schema using the following type matches:
| Parquet data type | ClickHouse data type |
|------------------------------|---------------------------------------------------------|
| `BOOL` | [Bool](../sql-reference/data-types/boolean.md) |
| `UINT8` | [UInt8](../sql-reference/data-types/int-uint.md) |
| `INT8` | [Int8](../sql-reference/data-types/int-uint.md) |
| `UINT16` | [UInt16](../sql-reference/data-types/int-uint.md) |
| `INT16` | [Int16](../sql-reference/data-types/int-uint.md) |
| `UINT32` | [UInt32](../sql-reference/data-types/int-uint.md) |
| `INT32` | [Int32](../sql-reference/data-types/int-uint.md) |
| `UINT64` | [UInt64](../sql-reference/data-types/int-uint.md) |
| `INT64` | [Int64](../sql-reference/data-types/int-uint.md) |
| `FLOAT` | [Float32](../sql-reference/data-types/float.md) |
| `DOUBLE` | [Float64](../sql-reference/data-types/float.md) |
| `DATE` | [Date32](../sql-reference/data-types/date32.md) |
| `TIME (ms)` | [DateTime](../sql-reference/data-types/datetime.md) |
| `TIMESTAMP`, `TIME (us, ns)` | [DateTime64](../sql-reference/data-types/datetime64.md) |
| `STRING`, `BINARY` | [String](../sql-reference/data-types/string.md) |
| `DECIMAL` | [Decimal](../sql-reference/data-types/decimal.md) |
| `LIST` | [Array](../sql-reference/data-types/array.md) |
| `STRUCT` | [Tuple](../sql-reference/data-types/tuple.md) |
| `MAP` | [Map](../sql-reference/data-types/map.md) |
2022-12-30 14:21:12 +00:00
Other Parquet types are not supported. By default, all inferred types are inside `Nullable`, but it can be changed using the setting `schema_inference_make_columns_nullable`.
## Arrow {#arrow}
2022-12-30 14:21:12 +00:00
In Arrow format ClickHouse reads its schema from the data and converts it to ClickHouse schema using the following type matches:
| Arrow data type | ClickHouse data type |
|---------------------------------|---------------------------------------------------------|
| `BOOL` | [Bool](../sql-reference/data-types/boolean.md) |
| `UINT8` | [UInt8](../sql-reference/data-types/int-uint.md) |
| `INT8` | [Int8](../sql-reference/data-types/int-uint.md) |
| `UINT16` | [UInt16](../sql-reference/data-types/int-uint.md) |
| `INT16` | [Int16](../sql-reference/data-types/int-uint.md) |
| `UINT32` | [UInt32](../sql-reference/data-types/int-uint.md) |
| `INT32` | [Int32](../sql-reference/data-types/int-uint.md) |
| `UINT64` | [UInt64](../sql-reference/data-types/int-uint.md) |
| `INT64` | [Int64](../sql-reference/data-types/int-uint.md) |
| `FLOAT`, `HALF_FLOAT` | [Float32](../sql-reference/data-types/float.md) |
| `DOUBLE` | [Float64](../sql-reference/data-types/float.md) |
| `DATE32` | [Date32](../sql-reference/data-types/date32.md) |
| `DATE64` | [DateTime](../sql-reference/data-types/datetime.md) |
| `TIMESTAMP`, `TIME32`, `TIME64` | [DateTime64](../sql-reference/data-types/datetime64.md) |
| `STRING`, `BINARY` | [String](../sql-reference/data-types/string.md) |
| `DECIMAL128`, `DECIMAL256` | [Decimal](../sql-reference/data-types/decimal.md) |
| `LIST` | [Array](../sql-reference/data-types/array.md) |
| `STRUCT` | [Tuple](../sql-reference/data-types/tuple.md) |
| `MAP` | [Map](../sql-reference/data-types/map.md) |
2022-12-30 14:21:12 +00:00
Other Arrow types are not supported. By default, all inferred types are inside `Nullable`, but it can be changed using the setting `schema_inference_make_columns_nullable`.
## ORC {#orc}
2022-12-30 14:21:12 +00:00
In ORC format ClickHouse reads its schema from the data and converts it to ClickHouse schema using the following type matches:
| ORC data type | ClickHouse data type |
|--------------------------------------|---------------------------------------------------------|
| `Boolean` | [Bool](../sql-reference/data-types/boolean.md) |
| `Tinyint` | [Int8](../sql-reference/data-types/int-uint.md) |
| `Smallint` | [Int16](../sql-reference/data-types/int-uint.md) |
| `Int` | [Int32](../sql-reference/data-types/int-uint.md) |
| `Bigint` | [Int64](../sql-reference/data-types/int-uint.md) |
| `Float` | [Float32](../sql-reference/data-types/float.md) |
| `Double` | [Float64](../sql-reference/data-types/float.md) |
| `Date` | [Date32](../sql-reference/data-types/date32.md) |
| `Timestamp` | [DateTime64](../sql-reference/data-types/datetime64.md) |
| `String`, `Char`, `Varchar`,`BINARY` | [String](../sql-reference/data-types/string.md) |
| `Decimal` | [Decimal](../sql-reference/data-types/decimal.md) |
| `List` | [Array](../sql-reference/data-types/array.md) |
| `Struct` | [Tuple](../sql-reference/data-types/tuple.md) |
| `Map` | [Map](../sql-reference/data-types/map.md) |
2022-12-30 14:21:12 +00:00
Other ORC types are not supported. By default, all inferred types are inside `Nullable`, but it can be changed using the setting `schema_inference_make_columns_nullable`.
## Native {#native}
Native format is used inside ClickHouse and contains the schema in the data.
2022-12-30 14:21:12 +00:00
In schema inference, ClickHouse reads the schema from the data without any transformations.
# Formats with external schema {#formats-with-external-schema}
Such formats require a schema describing the data in a separate file in a specific schema language.
To automatically infer a schema from files in such formats, ClickHouse reads external schema from a separate file and transforms it to a ClickHouse table schema.
# Protobuf {#protobuf}
2022-12-30 14:21:12 +00:00
In schema inference for Protobuf format ClickHouse uses the following type matches:
2022-12-29 20:42:03 +00:00
| Protobuf data type | ClickHouse data type |
|-------------------------------|---------------------------------------------------|
| `bool` | [UInt8](../sql-reference/data-types/int-uint.md) |
| `float` | [Float32](../sql-reference/data-types/float.md) |
| `double` | [Float64](../sql-reference/data-types/float.md) |
| `int32`, `sint32`, `sfixed32` | [Int32](../sql-reference/data-types/int-uint.md) |
| `int64`, `sint64`, `sfixed64` | [Int64](../sql-reference/data-types/int-uint.md) |
| `uint32`, `fixed32` | [UInt32](../sql-reference/data-types/int-uint.md) |
| `uint64`, `fixed64` | [UInt64](../sql-reference/data-types/int-uint.md) |
| `string`, `bytes` | [String](../sql-reference/data-types/string.md) |
| `enum` | [Enum](../sql-reference/data-types/enum.md) |
| `repeated T` | [Array(T)](../sql-reference/data-types/array.md) |
| `message`, `group` | [Tuple](../sql-reference/data-types/tuple.md) |
# CapnProto {#capnproto}
2022-12-30 14:21:12 +00:00
In schema inference for CapnProto format ClickHouse uses the following type matches:
2022-12-29 20:42:03 +00:00
| CapnProto data type | ClickHouse data type |
|------------------------------------|--------------------------------------------------------|
| `Bool` | [UInt8](../sql-reference/data-types/int-uint.md) |
| `Int8` | [Int8](../sql-reference/data-types/int-uint.md) |
| `UInt8` | [UInt8](../sql-reference/data-types/int-uint.md) |
| `Int16` | [Int16](../sql-reference/data-types/int-uint.md) |
| `UInt16` | [UInt16](../sql-reference/data-types/int-uint.md) |
| `Int32` | [Int32](../sql-reference/data-types/int-uint.md) |
| `UInt32` | [UInt32](../sql-reference/data-types/int-uint.md) |
| `Int64` | [Int64](../sql-reference/data-types/int-uint.md) |
| `UInt64` | [UInt64](../sql-reference/data-types/int-uint.md) |
| `Float32` | [Float32](../sql-reference/data-types/float.md) |
| `Float64` | [Float64](../sql-reference/data-types/float.md) |
| `Text`, `Data` | [String](../sql-reference/data-types/string.md) |
| `enum` | [Enum](../sql-reference/data-types/enum.md) |
| `List` | [Array](../sql-reference/data-types/array.md) |
| `struct` | [Tuple](../sql-reference/data-types/tuple.md) |
| `union(T, Void)`, `union(Void, T)` | [Nullable(T)](../sql-reference/data-types/nullable.md) |
# Strong-typed binary formats {#strong-typed-binary-formats}
2022-12-30 14:21:12 +00:00
In such formats, each serialized value contains information about its type (and possibly about its name), but there is no information about the whole table.
In schema inference for such formats, ClickHouse reads data row by row (up to `input_format_max_rows_to_read_for_schema_inference` rows) and extracts
2022-12-29 20:42:03 +00:00
the type (and possibly name) for each value from the data and then converts these types to ClickHouse types.
## MsgPack {msgpack}
2022-12-30 14:21:12 +00:00
In MsgPack format there is no delimiter between rows, to use schema inference for this format you should specify the number of columns in the table
using the setting `input_format_msgpack_number_of_columns`. ClickHouse uses the following type matches:
2022-12-29 20:42:03 +00:00
| MessagePack data type (`INSERT`) | ClickHouse data type |
|--------------------------------------------------------------------|-----------------------------------------------------------|
| `int N`, `uint N`, `negative fixint`, `positive fixint` | [Int64](../sql-reference/data-types/int-uint.md) |
| `bool` | [UInt8](../sql-reference/data-types/int-uint.md) |
| `fixstr`, `str 8`, `str 16`, `str 32`, `bin 8`, `bin 16`, `bin 32` | [String](../sql-reference/data-types/string.md) |
| `float 32` | [Float32](../sql-reference/data-types/float.md) |
| `float 64` | [Float64](../sql-reference/data-types/float.md) |
| `uint 16` | [Date](../sql-reference/data-types/date.md) |
| `uint 32` | [DateTime](../sql-reference/data-types/datetime.md) |
| `uint 64` | [DateTime64](../sql-reference/data-types/datetime.md) |
| `fixarray`, `array 16`, `array 32` | [Array](../sql-reference/data-types/array.md) |
| `fixmap`, `map 16`, `map 32` | [Map](../sql-reference/data-types/map.md) |
2022-12-30 14:21:12 +00:00
By default, all inferred types are inside `Nullable`, but it can be changed using the setting `schema_inference_make_columns_nullable`.
2022-12-29 20:42:03 +00:00
## BSONEachRow {#bsoneachrow}
2022-12-30 14:21:12 +00:00
In BSONEachRow each row of data is presented as a BSON document. In schema inference ClickHouse reads BSON documents one by one and extracts
values, names, and types from the data and then transforms these types to ClickHouse types using the following type matches:
2022-12-29 20:42:03 +00:00
| BSON Type | ClickHouse type |
|-----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| `\x08` boolean | [Bool](../sql-reference/data-types/boolean.md) |
| `\x10` int32 | [Int32](../sql-reference/data-types/int-uint.md) |
| `\x12` int64 | [Int64](../sql-reference/data-types/int-uint.md) |
| `\x01` double | [Float64](../sql-reference/data-types/float.md) |
| `\x09` datetime | [DateTime64](../sql-reference/data-types/datetime64.md) |
| `\x05` binary with`\x00` binary subtype, `\x02` string, `\x0E` symbol, `\x0D` JavaScript code | [String](../sql-reference/data-types/string.md) |
| `\x07` ObjectId, | [FixedString(12)](../sql-reference/data-types/fixedstring.md) |
| `\x05` binary with `\x04` uuid subtype, size = 16 | [UUID](../sql-reference/data-types/uuid.md) |
| `\x04` array | [Array](../sql-reference/data-types/array.md)/[Tuple](../sql-reference/data-types/tuple.md) (if nested types are different) |
| `\x03` document | [Named Tuple](../sql-reference/data-types/tuple.md)/[Map](../sql-reference/data-types/map.md) (with String keys) |
2022-12-30 14:21:12 +00:00
By default, all inferred types are inside `Nullable`, but it can be changed using the setting `schema_inference_make_columns_nullable`.
2022-12-29 20:42:03 +00:00
# Formats with constant schema {#formats-with-constant-schema}
2022-12-29 20:42:03 +00:00
Data in such formats always have the same schema.
## LineAsString {#line-as-string}
2022-12-30 14:21:12 +00:00
In this format, ClickHouse reads the whole line from the data into a single column with `String` data type. The inferred type for this format is always `String` and the column name is `line`.
2022-12-29 20:42:03 +00:00
**Example**
```sql
2022-12-30 14:53:35 +00:00
DESC format(LineAsString, 'Hello\nworld!')
```
```response
2022-12-29 20:42:03 +00:00
┌─name─┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ line │ String │ │ │ │ │ │
└──────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## JSONAsString {#json-as-string}
2022-12-30 14:21:12 +00:00
In this format, ClickHouse reads the whole JSON object from the data into a single column with `String` data type. The inferred type for this format is always `String` and the column name is `json`.
2022-12-29 20:42:03 +00:00
**Example**
```sql
2022-12-30 14:53:35 +00:00
DESC format(JSONAsString, '{"x" : 42, "y" : "Hello, World!"}')
```
```response
2022-12-29 20:42:03 +00:00
┌─name─┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ json │ String │ │ │ │ │ │
└──────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
## JSONAsObject {#json-as-object}
2022-12-30 14:21:12 +00:00
In this format, ClickHouse reads the whole JSON object from the data into a single column with `Object('json')` data type. Inferred type for this format is always `String` and the column name is `json`.
2022-12-29 20:42:03 +00:00
Note: This format works only if `allow_experimental_object_type` is enabled.
2022-12-29 21:31:52 +00:00
2022-12-29 20:42:03 +00:00
**Example**
```sql
2022-12-30 14:53:35 +00:00
DESC format(JSONAsString, '{"x" : 42, "y" : "Hello, World!"}') SETTINGS allow_experimental_object_type=1
```
```response
2022-12-29 20:42:03 +00:00
┌─name─┬─type───────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ json │ Object('json') │ │ │ │ │ │
└──────┴────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
```
2022-12-29 20:42:03 +00:00
[Original article](https://clickhouse.com/docs/en/interfaces/schema-inference) <!--hide-->