80 KiB
slug | sidebar_position | sidebar_label |
---|---|---|
/en/engines/table-engines/mergetree-family/mergetree | 11 | MergeTree |
MergeTree
The MergeTree
engine and other engines of this family (*MergeTree
) are the most commonly used and most robust ClickHouse table engines.
Engines in the MergeTree
family are designed for inserting a very large amount of data into a table. The data is quickly written to the table part by part, then rules are applied for merging the parts in the background. This method is much more efficient than continually rewriting the data in storage during insert.
Main features:
-
Stores data sorted by primary key.
This allows you to create a small sparse index that helps find data faster.
-
Partitions can be used if the partitioning key is specified.
ClickHouse supports certain operations with partitions that are more efficient than general operations on the same data with the same result. ClickHouse also automatically cuts off the partition data where the partitioning key is specified in the query.
-
Data replication support.
The family of
ReplicatedMergeTree
tables provides data replication. For more information, see Data replication. -
Data sampling support.
If necessary, you can set the data sampling method in the table.
:::info
The Merge engine does not belong to the *MergeTree
family.
:::
If you need to update rows frequently, we recommend using the ReplacingMergeTree
table engine. Using ALTER TABLE my_table UPDATE
to update rows triggers a mutation, which causes parts to be re-written and uses IO/resources. With ReplacingMergeTree
, you can simply insert the updated rows and the old rows will be replaced according to the table sorting key.
Creating a Table
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
name1 [type1] [[NOT] NULL] [DEFAULT|MATERIALIZED|ALIAS|EPHEMERAL expr1] [COMMENT ...] [CODEC(codec1)] [STATISTIC(stat1)] [TTL expr1] [PRIMARY KEY] [SETTINGS (name = value, ...)],
name2 [type2] [[NOT] NULL] [DEFAULT|MATERIALIZED|ALIAS|EPHEMERAL expr2] [COMMENT ...] [CODEC(codec2)] [STATISTIC(stat2)] [TTL expr2] [PRIMARY KEY] [SETTINGS (name = value, ...)],
...
INDEX index_name1 expr1 TYPE type1(...) [GRANULARITY value1],
INDEX index_name2 expr2 TYPE type2(...) [GRANULARITY value2],
...
PROJECTION projection_name_1 (SELECT <COLUMN LIST EXPR> [GROUP BY] [ORDER BY]),
PROJECTION projection_name_2 (SELECT <COLUMN LIST EXPR> [GROUP BY] [ORDER BY])
) ENGINE = MergeTree()
ORDER BY expr
[PARTITION BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr
[DELETE|TO DISK 'xxx'|TO VOLUME 'xxx' [, ...] ]
[WHERE conditions]
[GROUP BY key_expr [SET v1 = aggr_func(v1) [, v2 = aggr_func(v2) ...]] ] ]
[SETTINGS name = value, ...]
For a description of parameters, see the CREATE query description.
Query Clauses
ENGINE
ENGINE
— Name and parameters of the engine. ENGINE = MergeTree()
. The MergeTree
engine does not have parameters.
ORDER_BY
ORDER BY
— The sorting key.
A tuple of column names or arbitrary expressions. Example: ORDER BY (CounterID, EventDate)
.
ClickHouse uses the sorting key as a primary key if the primary key is not defined explicitly by the PRIMARY KEY
clause.
Use the ORDER BY tuple()
syntax, if you do not need sorting, or set create_table_empty_primary_key_by_default
to true
to use the ORDER BY tuple()
syntax by default. See Selecting the Primary Key.
PARTITION BY
PARTITION BY
— The partitioning key. Optional. In most cases, you don't need a partition key, and if you do need to partition, generally you do not need a partition key more granular than by month. Partitioning does not speed up queries (in contrast to the ORDER BY expression). You should never use too granular partitioning. Don't partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression).
For partitioning by month, use the toYYYYMM(date_column)
expression, where date_column
is a column with a date of the type Date. The partition names here have the "YYYYMM"
format.
PRIMARY KEY
PRIMARY KEY
— The primary key if it differs from the sorting key. Optional.
By default the primary key is the same as the sorting key (which is specified by the ORDER BY
clause). Thus in most cases it is unnecessary to specify a separate PRIMARY KEY
clause.
SAMPLE BY
SAMPLE BY
— An expression for sampling. Optional.
If a sampling expression is used, the primary key must contain it. The result of a sampling expression must be an unsigned integer. Example: SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID))
.
TTL
TTL
— A list of rules specifying storage duration of rows and defining logic of automatic parts movement between disks and volumes. Optional.
Expression must have one Date
or DateTime
column as a result. Example:
TTL date + INTERVAL 1 DAY
Type of the rule DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'|GROUP BY
specifies an action to be done with the part if the expression is satisfied (reaches current time): removal of expired rows, moving a part (if expression is satisfied for all rows in a part) to specified disk (TO DISK 'xxx'
) or to volume (TO VOLUME 'xxx'
), or aggregating values in expired rows. Default type of the rule is removal (DELETE
). List of multiple rules can be specified, but there should be no more than one DELETE
rule.
For more details, see TTL for columns and tables
SETTINGS
Additional parameters that control the behavior of the MergeTree
(optional):
index_granularity
index_granularity
— Maximum number of data rows between the marks of an index. Default value: 8192. See Data Storage.
index_granularity_bytes
index_granularity_bytes
— Maximum size of data granules in bytes. Default value: 10Mb. To restrict the granule size only by number of rows, set to 0 (not recommended). See Data Storage.
min_index_granularity_bytes
min_index_granularity_bytes
— Min allowed size of data granules in bytes. Default value: 1024b. To provide a safeguard against accidentally creating tables with very low index_granularity_bytes. See Data Storage.
enable_mixed_granularity_parts
enable_mixed_granularity_parts
— Enables or disables transitioning to control the granule size with the index_granularity_bytes
setting. Before version 19.11, there was only the index_granularity
setting for restricting granule size. The index_granularity_bytes
setting improves ClickHouse performance when selecting data from tables with big rows (tens and hundreds of megabytes). If you have tables with big rows, you can enable this setting for the tables to improve the efficiency of SELECT
queries.
use_minimalistic_part_header_in_zookeeper
use_minimalistic_part_header_in_zookeeper
— Storage method of the data parts headers in ZooKeeper. If use_minimalistic_part_header_in_zookeeper=1
, then ZooKeeper stores less data. For more information, see the setting description in “Server configuration parameters”.
min_merge_bytes_to_use_direct_io
min_merge_bytes_to_use_direct_io
— The minimum data volume for merge operation that is required for using direct I/O access to the storage disk. When merging data parts, ClickHouse calculates the total storage volume of all the data to be merged. If the volume exceeds min_merge_bytes_to_use_direct_io
bytes, ClickHouse reads and writes the data to the storage disk using the direct I/O interface (O_DIRECT
option). If min_merge_bytes_to_use_direct_io = 0
, then direct I/O is disabled. Default value: 10 * 1024 * 1024 * 1024
bytes.
merge_with_ttl_timeout
merge_with_ttl_timeout
— Minimum delay in seconds before repeating a merge with delete TTL. Default value: 14400
seconds (4 hours).
merge_with_recompression_ttl_timeout
merge_with_recompression_ttl_timeout
— Minimum delay in seconds before repeating a merge with recompression TTL. Default value: 14400
seconds (4 hours).
try_fetch_recompressed_part_timeout
try_fetch_recompressed_part_timeout
— Timeout (in seconds) before starting merge with recompression. During this time ClickHouse tries to fetch recompressed part from replica which assigned this merge with recompression. Default value: 7200
seconds (2 hours).
write_final_mark
write_final_mark
— Enables or disables writing the final index mark at the end of data part (after the last byte). Default value: 1. Don’t turn it off.
merge_max_block_size
merge_max_block_size
— Maximum number of rows in block for merge operations. Default value: 8192.
storage_policy
storage_policy
— Storage policy. See Using Multiple Block Devices for Data Storage.
min_bytes_for_wide_part
min_bytes_for_wide_part
, min_rows_for_wide_part
— Minimum number of bytes/rows in a data part that can be stored in Wide
format. You can set one, both or none of these settings. See Data Storage.
max_parts_in_total
max_parts_in_total
— Maximum number of parts in all partitions.
max_compress_block_size
max_compress_block_size
— Maximum size of blocks of uncompressed data before compressing for writing to a table. You can also specify this setting in the global settings (see max_compress_block_size setting). The value specified when table is created overrides the global value for this setting.
min_compress_block_size
min_compress_block_size
— Minimum size of blocks of uncompressed data required for compression when writing the next mark. You can also specify this setting in the global settings (see min_compress_block_size setting). The value specified when table is created overrides the global value for this setting.
max_partitions_to_read
max_partitions_to_read
— Limits the maximum number of partitions that can be accessed in one query. You can also specify setting max_partitions_to_read in the global setting.
Example of Sections Setting
ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity=8192
In the example, we set partitioning by month.
We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each CounterID
and EventDate
. If you define a SAMPLE clause when selecting the data, ClickHouse will return an evenly pseudorandom data sample for a subset of users.
The index_granularity
setting can be omitted because 8192 is the default value.
Deprecated Method for Creating a Table
:::note Do not use this method in new projects. If possible, switch old projects to the method described above. :::
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
...
) ENGINE [=] MergeTree(date-column [, sampling_expression], (primary, key), index_granularity)
MergeTree() Parameters
date-column
— The name of a column of the Date type. ClickHouse automatically creates partitions by month based on this column. The partition names are in the"YYYYMM"
format.sampling_expression
— An expression for sampling.(primary, key)
— Primary key. Type: Tuple()index_granularity
— The granularity of an index. The number of data rows between the “marks” of an index. The value 8192 is appropriate for most tasks.
Example
MergeTree(EventDate, intHash32(UserID), (CounterID, EventDate, intHash32(UserID)), 8192)
The MergeTree
engine is configured in the same way as in the example above for the main engine configuration method.
Data Storage
A table consists of data parts sorted by primary key.
When data is inserted in a table, separate data parts are created and each of them is lexicographically sorted by primary key. For example, if the primary key is (CounterID, Date)
, the data in the part is sorted by CounterID
, and within each CounterID
, it is ordered by Date
.
Data belonging to different partitions are separated into different parts. In the background, ClickHouse merges data parts for more efficient storage. Parts belonging to different partitions are not merged. The merge mechanism does not guarantee that all rows with the same primary key will be in the same data part.
Data parts can be stored in Wide
or Compact
format. In Wide
format each column is stored in a separate file in a filesystem, in Compact
format all columns are stored in one file. Compact
format can be used to increase performance of small and frequent inserts.
Data storing format is controlled by the min_bytes_for_wide_part
and min_rows_for_wide_part
settings of the table engine. If the number of bytes or rows in a data part is less then the corresponding setting's value, the part is stored in Compact
format. Otherwise it is stored in Wide
format. If none of these settings is set, data parts are stored in Wide
format.
Each data part is logically divided into granules. A granule is the smallest indivisible data set that ClickHouse reads when selecting data. ClickHouse does not split rows or values, so each granule always contains an integer number of rows. The first row of a granule is marked with the value of the primary key for the row. For each data part, ClickHouse creates an index file that stores the marks. For each column, whether it’s in the primary key or not, ClickHouse also stores the same marks. These marks let you find data directly in column files.
The granule size is restricted by the index_granularity
and index_granularity_bytes
settings of the table engine. The number of rows in a granule lays in the [1, index_granularity]
range, depending on the size of the rows. The size of a granule can exceed index_granularity_bytes
if the size of a single row is greater than the value of the setting. In this case, the size of the granule equals the size of the row.
Primary Keys and Indexes in Queries
Take the (CounterID, Date)
primary key as an example. In this case, the sorting and index can be illustrated as follows:
Whole data: [---------------------------------------------]
CounterID: [aaaaaaaaaaaaaaaaaabbbbcdeeeeeeeeeeeeefgggggggghhhhhhhhhiiiiiiiiikllllllll]
Date: [1111111222222233331233211111222222333211111112122222223111112223311122333]
Marks: | | | | | | | | | | |
a,1 a,2 a,3 b,3 e,2 e,3 g,1 h,2 i,1 i,3 l,3
Marks numbers: 0 1 2 3 4 5 6 7 8 9 10
If the data query specifies:
CounterID in ('a', 'h')
, the server reads the data in the ranges of marks[0, 3)
and[6, 8)
.CounterID IN ('a', 'h') AND Date = 3
, the server reads the data in the ranges of marks[1, 3)
and[7, 8)
.Date = 3
, the server reads the data in the range of marks[1, 10]
.
The examples above show that it is always more effective to use an index than a full scan.
A sparse index allows extra data to be read. When reading a single range of the primary key, up to index_granularity * 2
extra rows in each data block can be read.
Sparse indexes allow you to work with a very large number of table rows, because in most cases, such indexes fit in the computer’s RAM.
ClickHouse does not require a unique primary key. You can insert multiple rows with the same primary key.
You can use Nullable
-typed expressions in the PRIMARY KEY
and ORDER BY
clauses but it is strongly discouraged. To allow this feature, turn on the allow_nullable_key setting. The NULLS_LAST principle applies for NULL
values in the ORDER BY
clause.
Selecting the Primary Key
The number of columns in the primary key is not explicitly limited. Depending on the data structure, you can include more or fewer columns in the primary key. This may:
-
Improve the performance of an index.
If the primary key is
(a, b)
, then adding another columnc
will improve the performance if the following conditions are met:- There are queries with a condition on column
c
. - Long data ranges (several times longer than the
index_granularity
) with identical values for(a, b)
are common. In other words, when adding another column allows you to skip quite long data ranges.
- There are queries with a condition on column
-
Improve data compression.
ClickHouse sorts data by primary key, so the higher the consistency, the better the compression.
-
Provide additional logic when merging data parts in the CollapsingMergeTree and SummingMergeTree engines.
In this case it makes sense to specify the sorting key that is different from the primary key.
A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during SELECT
queries.
You can create a table without a primary key using the ORDER BY tuple()
syntax. In this case, ClickHouse stores data in the order of inserting. If you want to save data order when inserting data by INSERT ... SELECT
queries, set max_insert_threads = 1.
To select data in the initial order, use single-threaded SELECT
queries.
Choosing a Primary Key that Differs from the Sorting Key
It is possible to specify a primary key (an expression with values that are written in the index file for each mark) that is different from the sorting key (an expression for sorting the rows in data parts). In this case the primary key expression tuple must be a prefix of the sorting key expression tuple.
This feature is helpful when using the SummingMergeTree and
AggregatingMergeTree table engines. In a common case when using these engines, the table has two types of columns: dimensions and measures. Typical queries aggregate values of measure columns with arbitrary GROUP BY
and filtering by dimensions. Because SummingMergeTree and AggregatingMergeTree aggregate rows with the same value of the sorting key, it is natural to add all dimensions to it. As a result, the key expression consists of a long list of columns and this list must be frequently updated with newly added dimensions.
In this case it makes sense to leave only a few columns in the primary key that will provide efficient range scans and add the remaining dimension columns to the sorting key tuple.
ALTER of the sorting key is a lightweight operation because when a new column is simultaneously added to the table and to the sorting key, existing data parts do not need to be changed. Since the old sorting key is a prefix of the new sorting key and there is no data in the newly added column, the data is sorted by both the old and new sorting keys at the moment of table modification.
Use of Indexes and Partitions in Queries
For SELECT
queries, ClickHouse analyzes whether an index can be used. An index can be used if the WHERE/PREWHERE
clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has IN
or LIKE
with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions.
Thus, it is possible to quickly run queries on one or many ranges of the primary key. In this example, queries will be fast when run for a specific tracking tag, for a specific tag and date range, for a specific tag and date, for multiple tags with a date range, and so on.
Let’s look at the engine configured as follows:
ENGINE MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate)
SETTINGS index_granularity=8192
In this case, in queries:
SELECT count() FROM table
WHERE EventDate = toDate(now())
AND CounterID = 34
SELECT count() FROM table
WHERE EventDate = toDate(now())
AND (CounterID = 34 OR CounterID = 42)
SELECT count() FROM table
WHERE ((EventDate >= toDate('2014-01-01')
AND EventDate <= toDate('2014-01-31')) OR EventDate = toDate('2014-05-01'))
AND CounterID IN (101500, 731962, 160656)
AND (CounterID = 101500 OR EventDate != toDate('2014-05-01'))
ClickHouse will use the primary key index to trim improper data and the monthly partitioning key to trim partitions that are in improper date ranges.
The queries above show that the index is used even for complex expressions. Reading from the table is organized so that using the index can’t be slower than a full scan.
In the example below, the index can’t be used.
SELECT count() FROM table WHERE CounterID = 34 OR URL LIKE '%upyachka%'
To check whether ClickHouse can use the index when running a query, use the settings force_index_by_date and force_primary_key.
The key for partitioning by month allows reading only those data blocks which contain dates from the proper range. In this case, the data block may contain data for many dates (up to an entire month). Within a block, data is sorted by primary key, which might not contain the date as the first column. Because of this, using a query with only a date condition that does not specify the primary key prefix will cause more data to be read than for a single date.
Use of Index for Partially-monotonic Primary Keys
Consider, for example, the days of the month. They form a monotonic sequence for one month, but not monotonic for more extended periods. This is a partially-monotonic sequence. If a user creates the table with partially-monotonic primary key, ClickHouse creates a sparse index as usual. When a user selects data from this kind of table, ClickHouse analyzes the query conditions. If the user wants to get data between two marks of the index and both these marks fall within one month, ClickHouse can use the index in this particular case because it can calculate the distance between the parameters of a query and index marks.
ClickHouse cannot use an index if the values of the primary key in the query parameter range do not represent a monotonic sequence. In this case, ClickHouse uses the full scan method.
ClickHouse uses this logic not only for days of the month sequences, but for any primary key that represents a partially-monotonic sequence.
Data Skipping Indexes
The index declaration is in the columns section of the CREATE
query.
INDEX index_name expr TYPE type(...) [GRANULARITY granularity_value]
For tables from the *MergeTree
family, data skipping indices can be specified.
These indices aggregate some information about the specified expression on blocks, which consist of granularity_value
granules (the size of the granule is specified using the index_granularity
setting in the table engine). Then these aggregates are used in SELECT
queries for reducing the amount of data to read from the disk by skipping big blocks of data where the where
query cannot be satisfied.
The GRANULARITY
clause can be omitted, the default value of granularity_value
is 1.
Example
CREATE TABLE table_name
(
u64 UInt64,
i32 Int32,
s String,
...
INDEX idx1 u64 TYPE bloom_filter GRANULARITY 3,
INDEX idx2 u64 * i32 TYPE minmax GRANULARITY 3,
INDEX idx3 u64 * length(s) TYPE set(1000) GRANULARITY 4
) ENGINE = MergeTree()
...
Indices from the example can be used by ClickHouse to reduce the amount of data to read from disk in the following queries:
SELECT count() FROM table WHERE u64 == 10;
SELECT count() FROM table WHERE u64 * i32 >= 1234
SELECT count() FROM table WHERE u64 * length(s) == 1234
Data skipping indexes can also be created on composite columns:
-- on columns of type Map:
INDEX map_key_index mapKeys(map_column) TYPE bloom_filter
INDEX map_value_index mapValues(map_column) TYPE bloom_filter
-- on columns of type Tuple:
INDEX tuple_1_index tuple_column.1 TYPE bloom_filter
INDEX tuple_2_index tuple_column.2 TYPE bloom_filter
-- on columns of type Nested:
INDEX nested_1_index col.nested_col1 TYPE bloom_filter
INDEX nested_2_index col.nested_col2 TYPE bloom_filter
Available Types of Indices
MinMax
Stores extremes of the specified expression (if the expression is tuple
, then it stores extremes for each element of tuple
), uses stored info for skipping blocks of data like the primary key.
Syntax: minmax
Set
Stores unique values of the specified expression (no more than max_rows
rows, max_rows=0
means “no limits”). Uses the values to check if the WHERE
expression is not satisfiable on a block of data.
Syntax: set(max_rows)
Bloom Filter
Stores a Bloom filter for the specified columns. An optional false_positive
parameter with possible values between 0 and 1 specifies the probability of receiving a false positive response from the filter. Default value: 0.025. Supported data types: Int*
, UInt*
, Float*
, Enum
, Date
, DateTime
, String
, FixedString
, Array
, LowCardinality
, Nullable
, UUID
and Map
. For the Map
data type, the client can specify if the index should be created for keys or values using mapKeys or mapValues function.
Syntax: bloom_filter([false_positive])
N-gram Bloom Filter
Stores a Bloom filter that contains all n-grams from a block of data. Only works with datatypes: String, FixedString and Map. Can be used for optimization of EQUALS
, LIKE
and IN
expressions.
Syntax: ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)
n
— ngram size,size_of_bloom_filter_in_bytes
— Bloom filter size in bytes (you can use large values here, for example, 256 or 512, because it can be compressed well).number_of_hash_functions
— The number of hash functions used in the Bloom filter.random_seed
— The seed for Bloom filter hash functions.
Users can create UDF to estimate the parameters set of ngrambf_v1
. Query statements are as follows:
CREATE FUNCTION bfEstimateFunctions [ON CLUSTER cluster]
AS
(total_nubmer_of_all_grams, size_of_bloom_filter_in_bits) -> round((size_of_bloom_filter_in_bits / total_nubmer_of_all_grams) * log(2));
CREATE FUNCTION bfEstimateBmSize [ON CLUSTER cluster]
AS
(total_nubmer_of_all_grams, probability_of_false_positives) -> ceil((total_nubmer_of_all_grams * log(probability_of_false_positives)) / log(1 / pow(2, log(2))));
CREATE FUNCTION bfEstimateFalsePositive [ON CLUSTER cluster]
AS
(total_nubmer_of_all_grams, number_of_hash_functions, size_of_bloom_filter_in_bytes) -> pow(1 - exp(-number_of_hash_functions/ (size_of_bloom_filter_in_bytes / total_nubmer_of_all_grams)), number_of_hash_functions);
CREATE FUNCTION bfEstimateGramNumber [ON CLUSTER cluster]
AS
(number_of_hash_functions, probability_of_false_positives, size_of_bloom_filter_in_bytes) -> ceil(size_of_bloom_filter_in_bytes / (-number_of_hash_functions / log(1 - exp(log(probability_of_false_positives) / number_of_hash_functions))))
To use those functions,we need to specify two parameter at least. For example, if there 4300 ngrams in the granule and we expect false positives to be less than 0.0001. The other parameters can be estimated by executing following queries:
--- estimate number of bits in the filter
SELECT bfEstimateBmSize(4300, 0.0001) / 8 as size_of_bloom_filter_in_bytes;
┌─size_of_bloom_filter_in_bytes─┐
│ 10304 │
└───────────────────────────────┘
--- estimate number of hash functions
SELECT bfEstimateFunctions(4300, bfEstimateBmSize(4300, 0.0001)) as number_of_hash_functions
┌─number_of_hash_functions─┐
│ 13 │
└──────────────────────────┘
Of course, you can also use those functions to estimate parameters by other conditions. The functions refer to the content here.
Token Bloom Filter
The same as ngrambf_v1
, but stores tokens instead of ngrams. Tokens are sequences separated by non-alphanumeric characters.
Syntax: tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)
Special-purpose
- Experimental indexes to support approximate nearest neighbor (ANN) search. See here for details.
- An experimental inverted index to support full-text search. See here for details.
Functions Support
Conditions in the WHERE
clause contains calls of the functions that operate with columns. If the column is a part of an index, ClickHouse tries to use this index when performing the functions. ClickHouse supports different subsets of functions for using indexes.
Indexes of type set
can be utilized by all functions. The other index types are supported as follows:
Function (operator) / Index | primary key | minmax | ngrambf_v1 | tokenbf_v1 | bloom_filter | inverted |
---|---|---|---|---|---|---|
equals (=, ==) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
notEquals(!=, <>) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
like | ✔ | ✔ | ✔ | ✔ | ✗ | ✔ |
notLike | ✔ | ✔ | ✔ | ✔ | ✗ | ✔ |
match | ✗ | ✗ | ✔ | ✔ | ✗ | ✔ |
startsWith | ✔ | ✔ | ✔ | ✔ | ✗ | ✔ |
endsWith | ✗ | ✗ | ✔ | ✔ | ✗ | ✔ |
multiSearchAny | ✗ | ✗ | ✔ | ✗ | ✗ | ✔ |
in | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
notIn | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
less (<) | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ |
greater (>) | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ |
lessOrEquals (<=) | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ |
greaterOrEquals (>=) | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ |
empty | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ |
notEmpty | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ |
has | ✗ | ✗ | ✔ | ✔ | ✔ | ✔ |
hasAny | ✗ | ✗ | ✔ | ✔ | ✔ | ✗ |
hasAll | ✗ | ✗ | ✗ | ✗ | ✔ | ✗ |
hasToken | ✗ | ✗ | ✗ | ✔ | ✗ | ✔ |
hasTokenOrNull | ✗ | ✗ | ✗ | ✔ | ✗ | ✔ |
hasTokenCaseInsensitive (*) | ✗ | ✗ | ✗ | ✔ | ✗ | ✗ |
hasTokenCaseInsensitiveOrNull (*) | ✗ | ✗ | ✗ | ✔ | ✗ | ✗ |
Functions with a constant argument that is less than ngram size can’t be used by ngrambf_v1
for query optimization.
(*) For hasTokenCaseInsensitive
and hasTokenCaseInsensitiveOrNull
to be effective, the tokenbf_v1
index must be created on lowercased data, for example INDEX idx (lower(str_col)) TYPE tokenbf_v1(512, 3, 0)
.
:::note
Bloom filters can have false positive matches, so the ngrambf_v1
, tokenbf_v1
, and bloom_filter
indexes can not be used for optimizing queries where the result of a function is expected to be false.
For example:
- Can be optimized:
s LIKE '%test%'
NOT s NOT LIKE '%test%'
s = 1
NOT s != 1
startsWith(s, 'test')
- Can not be optimized:
NOT s LIKE '%test%'
s NOT LIKE '%test%'
NOT s = 1
s != 1
NOT startsWith(s, 'test')
:::
Projections
Projections are like materialized views but defined in part-level. It provides consistency guarantees along with automatic usage in queries.
:::note When you are implementing projections you should also consider the force_optimize_projection setting. :::
Projections are not supported in the SELECT
statements with the FINAL modifier.
Projection Query
A projection query is what defines a projection. It implicitly selects data from the parent table. Syntax
SELECT <column list expr> [GROUP BY] <group keys expr> [ORDER BY] <expr>
Projections can be modified or dropped with the ALTER statement.
Projection Storage
Projections are stored inside the part directory. It's similar to an index but contains a subdirectory that stores an anonymous MergeTree
table's part. The table is induced by the definition query of the projection. If there is a GROUP BY
clause, the underlying storage engine becomes AggregatingMergeTree, and all aggregate functions are converted to AggregateFunction
. If there is an ORDER BY
clause, the MergeTree
table uses it as its primary key expression. During the merge process the projection part is merged via its storage's merge routine. The checksum of the parent table's part is combined with the projection's part. Other maintenance jobs are similar to skip indices.
Query Analysis
- Check if the projection can be used to answer the given query, that is, it generates the same answer as querying the base table.
- Select the best feasible match, which contains the least granules to read.
- The query pipeline which uses projections will be different from the one that uses the original parts. If the projection is absent in some parts, we can add the pipeline to "project" it on the fly.
Concurrent Data Access
For concurrent table access, we use multi-versioning. In other words, when a table is simultaneously read and updated, data is read from a set of parts that is current at the time of the query. There are no lengthy locks. Inserts do not get in the way of read operations.
Reading from a table is automatically parallelized.
TTL for Columns and Tables
Determines the lifetime of values.
The TTL
clause can be set for the whole table and for each individual column. Table-level TTL
can also specify the logic of automatic moving data between disks and volumes, or recompressing parts where all the data has been expired.
Expressions must evaluate to Date or DateTime data type.
Syntax
Setting time-to-live for a column:
TTL time_column
TTL time_column + interval
To define interval
, use time interval operators, for example:
TTL date_time + INTERVAL 1 MONTH
TTL date_time + INTERVAL 15 HOUR
Column TTL
When the values in the column expire, ClickHouse replaces them with the default values for the column data type. If all the column values in the data part expire, ClickHouse deletes this column from the data part in a filesystem.
The TTL
clause can’t be used for key columns.
Examples
Creating a table with TTL
:
CREATE TABLE tab
(
d DateTime,
a Int TTL d + INTERVAL 1 MONTH,
b Int TTL d + INTERVAL 1 MONTH,
c String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY d;
Adding TTL to a column of an existing table
ALTER TABLE tab
MODIFY COLUMN
c String TTL d + INTERVAL 1 DAY;
Altering TTL of the column
ALTER TABLE tab
MODIFY COLUMN
c String TTL d + INTERVAL 1 MONTH;
Table TTL
Table can have an expression for removal of expired rows, and multiple expressions for automatic move of parts between disks or volumes. When rows in the table expire, ClickHouse deletes all corresponding rows. For parts moving or recompressing, all rows of a part must satisfy the TTL
expression criteria.
TTL expr
[DELETE|RECOMPRESS codec_name1|TO DISK 'xxx'|TO VOLUME 'xxx'][, DELETE|RECOMPRESS codec_name2|TO DISK 'aaa'|TO VOLUME 'bbb'] ...
[WHERE conditions]
[GROUP BY key_expr [SET v1 = aggr_func(v1) [, v2 = aggr_func(v2) ...]] ]
Type of TTL rule may follow each TTL expression. It affects an action which is to be done once the expression is satisfied (reaches current time):
DELETE
- delete expired rows (default action);RECOMPRESS codec_name
- recompress data part with thecodec_name
;TO DISK 'aaa'
- move part to the diskaaa
;TO VOLUME 'bbb'
- move part to the diskbbb
;GROUP BY
- aggregate expired rows.
DELETE
action can be used together with WHERE
clause to delete only some of the expired rows based on a filtering condition:
TTL time_column + INTERVAL 1 MONTH DELETE WHERE column = 'value'
GROUP BY
expression must be a prefix of the table primary key.
If a column is not part of the GROUP BY
expression and is not set explicitly in the SET
clause, in result row it contains an occasional value from the grouped rows (as if aggregate function any
is applied to it).
Examples
Creating a table with TTL
:
CREATE TABLE tab
(
d DateTime,
a Int
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY d
TTL d + INTERVAL 1 MONTH DELETE,
d + INTERVAL 1 WEEK TO VOLUME 'aaa',
d + INTERVAL 2 WEEK TO DISK 'bbb';
Altering TTL
of the table:
ALTER TABLE tab
MODIFY TTL d + INTERVAL 1 DAY;
Creating a table, where the rows are expired after one month. The expired rows where dates are Mondays are deleted:
CREATE TABLE table_with_where
(
d DateTime,
a Int
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY d
TTL d + INTERVAL 1 MONTH DELETE WHERE toDayOfWeek(d) = 1;
Creating a table, where expired rows are recompressed:
CREATE TABLE table_for_recompression
(
d DateTime,
key UInt64,
value String
) ENGINE MergeTree()
ORDER BY tuple()
PARTITION BY key
TTL d + INTERVAL 1 MONTH RECOMPRESS CODEC(ZSTD(17)), d + INTERVAL 1 YEAR RECOMPRESS CODEC(LZ4HC(10))
SETTINGS min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0;
Creating a table, where expired rows are aggregated. In result rows x
contains the maximum value across the grouped rows, y
— the minimum value, and d
— any occasional value from grouped rows.
CREATE TABLE table_for_aggregation
(
d DateTime,
k1 Int,
k2 Int,
x Int,
y Int
)
ENGINE = MergeTree
ORDER BY (k1, k2)
TTL d + INTERVAL 1 MONTH GROUP BY k1, k2 SET x = max(x), y = min(y);
Removing Expired Data
Data with an expired TTL
is removed when ClickHouse merges data parts.
When ClickHouse detects that data is expired, it performs an off-schedule merge. To control the frequency of such merges, you can set merge_with_ttl_timeout
. If the value is too low, it will perform many off-schedule merges that may consume a lot of resources.
If you perform the SELECT
query between merges, you may get expired data. To avoid it, use the OPTIMIZE query before SELECT
.
See Also
- ttl_only_drop_parts setting
Disk types
In addition to local block devices, ClickHouse supports these storage types:
s3
for S3 and MinIOgcs
for GCSblob_storage_disk
for Azure Blob Storagehdfs
for HDFSweb
for read-only from webcache
for local cachings3_plain
for backups to S3
Using Multiple Block Devices for Data Storage
Introduction
MergeTree
family table engines can store data on multiple block devices. For example, it can be useful when the data of a certain table are implicitly split into “hot” and “cold”. The most recent data is regularly requested but requires only a small amount of space. On the contrary, the fat-tailed historical data is requested rarely. If several disks are available, the “hot” data may be located on fast disks (for example, NVMe SSDs or in memory), while the “cold” data - on relatively slow ones (for example, HDD).
Data part is the minimum movable unit for MergeTree
-engine tables. The data belonging to one part are stored on one disk. Data parts can be moved between disks in the background (according to user settings) as well as by means of the ALTER queries.
Terms
- Disk — Block device mounted to the filesystem.
- Default disk — Disk that stores the path specified in the path server setting.
- Volume — Ordered set of equal disks (similar to JBOD).
- Storage policy — Set of volumes and the rules for moving data between them.
The names given to the described entities can be found in the system tables, system.storage_policies and system.disks. To apply one of the configured storage policies for a table, use the storage_policy
setting of MergeTree
-engine family tables.
Configuration
Disks, volumes and storage policies should be declared inside the <storage_configuration>
tag either in a file in the config.d
directory.
:::tip
Disks can also be declared in the SETTINGS
section of a query. This is useful
for ad-hoc analysis to temporarily attach a disk that is, for example, hosted at a URL.
See dynamic storage for more details.
:::
Configuration structure:
<storage_configuration>
<disks>
<disk_name_1> <!-- disk name -->
<path>/mnt/fast_ssd/clickhouse/</path>
</disk_name_1>
<disk_name_2>
<path>/mnt/hdd1/clickhouse/</path>
<keep_free_space_bytes>10485760</keep_free_space_bytes>
</disk_name_2>
<disk_name_3>
<path>/mnt/hdd2/clickhouse/</path>
<keep_free_space_bytes>10485760</keep_free_space_bytes>
</disk_name_3>
...
</disks>
...
</storage_configuration>
Tags:
<disk_name_N>
— Disk name. Names must be different for all disks.path
— path under which a server will store data (data
andshadow
folders), should be terminated with ‘/’.keep_free_space_bytes
— the amount of free disk space to be reserved.
The order of the disk definition is not important.
Storage policies configuration markup:
<storage_configuration>
...
<policies>
<policy_name_1>
<volumes>
<volume_name_1>
<disk>disk_name_from_disks_configuration</disk>
<max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
<load_balancing>round_robin</load_balancing>
</volume_name_1>
<volume_name_2>
<!-- configuration -->
</volume_name_2>
<!-- more volumes -->
</volumes>
<move_factor>0.2</move_factor>
</policy_name_1>
<policy_name_2>
<!-- configuration -->
</policy_name_2>
<!-- more policies -->
</policies>
...
</storage_configuration>
Tags:
policy_name_N
— Policy name. Policy names must be unique.volume_name_N
— Volume name. Volume names must be unique.disk
— a disk within a volume.max_data_part_size_bytes
— the maximum size of a part that can be stored on any of the volume’s disks. If the a size of a merged part estimated to be bigger thanmax_data_part_size_bytes
then this part will be written to a next volume. Basically this feature allows to keep new/small parts on a hot (SSD) volume and move them to a cold (HDD) volume when they reach large size. Do not use this setting if your policy has only one volume.move_factor
— when the amount of available space gets lower than this factor, data automatically starts to move on the next volume if any (by default, 0.1). ClickHouse sorts existing parts by size from largest to smallest (in descending order) and selects parts with the total size that is sufficient to meet themove_factor
condition. If the total size of all parts is insufficient, all parts will be moved.perform_ttl_move_on_insert
— Disables TTL move on data part INSERT. By default (if enabled) if we insert a data part that already expired by the TTL move rule it immediately goes to a volume/disk declared in move rule. This can significantly slowdown insert in case if destination volume/disk is slow (e.g. S3). If disabled then already expired data part is written into a default volume and then right after moved to TTL volume.load_balancing
- Policy for disk balancing,round_robin
orleast_used
.least_used_ttl_ms
- Configure timeout (in milliseconds) for the updating available space on all disks (0
- update always,-1
- never update, default is60000
). Note, if the disk can be used by ClickHouse only and is not subject to a online filesystem resize/shrink you can use-1
, in all other cases it is not recommended, since eventually it will lead to incorrect space distribution.prefer_not_to_merge
— You should not use this setting. Disables merging of data parts on this volume (this is harmful and leads to performance degradation). When this setting is enabled (don't do it), merging data on this volume is not allowed (which is bad). This allows (but you don't need it) controlling (if you want to control something, you're making a mistake) how ClickHouse works with slow disks (but ClickHouse knows better, so please don't use this setting).volume_priority
— Defines the priority (order) in which volumes are filled. Lower value means higher priority. The parameter values should be natural numbers and collectively cover the range from 1 to N (lowest priority given) without skipping any numbers.- If all volumes are tagged, they are prioritized in given order.
- If only some volumes are tagged, those without the tag have the lowest priority, and they are prioritized in the order they are defined in config.
- If no volumes are tagged, their priority is set correspondingly to their order they are declared in configuration.
- Two volumes cannot have the same priority value.
Configuration examples:
<storage_configuration>
...
<policies>
<hdd_in_order> <!-- policy name -->
<volumes>
<single> <!-- volume name -->
<disk>disk1</disk>
<disk>disk2</disk>
</single>
</volumes>
</hdd_in_order>
<moving_from_ssd_to_hdd>
<volumes>
<hot>
<disk>fast_ssd</disk>
<max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
</hot>
<cold>
<disk>disk1</disk>
</cold>
</volumes>
<move_factor>0.2</move_factor>
</moving_from_ssd_to_hdd>
<small_jbod_with_external_no_merges>
<volumes>
<main>
<disk>jbod1</disk>
</main>
<external>
<disk>external</disk>
</external>
</volumes>
</small_jbod_with_external_no_merges>
</policies>
...
</storage_configuration>
In given example, the hdd_in_order
policy implements the round-robin approach. Thus this policy defines only one volume (single
), the data parts are stored on all its disks in circular order. Such policy can be quite useful if there are several similar disks are mounted to the system, but RAID is not configured. Keep in mind that each individual disk drive is not reliable and you might want to compensate it with replication factor of 3 or more.
If there are different kinds of disks available in the system, moving_from_ssd_to_hdd
policy can be used instead. The volume hot
consists of an SSD disk (fast_ssd
), and the maximum size of a part that can be stored on this volume is 1GB. All the parts with the size larger than 1GB will be stored directly on the cold
volume, which contains an HDD disk disk1
.
Also, once the disk fast_ssd
gets filled by more than 80%, data will be transferred to the disk1
by a background process.
The order of volume enumeration within a storage policy is important in case at least one of the volumes listed has no explicit volume_priority
parameter.
Once a volume is overfilled, data are moved to the next one. The order of disk enumeration is important as well because data are stored on them in turns.
When creating a table, one can apply one of the configured storage policies to it:
CREATE TABLE table_with_non_default_policy (
EventDate Date,
OrderID UInt64,
BannerID UInt64,
SearchPhrase String
) ENGINE = MergeTree
ORDER BY (OrderID, BannerID)
PARTITION BY toYYYYMM(EventDate)
SETTINGS storage_policy = 'moving_from_ssd_to_hdd'
The default
storage policy implies using only one volume, which consists of only one disk given in <path>
.
You could change storage policy after table creation with [ALTER TABLE ... MODIFY SETTING] query, new policy should include all old disks and volumes with same names.
The number of threads performing background moves of data parts can be changed by background_move_pool_size setting.
Dynamic Storage
This example query shows how to attach a table stored at a URL and configure the remote storage within the query. The web storage is not configured in the ClickHouse configuration files; all the settings are in the CREATE/ATTACH query.
:::note
The example uses type=web
, but any disk type can be configured as dynamic, even Local disk. Local disks require a path argument to be inside the server config parameter custom_local_disks_base_directory
, which has no default, so set that also when using local disk.
:::
Example dynamic web storage
:::tip A demo dataset is hosted in GitHub. To prepare your own tables for web storage see the tool clickhouse-static-files-uploader :::
In this ATTACH TABLE
query the UUID
provided matches the directory name of the data, and the endpoint is the URL for the raw GitHub content.
# highlight-next-line
ATTACH TABLE uk_price_paid UUID 'cf712b4f-2ca8-435c-ac23-c4393efe52f7'
(
price UInt32,
date Date,
postcode1 LowCardinality(String),
postcode2 LowCardinality(String),
type Enum8('other' = 0, 'terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4),
is_new UInt8,
duration Enum8('unknown' = 0, 'freehold' = 1, 'leasehold' = 2),
addr1 String,
addr2 String,
street LowCardinality(String),
locality LowCardinality(String),
town LowCardinality(String),
district LowCardinality(String),
county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2)
# highlight-start
SETTINGS disk = disk(
type=web,
endpoint='https://raw.githubusercontent.com/ClickHouse/web-tables-demo/main/web/'
);
# highlight-end
Nested Dynamic Storage
This example query builds on the above dynamic disk configuration and shows how to use a local disk to cache data from a table stored at a URL. Neither the cache disk nor the web storage is configured in the ClickHouse configuration files; both are configured in the CREATE/ATTACH query settings.
In the settings highlighted below notice that the disk of type=web
is nested within
the disk of type=cache
.
ATTACH TABLE uk_price_paid UUID 'cf712b4f-2ca8-435c-ac23-c4393efe52f7'
(
price UInt32,
date Date,
postcode1 LowCardinality(String),
postcode2 LowCardinality(String),
type Enum8('other' = 0, 'terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4),
is_new UInt8,
duration Enum8('unknown' = 0, 'freehold' = 1, 'leasehold' = 2),
addr1 String,
addr2 String,
street LowCardinality(String),
locality LowCardinality(String),
town LowCardinality(String),
district LowCardinality(String),
county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2)
# highlight-start
SETTINGS disk = disk(
type=cache,
max_size='1Gi',
path='/var/lib/clickhouse/custom_disk_cache/',
disk=disk(
type=web,
endpoint='https://raw.githubusercontent.com/ClickHouse/web-tables-demo/main/web/'
)
);
# highlight-end
Details
In the case of MergeTree
tables, data is getting to disk in different ways:
- As a result of an insert (
INSERT
query). - During background merges and mutations.
- When downloading from another replica.
- As a result of partition freezing ALTER TABLE … FREEZE PARTITION.
In all these cases except for mutations and partition freezing, a part is stored on a volume and a disk according to the given storage policy:
- The first volume (in the order of definition) that has enough disk space for storing a part (
unreserved_space > current_part_size
) and allows for storing parts of a given size (max_data_part_size_bytes > current_part_size
) is chosen. - Within this volume, that disk is chosen that follows the one, which was used for storing the previous chunk of data, and that has free space more than the part size (
unreserved_space - keep_free_space_bytes > current_part_size
).
Under the hood, mutations and partition freezing make use of hard links. Hard links between different disks are not supported, therefore in such cases the resulting parts are stored on the same disks as the initial ones.
In the background, parts are moved between volumes on the basis of the amount of free space (move_factor
parameter) according to the order the volumes are declared in the configuration file.
Data is never transferred from the last one and into the first one. One may use system tables system.part_log (field type = MOVE_PART
) and system.parts (fields path
and disk
) to monitor background moves. Also, the detailed information can be found in server logs.
User can force moving a part or a partition from one volume to another using the query ALTER TABLE … MOVE PART|PARTITION … TO VOLUME|DISK …, all the restrictions for background operations are taken into account. The query initiates a move on its own and does not wait for background operations to be completed. User will get an error message if not enough free space is available or if any of the required conditions are not met.
Moving data does not interfere with data replication. Therefore, different storage policies can be specified for the same table on different replicas.
After the completion of background merges and mutations, old parts are removed only after a certain amount of time (old_parts_lifetime
).
During this time, they are not moved to other volumes or disks. Therefore, until the parts are finally removed, they are still taken into account for evaluation of the occupied disk space.
User can assign new big parts to different disks of a JBOD volume in a balanced way using the min_bytes_to_rebalance_partition_over_jbod setting.
Using S3 for Data Storage
:::note
Google Cloud Storage (GCS) is also supported using the type s3
. See GCS backed MergeTree.
:::
MergeTree
family table engines can store data to S3 using a disk with type s3
.
Configuration markup:
<storage_configuration>
...
<disks>
<s3>
<type>s3</type>
<support_batch_delete>true</support_batch_delete>
<endpoint>https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/root-path/</endpoint>
<access_key_id>your_access_key_id</access_key_id>
<secret_access_key>your_secret_access_key</secret_access_key>
<region></region>
<header>Authorization: Bearer SOME-TOKEN</header>
<server_side_encryption_customer_key_base64>your_base64_encoded_customer_key</server_side_encryption_customer_key_base64>
<server_side_encryption_kms_key_id>your_kms_key_id</server_side_encryption_kms_key_id>
<server_side_encryption_kms_encryption_context>your_kms_encryption_context</server_side_encryption_kms_encryption_context>
<server_side_encryption_kms_bucket_key_enabled>true</server_side_encryption_kms_bucket_key_enabled>
<proxy>
<uri>http://proxy1</uri>
<uri>http://proxy2</uri>
</proxy>
<connect_timeout_ms>10000</connect_timeout_ms>
<request_timeout_ms>5000</request_timeout_ms>
<retry_attempts>10</retry_attempts>
<single_read_retries>4</single_read_retries>
<min_bytes_for_seek>1000</min_bytes_for_seek>
<metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
<skip_access_check>false</skip_access_check>
</s3>
<s3_cache>
<type>cache</type>
<disk>s3</disk>
<path>/var/lib/clickhouse/disks/s3_cache/</path>
<max_size>10Gi</max_size>
</s3_cache>
</disks>
...
</storage_configuration>
:::note cache configuration ClickHouse versions 22.3 through 22.7 use a different cache configuration, see using local cache if you are using one of those versions. :::
Configuring the S3 disk
Required parameters:
endpoint
— S3 endpoint URL inpath
orvirtual hosted
styles. Endpoint URL should contain a bucket and root path to store data.access_key_id
— S3 access key id.secret_access_key
— S3 secret access key.
Optional parameters:
region
— S3 region name.support_batch_delete
— This controls the check to see if batch deletes are supported. Set this tofalse
when using Google Cloud Storage (GCS) as GCS does not support batch deletes and preventing the checks will prevent error messages in the logs.use_environment_credentials
— Reads AWS credentials from the Environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN if they exist. Default value isfalse
.use_insecure_imds_request
— If set totrue
, S3 client will use insecure IMDS request while obtaining credentials from Amazon EC2 metadata. Default value isfalse
.expiration_window_seconds
— Grace period for checking if expiration-based credentials have expired. Optional, default value is120
.proxy
— Proxy configuration for S3 endpoint. Eachuri
element insideproxy
block should contain a proxy URL.connect_timeout_ms
— Socket connect timeout in milliseconds. Default value is10 seconds
.request_timeout_ms
— Request timeout in milliseconds. Default value is5 seconds
.retry_attempts
— Number of retry attempts in case of failed request. Default value is10
.single_read_retries
— Number of retry attempts in case of connection drop during read. Default value is4
.min_bytes_for_seek
— Minimal number of bytes to use seek operation instead of sequential read. Default value is1 Mb
.metadata_path
— Path on local FS to store metadata files for S3. Default value is/var/lib/clickhouse/disks/<disk_name>/
.skip_access_check
— If true, disk access checks will not be performed on disk start-up. Default value isfalse
.header
— Adds specified HTTP header to a request to given endpoint. Optional, can be specified multiple times.server_side_encryption_customer_key_base64
— If specified, required headers for accessing S3 objects with SSE-C encryption will be set.server_side_encryption_kms_key_id
- If specified, required headers for accessing S3 objects with SSE-KMS encryption will be set. If an empty string is specified, the AWS managed S3 key will be used. Optional.server_side_encryption_kms_encryption_context
- If specified alongsideserver_side_encryption_kms_key_id
, the given encryption context header for SSE-KMS will be set. Optional.server_side_encryption_kms_bucket_key_enabled
- If specified alongsideserver_side_encryption_kms_key_id
, the header to enable S3 bucket keys for SSE-KMS will be set. Optional, can betrue
orfalse
, defaults to nothing (matches the bucket-level setting).s3_max_put_rps
— Maximum PUT requests per second rate before throttling. Default value is0
(unlimited).s3_max_put_burst
— Max number of requests that can be issued simultaneously before hitting request per second limit. By default (0
value) equals tos3_max_put_rps
.s3_max_get_rps
— Maximum GET requests per second rate before throttling. Default value is0
(unlimited).s3_max_get_burst
— Max number of requests that can be issued simultaneously before hitting request per second limit. By default (0
value) equals tos3_max_get_rps
.read_resource
— Resource name to be used for scheduling of read requests to this disk. Default value is empty string (IO scheduling is not enabled for this disk).write_resource
— Resource name to be used for scheduling of write requests to this disk. Default value is empty string (IO scheduling is not enabled for this disk).key_template
— Define the format with which the object keys are generated. By default, Clickhouse takesroot path
fromendpoint
option and adds random generated suffix. That suffix is a dir with 3 random symbols and a file name with 29 random symbols. With that option you have a full control how to the object keys are generated. Some usage scenarios require having random symbols in the prefix or in the middle of object key. For example:[a-z]{3}-prefix-random/constant-part/random-middle-[a-z]{3}/random-suffix-[a-z]{29}
. The value is parsed withre2
. Only some subset of the syntax is supported. Check if your preferred format is supported before using that option. Disk isn't initialized if clickhouse is unable to generate a key by the value ofkey_template
. It requires enabled feature flag storage_metadata_write_full_object_key. It forbids declaring theroot path
inendpoint
option. It requires definition of the optionkey_compatibility_prefix
.key_compatibility_prefix
— That option is required when optionkey_template
is in use. In order to be able to read the objects keys which were stored in the metadata files with the metadata version lower thatVERSION_FULL_OBJECT_KEY
, the previousroot path
from theendpoint
option should be set here.
Configuring the cache
This is the cache configuration from above:
<s3_cache>
<type>cache</type>
<disk>s3</disk>
<path>/var/lib/clickhouse/disks/s3_cache/</path>
<max_size>10Gi</max_size>
</s3_cache>
These parameters define the cache layer:
type
— If a disk is of typecache
it caches mark and index files in memory.disk
— The name of the disk that will be cached.
Cache parameters:
path
— The path where metadata for the cache is stored.max_size
— The size (amount of disk space) that the cache can grow to.
:::tip There are several other cache parameters that you can use to tune your storage, see using local cache for the details. :::
S3 disk can be configured as main
or cold
storage:
<storage_configuration>
...
<disks>
<s3>
<type>s3</type>
<endpoint>https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/root-path/</endpoint>
<access_key_id>your_access_key_id</access_key_id>
<secret_access_key>your_secret_access_key</secret_access_key>
</s3>
</disks>
<policies>
<s3_main>
<volumes>
<main>
<disk>s3</disk>
</main>
</volumes>
</s3_main>
<s3_cold>
<volumes>
<main>
<disk>default</disk>
</main>
<external>
<disk>s3</disk>
</external>
</volumes>
<move_factor>0.2</move_factor>
</s3_cold>
</policies>
...
</storage_configuration>
In case of cold
option a data can be moved to S3 if local disk free size will be smaller than move_factor * disk_size
or by TTL move rule.
Using Azure Blob Storage for Data Storage
MergeTree
family table engines can store data to Azure Blob Storage using a disk with type azure_blob_storage
.
As of February 2022, this feature is still a fresh addition, so expect that some Azure Blob Storage functionalities might be unimplemented.
Configuration markup:
<storage_configuration>
...
<disks>
<blob_storage_disk>
<type>azure_blob_storage</type>
<storage_account_url>http://account.blob.core.windows.net</storage_account_url>
<container_name>container</container_name>
<account_name>account</account_name>
<account_key>pass123</account_key>
<metadata_path>/var/lib/clickhouse/disks/blob_storage_disk/</metadata_path>
<cache_path>/var/lib/clickhouse/disks/blob_storage_disk/cache/</cache_path>
<skip_access_check>false</skip_access_check>
</blob_storage_disk>
</disks>
...
</storage_configuration>
Connection parameters:
endpoint
— AzureBlobStorage endpoint URL with container & prefix. Optionally can contain account_name if the authentication method used needs it. (http://account.blob.core.windows.net:{port}/[account_name]{container_name}/{data_prefix}
) or these parameters can be provided separately using storage_account_url, account_name & container. For specifying prefix, endpoint should be used.endpoint_contains_account_name
- This flag is used to specify if endpoint contains account_name as it is only needed for certain authentication methods. (Default : true)storage_account_url
- Required if endpoint is not specified, Azure Blob Storage account URL, likehttp://account.blob.core.windows.net
orhttp://azurite1:10000/devstoreaccount1
.container_name
- Target container name, defaults todefault-container
.container_already_exists
- If set tofalse
, a new containercontainer_name
is created in the storage account, if set totrue
, disk connects to the container directly, and if left unset, disk connects to the account, checks if the containercontainer_name
exists, and creates it if it doesn't exist yet.
Authentication parameters (the disk will try all available methods and Managed Identity Credential):
connection_string
- For authentication using a connection string.account_name
andaccount_key
- For authentication using Shared Key.
Limit parameters (mainly for internal usage):
s3_max_single_part_upload_size
- Limits the size of a single block upload to Blob Storage.min_bytes_for_seek
- Limits the size of a seekable region.max_single_read_retries
- Limits the number of attempts to read a chunk of data from Blob Storage.max_single_download_retries
- Limits the number of attempts to download a readable buffer from Blob Storage.thread_pool_size
- Limits the number of threads with whichIDiskRemote
is instantiated.s3_max_inflight_parts_for_one_file
- Limits the number of put requests that can be run concurrently for one object.
Other parameters:
metadata_path
- Path on local FS to store metadata files for Blob Storage. Default value is/var/lib/clickhouse/disks/<disk_name>/
.skip_access_check
- If true, disk access checks will not be performed on disk start-up. Default value isfalse
.read_resource
— Resource name to be used for scheduling of read requests to this disk. Default value is empty string (IO scheduling is not enabled for this disk).write_resource
— Resource name to be used for scheduling of write requests to this disk. Default value is empty string (IO scheduling is not enabled for this disk).
Examples of working configurations can be found in integration tests directory (see e.g. test_merge_tree_azure_blob_storage or test_azure_blob_storage_zero_copy_replication).
:::note Zero-copy replication is not ready for production Zero-copy replication is disabled by default in ClickHouse version 22.8 and higher. This feature is not recommended for production use. :::
HDFS storage
In this sample configuration:
- the disk is of type
hdfs
- the data is hosted at
hdfs://hdfs1:9000/clickhouse/
<clickhouse>
<storage_configuration>
<disks>
<hdfs>
<type>hdfs</type>
<endpoint>hdfs://hdfs1:9000/clickhouse/</endpoint>
<skip_access_check>true</skip_access_check>
</hdfs>
<hdd>
<type>local</type>
<path>/</path>
</hdd>
</disks>
<policies>
<hdfs>
<volumes>
<main>
<disk>hdfs</disk>
</main>
<external>
<disk>hdd</disk>
</external>
</volumes>
</hdfs>
</policies>
</storage_configuration>
</clickhouse>
Web storage (read-only)
Web storage can be used for read-only purposes. An example use is for hosting sample data, or for migrating data.
:::tip Storage can also be configured temporarily within a query, if a web dataset is not expected to be used routinely, see dynamic storage and skip editing the configuration file. :::
In this sample configuration:
- the disk is of type
web
- the data is hosted at
http://nginx:80/test1/
- a cache on local storage is used
<clickhouse>
<storage_configuration>
<disks>
<web>
<type>web</type>
<endpoint>http://nginx:80/test1/</endpoint>
</web>
<cached_web>
<type>cache</type>
<disk>web</disk>
<path>cached_web_cache/</path>
<max_size>100000000</max_size>
</cached_web>
</disks>
<policies>
<web>
<volumes>
<main>
<disk>web</disk>
</main>
</volumes>
</web>
<cached_web>
<volumes>
<main>
<disk>cached_web</disk>
</main>
</volumes>
</cached_web>
</policies>
</storage_configuration>
</clickhouse>
Virtual Columns
_part
— Name of a part._part_index
— Sequential index of the part in the query result._partition_id
— Name of a partition._part_uuid
— Unique part identifier (if enabled MergeTree settingassign_part_uuids
)._partition_value
— Values (a tuple) of apartition by
expression._sample_factor
— Sample factor (from the query)._block_number
— Block number of the row, it is persisted on merges whenallow_experimental_block_number_column
is set to true.
Column Statistics (Experimental)
The statistic declaration is in the columns section of the CREATE
query for tables from the *MergeTree*
Family when we enable set allow_experimental_statistic = 1
.
CREATE TABLE tab
(
a Int64 STATISTIC(tdigest),
b Float64
)
ENGINE = MergeTree
ORDER BY a
We can also manipulate statistics with ALTER
statements.
ALTER TABLE tab ADD STATISTIC b TYPE tdigest;
ALTER TABLE tab DROP STATISTIC a TYPE tdigest;
These lightweight statistics aggregate information about distribution of values in columns.
They can be used for query optimization when we enable set allow_statistic_optimize = 1
.
Available Types of Column Statistics
-
tdigest
Stores distribution of values from numeric columns in TDigest sketch.
Column-level Settings
Certain MergeTree settings can be override at column level:
max_compress_block_size
— Maximum size of blocks of uncompressed data before compressing for writing to a table.min_compress_block_size
— Minimum size of blocks of uncompressed data required for compression when writing the next mark.
Example:
CREATE TABLE tab
(
id Int64,
document String SETTINGS (min_compress_block_size = 16777216, max_compress_block_size = 16777216)
)
ENGINE = MergeTree
ORDER BY id
Column-level settings can be modified or removed using ALTER MODIFY COLUMN, for example:
- Remove
SETTINGS
from column declaration:
ALTER TABLE tab MODIFY COLUMN document REMOVE SETTINGS;
- Modify a setting:
ALTER TABLE tab MODIFY COLUMN document MODIFY SETTING min_compress_block_size = 8192;
- Reset one or more settings, also removes the setting declaration in the column expression of the table's CREATE query.
ALTER TABLE tab MODIFY COLUMN document RESET SETTING min_compress_block_size;