14 KiB
title | image | date | author | tags | ||
---|---|---|---|---|---|---|
What's New in ClickHouse 21.12 | https://blog-images.clickhouse.com/en/2021/clickhouse-v21-12/featured.jpg | 2021-12-16 | Alexey Milovidov, Christoph Wurm |
|
We're continuing our monthly release cadence. The 21.12 Christmas release includes 2460 new commits from 125 contributors, including 42 new contributors:
Alex Cao, Amr Alaa, Andrey Torsunov, Constantine Peresypkin, Dmitriy Dorofeev, Egor O'Sten, Elykov Alexandr, Evgeny, Frank Chen, LB, Natasha Murashkina, Peignon Melvyn, Rich Raposa, Roman Chyrva, Roman, SuperDJY, Thom O'Connor, Timur Magomedov, Tom Risse, Tomáš Hromada, cfcz48, cgp, cms, cmsxbc, congbaoyangrou, dongyifeng, frank chen, freedomDR, jus1096, khamadiev, laurieliyang, leosunli, liyang830, loneylee, michael1589, msaf1980, p0ny, qieqieplus, spume, sunlisheng, yandd, zhanghuajie.
If you are wondering, this list is generated by the following command:
clickhouse-local --query "
SELECT arrayStringConcat(groupArray(s), ', ')
FROM file('contributors-21.12.txt', LineAsString, 's String')
WHERE s NOT IN (
SELECT *
FROM file('contributors-21.11.txt', LineAsString, 's String'))
FORMAT TSVRaw"
And to list the contributors, you can always run the
SELECT * FROM system.contributors
query on your production server.
Let's highlight some of the new capabilities in 21.12:
ClickHouse Keeper is Feature Complete
In 21.12 clickhouse-keeper
started to support "four letter commands" for status and monitoring. This feature is contributed by JackyWoo and reviewed by Alexander Sapin (the author of ClickHouse Keeper).
It was the only missing feature to implement. In this release, clickhouse-keeper is still considered in pre-production stage, but many companies already started to evaluate and use it as a replacement of ZooKeeper. You can also start using clickhouse-keeper in your testing environments and we will appreciate your feedback.
ClickHouse Keeper development started in Sep 2020, more than a year ago. It was a long road, and most of the efforts were to ensure correctness and stability in unusual and exceptional scenarios. It is covered by Jepsen tests (including ZooKeeper tests and new introduced tests), continuous randomized stress testing with ClickHouse functional and integration tests. It is started to be tested in Yandex Cloud and among our best friends. If you're pretending to be our best friend, you can also do it.
How does this help you?
ClickHouse Keeper is a drop-in replacement for ZooKeeper. It implements the ZooKeeper wire protocol and data model, but does it better.
In contrast to ZooKeeper, there are no issues with zxid overflow or packet sizes. It has better memory usage and it does not require JVM tuning (because it does not use the JVM). Logs and snapshots are compressed (by about 10x typically) and checksummed. It can run as a separate process or directly inside clickhouse-server. You can use it with ClickHouse or with your Kafkas and Hadoops as well.
Partitions For INSERT INTO File, URL And HDFS Storages
When using the table engines File
, URL
, and HDFS
ClickHouse now supports partitions. When creating a table you can specify the partition key using the PARTITION BY
clause e.g. CREATE TABLE hits_files (...) ENGINE = File(TabSeparated) PARTITION BY toYYYYMM(EventDate)
.
Similarly, when exporting data from ClickHouse using the file
, url
, and hdfs
table functions you can now specify that the data is to be partitioned into multiple files using a PARTITION BY
clause. For example, INSERT INTO TABLE FUNCTION file('path/hits_{_partition_id}', 'TSV', 'columns...') PARTITION BY toYYYYMM(EventDate) VALUES ...
will create as many files as there are unique months in the dataset.
The s3
table function has already supported partitioned writes since ClickHouse 21.10.
How does this help you?
If data is split into multiple files, SELECT
queries will be automatically parallelized. For example:
SELECT user_id, count() FROM s3(
'https://s3.us-east-2.amazonaws.com/.../*.csv.zstd',
'...', '...',
CSV,
'user_id UInt64, ...')
You can even parallelize data processing across a distributed compute cluster if you use the s3Cluster
table function:
SELECT user_id, count() FROM s3Cluster(
my_cluster,
'https://s3.us-east-2.amazonaws.com/.../*.csv.zstd',
'...',
'...', CSV,
'user_id UInt64, ...')
It can also be used for integration with external data processing tools that consume data from s3
.
FROM INFILE in clickhouse-client now supports glob patterns and parallel reading
Just write:
INSERT INTO my_table FROM INFILE '*.csv.gz' FORMAT CSV
Glob patterns support *
, ?
and {n..m}
with {1..10}
or (aligned) {01..10}
forms.
This query will be automatically parallelized and it will also automatically detect the compression format from the file extension and decompress transparently.
This improvement is done by Arthur Filatenkov.
How does this help you?
Now you don't have to recall how to write a parallel for loop in your command line shell. clickhouse-client will do everything for you, it works intuitively and fast.
Support for INTERVAL operator inside WITH FILL modifier for ORDER BY clause
What's the... WITH FILL
modifier in the ORDER BY
clause? Take a look at the example:
:) SELECT EventDate, count() FROM test.hits WHERE CounterID = 2841673 GROUP BY EventDate ORDER BY EventDate
┌──EventDate─┬─count()─┐
│ 2014-03-17 │ 3 │
│ 2014-03-19 │ 6 │
│ 2014-03-21 │ 7 │
│ 2014-03-22 │ 6 │
└────────────┴─────────┘
We have the report with Mar 17th, 19th, 21st, and 22nd. But Mar 18th and 20th are missing, because there is no data for these dates. And this is how it works in all SQL databases.
But ClickHouse also has a quite unique and neat WITH FILL
modifier for the ORDER BY
clause.
You just write:
SELECT EventDate, count() FROM test.hits WHERE CounterID = 2841673 GROUP BY EventDate
ORDER BY EventDate WITH FILL STEP 1
┌──EventDate─┬─count()─┐
│ 2014-03-17 │ 3 │
│ 2014-03-18 │ 0 │
│ 2014-03-19 │ 6 │
│ 2014-03-20 │ 0 │
│ 2014-03-21 │ 7 │
│ 2014-03-22 │ 6 │
└────────────┴─────────┘
And missing data is automatically filled.
You can also add FROM
and TO
:
ORDER BY EventDate WITH FILL FROM '2014-03-01'::Date TO '2014-03-31'::Date STEP 1;
And it will automatically fill missing rows in the report.
The STEP
can be an arbitrary number. But what can you do if you want to fill missing dates for a report by months? You cannot just write STEP 30
or STEP 31
because different months contain different number of days...
Since ClickHouse version 21.12 you can do it like this:
ORDER BY EventDate WITH FILL STEP INTERVAL 1 MONTH
INTERVAL
is a standard SQL operator, you can use SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER and YEAR.
This is implemented by Anton Popov who is the author of the "WITH FILL" feature.
How does this help you?
It allows you to avoid a postprocessing step for your reports.
Add Support For "Identifier" Table and Database Query Parameters
ClickHouse has support for parameterized queries. For example:
SELECT uniq(user_id) FROM table WHERE website = {name:String}
It allows to safely substitute parameters without the risk of SQL injections:
curl https://clickhouse-server:8443/?param_name=upyachka -d 'SELECT uniq(user_id) FROM table WHERE website = {name:String}'
You can even create customized API handlers for clickhouse-server based on prepared queries.
In version 21.12 we introduce support for using parameters for tables and databases in your queries. This is implemented with the Identifier
table parameter:
SELECT uniq(user_id) FROM {tbl:Identifier}
Identifier parameters also work for CREATE, DROP and all DDL queries. This is implemented by Nikolai Degterinskiy.
How does this help you?
Let ClickHouse do the heavy lifting and keep your scripts safe and secure.
Bool Data Type
This feature is experimental in version 21.12. It is implemented by Kevin Wan (MaxWk) on top of initial work by hczhcz and reviewed by Pavel Kruglov.
ClickHouse now natively supports a Bool
data type. It allows to represent values as "true"/"false" during data import and export in text formats. It can also be adjusted to anything else using the settings bool_true_representation
and bool_false_representation
(for example, "yes" and "no").
How does this help you?
Native boolean data types exist today in other databases that are often integrated with ClickHouse, such as PostgreSQL. The Bool
data type in ClickHouse will make it more compatible with existing code and ease migration from other databases.
Also it simplifies data ingestion from various text sources.
Query Optimizations With Table Constraints
This feature is contributed by Nikita Vasilev. Nikita is one of the most notable ClickHouse contributors. He started in 2019 by introducing data skipping indices into ClickHouse, then continued in 2020 with SSD-optimized key-value dictionaries and now contributed the new advancements in the query optimizer. This feature is reviewed by Anton Popov.
So, what optimizations? ClickHouse already allows to specify constraints for tables:
CREATE TABLE
(
URL String,
Domain String,
CONSTRAINT validate CHECK isValidUTF8(URL) AND length(URL) BETWEEN 10 AND 10000,
CONSTRAINT my_constraint CHECK Domain = domainWithoutWWW(URL)
) ...
Constraints are checked on INSERT
. In this example we validate the URL and check that the Domain
column actually contains the domain of the URL.
Since version 21.12 constraints can also automatically optimize your queries! For example, if you write:
SELECT count() FROM hits WHERE domainWithoutWWW(URL) = 'ghe.clickhouse.tech'
The query can be automatically rewritten to:
SELECT count() FROM hits WHERE Domain = 'ghe.clickhouse.tech'
Because the Domain
column is smaller and more compressable it will be faster to read and does not require calculation of the domain from the URL.
The only thing you need to do is to enable the optimize_using_constraints
and optimize_substitute_columns
settings.
As a bonus, we introduced a new type of constraint: ASSUME
.
CONSTRAINT my_constraint ASSUME Domain = domainWithoutWWW(URL)
This type of constraint will not check anything on INSERT
but still use the assumption to optimize the queries.
It can also do logical inference, simplify the conditions and remove the conditions that are proved to be satisfied by constraints.
It is controlled by the convert_query_to_cnf
setting. You can also enable optimize_append_index
. With this setting ClickHouse will derive more conditions on the table primary key.
The idea is so powerful that we cannot resist adding one more feature: indices for hypothesis.
INDEX my_index (a < b) TYPE hypothesis GRANULARITY 1
The expression is checked and the result (true/false) is written as an index for query optimization.
How does this help you?
Especially in large ClickHouse deployments with many complex tables it can be hard for users to always be up to date on the best way to query a given dataset. Constraints can help optimize queries without having to change the query structure itself. They can also make it easier to make changes to tables.
For example, let's say you have a table containing web requests and it includes a URL column that contains the full URL of each request. Many times, users will want to know the top level domain (.com, .co.uk, etc.), something ClickHouse provides the topLevelDomain
function to calculate. If you discover that many people are using this function you might decide to create a new materialized column that pre-calculates the top level domain for each record.
Rather than tell all your users to change their queries you can use a table constraint to tell ClickHouse that each time a user tries to call the topLevelDomain
function the request should be rewritten to use the new materialized column.
Read Large Remote Files In Chunks
ClickHouse combines a fast query engine and efficient data storage. It also allows to integrate external data sources for data import and export or even to process external datasets on the fly without the need for data import or preprocessing.
When reading large files in Parquet
, ORC
, and Arrow
format using the s3
, url
, and hdfs
table functions, ClickHouse will now automatically choose whether to read the entire file at once or read parts of it incrementally. This is now enabled by default and the setting remote_read_min_bytes_for_seek
controls when to switch from reading it all to reading in chunks. The default is 1MiB.
Parquet
, ORC
, and Arrow
are column-oriented formats (quite similar to the ClickHouse Native format) and now we can read only requested columns even if they are being read from a remote HTTP server with the url
table function (range requests will be performed to skip unneeded data).
This feature is implemented by Kseniia Sumarokova.
How does this help our ClickHouse Users?
In previous versions, when reading files in Arrow-based formats from remote locations with the s3
, url
, and hdfs
table functions, ClickHouse would always read the entire file into memory. This works well when the files are small but will cause excessive memory usage or not work at all when the files are large. With this change, ClickHouse will read large files in chunks to keep memory usage in check and is now able to read even very large files.
... And Many More
Read the full changelog for the 21.12 "Christmas" release for the full list of gifts from the ClickHouse Team.