We're continuing our monthly release cadence. The 21.12 Christmas release includes 2460 new commits from 125 contributors, including 42 new contributors:
If you are wondering, this list is generated by the following command:
```
clickhouse-local --query "
SELECT arrayStringConcat(groupArray(s), ', ')
FROM file('contributors-21.12.txt', LineAsString, 's String')
WHERE s NOT IN (
SELECT *
FROM file('contributors-21.11.txt', LineAsString, 's String'))
FORMAT TSVRaw"
```
And to list the contributors, you can always run the
```
SELECT * FROM system.contributors
```
query on your production server.
Let's highlight some of the new capabilities in 21.12:
## ClickHouse Keeper is Feature Complete
In 21.12 `clickhouse-keeper` started to support ["four letter commands"](https://zookeeper.apache.org/doc/r3.4.8/zookeeperAdmin.html#sc_zkCommands) for status and monitoring. This feature is contributed by **JackyWoo** and reviewed by **Alexander Sapin** (the author of ClickHouse Keeper).
It was the only missing feature to implement. In this release, clickhouse-keeper is still considered in pre-production stage, but many companies already started to evaluate and use it as a replacement of ZooKeeper. You can also start using clickhouse-keeper in your testing environments and we will appreciate your feedback.
ClickHouse Keeper development started in Sep 2020, more than a year ago. It was a long road, and most of the efforts were to ensure correctness and stability in unusual and exceptional scenarios. It is covered by [Jepsen](https://jepsen.io/) tests (including ZooKeeper tests and [new introduced tests](https://github.com/ClickHouse/ClickHouse/tree/master/tests/jepsen.clickhouse-keeper)), continuous randomized stress testing with ClickHouse functional and integration tests. It is started to be tested in Yandex Cloud and among our best friends. If you're pretending to be our best friend, you can also do it.
ClickHouse Keeper is a drop-in replacement for ZooKeeper. It implements ZooKeeper wire protocol and data model, but does it better.
In contrast to ZooKeeper, there are no issues with zxid overflow or packet sizes. It has better memory usage and it does not require JVM tuning (because it does not use JVM). Logs and snapshots are compressed (about 10x typical) and checksummed. It can run as a separate process or directly inside clickhouse-server. You can use it with ClickHouse or with your Kafkas and Hadoops as well.
When using the table engines `File`, `URL`, and `HDFS` ClickHouse now supports partitions. When creating a table you can specify the partition key using the `PARTITION BY` clause e.g. `CREATE TABLE hits_files (...) ENGINE = File(TabSeparated) PARTITION BY toYYYYMM(EventDate)`.
Similarly, when exporting data from ClickHouse using the `file`, `url`, and `hdfs` table functions you can now specify that the data is to be partitioned into multiple files using a `PARTITION BY` clause. For example, `INSERT INTO TABLE FUNCTION file('path/hits_{_partition_id}', 'TSV', 'columns...') PARTITION BY toYYYYMM(EventDate) VALUES ...` will create as many files as there are unique months in the dataset.
Now you don't have to recall how to write parallel for loop in your command line shell. clickhouse-client will do everything for you, it works intuitively and fast.
:) SELECT EventDate, count() FROM test.hits WHERE CounterID = 2841673 GROUP BY EventDate ORDER BY EventDate
┌──EventDate─┬─count()─┐
│ 2014-03-17 │ 3 │
│ 2014-03-19 │ 6 │
│ 2014-03-21 │ 7 │
│ 2014-03-22 │ 6 │
└────────────┴─────────┘
```
We have the report with Mar 17th, 19th, 21th, 22th. But Mar 18th and 20th are missing, because there is no data for these dates.
And this is how it works in all SQL databases.
But ClickHouse also has quite unique and neat `WITH FILL` modifier for `ORDER BY clause`.
You just write:
```
SELECT EventDate, count() FROM test.hits WHERE CounterID = 2841673 GROUP BY EventDate
ORDER BY EventDate WITH FILL STEP 1
┌──EventDate─┬─count()─┐
│ 2014-03-17 │ 3 │
│ 2014-03-18 │ 0 │
│ 2014-03-19 │ 6 │
│ 2014-03-20 │ 0 │
│ 2014-03-21 │ 7 │
│ 2014-03-22 │ 6 │
└────────────┴─────────┘
```
And missing data is automatically filled.
You can also add `FROM` and `TO`:
```
ORDER BY date WITH FILL FROM '2014-03-01'::Date TO '2014-03-31'::Date STEP 1;
```
And it will automatically fill missing rows in the report.
The STEP can be arbitrary number. But what to do if you want fill missing dates for report by months? You cannot just write STEP 30 or STEP 31 because months contain different number of days...
Since ClickHouse version 21.12 you can do it like this:
```
ORDER BY EventDate WITH FILL STEP INTERVAL 1 MONTH
## Add Support For "Identifier" Table and Database Query Parameters
ClickHouse has support for parameterized queries.
```
SELECT uniq(user_id) FROM table WHERE website = {name:String}
```
It allows to safely substitute parameters without the risk of SQL injections:
```
curl https://clickhouse-server:8443/?param_name=upyachka -d 'SELECT uniq(user_id) FROM table WHERE website = {name:String}'
```
You can even create customized API handlers for clickhouse-server based on prepared queries.
Since version 21.12 we introduce support for using parameters for tables and databases in your queries. This is implemented with `Identifier` table parameter:
Let ClickHouse do the heavy-lifting and keep your scripts safe and secure.
## Bool Data Type
This feature is experimental in version 21.12.
ClickHouse now natively supports a `Bool` data type. It allows to represent values as "true"/"false" during data import and export in text formats. It can also be adjusted to anything else using the settings `bool_true_representation` and `bool_false_representation` (for example, "yes" and "no").
**How does this help you?**
Native boolean data types exist today in other databases that are often integrated with ClickHouse, such as PostgreSQL. The `Bool` data type in ClickHouse will make it more compatible with existing code and ease migration from other databases.
Also it simplifies data ingestion from various text sources.
This feature is [contributed](https://github.com/ClickHouse/ClickHouse/pull/18787) by **Nikita Vasilev**. Nikita is one of the most notable ClickHouse contributors. He started in 2019 by introducing data skipping indices into ClickHouse, then continued in 2020 with SSD-optimized key-value dictionaries and now contributed the new advancements in the query optimizer. This feature is reviewed by **Anton Popov**.
This type of constraint will not check anything on INSERT, but still use the assumption to optimize the queries.
It can also do logical inference, simplify the conditions and remove the conditions that are proved to be satisfied by constraints.
It is controlled by `convert_query_to_cnf` setting. You can also enable `optimize_append_index` setting. With this setting ClickHouse will derive more consitions on the table primary key.
The idea is so powerful that we cannot resist adding one more feature: *indices for hypothesis*.
```
INDEX my_index (a <b)TYPEhypothesisGRANULARITY1
```
The expression is checked and the result (true/false) is written as an index for query optimization.
Especially in large ClickHouse deployments with many complex tables it can be hard for users to always be up to date on the best way to query a given dataset. Constraints can help optimize queries without having to change the query structure itself. They can also make it easier to make changes to tables.
For example, let's say you have a table containing web requests and it includes a URL column that contains the full URL of each request. Many times, users will want to know the top level domain (.com, .co.uk, etc.), something ClickHouse provides the `topLevelDomain` function to calculate. If you discover that many people are using this function you might decide to create a new materialized column that pre-calculates the top level domain for each record.
Rather than tell all your users to change their queries you can use a table constraint to tell ClickHouse that each time a user tries to call the `topLevelDomain` function the request should be rewritten to use the new materialized column.
ClickHouse combines fast query engine and efficient data storage. It also allows to integrate external data sources for data import and export or even to process external datasets on the fly without the need for data import or preprocessing.
When reading large files in `Parquet`, `ORC`, and `Arrow` format using the `s3`, `url`, and `hdfs` table functions, ClickHouse will now automatically choose whether to read the entire file at once or read parts of it incrementally. This is now enabled by default and the setting `remote_read_min_bytes_for_seek` controls when to switch from reading it all to reading in chunks. The default is 1MiB.
`Parquet`, `ORC`, and `Arrow` are column-oriented formats (quite similar to ClickHouse Native format) and now we can read only requested columns even if they are being read from remote HTTP server with the `url` table function (range requests will be performed to skip unneeded data).
In previous versions, when reading files in Arrow-based formats from remote locations with the `s3`, `url`, and `hdfs` table functions, ClickHouse would always read the entire file into memory. This works well when the files are small but will cause excessive memory usage or not work at all when the files are large. With this change, ClickHouse will read large files in chunks to keep memory usage in check and is now able to read even very large files.
Read the [full changelog](https://github.com/ClickHouse/ClickHouse/blob/master/CHANGELOG.md) for 21.12 "Christmas" release for the full list of the gifts from [ClickHouse Team](https://clickhouse.com/careers/).