In this query we are importing data with the `url` table function. Data is posted on an HTTP server in `.native.xz` file. The most annoying part of this query is that we have to specify the data structure and the format of this file.
In the new ClickHouse release 22.1 it becomes much easier:
```
SELECT * FROM url('https://datasets.clickhouse.com/github_events_v2.native.xz')
```
Cannot be more easy! How is that possible?
Firstly, we detect data format automatically from file extension. Here it is `.native.xz`, so we know that the data is compressed by `xz` (LZMA2) compression and is represented in `Native` format. The `Native` format already contains all information about the types and names of the columns, and we just read and use it.
It works for every format that contains information about the data types: `Native`, `Avro`, `Parquet`, `ORC`, `Arrow` as well as `CSVWithNamesAndTypes`, `TSVWithNamesAndTypes`.
And it works for every table function that reads files: `s3`, `file`, `hdfs`, `url`, `s3Cluster`, `hdfsCluster`.
A lot of magic happens under the hood. It does not require reading the whole file in memory. For example, Parquet format has metadata at the end of file. So, we will read the header first to find where the metadata is located, then do a range request to read the metadata about columns and their types, then continue to read the requested columns. And if the file is small, it will be read with a single request.
Data structure can be also automatically inferred from `JSONEachRow`, `CSV`, `TSV`, `CSVWithNames`, `TSVWithNames`, `MsgPack`, `Values` and `Regexp` formats.
For `CSV`, either Float64 or String is inferred. For `JSONEachRow` the inference of array types is supported, including multidimensional arrays. Arrays of non-uniform types are mapped to Tuples. And objects are mapped to the `Map` data type.
If a format does not have column names (like `CSV` without a header), the names `c1`, `c2`, ... are used.
File format is detected from file extension: `csv`, `tsv`, `native`, `parquet`, `pb`, `ndjson`, `orc`... For example, `.ndjson` file is recognized as `JSONEachRow` format and `.csv` is recognized as header-less `CSV` format in ClickHouse, and if you want `CSVWithNames` you can specify the format explicitly.
We support "schema on demand" queries. For example, the autodetected data types for `TSV` format are Strings, but you can refine the types in your query with the `::` operator:
```
SELECT c1 AS domain, uniq(c2::UInt64), count() AS cnt
FROM file('hits.tsv')
GROUP BY domain ORDER BY cnt DESC LIMIT 10
```
As a bonus, `LineAsString` and `RawBLOB` formats also get type inference. Try this query to see how I prefer to read my favorite website:
```
SELECT extractTextFromHTML(*)
FROM url('https://news.ycombinator.com/', LineAsString);
```
Schema autodetection also works while creating `Merge`, `Distributed` and `ReplicatedMegreTree` tables. When you create the first replica, you will have to specify the table structure. But when creating all the subsequent replicas, you only write `CREATE TABLE hits
ENGINE = ReplicatedMegreTree(...)` without listing the columns - the definition will be copied from another replica.
This feature is implemented by **Pavel Kruglov** with the inspiration of initial work by **Igor Baliuk** and with additions by **ZhongYuanKai**.
For distributed queries, we show both total memory usage and max memory usage per host.
This feature has made possible by implementation of distributed metrics forwarding by **Dmitry Novik**. I have added this small visualization to clickhouse-client, and now it is possible to add similar info in every client using native ClickHouse protocol.
## Parallel Query Processing On Replicas
ClickHouse is a distributed MPP DBMS. It can scale up to use all CPU cores on one server and scale out to use computation resources of multiple shards in a cluster.
But each shard usually contains more than one replica. And by default ClickHouse is using the resources of only one replica on every shard. E.g. if you have a cluster of 6 servers with 3 shards and two replicas on each, a query will use just three servers instead of all six.
There was an option to enable `max_parallel_replicas`, but that option required specifying a "sampling key", it was inconvenient to use and did not scale well.
Now we have a setting to enable the new parallel processing algorithm: `allow_experimental_parallel_reading_from_replicas`. If it is enabled, replicas will *dynamically* select and distribute the work across them.
It works perfectly even if replicas have lower or higher amounts of computation resources. And it gives a complete result even if some replicas are stale.
This feature is implemented by **Nikita Mikhaylov**
## Service Discovery
When adding or removing nodes in a cluster, now you don't have to edit config on every server. Just use automatic cluster and servers will register itself:
```
<allow_experimental_cluster_discovery>1
</allow_experimental_cluster_discovery>
<remote_servers>
<auto_cluster>
<discovery>
<path>/clickhouse/discovery/auto_cluster</path>
<shard>1</shard>
</discovery>
</auto_cluster>
</remote_servers>
```
No need to edit config on adding new replicas!
This feature is implemented by **Vladimir Cherkasov**.
## Sparse Encoding For Columns
If a column contains mostly zeros, we can encode it in sparse format
It is a gift from the Yandex Cloud team. They have a tool to collect a report about ClickHouse instances to provide all the needed information for support. They decided to contribute this tool to open-source!
You can find the tool here: [utils/clickhouse-diagnostics](https://github.com/ClickHouse/ClickHouse/tree/master/
utils/clickhouse-diagnostics)
Developed by **Alexander Burmak**.
## Integrations
Plenty of new integrations were added in 22.1:
Integration with **Hive** as foreign table engine for SELECT queries, contributed by **Taiyang Li** and reviewed by **Ksenia Sumarokova**.
Integration with **Azure Blob Storage** similar to S3, contributed by **Jakub Kuklis** and reviewed by **Ksenia Sumarokova**.
Support for **hdfsCluster** table function similar to **s3Cluster**, contributed by **Zhichang Yu** and reviewed by **Nikita Mikhailov**.
## Statistical Functions
I hope you have always dreamed of calculating the Cramer's V and Theil's U coefficients in ClickHouse. Because now we have these functions for you and you have to deal with it.
```
:) SELECT cramersV(URL, URLDomain) FROM test.hits
0.98
:) SELECT cramersV(URLDomain, ResolutionWidth) FROM test.hits
0.27
```
It can calculate some sort of dependency between categorical (discrete) values. You can imagine it like this: there is a correlation function `corr` but it is only applicable for linear dependencies; there is a rank correlation function `rankCorr` but it is only applicable for ordered values. And now there are a few functions to calculate *something* for discrete values.
Read the [full changelog](https://github.com/ClickHouse/ClickHouse/blob/master/CHANGELOG.md) for the 22.1 release and follow [the roadmap](https://github.com/ClickHouse/ClickHouse/issues/32513).