In this query we are importing data with the `url` table function. Data is posted on an HTTP server in a `.native.xz` file. The most annoying part of this query is that we have to specify the data structure and the format of this file.
Firstly, we detect the data format automatically from the file extension. Here it is `.native.xz`, so we know that the data is compressed by `xz` (LZMA2) compression and is represented in `Native` format. The `Native` format already contains all information about the types and names of the columns, and we just read and use it.
It works for every format that contains information about the data types: `Native`, `Avro`, `Parquet`, `ORC`, `Arrow` as well as `CSVWithNamesAndTypes`, `TSVWithNamesAndTypes`.
And it works for every table function that reads files: `s3`, `file`, `hdfs`, `url`, `s3Cluster`, `hdfsCluster`.
A lot of magic happens under the hood. It does not require reading the whole file in memory. For example, Parquet format has metadata at the end of file. So, we read the header first to find where the metadata is located, then do a range request to read the metadata about columns and their types, then continue to read the requested columns. And if the file is small, it will be read with a single request.
Data structure can be also automatically inferred from `JSONEachRow`, `CSV`, `TSV`, `CSVWithNames`, `TSVWithNames`, `MsgPack`, `Values` and `Regexp` formats.
For `CSV`, either Float64 or String is inferred. For `JSONEachRow` the inference of array types is supported, including multidimensional arrays. Arrays of non-uniform types are mapped to Tuples. And objects are mapped to the `Map` data type.
If a format does not have column names (like `CSV` without a header), the names `c1`, `c2`, ... are used.
File format is detected from the file extension: `csv`, `tsv`, `native`, `parquet`, `pb`, `ndjson`, `orc`... For example, `.ndjson` file is recognized as `JSONEachRow` format and `.csv` is recognized as header-less `CSV` format in ClickHouse, and if you want `CSVWithNames` you can specify the format explicitly.
We support "schema on demand" queries. For example, the autodetected data types for `TSV` format are Strings, but you can refine the types in your query with the `::` operator:
```
SELECT c1 AS domain, uniq(c2::UInt64), count() AS cnt
FROM file('hits.tsv')
GROUP BY domain ORDER BY cnt DESC LIMIT 10
```
As a bonus, `LineAsString` and `RawBLOB` formats also get type inference. Try this query to see how I prefer to read my favorite website:
```
SELECT extractTextFromHTML(*)
FROM url('https://news.ycombinator.com/', LineAsString);
Schema autodetection also works while creating `Merge`, `Distributed` and `ReplicatedMegreTree` tables. When you create the first replica, you have to specify the table structure. But when creating all the subsequent replicas, you only need `CREATE TABLE hits
This feature was made possible by implementation of distributed metrics forwarding by **Dmitry Novik**. I have added this small visualization to clickhouse-client, and now it is possible to add similar info in every client using native ClickHouse protocol.
ClickHouse is a distributed MPP DBMS. It can scale up to use all CPU cores on one server and scale out to use computation resources of multiple shards in a cluster.
But each shard usually contains more than one replica. And by default ClickHouse is using the resources of only one replica on every shard. E.g. if you have a cluster of 6 servers with 3 shards and two replicas on each, a query will use just three servers instead of all six.
There was an option to enable `max_parallel_replicas`, but that option required specifying a "sampling key", it was inconvenient to use and did not scale well.
Now we have a setting to enable the new parallel processing algorithm: `allow_experimental_parallel_reading_from_replicas`. If it is enabled, replicas will *dynamically* select and distribute the work across them.
It works perfectly even if replicas have lower or higher amounts of computation resources. And it gives a complete result even if some replicas are stale.
When adding or removing nodes in a cluster, now you don't have to edit the config on every server. Just use automatic cluster and servers will register itself:
It is a gift from the Yandex Cloud team. They have a tool to collect a report about ClickHouse instances to provide all the needed information for support. They decided to contribute this tool to open-source!
I hope you have always dreamed of calculating the Cramer's V and Theil's U coefficients in ClickHouse, because now we have these functions for you and you have to deal with it.
:) SELECT cramersV(URLDomain, ResolutionWidth) FROM test.hits
0.27
```
It can calculate some sort of dependency between categorical (discrete) values. You can imagine it like this: there is a correlation function `corr` but it is only applicable for linear dependencies; there is a rank correlation function `rankCorr` but it is only applicable for ordered values. And now there are a few functions to calculate *something* for discrete values.
Read the [full changelog](https://github.com/ClickHouse/ClickHouse/blob/master/CHANGELOG.md) for the 22.1 release and follow [the roadmap](https://github.com/ClickHouse/ClickHouse/issues/32513).