Merge branch 'master' into fix-unsorted-array

This commit is contained in:
Yakov Olkhovskiy 2023-02-01 22:36:52 -05:00 committed by GitHub
commit f5811a3d5a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
306 changed files with 6522 additions and 2321 deletions

View File

@ -15,7 +15,8 @@ jobs:
- name: Deploy packages and assets
run: |
GITHUB_TAG="${GITHUB_REF#refs/tags/}"
curl '${{ secrets.PACKAGES_RELEASE_URL }}/release/'"${GITHUB_TAG}"'?binary=binary_darwin&binary=binary_darwin_aarch64&sync=true' -d ''
curl --silent --data '' \
'${{ secrets.PACKAGES_RELEASE_URL }}/release/'"${GITHUB_TAG}"'?binary=binary_darwin&binary=binary_darwin_aarch64&sync=true'
############################################################################################
##################################### Docker images #######################################
############################################################################################

3
.gitmodules vendored
View File

@ -330,3 +330,6 @@
[submodule "contrib/crc32-vpmsum"]
path = contrib/crc32-vpmsum
url = https://github.com/antonblanchard/crc32-vpmsum.git
[submodule "contrib/liburing"]
path = contrib/liburing
url = https://github.com/axboe/liburing

View File

@ -13,43 +13,45 @@
* The `PREALLOCATE` option for `HASHED`/`SPARSE_HASHED` dictionaries becomes a no-op. [#45388](https://github.com/ClickHouse/ClickHouse/pull/45388) ([Azat Khuzhin](https://github.com/azat)). It does not give significant advantages anymore.
* Disallow `Gorilla` codec on columns of non-Float32 or non-Float64 type. [#45252](https://github.com/ClickHouse/ClickHouse/pull/45252) ([Robert Schulze](https://github.com/rschu1ze)). It was pointless and led to inconsistencies.
* Parallel quorum inserts might work incorrectly with `*MergeTree` tables created with the deprecated syntax. Therefore, parallel quorum inserts support is completely disabled for such tables. It does not affect tables created with a new syntax. [#45430](https://github.com/ClickHouse/ClickHouse/pull/45430) ([Alexander Tokmakov](https://github.com/tavplubix)).
* Use `GetObjectAttributes` request instead of `HeadObject` request to get the size of an object in AWS S3. This change fixes handling endpoints without explicit region after updating the AWS SDK, for example. [#45288](https://github.com/ClickHouse/ClickHouse/pull/45288) ([Vitaly Baranov](https://github.com/vitlibar)). AWS S3 and Minio are tested, but keep in mind that various S3-compatible services (GCS, R2, B2) may have subtle incompatibilities. This change also may require you to adjust the ACL to allow the `GetObjectAttributes` request.
* Use the `GetObjectAttributes` request instead of the `HeadObject` request to get the size of an object in AWS S3. This change fixes handling endpoints without explicit regions after updating the AWS SDK, for example. [#45288](https://github.com/ClickHouse/ClickHouse/pull/45288) ([Vitaly Baranov](https://github.com/vitlibar)). AWS S3 and Minio are tested, but keep in mind that various S3-compatible services (GCS, R2, B2) may have subtle incompatibilities. This change also may require you to adjust the ACL to allow the `GetObjectAttributes` request.
* Forbid paths in timezone names. For example, a timezone name like `/usr/share/zoneinfo/Asia/Aden` is not allowed; the IANA timezone database name like `Asia/Aden` should be used. [#44225](https://github.com/ClickHouse/ClickHouse/pull/44225) ([Kruglov Pavel](https://github.com/Avogar)).
* Queries combining equijoin and constant expressions (e.g., `JOIN ON t1.x = t2.x AND 1 = 1`) are forbidden due to incorrect results. [#44016](https://github.com/ClickHouse/ClickHouse/pull/44016) ([Vladimir C](https://github.com/vdimir)).
#### New Feature
* Dictionary source for extracting keys by traversing regular expressions tree. It can be used for User-Agent parsing. [#40878](https://github.com/ClickHouse/ClickHouse/pull/40878) ([Vage Ogannisian](https://github.com/nooblose)). [#43858](https://github.com/ClickHouse/ClickHouse/pull/43858) ([Han Fei](https://github.com/hanfei1991)).
* Added parametrized view functionality, now it's possible to specify query parameters for View table engine. resolves [#40907](https://github.com/ClickHouse/ClickHouse/issues/40907). [#41687](https://github.com/ClickHouse/ClickHouse/pull/41687) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Added parametrized view functionality, now it's possible to specify query parameters for the View table engine. resolves [#40907](https://github.com/ClickHouse/ClickHouse/issues/40907). [#41687](https://github.com/ClickHouse/ClickHouse/pull/41687) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Add `quantileInterpolatedWeighted`/`quantilesInterpolatedWeighted` functions. [#38252](https://github.com/ClickHouse/ClickHouse/pull/38252) ([Bharat Nallan](https://github.com/bharatnc)).
* Array join support for the `Map` type, like the function "explode" in Spark. [#43239](https://github.com/ClickHouse/ClickHouse/pull/43239) ([李扬](https://github.com/taiyang-li)).
* Support SQL standard binary and hex string literals. [#43785](https://github.com/ClickHouse/ClickHouse/pull/43785) ([Mo Xuan](https://github.com/mo-avatar)).
* Allow to format `DateTime` in Joda-Time style. Refer to [the Joda-Time docs](https://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html). [#43818](https://github.com/ClickHouse/ClickHouse/pull/43818) ([李扬](https://github.com/taiyang-li)).
* Allow formatting `DateTime` in Joda-Time style. Refer to [the Joda-Time docs](https://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html). [#43818](https://github.com/ClickHouse/ClickHouse/pull/43818) ([李扬](https://github.com/taiyang-li)).
* Implemented a fractional second formatter (`%f`) for `formatDateTime`. [#44060](https://github.com/ClickHouse/ClickHouse/pull/44060) ([ltrk2](https://github.com/ltrk2)). [#44497](https://github.com/ClickHouse/ClickHouse/pull/44497) ([Alexander Gololobov](https://github.com/davenger)).
* Added `age` function to calculate difference between two dates or dates with time values expressed as number of full units. Closes [#41115](https://github.com/ClickHouse/ClickHouse/issues/41115). [#44421](https://github.com/ClickHouse/ClickHouse/pull/44421) ([Robert Schulze](https://github.com/rschu1ze)).
* Added `age` function to calculate the difference between two dates or dates with time values expressed as the number of full units. Closes [#41115](https://github.com/ClickHouse/ClickHouse/issues/41115). [#44421](https://github.com/ClickHouse/ClickHouse/pull/44421) ([Robert Schulze](https://github.com/rschu1ze)).
* Add `Null` source for dictionaries. Closes [#44240](https://github.com/ClickHouse/ClickHouse/issues/44240). [#44502](https://github.com/ClickHouse/ClickHouse/pull/44502) ([mayamika](https://github.com/mayamika)).
* Allow configuring the S3 storage class with the `s3_storage_class` configuration option. Such as `<s3_storage_class>STANDARD/INTELLIGENT_TIERING</s3_storage_class>` Closes [#44443](https://github.com/ClickHouse/ClickHouse/issues/44443). [#44707](https://github.com/ClickHouse/ClickHouse/pull/44707) ([chen](https://github.com/xiedeyantu)).
* Insert default values in case of missing elements in JSON object while parsing named tuple. Add setting `input_format_json_defaults_for_missing_elements_in_named_tuple` that controls this behaviour. Closes [#45142](https://github.com/ClickHouse/ClickHouse/issues/45142)#issuecomment-1380153217. [#45231](https://github.com/ClickHouse/ClickHouse/pull/45231) ([Kruglov Pavel](https://github.com/Avogar)).
* Record server startup time in ProfileEvents (`ServerStartupMilliseconds`). Resolves [#43188](https://github.com/ClickHouse/ClickHouse/issues/43188). [#45250](https://github.com/ClickHouse/ClickHouse/pull/45250) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Refactor and Improve streaming engines Kafka/RabbitMQ/NATS and add support for all formats, also refactor formats a bit: - Fix producing messages in row-based formats with suffixes/prefixes. Now every message is formatted completely with all delimiters and can be parsed back using input format. - Support block-based formats like Native, Parquet, ORC, etc. Every block is formatted as a separated message. The number of rows in one message depends on the block size, so you can control it via setting `max_block_size`. - Add new engine settings `kafka_max_rows_per_message/rabbitmq_max_rows_per_message/nats_max_rows_per_message`. They control the number of rows formatted in one message in row-based formats. Default value: 1. - Fix high memory consumption in NATS table engine. - Support arbitrary binary data in NATS producer (previously it worked only with strings contained \0 at the end) - Add missing Kafka/RabbitMQ/NATS engine settings in documentation. - Refactor producing and consuming in Kafka/RabbitMQ/NATS, separate it from WriteBuffers/ReadBuffers semantic. - Refactor output formats: remove callbacks on each row used in Kafka/RabbitMQ/NATS (now we don't use callbacks there), allow to use IRowOutputFormat directly, clarify row end and row between delimiters, make it possible to reset output format to start formatting again - Add proper implementation in formatRow function (bonus after formats refactoring). [#42777](https://github.com/ClickHouse/ClickHouse/pull/42777) ([Kruglov Pavel](https://github.com/Avogar)).
* Refactor and Improve streaming engines Kafka/RabbitMQ/NATS and add support for all formats, also refactor formats a bit: - Fix producing messages in row-based formats with suffixes/prefixes. Now every message is formatted completely with all delimiters and can be parsed back using input format. - Support block-based formats like Native, Parquet, ORC, etc. Every block is formatted as a separate message. The number of rows in one message depends on the block size, so you can control it via the setting `max_block_size`. - Add new engine settings `kafka_max_rows_per_message/rabbitmq_max_rows_per_message/nats_max_rows_per_message`. They control the number of rows formatted in one message in row-based formats. Default value: 1. - Fix high memory consumption in the NATS table engine. - Support arbitrary binary data in NATS producer (previously it worked only with strings contained \0 at the end) - Add missing Kafka/RabbitMQ/NATS engine settings in the documentation. - Refactor producing and consuming in Kafka/RabbitMQ/NATS, separate it from WriteBuffers/ReadBuffers semantic. - Refactor output formats: remove callbacks on each row used in Kafka/RabbitMQ/NATS (now we don't use callbacks there), allow to use IRowOutputFormat directly, clarify row end and row between delimiters, make it possible to reset output format to start formatting again - Add proper implementation in formatRow function (bonus after formats refactoring). [#42777](https://github.com/ClickHouse/ClickHouse/pull/42777) ([Kruglov Pavel](https://github.com/Avogar)).
* Support reading/writing `Nested` tables as `List` of `Struct` in `CapnProto` format. Read/write `Decimal32/64` as `Int32/64`. Closes [#43319](https://github.com/ClickHouse/ClickHouse/issues/43319). [#43379](https://github.com/ClickHouse/ClickHouse/pull/43379) ([Kruglov Pavel](https://github.com/Avogar)).
* Added a `message_format_string` column to `system.text_log`. The column contains a pattern that was used to format the message. [#44543](https://github.com/ClickHouse/ClickHouse/pull/44543) ([Alexander Tokmakov](https://github.com/tavplubix)). This allows various analytics over ClickHouse own logs.
* Try to autodetect header with column names (and maybe types) for CSV/TSV/CustomSeparated input formats.
Add settings input_format_tsv/csv/custom_detect_header that enables this behaviour (enabled by default). Closes [#44640](https://github.com/ClickHouse/ClickHouse/issues/44640). [#44953](https://github.com/ClickHouse/ClickHouse/pull/44953) ([Kruglov Pavel](https://github.com/Avogar)).
* Added a `message_format_string` column to `system.text_log`. The column contains a pattern that was used to format the message. [#44543](https://github.com/ClickHouse/ClickHouse/pull/44543) ([Alexander Tokmakov](https://github.com/tavplubix)). This allows various analytics over the ClickHouse logs.
* Try to autodetect headers with column names (and maybe types) for CSV/TSV/CustomSeparated input formats.
Add settings input_format_tsv/csv/custom_detect_header that enable this behaviour (enabled by default). Closes [#44640](https://github.com/ClickHouse/ClickHouse/issues/44640). [#44953](https://github.com/ClickHouse/ClickHouse/pull/44953) ([Kruglov Pavel](https://github.com/Avogar)).
#### Experimental Feature
* Add an experimental inverted index as a new secondary index type for efficient text search. [#38667](https://github.com/ClickHouse/ClickHouse/pull/38667) ([larryluogit](https://github.com/larryluogit)).
* Add experimental query result cache. [#43797](https://github.com/ClickHouse/ClickHouse/pull/43797) ([Robert Schulze](https://github.com/rschu1ze)).
* Added extendable and configurable scheduling subsystem for IO requests (not yet integrated with IO code itself). [#41840](https://github.com/ClickHouse/ClickHouse/pull/41840) ([Sergei Trifonov](https://github.com/serxa)). This feature does nothing at all, enjoy.
* Added `SYSTEM DROP DATABASE REPLICA` that removes metadata of dead replica of `Replicated` database. Resolves [#41794](https://github.com/ClickHouse/ClickHouse/issues/41794). [#42807](https://github.com/ClickHouse/ClickHouse/pull/42807) ([Alexander Tokmakov](https://github.com/tavplubix)).
* Added `SYSTEM DROP DATABASE REPLICA` that removes metadata of a dead replica of a `Replicated` database. Resolves [#41794](https://github.com/ClickHouse/ClickHouse/issues/41794). [#42807](https://github.com/ClickHouse/ClickHouse/pull/42807) ([Alexander Tokmakov](https://github.com/tavplubix)).
#### Performance Improvement
* Do not load inactive parts at startup of `MergeTree` tables. [#42181](https://github.com/ClickHouse/ClickHouse/pull/42181) ([Anton Popov](https://github.com/CurtizJ)).
* Improved latency of reading from storage `S3` and table function `s3` with large number of small files. Now settings `remote_filesystem_read_method` and `remote_filesystem_read_prefetch` take effect while reading from storage `S3`. [#43726](https://github.com/ClickHouse/ClickHouse/pull/43726) ([Anton Popov](https://github.com/CurtizJ)).
* Improved latency of reading from storage `S3` and table function `s3` with large numbers of small files. Now settings `remote_filesystem_read_method` and `remote_filesystem_read_prefetch` take effect while reading from storage `S3`. [#43726](https://github.com/ClickHouse/ClickHouse/pull/43726) ([Anton Popov](https://github.com/CurtizJ)).
* Optimization for reading struct fields in Parquet/ORC files. Only the required fields are loaded. [#44484](https://github.com/ClickHouse/ClickHouse/pull/44484) ([lgbo](https://github.com/lgbo-ustc)).
* Two-level aggregation algorithm was mistakenly disabled for queries over HTTP interface. It was enabled back, and it leads to a major performance improvement. [#45450](https://github.com/ClickHouse/ClickHouse/pull/45450) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Two-level aggregation algorithm was mistakenly disabled for queries over the HTTP interface. It was enabled back, and it leads to a major performance improvement. [#45450](https://github.com/ClickHouse/ClickHouse/pull/45450) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Added mmap support for StorageFile, which should improve the performance of clickhouse-local. [#43927](https://github.com/ClickHouse/ClickHouse/pull/43927) ([pufit](https://github.com/pufit)).
* Added sharding support in HashedDictionary to allow parallel load (almost linear scaling based on number of shards). [#40003](https://github.com/ClickHouse/ClickHouse/pull/40003) ([Azat Khuzhin](https://github.com/azat)).
* Speed up query parsing. [#42284](https://github.com/ClickHouse/ClickHouse/pull/42284) ([Raúl Marín](https://github.com/Algunenano)).
* Always replace OR chain `expr = x1 OR ... OR expr = xN` to `expr IN (x1, ..., xN)` in case if `expr` is a `LowCardinality` column. Setting `optimize_min_equality_disjunction_chain_length` is ignored in this case. [#42889](https://github.com/ClickHouse/ClickHouse/pull/42889) ([Guo Wangyang](https://github.com/guowangy)).
* Always replace OR chain `expr = x1 OR ... OR expr = xN` to `expr IN (x1, ..., xN)` in the case where `expr` is a `LowCardinality` column. Setting `optimize_min_equality_disjunction_chain_length` is ignored in this case. [#42889](https://github.com/ClickHouse/ClickHouse/pull/42889) ([Guo Wangyang](https://github.com/guowangy)).
* Slightly improve performance by optimizing the code around ThreadStatus. [#43586](https://github.com/ClickHouse/ClickHouse/pull/43586) ([Zhiguo Zhou](https://github.com/ZhiguoZh)).
* Optimize the column-wise ternary logic evaluation by achieving auto-vectorization. In the performance test of this [microbenchmark](https://github.com/ZhiguoZh/ClickHouse/blob/20221123-ternary-logic-opt-example/src/Functions/examples/associative_applier_perf.cpp), we've observed a peak **performance gain** of **21x** on the ICX device (Intel Xeon Platinum 8380 CPU). [#43669](https://github.com/ClickHouse/ClickHouse/pull/43669) ([Zhiguo Zhou](https://github.com/ZhiguoZh)).
* Avoid acquiring read locks in the `system.tables` table if possible. [#43840](https://github.com/ClickHouse/ClickHouse/pull/43840) ([Raúl Marín](https://github.com/Algunenano)).
@ -59,10 +61,10 @@ Add settings input_format_tsv/csv/custom_detect_header that enables this behavio
* Add fast path for: - `col like '%%'`; - `col like '%'`; - `col not like '%'`; - `col not like '%'`; - `match(col, '.*')`. [#45244](https://github.com/ClickHouse/ClickHouse/pull/45244) ([李扬](https://github.com/taiyang-li)).
* Slightly improve happy path optimisation in filtering (WHERE clause). [#45289](https://github.com/ClickHouse/ClickHouse/pull/45289) ([Nikita Taranov](https://github.com/nickitat)).
* Provide monotonicity info for `toUnixTimestamp64*` to enable more algebraic optimizations for index analysis. [#44116](https://github.com/ClickHouse/ClickHouse/pull/44116) ([Nikita Taranov](https://github.com/nickitat)).
* Allow to configure temporary data for query processing (spilling to disk) to cooperate with filesystem cache (taking up the space from the cache disk) [#43972](https://github.com/ClickHouse/ClickHouse/pull/43972) ([Vladimir C](https://github.com/vdimir)). This mainly improves [ClickHouse Cloud](https://clickhouse.cloud/), but can be used for self-managed setups as well, if you know what to do.
* Allow the configuration of temporary data for query processing (spilling to disk) to cooperate with the filesystem cache (taking up the space from the cache disk) [#43972](https://github.com/ClickHouse/ClickHouse/pull/43972) ([Vladimir C](https://github.com/vdimir)). This mainly improves [ClickHouse Cloud](https://clickhouse.cloud/), but can be used for self-managed setups as well, if you know what to do.
* Make `system.replicas` table do parallel fetches of replicas statuses. Closes [#43918](https://github.com/ClickHouse/ClickHouse/issues/43918). [#43998](https://github.com/ClickHouse/ClickHouse/pull/43998) ([Nikolay Degterinsky](https://github.com/evillique)).
* Optimize memory consumption during backup to S3: files to S3 now will be copied directly without using `WriteBufferFromS3` (which could use a lot of memory). [#45188](https://github.com/ClickHouse/ClickHouse/pull/45188) ([Vitaly Baranov](https://github.com/vitlibar)).
* Add a cache for async block ids. This will reduce the requests of zookeeper when we enable async inserts deduplication. [#45106](https://github.com/ClickHouse/ClickHouse/pull/45106) ([Han Fei](https://github.com/hanfei1991)).
* Add a cache for async block ids. This will reduce the number of requests of ZooKeeper when we enable async inserts deduplication. [#45106](https://github.com/ClickHouse/ClickHouse/pull/45106) ([Han Fei](https://github.com/hanfei1991)).
#### Improvement
@ -71,25 +73,25 @@ Add settings input_format_tsv/csv/custom_detect_header that enables this behavio
* Added fields `supports_parallel_parsing` and `supports_parallel_formatting` to table `system.formats` for better introspection. [#45499](https://github.com/ClickHouse/ClickHouse/pull/45499) ([Anton Popov](https://github.com/CurtizJ)).
* Improve reading CSV field in CustomSeparated/Template format. Closes [#42352](https://github.com/ClickHouse/ClickHouse/issues/42352) Closes [#39620](https://github.com/ClickHouse/ClickHouse/issues/39620). [#43332](https://github.com/ClickHouse/ClickHouse/pull/43332) ([Kruglov Pavel](https://github.com/Avogar)).
* Unify query elapsed time measurements. [#43455](https://github.com/ClickHouse/ClickHouse/pull/43455) ([Raúl Marín](https://github.com/Algunenano)).
* Improve automatic usage of structure from insertion table in table functions file/hdfs/s3 when virtual columns present in select query, it fixes possible error `Block structure mismatch` or `number of columns mismatch`. [#43695](https://github.com/ClickHouse/ClickHouse/pull/43695) ([Kruglov Pavel](https://github.com/Avogar)).
* Add support for signed arguments in function `range`. Fixes [#43333](https://github.com/ClickHouse/ClickHouse/issues/43333). [#43733](https://github.com/ClickHouse/ClickHouse/pull/43733) ([sanyu](https://github.com/wineternity)).
* Improve automatic usage of structure from insertion table in table functions file/hdfs/s3 when virtual columns are present in a select query, it fixes the possible error `Block structure mismatch` or `number of columns mismatch`. [#43695](https://github.com/ClickHouse/ClickHouse/pull/43695) ([Kruglov Pavel](https://github.com/Avogar)).
* Add support for signed arguments in the function `range`. Fixes [#43333](https://github.com/ClickHouse/ClickHouse/issues/43333). [#43733](https://github.com/ClickHouse/ClickHouse/pull/43733) ([sanyu](https://github.com/wineternity)).
* Remove redundant sorting, for example, sorting related ORDER BY clauses in subqueries. Implemented on top of query plan. It does similar optimization as `optimize_duplicate_order_by_and_distinct` regarding `ORDER BY` clauses, but more generic, since it's applied to any redundant sorting steps (not only caused by ORDER BY clause) and applied to subqueries of any depth. Related to [#42648](https://github.com/ClickHouse/ClickHouse/issues/42648). [#43905](https://github.com/ClickHouse/ClickHouse/pull/43905) ([Igor Nikonov](https://github.com/devcrafter)).
* Add ability to disable deduplication of files for BACKUP (for backups wiithout deduplication ATTACH can be used instead of full RESTORE), example `BACKUP foo TO S3(...) SETTINGS deduplicate_files=0` (default `deduplicate_files=1`). [#43947](https://github.com/ClickHouse/ClickHouse/pull/43947) ([Azat Khuzhin](https://github.com/azat)).
* Add the ability to disable deduplication of files for BACKUP (for backups without deduplication ATTACH can be used instead of full RESTORE). For example `BACKUP foo TO S3(...) SETTINGS deduplicate_files=0` (default `deduplicate_files=1`). [#43947](https://github.com/ClickHouse/ClickHouse/pull/43947) ([Azat Khuzhin](https://github.com/azat)).
* Refactor and improve schema inference for text formats. Add new setting `schema_inference_make_columns_nullable` that controls making result types `Nullable` (enabled by default);. [#44019](https://github.com/ClickHouse/ClickHouse/pull/44019) ([Kruglov Pavel](https://github.com/Avogar)).
* Better support for `PROXYv1` protocol. [#44135](https://github.com/ClickHouse/ClickHouse/pull/44135) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Add information about the latest part check by cleanup threads into `system.parts` table. [#44244](https://github.com/ClickHouse/ClickHouse/pull/44244) ([Dmitry Novik](https://github.com/novikd)).
* Disable table functions in readonly mode for inserts. [#44290](https://github.com/ClickHouse/ClickHouse/pull/44290) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Add a setting `simultaneous_parts_removal_limit` to allow to limit the number of parts being processed by one iteration of CleanupThread. [#44461](https://github.com/ClickHouse/ClickHouse/pull/44461) ([Dmitry Novik](https://github.com/novikd)).
* If user only need virtual columns, we don't need to initialize ReadBufferFromS3. May be helpful to [#44246](https://github.com/ClickHouse/ClickHouse/issues/44246). [#44493](https://github.com/ClickHouse/ClickHouse/pull/44493) ([chen](https://github.com/xiedeyantu)).
* Add a setting `simultaneous_parts_removal_limit` to allow limiting the number of parts being processed by one iteration of CleanupThread. [#44461](https://github.com/ClickHouse/ClickHouse/pull/44461) ([Dmitry Novik](https://github.com/novikd)).
* Do not initialize ReadBufferFromS3 when only virtual columns are needed in a query. This may be helpful to [#44246](https://github.com/ClickHouse/ClickHouse/issues/44246). [#44493](https://github.com/ClickHouse/ClickHouse/pull/44493) ([chen](https://github.com/xiedeyantu)).
* Prevent duplicate column names hints. Closes [#44130](https://github.com/ClickHouse/ClickHouse/issues/44130). [#44519](https://github.com/ClickHouse/ClickHouse/pull/44519) ([Joanna Hulboj](https://github.com/jh0x)).
* Allow macro substitution in endpoint of disks. Resolve [#40951](https://github.com/ClickHouse/ClickHouse/issues/40951). [#44533](https://github.com/ClickHouse/ClickHouse/pull/44533) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Improve schema inference when `input_format_json_read_object_as_string` is enabled. [#44546](https://github.com/ClickHouse/ClickHouse/pull/44546) ([Kruglov Pavel](https://github.com/Avogar)).
* Add a user-level setting `database_replicated_allow_replicated_engine_arguments` which allow to ban creation of `ReplicatedMergeTree` tables with arguments in `DatabaseReplicated`. [#44566](https://github.com/ClickHouse/ClickHouse/pull/44566) ([alesapin](https://github.com/alesapin)).
* Add a user-level setting `database_replicated_allow_replicated_engine_arguments` which allows banning the creation of `ReplicatedMergeTree` tables with arguments in `DatabaseReplicated`. [#44566](https://github.com/ClickHouse/ClickHouse/pull/44566) ([alesapin](https://github.com/alesapin)).
* Prevent users from mistakenly specifying zero (invalid) value for `index_granularity`. This closes [#44536](https://github.com/ClickHouse/ClickHouse/issues/44536). [#44578](https://github.com/ClickHouse/ClickHouse/pull/44578) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Added possibility to set path to service keytab file in `keytab` parameter in `kerberos` section of config.xml. [#44594](https://github.com/ClickHouse/ClickHouse/pull/44594) ([Roman Vasin](https://github.com/rvasin)).
* Use already written part of the query for fuzzy search (pass to the `skim` library, which is written in Rust and linked statically to ClickHouse). [#44600](https://github.com/ClickHouse/ClickHouse/pull/44600) ([Azat Khuzhin](https://github.com/azat)).
* Enable `input_format_json_read_objects_as_strings` by default to be able to read nested JSON objects while JSON Object type is experimental. [#44657](https://github.com/ClickHouse/ClickHouse/pull/44657) ([Kruglov Pavel](https://github.com/Avogar)).
* Improvement for deduplication of async inserts: when users do duplicate async inserts, we should dedup inside the memory before we query keeper. [#44682](https://github.com/ClickHouse/ClickHouse/pull/44682) ([Han Fei](https://github.com/hanfei1991)).
* Improvement for deduplication of async inserts: when users do duplicate async inserts, we should deduplicate inside the memory before we query Keeper. [#44682](https://github.com/ClickHouse/ClickHouse/pull/44682) ([Han Fei](https://github.com/hanfei1991)).
* Input/ouptut `Avro` format will parse bool type as ClickHouse bool type. [#44684](https://github.com/ClickHouse/ClickHouse/pull/44684) ([Kruglov Pavel](https://github.com/Avogar)).
* Support Bool type in Arrow/Parquet/ORC. Closes [#43970](https://github.com/ClickHouse/ClickHouse/issues/43970). [#44698](https://github.com/ClickHouse/ClickHouse/pull/44698) ([Kruglov Pavel](https://github.com/Avogar)).
* Don't greedily parse beyond the quotes when reading UUIDs - it may lead to mistakenly successful parsing of incorrect data. [#44686](https://github.com/ClickHouse/ClickHouse/pull/44686) ([Raúl Marín](https://github.com/Algunenano)).
@ -98,7 +100,7 @@ Add settings input_format_tsv/csv/custom_detect_header that enables this behavio
* Fix `output_format_pretty_row_numbers` does not preserve the counter across the blocks. Closes [#44815](https://github.com/ClickHouse/ClickHouse/issues/44815). [#44832](https://github.com/ClickHouse/ClickHouse/pull/44832) ([flynn](https://github.com/ucasfl)).
* Don't report errors in `system.errors` due to parts being merged concurrently with the background cleanup process. [#44874](https://github.com/ClickHouse/ClickHouse/pull/44874) ([Raúl Marín](https://github.com/Algunenano)).
* Optimize and fix metrics for Distributed async INSERT. [#44922](https://github.com/ClickHouse/ClickHouse/pull/44922) ([Azat Khuzhin](https://github.com/azat)).
* Added settings to disallow concurrent backups and restores resolves [#43891](https://github.com/ClickHouse/ClickHouse/issues/43891) Implementation: * Added server level settings to disallow concurrent backups and restores, which are read and set when BackupWorker is created in Context. * Settings are set to true by default. * Before starting backup or restores, added a check to see if any other backups/restores are running. For internal request it checks if its from the self node using backup_uuid. [#45072](https://github.com/ClickHouse/ClickHouse/pull/45072) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Added settings to disallow concurrent backups and restores resolves [#43891](https://github.com/ClickHouse/ClickHouse/issues/43891) Implementation: * Added server-level settings to disallow concurrent backups and restores, which are read and set when BackupWorker is created in Context. * Settings are set to true by default. * Before starting backup or restores, added a check to see if any other backups/restores are running. For internal requests, it checks if it is from the self node using backup_uuid. [#45072](https://github.com/ClickHouse/ClickHouse/pull/45072) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Add `<storage_policy>` config parameter for system logs. [#45320](https://github.com/ClickHouse/ClickHouse/pull/45320) ([Stig Bakken](https://github.com/stigsb)).
#### Build/Testing/Packaging Improvement
@ -114,22 +116,21 @@ Add settings input_format_tsv/csv/custom_detect_header that enables this behavio
#### Bug Fix
* Replace domain IP types (IPv4, IPv6) with native. [#43221](https://github.com/ClickHouse/ClickHouse/pull/43221) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)). It automatically fixes some missing implementations in the code.
* Fix backup process if mutations get killed during the backup process. [#45351](https://github.com/ClickHouse/ClickHouse/pull/45351) ([Vitaly Baranov](https://github.com/vitlibar)).
* Fix the backup process if mutations get killed during the backup process. [#45351](https://github.com/ClickHouse/ClickHouse/pull/45351) ([Vitaly Baranov](https://github.com/vitlibar)).
* Fix the `Invalid number of rows in Chunk` exception message. [#41404](https://github.com/ClickHouse/ClickHouse/issues/41404). [#42126](https://github.com/ClickHouse/ClickHouse/pull/42126) ([Alexander Gololobov](https://github.com/davenger)).
* Fix possible use of uninitialized value after executing expressions after sorting. Closes [#43386](https://github.com/ClickHouse/ClickHouse/issues/43386) [#43635](https://github.com/ClickHouse/ClickHouse/pull/43635) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix possible use of an uninitialized value after executing expressions after sorting. Closes [#43386](https://github.com/ClickHouse/ClickHouse/issues/43386) [#43635](https://github.com/ClickHouse/ClickHouse/pull/43635) ([Kruglov Pavel](https://github.com/Avogar)).
* Better handling of NULL in aggregate combinators, fix possible segfault/logical error while using an obscure optimization `optimize_rewrite_sum_if_to_count_if`. Closes [#43758](https://github.com/ClickHouse/ClickHouse/issues/43758). [#43813](https://github.com/ClickHouse/ClickHouse/pull/43813) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix CREATE USER/ROLE query settings constraints. [#43993](https://github.com/ClickHouse/ClickHouse/pull/43993) ([Nikolay Degterinsky](https://github.com/evillique)).
* Fix wrong behavior of `JOIN ON t1.x = t2.x AND 1 = 1`, forbid such queries. [#44016](https://github.com/ClickHouse/ClickHouse/pull/44016) ([Vladimir C](https://github.com/vdimir)).
* Fixed bug with non-parsable default value for `EPHEMERAL` column in table metadata. [#44026](https://github.com/ClickHouse/ClickHouse/pull/44026) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Fix parsing of bad version from compatibility setting. [#44224](https://github.com/ClickHouse/ClickHouse/pull/44224) ([Kruglov Pavel](https://github.com/Avogar)).
* Bring interval subtraction from datetime in line with addition. [#44241](https://github.com/ClickHouse/ClickHouse/pull/44241) ([ltrk2](https://github.com/ltrk2)).
* Remove limits on maximum size of the result for view. [#44261](https://github.com/ClickHouse/ClickHouse/pull/44261) ([lizhuoyu5](https://github.com/lzydmxy)).
* Remove limits on the maximum size of the result for view. [#44261](https://github.com/ClickHouse/ClickHouse/pull/44261) ([lizhuoyu5](https://github.com/lzydmxy)).
* Fix possible logical error in cache if `do_not_evict_index_and_mrk_files=1`. Closes [#42142](https://github.com/ClickHouse/ClickHouse/issues/42142). [#44268](https://github.com/ClickHouse/ClickHouse/pull/44268) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Fix possible too early cache write interruption in write-through cache (caching could be stopped due to false assumption when it shouldn't have). [#44289](https://github.com/ClickHouse/ClickHouse/pull/44289) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Fix possible crash in case function `IN` with constant arguments was used as a constant argument together with `LowCardinality`. Fixes [#44221](https://github.com/ClickHouse/ClickHouse/issues/44221). [#44346](https://github.com/ClickHouse/ClickHouse/pull/44346) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix possible crash in the case function `IN` with constant arguments was used as a constant argument together with `LowCardinality`. Fixes [#44221](https://github.com/ClickHouse/ClickHouse/issues/44221). [#44346](https://github.com/ClickHouse/ClickHouse/pull/44346) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix support for complex parameters (like arrays) of parametric aggregate functions. This closes [#30975](https://github.com/ClickHouse/ClickHouse/issues/30975). The aggregate function `sumMapFiltered` was unusable in distributed queries before this change. [#44358](https://github.com/ClickHouse/ClickHouse/pull/44358) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Fix reading ObjectId in BSON schema inference. [#44382](https://github.com/ClickHouse/ClickHouse/pull/44382) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix race which can lead to premature temp parts removal before merge finished in ReplicatedMergeTree. This issue could lead to errors like `No such file or directory: xxx`. Fixes [#43983](https://github.com/ClickHouse/ClickHouse/issues/43983). [#44383](https://github.com/ClickHouse/ClickHouse/pull/44383) ([alesapin](https://github.com/alesapin)).
* Fix race which can lead to premature temp parts removal before merge finishes in ReplicatedMergeTree. This issue could lead to errors like `No such file or directory: xxx`. Fixes [#43983](https://github.com/ClickHouse/ClickHouse/issues/43983). [#44383](https://github.com/ClickHouse/ClickHouse/pull/44383) ([alesapin](https://github.com/alesapin)).
* Some invalid `SYSTEM ... ON CLUSTER` queries worked in an unexpected way if a cluster name was not specified. It's fixed, now invalid queries throw `SYNTAX_ERROR` as they should. Fixes [#44264](https://github.com/ClickHouse/ClickHouse/issues/44264). [#44387](https://github.com/ClickHouse/ClickHouse/pull/44387) ([Alexander Tokmakov](https://github.com/tavplubix)).
* Fix reading Map type in ORC format. [#44400](https://github.com/ClickHouse/ClickHouse/pull/44400) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix reading columns that are not presented in input data in Parquet/ORC formats. Previously it could lead to error `INCORRECT_NUMBER_OF_COLUMNS`. Closes [#44333](https://github.com/ClickHouse/ClickHouse/issues/44333). [#44405](https://github.com/ClickHouse/ClickHouse/pull/44405) ([Kruglov Pavel](https://github.com/Avogar)).
@ -137,49 +138,49 @@ Add settings input_format_tsv/csv/custom_detect_header that enables this behavio
* Placing profile settings after profile settings constraints in the configuration file made constraints ineffective. [#44411](https://github.com/ClickHouse/ClickHouse/pull/44411) ([Konstantin Bogdanov](https://github.com/thevar1able)).
* Fix `SYNTAX_ERROR` while running `EXPLAIN AST INSERT` queries with data. Closes [#44207](https://github.com/ClickHouse/ClickHouse/issues/44207). [#44413](https://github.com/ClickHouse/ClickHouse/pull/44413) ([save-my-heart](https://github.com/save-my-heart)).
* Fix reading bool value with CRLF in CSV format. Closes [#44401](https://github.com/ClickHouse/ClickHouse/issues/44401). [#44442](https://github.com/ClickHouse/ClickHouse/pull/44442) ([Kruglov Pavel](https://github.com/Avogar)).
* Don't execute and/or/if/multiIf on LowCardinality dictionary, so the result type cannot be LowCardinality. It could lead to error `Illegal column ColumnLowCardinality` in some cases. Fixes [#43603](https://github.com/ClickHouse/ClickHouse/issues/43603). [#44469](https://github.com/ClickHouse/ClickHouse/pull/44469) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix mutations with setting `max_streams_for_merge_tree_reading`. [#44472](https://github.com/ClickHouse/ClickHouse/pull/44472) ([Anton Popov](https://github.com/CurtizJ)).
* Don't execute and/or/if/multiIf on a LowCardinality dictionary, so the result type cannot be LowCardinality. It could lead to the error `Illegal column ColumnLowCardinality` in some cases. Fixes [#43603](https://github.com/ClickHouse/ClickHouse/issues/43603). [#44469](https://github.com/ClickHouse/ClickHouse/pull/44469) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix mutations with the setting `max_streams_for_merge_tree_reading`. [#44472](https://github.com/ClickHouse/ClickHouse/pull/44472) ([Anton Popov](https://github.com/CurtizJ)).
* Fix potential null pointer dereference with GROUPING SETS in ASTSelectQuery::formatImpl ([#43049](https://github.com/ClickHouse/ClickHouse/issues/43049)). [#44479](https://github.com/ClickHouse/ClickHouse/pull/44479) ([Robert Schulze](https://github.com/rschu1ze)).
* Validate types in table function arguments, CAST function arguments, JSONAsObject schema inference according to settings. [#44501](https://github.com/ClickHouse/ClickHouse/pull/44501) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix IN function with LowCardinality and const column, close [#44503](https://github.com/ClickHouse/ClickHouse/issues/44503). [#44506](https://github.com/ClickHouse/ClickHouse/pull/44506) ([Duc Canh Le](https://github.com/canhld94)).
* Fixed a bug in normalization of a `DEFAULT` expression in `CREATE TABLE` statement. The second argument of function `in` (or the right argument of operator `IN`) might be replaced with the result of its evaluation during CREATE query execution. Fixes [#44496](https://github.com/ClickHouse/ClickHouse/issues/44496). [#44547](https://github.com/ClickHouse/ClickHouse/pull/44547) ([Alexander Tokmakov](https://github.com/tavplubix)).
* Fixed a bug in the normalization of a `DEFAULT` expression in `CREATE TABLE` statement. The second argument of the function `in` (or the right argument of operator `IN`) might be replaced with the result of its evaluation during CREATE query execution. Fixes [#44496](https://github.com/ClickHouse/ClickHouse/issues/44496). [#44547](https://github.com/ClickHouse/ClickHouse/pull/44547) ([Alexander Tokmakov](https://github.com/tavplubix)).
* Projections do not work in presence of WITH ROLLUP, WITH CUBE and WITH TOTALS. In previous versions, a query produced an exception instead of skipping the usage of projections. This closes [#44614](https://github.com/ClickHouse/ClickHouse/issues/44614). This closes [#42772](https://github.com/ClickHouse/ClickHouse/issues/42772). [#44615](https://github.com/ClickHouse/ClickHouse/pull/44615) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Async blocks were not cleaned because the function `get all blocks sorted by time` didn't get async blocks. [#44651](https://github.com/ClickHouse/ClickHouse/pull/44651) ([Han Fei](https://github.com/hanfei1991)).
* Fix `LOGICAL_ERROR` `The top step of the right pipeline should be ExpressionStep` for JOIN with subquery, UNION, and TOTALS. Fixes [#43687](https://github.com/ClickHouse/ClickHouse/issues/43687). [#44673](https://github.com/ClickHouse/ClickHouse/pull/44673) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Avoid `std::out_of_range` exception in the Executable table engine. [#44681](https://github.com/ClickHouse/ClickHouse/pull/44681) ([Kruglov Pavel](https://github.com/Avogar)).
* Do not apply `optimize_syntax_fuse_functions` to quantiles on AST, close [#44712](https://github.com/ClickHouse/ClickHouse/issues/44712). [#44713](https://github.com/ClickHouse/ClickHouse/pull/44713) ([Vladimir C](https://github.com/vdimir)).
* Fix bug with wrong type in Merge table and PREWHERE, close [#43324](https://github.com/ClickHouse/ClickHouse/issues/43324). [#44716](https://github.com/ClickHouse/ClickHouse/pull/44716) ([Vladimir C](https://github.com/vdimir)).
* Fix possible crash during shutdown (while destroying TraceCollector). Fixes [#44757](https://github.com/ClickHouse/ClickHouse/issues/44757). [#44758](https://github.com/ClickHouse/ClickHouse/pull/44758) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix a possible crash in distributed query processing. The crash could happen if a query with totals or extremes returned an empty result and there are mismatched types in the Distrubuted and the local tables. Fixes [#44738](https://github.com/ClickHouse/ClickHouse/issues/44738). [#44760](https://github.com/ClickHouse/ClickHouse/pull/44760) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix a possible crash during shutdown (while destroying TraceCollector). Fixes [#44757](https://github.com/ClickHouse/ClickHouse/issues/44757). [#44758](https://github.com/ClickHouse/ClickHouse/pull/44758) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix a possible crash in distributed query processing. The crash could happen if a query with totals or extremes returned an empty result and there are mismatched types in the Distributed and the local tables. Fixes [#44738](https://github.com/ClickHouse/ClickHouse/issues/44738). [#44760](https://github.com/ClickHouse/ClickHouse/pull/44760) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix fsync for fetches (`min_compressed_bytes_to_fsync_after_fetch`)/small files (ttl.txt, columns.txt) in mutations (`min_rows_to_fsync_after_merge`/`min_compressed_bytes_to_fsync_after_merge`). [#44781](https://github.com/ClickHouse/ClickHouse/pull/44781) ([Azat Khuzhin](https://github.com/azat)).
* A rare race condition was possible when querying the `system.parts` or `system.parts_columns` tables in the presence of parts being moved between disks. Introduced in [#41145](https://github.com/ClickHouse/ClickHouse/issues/41145). [#44809](https://github.com/ClickHouse/ClickHouse/pull/44809) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Fix the error `Context has expired` which could appear with enabled projections optimization. Can be reproduced for queries with specific functions, like `dictHas/dictGet` which use context in runtime. Fixes [#44844](https://github.com/ClickHouse/ClickHouse/issues/44844). [#44850](https://github.com/ClickHouse/ClickHouse/pull/44850) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* A fix for `Cannot read all data` error which could happen while reading `LowCardinality` dictionary from remote fs. Fixes [#44709](https://github.com/ClickHouse/ClickHouse/issues/44709). [#44875](https://github.com/ClickHouse/ClickHouse/pull/44875) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Ignore cases when hardware monitor sensors cannot be read instead of showing a full exception message in logs. [#44895](https://github.com/ClickHouse/ClickHouse/pull/44895) ([Raúl Marín](https://github.com/Algunenano)).
* Use `max_delay_to_insert` value in case calculated time to delay INSERT exceeds the setting value. Related to [#44902](https://github.com/ClickHouse/ClickHouse/issues/44902). [#44916](https://github.com/ClickHouse/ClickHouse/pull/44916) ([Igor Nikonov](https://github.com/devcrafter)).
* Use `max_delay_to_insert` value in case the calculated time to delay INSERT exceeds the setting value. Related to [#44902](https://github.com/ClickHouse/ClickHouse/issues/44902). [#44916](https://github.com/ClickHouse/ClickHouse/pull/44916) ([Igor Nikonov](https://github.com/devcrafter)).
* Fix error `Different order of columns in UNION subquery` for queries with `UNION`. Fixes [#44866](https://github.com/ClickHouse/ClickHouse/issues/44866). [#44920](https://github.com/ClickHouse/ClickHouse/pull/44920) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Delay for INSERT can be calculated incorrectly, which can lead to always using `max_delay_to_insert` setting as delay instead of a correct value. Using simple formula `max_delay_to_insert * (parts_over_threshold/max_allowed_parts_over_threshold)` i.e. delay grows proportionally to parts over threshold. Closes [#44902](https://github.com/ClickHouse/ClickHouse/issues/44902). [#44954](https://github.com/ClickHouse/ClickHouse/pull/44954) ([Igor Nikonov](https://github.com/devcrafter)).
* fix alter table ttl error when wide part has light weight delete mask. [#44959](https://github.com/ClickHouse/ClickHouse/pull/44959) ([Mingliang Pan](https://github.com/liangliangpan)).
* Fix alter table TTL error when a wide part has the lightweight delete mask. [#44959](https://github.com/ClickHouse/ClickHouse/pull/44959) ([Mingliang Pan](https://github.com/liangliangpan)).
* Follow-up fix for Replace domain IP types (IPv4, IPv6) with native [#43221](https://github.com/ClickHouse/ClickHouse/issues/43221). [#45024](https://github.com/ClickHouse/ClickHouse/pull/45024) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Follow-up fix for Replace domain IP types (IPv4, IPv6) with native https://github.com/ClickHouse/ClickHouse/pull/43221. [#45043](https://github.com/ClickHouse/ClickHouse/pull/45043) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* A buffer overflow was possible in the parser. Found by fuzzer. [#45047](https://github.com/ClickHouse/ClickHouse/pull/45047) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Fix possible cannot-read-all-data error in storage FileLog. Closes [#45051](https://github.com/ClickHouse/ClickHouse/issues/45051), [#38257](https://github.com/ClickHouse/ClickHouse/issues/38257). [#45057](https://github.com/ClickHouse/ClickHouse/pull/45057) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Memory efficient aggregation (setting `distributed_aggregation_memory_efficient`) is disabled when grouping sets are present in the query. [#45058](https://github.com/ClickHouse/ClickHouse/pull/45058) ([Nikita Taranov](https://github.com/nickitat)).
* Fix `RANGE_HASHED` dictionary to count range columns as part of primary key during updates when `update_field` is specified. Closes [#44588](https://github.com/ClickHouse/ClickHouse/issues/44588). [#45061](https://github.com/ClickHouse/ClickHouse/pull/45061) ([Maksim Kita](https://github.com/kitaisreal)).
* Fix error `Cannot capture column` for `LowCardinality` captured argument of nested labmda. Fixes [#45028](https://github.com/ClickHouse/ClickHouse/issues/45028). [#45065](https://github.com/ClickHouse/ClickHouse/pull/45065) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix the wrong query result of `additional_table_filters` (additional filter was not applied) in case if minmax/count projection is used. [#45133](https://github.com/ClickHouse/ClickHouse/pull/45133) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix `RANGE_HASHED` dictionary to count range columns as part of the primary key during updates when `update_field` is specified. Closes [#44588](https://github.com/ClickHouse/ClickHouse/issues/44588). [#45061](https://github.com/ClickHouse/ClickHouse/pull/45061) ([Maksim Kita](https://github.com/kitaisreal)).
* Fix error `Cannot capture column` for `LowCardinality` captured argument of nested lambda. Fixes [#45028](https://github.com/ClickHouse/ClickHouse/issues/45028). [#45065](https://github.com/ClickHouse/ClickHouse/pull/45065) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix the wrong query result of `additional_table_filters` (additional filter was not applied) in case the minmax/count projection is used. [#45133](https://github.com/ClickHouse/ClickHouse/pull/45133) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fixed bug in `histogram` function accepting negative values. [#45147](https://github.com/ClickHouse/ClickHouse/pull/45147) ([simpleton](https://github.com/rgzntrade)).
* Fix wrong column nullability in StoreageJoin, close [#44940](https://github.com/ClickHouse/ClickHouse/issues/44940). [#45184](https://github.com/ClickHouse/ClickHouse/pull/45184) ([Vladimir C](https://github.com/vdimir)).
* Fix `background_fetches_pool_size` settings reload (increase at runtime). [#45189](https://github.com/ClickHouse/ClickHouse/pull/45189) ([Raúl Marín](https://github.com/Algunenano)).
* Correctly process `SELECT` queries on KV engines (e.g. KeeperMap, EmbeddedRocksDB) using `IN` on the key with subquery producing different type. [#45215](https://github.com/ClickHouse/ClickHouse/pull/45215) ([Antonio Andelic](https://github.com/antonio2368)).
* Fix logical error in SEMI JOIN & join_use_nulls in some cases, close [#45163](https://github.com/ClickHouse/ClickHouse/issues/45163), close [#45209](https://github.com/ClickHouse/ClickHouse/issues/45209). [#45230](https://github.com/ClickHouse/ClickHouse/pull/45230) ([Vladimir C](https://github.com/vdimir)).
* Fix heap-use-after-free in reading from s3. [#45253](https://github.com/ClickHouse/ClickHouse/pull/45253) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix bug when the Avro Union type is ['null', Nested type], closes [#45275](https://github.com/ClickHouse/ClickHouse/issues/45275). Fix bug that incorrectly infer `bytes` type to `Float`. [#45276](https://github.com/ClickHouse/ClickHouse/pull/45276) ([flynn](https://github.com/ucasfl)).
* Throw a correct exception when explicit PREWHERE cannot be used with table using storage engine `Merge`. [#45319](https://github.com/ClickHouse/ClickHouse/pull/45319) ([Antonio Andelic](https://github.com/antonio2368)).
* Under WSL1 Ubuntu self-extracting clickhouse fails to decompress due to inconsistency - /proc/self/maps reporting 32bit file's inode, while stat reporting 64bit inode. [#45339](https://github.com/ClickHouse/ClickHouse/pull/45339) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Fix bug when the Avro Union type is ['null', Nested type], closes [#45275](https://github.com/ClickHouse/ClickHouse/issues/45275). Fix bug that incorrectly infers `bytes` type to `Float`. [#45276](https://github.com/ClickHouse/ClickHouse/pull/45276) ([flynn](https://github.com/ucasfl)).
* Throw a correct exception when explicit PREWHERE cannot be used with a table using the storage engine `Merge`. [#45319](https://github.com/ClickHouse/ClickHouse/pull/45319) ([Antonio Andelic](https://github.com/antonio2368)).
* Under WSL1 Ubuntu self-extracting ClickHouse fails to decompress due to inconsistency - /proc/self/maps reporting 32bit file's inode, while stat reporting 64bit inode. [#45339](https://github.com/ClickHouse/ClickHouse/pull/45339) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Fix race in Distributed table startup (that could lead to processing file of async INSERT multiple times). [#45360](https://github.com/ClickHouse/ClickHouse/pull/45360) ([Azat Khuzhin](https://github.com/azat)).
* Fix possible crash while reading from storage `S3` and table function `s3` in case when `ListObject` request has failed. [#45371](https://github.com/ClickHouse/ClickHouse/pull/45371) ([Anton Popov](https://github.com/CurtizJ)).
* Fix `SELECT ... FROM system.dictionaries` exception when there is a dictionary with a bad structure (e.g. incorrect type in xml config). [#45399](https://github.com/ClickHouse/ClickHouse/pull/45399) ([Aleksei Filatov](https://github.com/aalexfvk)).
* Fix a possible crash while reading from storage `S3` and table function `s3` in the case when `ListObject` request has failed. [#45371](https://github.com/ClickHouse/ClickHouse/pull/45371) ([Anton Popov](https://github.com/CurtizJ)).
* Fix `SELECT ... FROM system.dictionaries` exception when there is a dictionary with a bad structure (e.g. incorrect type in XML config). [#45399](https://github.com/ClickHouse/ClickHouse/pull/45399) ([Aleksei Filatov](https://github.com/aalexfvk)).
* Fix s3Cluster schema inference when structure from insertion table is used in `INSERT INTO ... SELECT * FROM s3Cluster` queries. [#45422](https://github.com/ClickHouse/ClickHouse/pull/45422) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix bug in JSON/BSONEachRow parsing with HTTP that could lead to using default values for some columns instead of values from data. [#45424](https://github.com/ClickHouse/ClickHouse/pull/45424) ([Kruglov Pavel](https://github.com/Avogar)).
* Fixed bug (Code: 632. DB::Exception: Unexpected data ... after parsed IPv6 value ...) with typed parsing of IP types from text source. [#45425](https://github.com/ClickHouse/ClickHouse/pull/45425) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
@ -187,7 +188,7 @@ Add settings input_format_tsv/csv/custom_detect_header that enables this behavio
* Fix possible (likely distributed) query hung. [#45448](https://github.com/ClickHouse/ClickHouse/pull/45448) ([Azat Khuzhin](https://github.com/azat)).
* Fix possible deadlock with `allow_asynchronous_read_from_io_pool_for_merge_tree` enabled in case of exception from `ThreadPool::schedule`. [#45481](https://github.com/ClickHouse/ClickHouse/pull/45481) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix possible in-use table after DETACH. [#45493](https://github.com/ClickHouse/ClickHouse/pull/45493) ([Azat Khuzhin](https://github.com/azat)).
* Fix rare abort in case when query is canceled and parallel parsing was used during its execution. [#45498](https://github.com/ClickHouse/ClickHouse/pull/45498) ([Anton Popov](https://github.com/CurtizJ)).
* Fix rare abort in the case when a query is canceled and parallel parsing was used during its execution. [#45498](https://github.com/ClickHouse/ClickHouse/pull/45498) ([Anton Popov](https://github.com/CurtizJ)).
* Fix a race between Distributed table creation and INSERT into it (could lead to CANNOT_LINK during INSERT into the table). [#45502](https://github.com/ClickHouse/ClickHouse/pull/45502) ([Azat Khuzhin](https://github.com/azat)).
* Add proper default (SLRU) to cache policy getter. Closes [#45514](https://github.com/ClickHouse/ClickHouse/issues/45514). [#45524](https://github.com/ClickHouse/ClickHouse/pull/45524) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Disallow array join in mutations closes [#42637](https://github.com/ClickHouse/ClickHouse/issues/42637) [#44447](https://github.com/ClickHouse/ClickHouse/pull/44447) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).

View File

@ -9,7 +9,7 @@ ClickHouse® is an open-source column-oriented database management system that a
* [Tutorial](https://clickhouse.com/docs/en/getting_started/tutorial/) shows how to set up and query a small ClickHouse cluster.
* [Documentation](https://clickhouse.com/docs/en/) provides more in-depth information.
* [YouTube channel](https://www.youtube.com/c/ClickHouseDB) has a lot of content about ClickHouse in video format.
* [Slack](https://join.slack.com/t/clickhousedb/shared_invite/zt-rxm3rdrk-lIUmhLC3V8WTaL0TGxsOmg) and [Telegram](https://telegram.me/clickhouse_en) allow chatting with ClickHouse users in real-time.
* [Slack](https://clickhousedb.slack.com/) and [Telegram](https://telegram.me/clickhouse_en) allow chatting with ClickHouse users in real-time.
* [Blog](https://clickhouse.com/blog/) contains various ClickHouse-related articles, as well as announcements and reports about events.
* [Code Browser (Woboq)](https://clickhouse.com/codebrowser/ClickHouse/index.html) with syntax highlight and navigation.
* [Code Browser (github.dev)](https://github.dev/ClickHouse/ClickHouse) with syntax highlight, powered by github.dev.

View File

@ -140,6 +140,7 @@ add_contrib (simdjson-cmake simdjson)
add_contrib (rapidjson-cmake rapidjson)
add_contrib (fastops-cmake fastops)
add_contrib (libuv-cmake libuv)
add_contrib (liburing-cmake liburing)
add_contrib (amqpcpp-cmake AMQP-CPP) # requires: libuv
add_contrib (cassandra-cmake cassandra) # requires: libuv

2
contrib/NuRaft vendored

@ -1 +1 @@
Subproject commit 545b8c810a956b2efdc116e86be219af7e83d68a
Subproject commit b56784be1aec568fb72aff47f281097c017623cb

View File

@ -460,8 +460,15 @@ set(ICUI18N_SOURCES
file(GENERATE OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/empty.cpp" CONTENT " ")
enable_language(ASM)
if (ARCH_S390X)
set(ICUDATA_SOURCE_FILE "${ICUDATA_SOURCE_DIR}/icudt70b_dat.S" )
else()
set(ICUDATA_SOURCE_FILE "${ICUDATA_SOURCE_DIR}/icudt70l_dat.S" )
endif()
set(ICUDATA_SOURCES
"${ICUDATA_SOURCE_DIR}/icudt70l_dat.S"
"${ICUDATA_SOURCE_FILE}"
"${CMAKE_CURRENT_BINARY_DIR}/empty.cpp" # Without this cmake can incorrectly detects library type (OBJECT) instead of SHARED/STATIC
)

2
contrib/icudata vendored

@ -1 +1 @@
Subproject commit 72d9a4a7febc904e2b0a534ccb25ae40fac5f1e5
Subproject commit c8e717892a557b4d2852317c7d628aacc0a0e5ab

2
contrib/krb5 vendored

@ -1 +1 @@
Subproject commit b89e20367b074bd02dd118a6534099b21e88b3c3
Subproject commit f8262a1b548eb29d97e059260042036255d07f8d

View File

@ -15,6 +15,10 @@ if(NOT AWK_PROGRAM)
message(FATAL_ERROR "You need the awk program to build ClickHouse with krb5 enabled.")
endif()
if (NOT (ENABLE_OPENSSL OR ENABLE_OPENSSL_DYNAMIC))
add_compile_definitions(USE_BORINGSSL=1)
endif ()
set(KRB5_SOURCE_DIR "${ClickHouse_SOURCE_DIR}/contrib/krb5/src")
set(KRB5_ET_BIN_DIR "${CMAKE_CURRENT_BINARY_DIR}/include_private")
@ -578,12 +582,6 @@ if(CMAKE_SYSTEM_NAME MATCHES "Darwin")
list(APPEND ALL_SRCS "${CMAKE_CURRENT_BINARY_DIR}/include_private/kcmrpc.c")
endif()
if (ENABLE_OPENSSL OR ENABLE_OPENSSL_DYNAMIC)
list(REMOVE_ITEM ALL_SRCS "${KRB5_SOURCE_DIR}/lib/crypto/openssl/enc_provider/aes.c")
list(APPEND ALL_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/aes.c")
endif ()
target_sources(_krb5 PRIVATE
${ALL_SRCS}
)

View File

@ -1,302 +0,0 @@
/* -*- mode: c; c-basic-offset: 4; indent-tabs-mode: nil -*- */
/* lib/crypto/openssl/enc_provider/aes.c */
/*
* Copyright (C) 2003, 2007, 2008, 2009 by the Massachusetts Institute of Technology.
* All rights reserved.
*
* Export of this software from the United States of America may
* require a specific license from the United States Government.
* It is the responsibility of any person or organization contemplating
* export to obtain such a license before exporting.
*
* WITHIN THAT CONSTRAINT, permission to use, copy, modify, and
* distribute this software and its documentation for any purpose and
* without fee is hereby granted, provided that the above copyright
* notice appear in all copies and that both that copyright notice and
* this permission notice appear in supporting documentation, and that
* the name of M.I.T. not be used in advertising or publicity pertaining
* to distribution of the software without specific, written prior
* permission. Furthermore if you modify this software you must label
* your software as modified software and not distribute it in such a
* fashion that it might be confused with the original M.I.T. software.
* M.I.T. makes no representations about the suitability of
* this software for any purpose. It is provided "as is" without express
* or implied warranty.
*/
#include "crypto_int.h"
#include <openssl/evp.h>
#include <openssl/aes.h>
/* proto's */
static krb5_error_code
cbc_enc(krb5_key key, const krb5_data *ivec, krb5_crypto_iov *data,
size_t num_data);
static krb5_error_code
cbc_decr(krb5_key key, const krb5_data *ivec, krb5_crypto_iov *data,
size_t num_data);
static krb5_error_code
cts_encr(krb5_key key, const krb5_data *ivec, krb5_crypto_iov *data,
size_t num_data, size_t dlen);
static krb5_error_code
cts_decr(krb5_key key, const krb5_data *ivec, krb5_crypto_iov *data,
size_t num_data, size_t dlen);
#define BLOCK_SIZE 16
#define NUM_BITS 8
#define IV_CTS_BUF_SIZE 16 /* 16 - hardcoded in CRYPTO_cts128_en/decrypt */
static const EVP_CIPHER *
map_mode(unsigned int len)
{
if (len==16)
return EVP_aes_128_cbc();
if (len==32)
return EVP_aes_256_cbc();
else
return NULL;
}
/* Encrypt one block using CBC. */
static krb5_error_code
cbc_enc(krb5_key key, const krb5_data *ivec, krb5_crypto_iov *data,
size_t num_data)
{
int ret, olen = BLOCK_SIZE;
unsigned char iblock[BLOCK_SIZE], oblock[BLOCK_SIZE];
EVP_CIPHER_CTX *ctx;
struct iov_cursor cursor;
ctx = EVP_CIPHER_CTX_new();
if (ctx == NULL)
return ENOMEM;
ret = EVP_EncryptInit_ex(ctx, map_mode(key->keyblock.length),
NULL, key->keyblock.contents, (ivec) ? (unsigned char*)ivec->data : NULL);
if (ret == 0) {
EVP_CIPHER_CTX_free(ctx);
return KRB5_CRYPTO_INTERNAL;
}
k5_iov_cursor_init(&cursor, data, num_data, BLOCK_SIZE, FALSE);
k5_iov_cursor_get(&cursor, iblock);
EVP_CIPHER_CTX_set_padding(ctx,0);
ret = EVP_EncryptUpdate(ctx, oblock, &olen, iblock, BLOCK_SIZE);
if (ret == 1)
k5_iov_cursor_put(&cursor, oblock);
EVP_CIPHER_CTX_free(ctx);
zap(iblock, BLOCK_SIZE);
zap(oblock, BLOCK_SIZE);
return (ret == 1) ? 0 : KRB5_CRYPTO_INTERNAL;
}
/* Decrypt one block using CBC. */
static krb5_error_code
cbc_decr(krb5_key key, const krb5_data *ivec, krb5_crypto_iov *data,
size_t num_data)
{
int ret = 0, olen = BLOCK_SIZE;
unsigned char iblock[BLOCK_SIZE], oblock[BLOCK_SIZE];
EVP_CIPHER_CTX *ctx;
struct iov_cursor cursor;
ctx = EVP_CIPHER_CTX_new();
if (ctx == NULL)
return ENOMEM;
ret = EVP_DecryptInit_ex(ctx, map_mode(key->keyblock.length),
NULL, key->keyblock.contents, (ivec) ? (unsigned char*)ivec->data : NULL);
if (ret == 0) {
EVP_CIPHER_CTX_free(ctx);
return KRB5_CRYPTO_INTERNAL;
}
k5_iov_cursor_init(&cursor, data, num_data, BLOCK_SIZE, FALSE);
k5_iov_cursor_get(&cursor, iblock);
EVP_CIPHER_CTX_set_padding(ctx,0);
ret = EVP_DecryptUpdate(ctx, oblock, &olen, iblock, BLOCK_SIZE);
if (ret == 1)
k5_iov_cursor_put(&cursor, oblock);
EVP_CIPHER_CTX_free(ctx);
zap(iblock, BLOCK_SIZE);
zap(oblock, BLOCK_SIZE);
return (ret == 1) ? 0 : KRB5_CRYPTO_INTERNAL;
}
static krb5_error_code
cts_encr(krb5_key key, const krb5_data *ivec, krb5_crypto_iov *data,
size_t num_data, size_t dlen)
{
int ret = 0;
size_t size = 0;
unsigned char *oblock = NULL, *dbuf = NULL;
unsigned char iv_cts[IV_CTS_BUF_SIZE];
struct iov_cursor cursor;
AES_KEY enck;
memset(iv_cts,0,sizeof(iv_cts));
if (ivec && ivec->data){
if (ivec->length != sizeof(iv_cts))
return KRB5_CRYPTO_INTERNAL;
memcpy(iv_cts, ivec->data,ivec->length);
}
oblock = OPENSSL_malloc(dlen);
if (!oblock){
return ENOMEM;
}
dbuf = OPENSSL_malloc(dlen);
if (!dbuf){
OPENSSL_free(oblock);
return ENOMEM;
}
k5_iov_cursor_init(&cursor, data, num_data, dlen, FALSE);
k5_iov_cursor_get(&cursor, dbuf);
AES_set_encrypt_key(key->keyblock.contents,
NUM_BITS * key->keyblock.length, &enck);
size = CRYPTO_cts128_encrypt((unsigned char *)dbuf, oblock, dlen, &enck,
iv_cts, AES_cbc_encrypt);
if (size <= 0)
ret = KRB5_CRYPTO_INTERNAL;
else
k5_iov_cursor_put(&cursor, oblock);
if (!ret && ivec && ivec->data)
memcpy(ivec->data, iv_cts, sizeof(iv_cts));
zap(oblock, dlen);
zap(dbuf, dlen);
OPENSSL_free(oblock);
OPENSSL_free(dbuf);
return ret;
}
static krb5_error_code
cts_decr(krb5_key key, const krb5_data *ivec, krb5_crypto_iov *data,
size_t num_data, size_t dlen)
{
int ret = 0;
size_t size = 0;
unsigned char *oblock = NULL;
unsigned char *dbuf = NULL;
unsigned char iv_cts[IV_CTS_BUF_SIZE];
struct iov_cursor cursor;
AES_KEY deck;
memset(iv_cts,0,sizeof(iv_cts));
if (ivec && ivec->data){
if (ivec->length != sizeof(iv_cts))
return KRB5_CRYPTO_INTERNAL;
memcpy(iv_cts, ivec->data,ivec->length);
}
oblock = OPENSSL_malloc(dlen);
if (!oblock)
return ENOMEM;
dbuf = OPENSSL_malloc(dlen);
if (!dbuf){
OPENSSL_free(oblock);
return ENOMEM;
}
AES_set_decrypt_key(key->keyblock.contents,
NUM_BITS * key->keyblock.length, &deck);
k5_iov_cursor_init(&cursor, data, num_data, dlen, FALSE);
k5_iov_cursor_get(&cursor, dbuf);
size = CRYPTO_cts128_decrypt((unsigned char *)dbuf, oblock,
dlen, &deck,
iv_cts, AES_cbc_encrypt);
if (size <= 0)
ret = KRB5_CRYPTO_INTERNAL;
else
k5_iov_cursor_put(&cursor, oblock);
if (!ret && ivec && ivec->data)
memcpy(ivec->data, iv_cts, sizeof(iv_cts));
zap(oblock, dlen);
zap(dbuf, dlen);
OPENSSL_free(oblock);
OPENSSL_free(dbuf);
return ret;
}
krb5_error_code
krb5int_aes_encrypt(krb5_key key, const krb5_data *ivec,
krb5_crypto_iov *data, size_t num_data)
{
int ret = 0;
size_t input_length, nblocks;
input_length = iov_total_length(data, num_data, FALSE);
nblocks = (input_length + BLOCK_SIZE - 1) / BLOCK_SIZE;
if (nblocks == 1) {
if (input_length != BLOCK_SIZE)
return KRB5_BAD_MSIZE;
ret = cbc_enc(key, ivec, data, num_data);
} else if (nblocks > 1) {
ret = cts_encr(key, ivec, data, num_data, input_length);
}
return ret;
}
krb5_error_code
krb5int_aes_decrypt(krb5_key key, const krb5_data *ivec,
krb5_crypto_iov *data, size_t num_data)
{
int ret = 0;
size_t input_length, nblocks;
input_length = iov_total_length(data, num_data, FALSE);
nblocks = (input_length + BLOCK_SIZE - 1) / BLOCK_SIZE;
if (nblocks == 1) {
if (input_length != BLOCK_SIZE)
return KRB5_BAD_MSIZE;
ret = cbc_decr(key, ivec, data, num_data);
} else if (nblocks > 1) {
ret = cts_decr(key, ivec, data, num_data, input_length);
}
return ret;
}
static krb5_error_code
krb5int_aes_init_state (const krb5_keyblock *key, krb5_keyusage usage,
krb5_data *state)
{
state->length = 16;
state->data = (void *) malloc(16);
if (state->data == NULL)
return ENOMEM;
memset(state->data, 0, state->length);
return 0;
}
const struct krb5_enc_provider krb5int_enc_aes128 = {
16,
16, 16,
krb5int_aes_encrypt,
krb5int_aes_decrypt,
NULL,
krb5int_aes_init_state,
krb5int_default_free_state
};
const struct krb5_enc_provider krb5int_enc_aes256 = {
16,
32, 32,
krb5int_aes_encrypt,
krb5int_aes_decrypt,
NULL,
krb5int_aes_init_state,
krb5int_default_free_state
};

1
contrib/liburing vendored Submodule

@ -0,0 +1 @@
Subproject commit f5a48392c4ea33f222cbebeb2e2fc31620162949

View File

@ -0,0 +1,53 @@
set (ENABLE_LIBURING_DEFAULT ${ENABLE_LIBRARIES})
if (NOT OS_LINUX)
set (ENABLE_LIBURING_DEFAULT OFF)
endif ()
option (ENABLE_LIBURING "Enable liburing" ${ENABLE_LIBURING_DEFAULT})
if (NOT ENABLE_LIBURING)
message (STATUS "Not using liburing")
return ()
endif ()
set (LIBURING_INCLUDE_DIR "${ClickHouse_SOURCE_DIR}/contrib/liburing/src/include")
set (LIBURING_SOURCE_DIR "${ClickHouse_SOURCE_DIR}/contrib/liburing/src")
set (SRCS
"${LIBURING_SOURCE_DIR}/queue.c"
"${LIBURING_SOURCE_DIR}/register.c"
"${LIBURING_SOURCE_DIR}/setup.c"
"${LIBURING_SOURCE_DIR}/syscall.c"
"${LIBURING_SOURCE_DIR}/version.c"
)
add_compile_definitions (_GNU_SOURCE)
add_compile_definitions (LIBURING_INTERNAL)
set (LIBURING_COMPAT_INCLUDE_DIR "${ClickHouse_BINARY_DIR}/contrib/liburing/src/include-compat")
set (LIBURING_COMPAT_HEADER "${LIBURING_COMPAT_INCLUDE_DIR}/liburing/compat.h")
set (LIBURING_CONFIG_HAS_KERNEL_RWF_T FALSE)
set (LIBURING_CONFIG_HAS_KERNEL_TIMESPEC FALSE)
set (LIBURING_CONFIG_HAS_OPEN_HOW FALSE)
set (LIBURING_CONFIG_HAS_STATX FALSE)
set (LIBURING_CONFIG_HAS_GLIBC_STATX FALSE)
configure_file (compat.h.in ${LIBURING_COMPAT_HEADER})
set (LIBURING_GENERATED_INCLUDE_DIR "${ClickHouse_BINARY_DIR}/contrib/liburing/src/include")
set (LIBURING_VERSION_HEADER "${LIBURING_GENERATED_INCLUDE_DIR}/liburing/io_uring_version.h")
file (READ "${LIBURING_SOURCE_DIR}/../liburing.spec" LIBURING_SPEC)
string (REGEX MATCH "Version: ([0-9]+)\.([0-9]+)" _ ${LIBURING_SPEC})
set (LIBURING_VERSION_MAJOR ${CMAKE_MATCH_1})
set (LIBURING_VERSION_MINOR ${CMAKE_MATCH_2})
configure_file (io_uring_version.h.in ${LIBURING_VERSION_HEADER})
add_library (_liburing ${SRCS})
add_library (ch_contrib::liburing ALIAS _liburing)
target_include_directories (_liburing SYSTEM PUBLIC ${LIBURING_COMPAT_INCLUDE_DIR} ${LIBURING_GENERATED_INCLUDE_DIR} "${LIBURING_SOURCE_DIR}/include")

View File

@ -0,0 +1,50 @@
/* SPDX-License-Identifier: MIT */
#ifndef LIBURING_COMPAT_H
#define LIBURING_COMPAT_H
# cmakedefine LIBURING_CONFIG_HAS_KERNEL_RWF_T
# cmakedefine LIBURING_CONFIG_HAS_KERNEL_TIMESPEC
# cmakedefine LIBURING_CONFIG_HAS_OPEN_HOW
# cmakedefine LIBURING_CONFIG_HAS_GLIBC_STATX
# cmakedefine LIBURING_CONFIG_HAS_STATX
#if !defined(LIBURING_CONFIG_HAS_KERNEL_RWF_T)
typedef int __kernel_rwf_t;
#endif
#if !defined(LIBURING_CONFIG_HAS_KERNEL_TIMESPEC)
#include <stdint.h>
struct __kernel_timespec {
int64_t tv_sec;
long long tv_nsec;
};
/* <linux/time_types.h> is not available, so it can't be included */
#define UAPI_LINUX_IO_URING_H_SKIP_LINUX_TIME_TYPES_H 1
#else
#include <linux/time_types.h>
/* <linux/time_types.h> is included above and not needed again */
#define UAPI_LINUX_IO_URING_H_SKIP_LINUX_TIME_TYPES_H 1
#endif
#if !defined(LIBURING_CONFIG_HAS_OPEN_HOW)
#include <inttypes.h>
struct open_how {
uint64_t flags;
uint64_t mode;
uint64_t resolve;
};
#else
#include <linux/openat2.h>
#endif
#if !defined(LIBURING_CONFIG_HAS_GLIBC_STATX) && defined(LIBURING_CONFIG_HAS_STATX)
#include <sys/stat.h>
#endif
#endif

View File

@ -0,0 +1,8 @@
/* SPDX-License-Identifier: MIT */
#ifndef LIBURING_VERSION_H
#define LIBURING_VERSION_H
#define IO_URING_VERSION_MAJOR ${LIBURING_VERSION_MAJOR}
#define IO_URING_VERSION_MINOR ${LIBURING_VERSION_MINOR}
#endif

View File

@ -1,3 +1,9 @@
# Note: ClickHouse uses BoringSSL. The presence of OpenSSL is only due to IBM's port of ClickHouse to s390x. BoringSSL does not support
# s390x, also FIPS validation provided by the OS vendor (Red Hat, Ubuntu) requires (preferrably dynamic) linking with OS packages which
# ClickHouse generally avoids.
#
# Furthermore, the in-source OpenSSL dump in this directory is due to development purposes and non FIPS-compliant.
if(ENABLE_OPENSSL_DYNAMIC OR ENABLE_OPENSSL)
set(ENABLE_SSL 1 CACHE INTERNAL "")
set(OPENSSL_SOURCE_DIR ${ClickHouse_SOURCE_DIR}/contrib/openssl)

View File

@ -1,6 +1,10 @@
set (SOURCE_DIR "${CMAKE_SOURCE_DIR}/contrib/snappy")
set (SNAPPY_IS_BIG_ENDIAN 0)
if (ARCH_S390X)
set (SNAPPY_IS_BIG_ENDIAN 1)
else ()
set (SNAPPY_IS_BIG_ENDIAN 0)
endif()
set (HAVE_BYTESWAP_H 1)
set (HAVE_SYS_MMAN_H 1)

View File

@ -33,7 +33,7 @@ RUN arch=${TARGETARCH:-amd64} \
# lts / testing / prestable / etc
ARG REPO_CHANNEL="stable"
ARG REPOSITORY="https://packages.clickhouse.com/tgz/${REPO_CHANNEL}"
ARG VERSION="23.1.1.3077"
ARG VERSION="23.1.2.9"
ARG PACKAGES="clickhouse-client clickhouse-server clickhouse-common-static"
# user/group precreated explicitly with fixed uid/gid on purpose.

View File

@ -21,7 +21,7 @@ RUN sed -i "s|http://archive.ubuntu.com|${apt_archive}|g" /etc/apt/sources.list
ARG REPO_CHANNEL="stable"
ARG REPOSITORY="deb https://packages.clickhouse.com/deb ${REPO_CHANNEL} main"
ARG VERSION="23.1.1.3077"
ARG VERSION="23.1.2.9"
ARG PACKAGES="clickhouse-client clickhouse-server clickhouse-common-static"
# set non-empty deb_location_url url to create a docker image

View File

@ -139,6 +139,7 @@ function clone_submodules
contrib/morton-nd
contrib/xxHash
contrib/simdjson
contrib/liburing
)
git submodule sync
@ -161,6 +162,7 @@ function run_cmake
"-DENABLE_NURAFT=1"
"-DENABLE_SIMDJSON=1"
"-DENABLE_JEMALLOC=1"
"-DENABLE_LIBURING=1"
)
export CCACHE_DIR="$FASTTEST_WORKSPACE/ccache"

View File

@ -11,6 +11,18 @@ set -x
# core.COMM.PID-TID
sysctl kernel.core_pattern='core.%e.%p-%P'
OK="\tOK\t\\N\t"
FAIL="\tFAIL\t\\N\t"
function escaped()
{
# That's the simplest way I found to escape a string in bash. Yep, bash is the most convenient programming language.
clickhouse local -S 's String' --input-format=LineAsString -q "select * from table format CustomSeparated settings format_custom_row_after_delimiter='\\\\\\\\n'"
}
function head_escaped()
{
head -50 $1 | escaped
}
function install_packages()
{
@ -33,7 +45,9 @@ function configure()
ln -s /usr/share/clickhouse-test/ci/get_previous_release_tag.py /usr/bin/get_previous_release_tag
# avoid too slow startup
sudo cat /etc/clickhouse-server/config.d/keeper_port.xml | sed "s|<snapshot_distance>100000</snapshot_distance>|<snapshot_distance>10000</snapshot_distance>|" > /etc/clickhouse-server/config.d/keeper_port.xml.tmp
sudo cat /etc/clickhouse-server/config.d/keeper_port.xml \
| sed "s|<snapshot_distance>100000</snapshot_distance>|<snapshot_distance>10000</snapshot_distance>|" \
> /etc/clickhouse-server/config.d/keeper_port.xml.tmp
sudo mv /etc/clickhouse-server/config.d/keeper_port.xml.tmp /etc/clickhouse-server/config.d/keeper_port.xml
sudo chown clickhouse /etc/clickhouse-server/config.d/keeper_port.xml
sudo chgrp clickhouse /etc/clickhouse-server/config.d/keeper_port.xml
@ -136,6 +150,7 @@ function stop()
clickhouse stop --max-tries "$max_tries" --do-not-kill && return
# We failed to stop the server with SIGTERM. Maybe it hang, let's collect stacktraces.
echo -e "Possible deadlock on shutdown (see gdb.log)$FAIL" >> /test_output/test_results.tsv
kill -TERM "$(pidof gdb)" ||:
sleep 5
echo "thread apply all backtrace (on stop)" >> /test_output/gdb.log
@ -151,10 +166,11 @@ function start()
if [ "$counter" -gt ${1:-120} ]
then
echo "Cannot start clickhouse-server"
echo -e "Cannot start clickhouse-server\tFAIL" >> /test_output/test_results.tsv
rg --text "<Error>.*Application" /var/log/clickhouse-server/clickhouse-server.log > /test_output/application_errors.txt ||:
echo -e "Cannot start clickhouse-server$FAIL$(head_escaped /test_output/application_errors.txt)" >> /test_output/test_results.tsv
cat /var/log/clickhouse-server/stdout.log
tail -n1000 /var/log/clickhouse-server/stderr.log
tail -n100000 /var/log/clickhouse-server/clickhouse-server.log | rg -F -v -e '<Warning> RaftInstance:' -e '<Information> RaftInstance' | tail -n1000
tail -n100 /var/log/clickhouse-server/stderr.log
tail -n100000 /var/log/clickhouse-server/clickhouse-server.log | rg -F -v -e '<Warning> RaftInstance:' -e '<Information> RaftInstance' | tail -n100
break
fi
# use root to match with current uid
@ -252,9 +268,92 @@ start
clickhouse-client --query "SHOW TABLES FROM datasets"
clickhouse-client --query "SHOW TABLES FROM test"
clickhouse-client --query "CREATE TABLE test.hits_s3 (WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16, EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, ClientIP6 FixedString(16), RegionID UInt32, UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, URLDomain String, RefererDomain String, Refresh UInt8, IsRobot UInt8, RefererCategories Array(UInt16), URLCategories Array(UInt16), URLRegions Array(UInt32), RefererRegions Array(UInt32), ResolutionWidth UInt16, ResolutionHeight UInt16, ResolutionDepth UInt8, FlashMajor UInt8, FlashMinor UInt8, FlashMinor2 String, NetMajor UInt8, NetMinor UInt8, UserAgentMajor UInt16, UserAgentMinor FixedString(2), CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, MobilePhone UInt8, MobilePhoneModel String, Params String, IPNetworkID UInt32, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, IsArtifical UInt8, WindowClientWidth UInt16, WindowClientHeight UInt16, ClientTimeZone Int16, ClientEventTime DateTime, SilverlightVersion1 UInt8, SilverlightVersion2 UInt8, SilverlightVersion3 UInt32, SilverlightVersion4 UInt16, PageCharset String, CodeVersion UInt32, IsLink UInt8, IsDownload UInt8, IsNotBounce UInt8, FUniqID UInt64, HID UInt32, IsOldCounter UInt8, IsEvent UInt8, IsParameter UInt8, DontCountHits UInt8, WithHash UInt8, HitColor FixedString(1), UTCEventTime DateTime, Age UInt8, Sex UInt8, Income UInt8, Interests UInt16, Robotness UInt8, GeneralInterests Array(UInt16), RemoteIP UInt32, RemoteIP6 FixedString(16), WindowName Int32, OpenerName Int32, HistoryLength Int16, BrowserLanguage FixedString(2), BrowserCountry FixedString(2), SocialNetwork String, SocialAction String, HTTPError UInt16, SendTiming Int32, DNSTiming Int32, ConnectTiming Int32, ResponseStartTiming Int32, ResponseEndTiming Int32, FetchTiming Int32, RedirectTiming Int32, DOMInteractiveTiming Int32, DOMContentLoadedTiming Int32, DOMCompleteTiming Int32, LoadEventStartTiming Int32, LoadEventEndTiming Int32, NSToDOMContentLoadedTiming Int32, FirstPaintTiming Int32, RedirectCount Int8, SocialSourceNetworkID UInt8, SocialSourcePage String, ParamPrice Int64, ParamOrderID String, ParamCurrency FixedString(3), ParamCurrencyID UInt16, GoalsReached Array(UInt32), OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, RefererHash UInt64, URLHash UInt64, CLID UInt32, YCLID UInt64, ShareService String, ShareURL String, ShareTitle String, ParsedParams Nested(Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), IslandID FixedString(16), RequestNum UInt32, RequestTry UInt8) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192, storage_policy='s3_cache'"
clickhouse-client --query "CREATE TABLE test.hits (WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16, EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, ClientIP6 FixedString(16), RegionID UInt32, UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, URLDomain String, RefererDomain String, Refresh UInt8, IsRobot UInt8, RefererCategories Array(UInt16), URLCategories Array(UInt16), URLRegions Array(UInt32), RefererRegions Array(UInt32), ResolutionWidth UInt16, ResolutionHeight UInt16, ResolutionDepth UInt8, FlashMajor UInt8, FlashMinor UInt8, FlashMinor2 String, NetMajor UInt8, NetMinor UInt8, UserAgentMajor UInt16, UserAgentMinor FixedString(2), CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, MobilePhone UInt8, MobilePhoneModel String, Params String, IPNetworkID UInt32, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, IsArtifical UInt8, WindowClientWidth UInt16, WindowClientHeight UInt16, ClientTimeZone Int16, ClientEventTime DateTime, SilverlightVersion1 UInt8, SilverlightVersion2 UInt8, SilverlightVersion3 UInt32, SilverlightVersion4 UInt16, PageCharset String, CodeVersion UInt32, IsLink UInt8, IsDownload UInt8, IsNotBounce UInt8, FUniqID UInt64, HID UInt32, IsOldCounter UInt8, IsEvent UInt8, IsParameter UInt8, DontCountHits UInt8, WithHash UInt8, HitColor FixedString(1), UTCEventTime DateTime, Age UInt8, Sex UInt8, Income UInt8, Interests UInt16, Robotness UInt8, GeneralInterests Array(UInt16), RemoteIP UInt32, RemoteIP6 FixedString(16), WindowName Int32, OpenerName Int32, HistoryLength Int16, BrowserLanguage FixedString(2), BrowserCountry FixedString(2), SocialNetwork String, SocialAction String, HTTPError UInt16, SendTiming Int32, DNSTiming Int32, ConnectTiming Int32, ResponseStartTiming Int32, ResponseEndTiming Int32, FetchTiming Int32, RedirectTiming Int32, DOMInteractiveTiming Int32, DOMContentLoadedTiming Int32, DOMCompleteTiming Int32, LoadEventStartTiming Int32, LoadEventEndTiming Int32, NSToDOMContentLoadedTiming Int32, FirstPaintTiming Int32, RedirectCount Int8, SocialSourceNetworkID UInt8, SocialSourcePage String, ParamPrice Int64, ParamOrderID String, ParamCurrency FixedString(3), ParamCurrencyID UInt16, GoalsReached Array(UInt32), OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, RefererHash UInt64, URLHash UInt64, CLID UInt32, YCLID UInt64, ShareService String, ShareURL String, ShareTitle String, ParsedParams Nested(Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), IslandID FixedString(16), RequestNum UInt32, RequestTry UInt8) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192, storage_policy='s3_cache'"
clickhouse-client --query "CREATE TABLE test.visits (CounterID UInt32, StartDate Date, Sign Int8, IsNew UInt8, VisitID UInt64, UserID UInt64, StartTime DateTime, Duration UInt32, UTCStartTime DateTime, PageViews Int32, Hits Int32, IsBounce UInt8, Referer String, StartURL String, RefererDomain String, StartURLDomain String, EndURL String, LinkURL String, IsDownload UInt8, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, PlaceID Int32, RefererCategories Array(UInt16), URLCategories Array(UInt16), URLRegions Array(UInt32), RefererRegions Array(UInt32), IsYandex UInt8, GoalReachesDepth Int32, GoalReachesURL Int32, GoalReachesAny Int32, SocialSourceNetworkID UInt8, SocialSourcePage String, MobilePhoneModel String, ClientEventTime DateTime, RegionID UInt32, ClientIP UInt32, ClientIP6 FixedString(16), RemoteIP UInt32, RemoteIP6 FixedString(16), IPNetworkID UInt32, SilverlightVersion3 UInt32, CodeVersion UInt32, ResolutionWidth UInt16, ResolutionHeight UInt16, UserAgentMajor UInt16, UserAgentMinor UInt16, WindowClientWidth UInt16, WindowClientHeight UInt16, SilverlightVersion2 UInt8, SilverlightVersion4 UInt16, FlashVersion3 UInt16, FlashVersion4 UInt16, ClientTimeZone Int16, OS UInt8, UserAgent UInt8, ResolutionDepth UInt8, FlashMajor UInt8, FlashMinor UInt8, NetMajor UInt8, NetMinor UInt8, MobilePhone UInt8, SilverlightVersion1 UInt8, Age UInt8, Sex UInt8, Income UInt8, JavaEnable UInt8, CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, BrowserLanguage UInt16, BrowserCountry UInt16, Interests UInt16, Robotness UInt8, GeneralInterests Array(UInt16), Params Array(String), Goals Nested(ID UInt32, Serial UInt32, EventTime DateTime, Price Int64, OrderID String, CurrencyID UInt32), WatchIDs Array(UInt64), ParamSumPrice Int64, ParamCurrency FixedString(3), ParamCurrencyID UInt16, ClickLogID UInt64, ClickEventID Int32, ClickGoodEvent Int32, ClickEventTime DateTime, ClickPriorityID Int32, ClickPhraseID Int32, ClickPageID Int32, ClickPlaceID Int32, ClickTypeID Int32, ClickResourceID Int32, ClickCost UInt32, ClickClientIP UInt32, ClickDomainID UInt32, ClickURL String, ClickAttempt UInt8, ClickOrderID UInt32, ClickBannerID UInt32, ClickMarketCategoryID UInt32, ClickMarketPP UInt32, ClickMarketCategoryName String, ClickMarketPPName String, ClickAWAPSCampaignName String, ClickPageName String, ClickTargetType UInt16, ClickTargetPhraseID UInt64, ClickContextType UInt8, ClickSelectType Int8, ClickOptions String, ClickGroupBannerID Int32, OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, FirstVisit DateTime, PredLastVisit Date, LastVisit Date, TotalVisits UInt32, TraficSource Nested(ID Int8, SearchEngineID UInt16, AdvEngineID UInt8, PlaceID UInt16, SocialSourceNetworkID UInt8, Domain String, SearchPhrase String, SocialSourcePage String), Attendance FixedString(16), CLID UInt32, YCLID UInt64, NormalizedRefererHash UInt64, SearchPhraseHash UInt64, RefererDomainHash UInt64, NormalizedStartURLHash UInt64, StartURLDomainHash UInt64, NormalizedEndURLHash UInt64, TopLevelDomain UInt64, URLScheme UInt64, OpenstatServiceNameHash UInt64, OpenstatCampaignIDHash UInt64, OpenstatAdIDHash UInt64, OpenstatSourceIDHash UInt64, UTMSourceHash UInt64, UTMMediumHash UInt64, UTMCampaignHash UInt64, UTMContentHash UInt64, UTMTermHash UInt64, FromHash UInt64, WebVisorEnabled UInt8, WebVisorActivity UInt32, ParsedParams Nested(Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), Market Nested(Type UInt8, GoalID UInt32, OrderID String, OrderPrice Int64, PP UInt32, DirectPlaceID UInt32, DirectOrderID UInt32, DirectBannerID UInt32, GoodID String, GoodName String, GoodQuantity Int32, GoodPrice Int64), IslandID FixedString(16)) ENGINE = CollapsingMergeTree(Sign) PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192, storage_policy='s3_cache'"
clickhouse-client --query "CREATE TABLE test.hits_s3 (WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16,
EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, ClientIP6 FixedString(16), RegionID UInt32,
UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, URLDomain String, RefererDomain String,
Refresh UInt8, IsRobot UInt8, RefererCategories Array(UInt16), URLCategories Array(UInt16), URLRegions Array(UInt32),
RefererRegions Array(UInt32), ResolutionWidth UInt16, ResolutionHeight UInt16, ResolutionDepth UInt8, FlashMajor UInt8,
FlashMinor UInt8, FlashMinor2 String, NetMajor UInt8, NetMinor UInt8, UserAgentMajor UInt16, UserAgentMinor FixedString(2),
CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, MobilePhone UInt8, MobilePhoneModel String, Params String,
IPNetworkID UInt32, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, IsArtifical UInt8,
WindowClientWidth UInt16, WindowClientHeight UInt16, ClientTimeZone Int16, ClientEventTime DateTime, SilverlightVersion1 UInt8,
SilverlightVersion2 UInt8, SilverlightVersion3 UInt32, SilverlightVersion4 UInt16, PageCharset String, CodeVersion UInt32,
IsLink UInt8, IsDownload UInt8, IsNotBounce UInt8, FUniqID UInt64, HID UInt32, IsOldCounter UInt8, IsEvent UInt8,
IsParameter UInt8, DontCountHits UInt8, WithHash UInt8, HitColor FixedString(1), UTCEventTime DateTime, Age UInt8,
Sex UInt8, Income UInt8, Interests UInt16, Robotness UInt8, GeneralInterests Array(UInt16), RemoteIP UInt32,
RemoteIP6 FixedString(16), WindowName Int32, OpenerName Int32, HistoryLength Int16, BrowserLanguage FixedString(2),
BrowserCountry FixedString(2), SocialNetwork String, SocialAction String, HTTPError UInt16, SendTiming Int32,
DNSTiming Int32, ConnectTiming Int32, ResponseStartTiming Int32, ResponseEndTiming Int32, FetchTiming Int32,
RedirectTiming Int32, DOMInteractiveTiming Int32, DOMContentLoadedTiming Int32, DOMCompleteTiming Int32,
LoadEventStartTiming Int32, LoadEventEndTiming Int32, NSToDOMContentLoadedTiming Int32, FirstPaintTiming Int32,
RedirectCount Int8, SocialSourceNetworkID UInt8, SocialSourcePage String, ParamPrice Int64, ParamOrderID String,
ParamCurrency FixedString(3), ParamCurrencyID UInt16, GoalsReached Array(UInt32), OpenstatServiceName String,
OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String,
UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, RefererHash UInt64,
URLHash UInt64, CLID UInt32, YCLID UInt64, ShareService String, ShareURL String, ShareTitle String,
ParsedParams Nested(Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64),
IslandID FixedString(16), RequestNum UInt32, RequestTry UInt8) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192, storage_policy='s3_cache'"
clickhouse-client --query "CREATE TABLE test.hits (WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16,
EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, ClientIP6 FixedString(16), RegionID UInt32,
UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, URLDomain String,
RefererDomain String, Refresh UInt8, IsRobot UInt8, RefererCategories Array(UInt16), URLCategories Array(UInt16),
URLRegions Array(UInt32), RefererRegions Array(UInt32), ResolutionWidth UInt16, ResolutionHeight UInt16, ResolutionDepth UInt8,
FlashMajor UInt8, FlashMinor UInt8, FlashMinor2 String, NetMajor UInt8, NetMinor UInt8, UserAgentMajor UInt16,
UserAgentMinor FixedString(2), CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, MobilePhone UInt8,
MobilePhoneModel String, Params String, IPNetworkID UInt32, TraficSourceID Int8, SearchEngineID UInt16,
SearchPhrase String, AdvEngineID UInt8, IsArtifical UInt8, WindowClientWidth UInt16, WindowClientHeight UInt16,
ClientTimeZone Int16, ClientEventTime DateTime, SilverlightVersion1 UInt8, SilverlightVersion2 UInt8, SilverlightVersion3 UInt32,
SilverlightVersion4 UInt16, PageCharset String, CodeVersion UInt32, IsLink UInt8, IsDownload UInt8, IsNotBounce UInt8,
FUniqID UInt64, HID UInt32, IsOldCounter UInt8, IsEvent UInt8, IsParameter UInt8, DontCountHits UInt8, WithHash UInt8,
HitColor FixedString(1), UTCEventTime DateTime, Age UInt8, Sex UInt8, Income UInt8, Interests UInt16, Robotness UInt8,
GeneralInterests Array(UInt16), RemoteIP UInt32, RemoteIP6 FixedString(16), WindowName Int32, OpenerName Int32,
HistoryLength Int16, BrowserLanguage FixedString(2), BrowserCountry FixedString(2), SocialNetwork String, SocialAction String,
HTTPError UInt16, SendTiming Int32, DNSTiming Int32, ConnectTiming Int32, ResponseStartTiming Int32, ResponseEndTiming Int32,
FetchTiming Int32, RedirectTiming Int32, DOMInteractiveTiming Int32, DOMContentLoadedTiming Int32, DOMCompleteTiming Int32,
LoadEventStartTiming Int32, LoadEventEndTiming Int32, NSToDOMContentLoadedTiming Int32, FirstPaintTiming Int32,
RedirectCount Int8, SocialSourceNetworkID UInt8, SocialSourcePage String, ParamPrice Int64, ParamOrderID String,
ParamCurrency FixedString(3), ParamCurrencyID UInt16, GoalsReached Array(UInt32), OpenstatServiceName String,
OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String,
UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, RefererHash UInt64,
URLHash UInt64, CLID UInt32, YCLID UInt64, ShareService String, ShareURL String, ShareTitle String,
ParsedParams Nested(Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64),
IslandID FixedString(16), RequestNum UInt32, RequestTry UInt8) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192, storage_policy='s3_cache'"
clickhouse-client --query "CREATE TABLE test.visits (CounterID UInt32, StartDate Date, Sign Int8, IsNew UInt8,
VisitID UInt64, UserID UInt64, StartTime DateTime, Duration UInt32, UTCStartTime DateTime, PageViews Int32,
Hits Int32, IsBounce UInt8, Referer String, StartURL String, RefererDomain String, StartURLDomain String,
EndURL String, LinkURL String, IsDownload UInt8, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String,
AdvEngineID UInt8, PlaceID Int32, RefererCategories Array(UInt16), URLCategories Array(UInt16), URLRegions Array(UInt32),
RefererRegions Array(UInt32), IsYandex UInt8, GoalReachesDepth Int32, GoalReachesURL Int32, GoalReachesAny Int32,
SocialSourceNetworkID UInt8, SocialSourcePage String, MobilePhoneModel String, ClientEventTime DateTime, RegionID UInt32,
ClientIP UInt32, ClientIP6 FixedString(16), RemoteIP UInt32, RemoteIP6 FixedString(16), IPNetworkID UInt32,
SilverlightVersion3 UInt32, CodeVersion UInt32, ResolutionWidth UInt16, ResolutionHeight UInt16, UserAgentMajor UInt16,
UserAgentMinor UInt16, WindowClientWidth UInt16, WindowClientHeight UInt16, SilverlightVersion2 UInt8, SilverlightVersion4 UInt16,
FlashVersion3 UInt16, FlashVersion4 UInt16, ClientTimeZone Int16, OS UInt8, UserAgent UInt8, ResolutionDepth UInt8,
FlashMajor UInt8, FlashMinor UInt8, NetMajor UInt8, NetMinor UInt8, MobilePhone UInt8, SilverlightVersion1 UInt8,
Age UInt8, Sex UInt8, Income UInt8, JavaEnable UInt8, CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8,
BrowserLanguage UInt16, BrowserCountry UInt16, Interests UInt16, Robotness UInt8, GeneralInterests Array(UInt16),
Params Array(String), Goals Nested(ID UInt32, Serial UInt32, EventTime DateTime, Price Int64, OrderID String, CurrencyID UInt32),
WatchIDs Array(UInt64), ParamSumPrice Int64, ParamCurrency FixedString(3), ParamCurrencyID UInt16, ClickLogID UInt64,
ClickEventID Int32, ClickGoodEvent Int32, ClickEventTime DateTime, ClickPriorityID Int32, ClickPhraseID Int32, ClickPageID Int32,
ClickPlaceID Int32, ClickTypeID Int32, ClickResourceID Int32, ClickCost UInt32, ClickClientIP UInt32, ClickDomainID UInt32,
ClickURL String, ClickAttempt UInt8, ClickOrderID UInt32, ClickBannerID UInt32, ClickMarketCategoryID UInt32, ClickMarketPP UInt32,
ClickMarketCategoryName String, ClickMarketPPName String, ClickAWAPSCampaignName String, ClickPageName String, ClickTargetType UInt16,
ClickTargetPhraseID UInt64, ClickContextType UInt8, ClickSelectType Int8, ClickOptions String, ClickGroupBannerID Int32,
OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String,
UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, FirstVisit DateTime,
PredLastVisit Date, LastVisit Date, TotalVisits UInt32, TraficSource Nested(ID Int8, SearchEngineID UInt16, AdvEngineID UInt8,
PlaceID UInt16, SocialSourceNetworkID UInt8, Domain String, SearchPhrase String, SocialSourcePage String), Attendance FixedString(16),
CLID UInt32, YCLID UInt64, NormalizedRefererHash UInt64, SearchPhraseHash UInt64, RefererDomainHash UInt64, NormalizedStartURLHash UInt64,
StartURLDomainHash UInt64, NormalizedEndURLHash UInt64, TopLevelDomain UInt64, URLScheme UInt64, OpenstatServiceNameHash UInt64,
OpenstatCampaignIDHash UInt64, OpenstatAdIDHash UInt64, OpenstatSourceIDHash UInt64, UTMSourceHash UInt64, UTMMediumHash UInt64,
UTMCampaignHash UInt64, UTMContentHash UInt64, UTMTermHash UInt64, FromHash UInt64, WebVisorEnabled UInt8, WebVisorActivity UInt32,
ParsedParams Nested(Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64),
Market Nested(Type UInt8, GoalID UInt32, OrderID String, OrderPrice Int64, PP UInt32, DirectPlaceID UInt32, DirectOrderID UInt32,
DirectBannerID UInt32, GoodID String, GoodName String, GoodQuantity Int32, GoodPrice Int64), IslandID FixedString(16))
ENGINE = CollapsingMergeTree(Sign) PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID)
SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192, storage_policy='s3_cache'"
clickhouse-client --query "INSERT INTO test.hits_s3 SELECT * FROM datasets.hits_v1 SETTINGS enable_filesystem_cache_on_write_operations=0"
clickhouse-client --query "INSERT INTO test.hits SELECT * FROM datasets.hits_v1 SETTINGS enable_filesystem_cache_on_write_operations=0"
@ -275,7 +374,9 @@ export ZOOKEEPER_FAULT_INJECTION=1
configure
# But we still need default disk because some tables loaded only into it
sudo cat /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml | sed "s|<main><disk>s3</disk></main>|<main><disk>s3</disk></main><default><disk>default</disk></default>|" > /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml.tmp
sudo cat /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml \
| sed "s|<main><disk>s3</disk></main>|<main><disk>s3</disk></main><default><disk>default</disk></default>|" \
> /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml.tmp
mv /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml.tmp /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml
sudo chown clickhouse /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml
sudo chgrp clickhouse /etc/clickhouse-server/config.d/s3_storage_policy_by_default.xml
@ -283,8 +384,12 @@ sudo chgrp clickhouse /etc/clickhouse-server/config.d/s3_storage_policy_by_defau
start
./stress --hung-check --drop-databases --output-folder test_output --skip-func-tests "$SKIP_TESTS_OPTION" --global-time-limit 1200 \
&& echo -e 'Test script exit code\tOK' >> /test_output/test_results.tsv \
|| echo -e 'Test script failed\tFAIL' >> /test_output/test_results.tsv
&& echo -e "Test script exit code$OK" >> /test_output/test_results.tsv \
|| echo -e "Test script failed$FAIL script exit code: $?" >> /test_output/test_results.tsv
# NOTE Hung check is implemented in docker/tests/stress/stress
rg -Fa "No queries hung" /test_output/test_results.tsv | grep -Fa "OK" \
|| echo -e "Hung check failed, possible deadlock found (see hung_check.log)$FAIL$(head_escaped /test_output/hung_check.log)"
stop
mv /var/log/clickhouse-server/clickhouse-server.log /var/log/clickhouse-server/clickhouse-server.stress.log
@ -295,9 +400,10 @@ unset "${!THREAD_@}"
start
clickhouse-client --query "SELECT 'Server successfully started', 'OK'" >> /test_output/test_results.tsv \
|| (echo -e 'Server failed to start (see application_errors.txt and clickhouse-server.clean.log)\tFAIL' >> /test_output/test_results.tsv \
&& rg --text "<Error>.*Application" /var/log/clickhouse-server/clickhouse-server.log > /test_output/application_errors.txt)
clickhouse-client --query "SELECT 'Server successfully started', 'OK', NULL, ''" >> /test_output/test_results.tsv \
|| (rg --text "<Error>.*Application" /var/log/clickhouse-server/clickhouse-server.log > /test_output/application_errors.txt \
&& echo -e "Server failed to start (see application_errors.txt and clickhouse-server.clean.log)$FAIL$(head_escaped /test_output/application_errors.txt)" \
>> /test_output/test_results.tsv)
stop
@ -310,49 +416,49 @@ stop
rg -Fa "==================" /var/log/clickhouse-server/stderr.log | rg -v "in query:" >> /test_output/tmp
rg -Fa "WARNING" /var/log/clickhouse-server/stderr.log >> /test_output/tmp
rg -Fav -e "ASan doesn't fully support makecontext/swapcontext functions" -e "DB::Exception" /test_output/tmp > /dev/null \
&& echo -e 'Sanitizer assert (in stderr.log)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'No sanitizer asserts\tOK' >> /test_output/test_results.tsv
&& echo -e "Sanitizer assert (in stderr.log)$FAIL$(head_escaped /test_output/tmp)" >> /test_output/test_results.tsv \
|| echo -e "No sanitizer asserts$OK" >> /test_output/test_results.tsv
rm -f /test_output/tmp
# OOM
rg -Fa " <Fatal> Application: Child process was terminated by signal 9" /var/log/clickhouse-server/clickhouse-server*.log > /dev/null \
&& echo -e 'OOM killer (or signal 9) in clickhouse-server.log\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'No OOM messages in clickhouse-server.log\tOK' >> /test_output/test_results.tsv
&& echo -e "Signal 9 in clickhouse-server.log$FAIL" >> /test_output/test_results.tsv \
|| echo -e "No OOM messages in clickhouse-server.log$OK" >> /test_output/test_results.tsv
# Logical errors
rg -Fa "Code: 49, e.displayText() = DB::Exception:" /var/log/clickhouse-server/clickhouse-server*.log > /test_output/logical_errors.txt \
&& echo -e 'Logical error thrown (see clickhouse-server.log or logical_errors.txt)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'No logical errors\tOK' >> /test_output/test_results.tsv
rg -Fa "Code: 49. DB::Exception: " /var/log/clickhouse-server/clickhouse-server*.log > /test_output/logical_errors.txt \
&& echo -e "Logical error thrown (see clickhouse-server.log or logical_errors.txt)$FAIL$(head_escaped /test_output/logical_errors.txt)" >> /test_output/test_results.tsv \
|| echo -e "No logical errors$OK" >> /test_output/test_results.tsv
# Remove file logical_errors.txt if it's empty
[ -s /test_output/logical_errors.txt ] || rm /test_output/logical_errors.txt
# No such key errors
rg --text "Code: 499.*The specified key does not exist" /var/log/clickhouse-server/clickhouse-server*.log > /test_output/no_such_key_errors.txt \
&& echo -e 'S3_ERROR No such key thrown (see clickhouse-server.log or no_such_key_errors.txt)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'No lost s3 keys\tOK' >> /test_output/test_results.tsv
&& echo -e "S3_ERROR No such key thrown (see clickhouse-server.log or no_such_key_errors.txt)$FAIL$(head_escaped /test_output/no_such_key_errors.txt)" >> /test_output/test_results.tsv \
|| echo -e "No lost s3 keys$OK" >> /test_output/test_results.tsv
# Remove file no_such_key_errors.txt if it's empty
[ -s /test_output/no_such_key_errors.txt ] || rm /test_output/no_such_key_errors.txt
# Crash
rg -Fa "########################################" /var/log/clickhouse-server/clickhouse-server*.log > /dev/null \
&& echo -e 'Killed by signal (in clickhouse-server.log)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'Not crashed\tOK' >> /test_output/test_results.tsv
&& echo -e "Killed by signal (in clickhouse-server.log)$FAIL" >> /test_output/test_results.tsv \
|| echo -e "Not crashed$OK" >> /test_output/test_results.tsv
# It also checks for crash without stacktrace (printed by watchdog)
rg -Fa " <Fatal> " /var/log/clickhouse-server/clickhouse-server*.log > /test_output/fatal_messages.txt \
&& echo -e 'Fatal message in clickhouse-server.log (see fatal_messages.txt)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'No fatal messages in clickhouse-server.log\tOK' >> /test_output/test_results.tsv
&& echo -e "Fatal message in clickhouse-server.log (see fatal_messages.txt)$FAIL$(head_escaped /test_output/fatal_messages.txt)" >> /test_output/test_results.tsv \
|| echo -e "No fatal messages in clickhouse-server.log$OK" >> /test_output/test_results.tsv
# Remove file fatal_messages.txt if it's empty
[ -s /test_output/fatal_messages.txt ] || rm /test_output/fatal_messages.txt
rg -Fa "########################################" /test_output/* > /dev/null \
&& echo -e 'Killed by signal (output files)\tFAIL' >> /test_output/test_results.tsv
&& echo -e "Killed by signal (output files)$FAIL" >> /test_output/test_results.tsv
rg -Fa " received signal " /test_output/gdb.log > /dev/null \
&& echo -e 'Found signal in gdb.log\tFAIL' >> /test_output/test_results.tsv
&& echo -e "Found signal in gdb.log$FAIL$(rg -A50 -Fa " received signal " /test_output/gdb.log | escaped)" >> /test_output/test_results.tsv
if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
echo -e "Backward compatibility check\n"
@ -367,8 +473,8 @@ if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
echo "Download clickhouse-server from the previous release"
mkdir previous_release_package_folder
echo $previous_release_tag | download_release_packages && echo -e 'Download script exit code\tOK' >> /test_output/test_results.tsv \
|| echo -e 'Download script failed\tFAIL' >> /test_output/test_results.tsv
echo $previous_release_tag | download_release_packages && echo -e "Download script exit code$OK" >> /test_output/test_results.tsv \
|| echo -e "Download script failed$FAIL" >> /test_output/test_results.tsv
mv /var/log/clickhouse-server/clickhouse-server.log /var/log/clickhouse-server/clickhouse-server.clean.log
for table in query_log trace_log
@ -381,13 +487,13 @@ if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
# Check if we cloned previous release repository successfully
if ! [ "$(ls -A previous_release_repository/tests/queries)" ]
then
echo -e "Backward compatibility check: Failed to clone previous release tests\tFAIL" >> /test_output/test_results.tsv
echo -e "Backward compatibility check: Failed to clone previous release tests$FAIL" >> /test_output/test_results.tsv
elif ! [ "$(ls -A previous_release_package_folder/clickhouse-common-static_*.deb && ls -A previous_release_package_folder/clickhouse-server_*.deb)" ]
then
echo -e "Backward compatibility check: Failed to download previous release packages\tFAIL" >> /test_output/test_results.tsv
echo -e "Backward compatibility check: Failed to download previous release packages$FAIL" >> /test_output/test_results.tsv
else
echo -e "Successfully cloned previous release tests\tOK" >> /test_output/test_results.tsv
echo -e "Successfully downloaded previous release packages\tOK" >> /test_output/test_results.tsv
echo -e "Successfully cloned previous release tests$OK" >> /test_output/test_results.tsv
echo -e "Successfully downloaded previous release packages$OK" >> /test_output/test_results.tsv
# Uninstall current packages
dpkg --remove clickhouse-client
@ -446,9 +552,10 @@ if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
mkdir tmp_stress_output
./stress --test-cmd="/usr/bin/clickhouse-test --queries=\"previous_release_repository/tests/queries\"" --backward-compatibility-check --output-folder tmp_stress_output --global-time-limit=1200 \
&& echo -e 'Backward compatibility check: Test script exit code\tOK' >> /test_output/test_results.tsv \
|| echo -e 'Backward compatibility check: Test script failed\tFAIL' >> /test_output/test_results.tsv
./stress --test-cmd="/usr/bin/clickhouse-test --queries=\"previous_release_repository/tests/queries\"" \
--backward-compatibility-check --output-folder tmp_stress_output --global-time-limit=1200 \
&& echo -e "Backward compatibility check: Test script exit code$OK" >> /test_output/test_results.tsv \
|| echo -e "Backward compatibility check: Test script failed$FAIL" >> /test_output/test_results.tsv
rm -rf tmp_stress_output
# We experienced deadlocks in this command in very rare cases. Let's debug it:
@ -470,9 +577,9 @@ if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
export ZOOKEEPER_FAULT_INJECTION=0
configure
start 500
clickhouse-client --query "SELECT 'Backward compatibility check: Server successfully started', 'OK'" >> /test_output/test_results.tsv \
|| (echo -e 'Backward compatibility check: Server failed to start\tFAIL' >> /test_output/test_results.tsv \
&& rg --text "<Error>.*Application" /var/log/clickhouse-server/clickhouse-server.log >> /test_output/bc_check_application_errors.txt)
clickhouse-client --query "SELECT 'Backward compatibility check: Server successfully started', 'OK', NULL, ''" >> /test_output/test_results.tsv \
|| (rg --text "<Error>.*Application" /var/log/clickhouse-server/clickhouse-server.log >> /test_output/bc_check_application_errors.txt \
&& echo -e "Backward compatibility check: Server failed to start$FAIL$(head_escaped /test_output/bc_check_application_errors.txt)" >> /test_output/test_results.tsv)
clickhouse-client --query="SELECT 'Server version: ', version()"
@ -488,8 +595,6 @@ if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
# FIXME Not sure if it's expected, but some tests from BC check may not be finished yet when we restarting server.
# Let's just ignore all errors from queries ("} <Error> TCPHandler: Code:", "} <Error> executeQuery: Code:")
# FIXME https://github.com/ClickHouse/ClickHouse/issues/39197 ("Missing columns: 'v3' while processing query: 'v3, k, v1, v2, p'")
# NOTE Incompatibility was introduced in https://github.com/ClickHouse/ClickHouse/pull/39263, it's expected
# ("This engine is deprecated and is not supported in transactions", "[Queue = DB::MergeMutateRuntimeQueue]: Code: 235. DB::Exception: Part")
# FIXME https://github.com/ClickHouse/ClickHouse/issues/39174 - bad mutation does not indicate backward incompatibility
echo "Check for Error messages in server log:"
rg -Fav -e "Code: 236. DB::Exception: Cancelled merging parts" \
@ -519,7 +624,6 @@ if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
-e "} <Error> TCPHandler: Code:" \
-e "} <Error> executeQuery: Code:" \
-e "Missing columns: 'v3' while processing query: 'v3, k, v1, v2, p'" \
-e "This engine is deprecated and is not supported in transactions" \
-e "[Queue = DB::MergeMutateRuntimeQueue]: Code: 235. DB::Exception: Part" \
-e "The set of parts restored in place of" \
-e "(ReplicatedMergeTreeAttachThread): Initialization failed. Error" \
@ -528,9 +632,11 @@ if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
-e "MutateFromLogEntryTask" \
-e "No connection to ZooKeeper, cannot get shared table ID" \
-e "Session expired" \
-e "TOO_MANY_PARTS" \
/var/log/clickhouse-server/clickhouse-server.backward.dirty.log | rg -Fa "<Error>" > /test_output/bc_check_error_messages.txt \
&& echo -e 'Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'Backward compatibility check: No Error messages in clickhouse-server.log\tOK' >> /test_output/test_results.tsv
&& echo -e "Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt)$FAIL$(head_escaped /test_output/bc_check_error_messages.txt)" \
>> /test_output/test_results.tsv \
|| echo -e "Backward compatibility check: No Error messages in clickhouse-server.log$OK" >> /test_output/test_results.tsv
# Remove file bc_check_error_messages.txt if it's empty
[ -s /test_output/bc_check_error_messages.txt ] || rm /test_output/bc_check_error_messages.txt
@ -539,34 +645,36 @@ if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
rg -Fa "==================" /var/log/clickhouse-server/stderr.log >> /test_output/tmp
rg -Fa "WARNING" /var/log/clickhouse-server/stderr.log >> /test_output/tmp
rg -Fav -e "ASan doesn't fully support makecontext/swapcontext functions" -e "DB::Exception" /test_output/tmp > /dev/null \
&& echo -e 'Backward compatibility check: Sanitizer assert (in stderr.log)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'Backward compatibility check: No sanitizer asserts\tOK' >> /test_output/test_results.tsv
&& echo -e "Backward compatibility check: Sanitizer assert (in stderr.log)$FAIL$(head_escaped /test_output/tmp)" >> /test_output/test_results.tsv \
|| echo -e "Backward compatibility check: No sanitizer asserts$OK" >> /test_output/test_results.tsv
rm -f /test_output/tmp
# OOM
rg -Fa " <Fatal> Application: Child process was terminated by signal 9" /var/log/clickhouse-server/clickhouse-server.backward.*.log > /dev/null \
&& echo -e 'Backward compatibility check: OOM killer (or signal 9) in clickhouse-server.log\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'Backward compatibility check: No OOM messages in clickhouse-server.log\tOK' >> /test_output/test_results.tsv
&& echo -e "Backward compatibility check: Signal 9 in clickhouse-server.log$FAIL" >> /test_output/test_results.tsv \
|| echo -e "Backward compatibility check: No OOM messages in clickhouse-server.log$OK" >> /test_output/test_results.tsv
# Logical errors
echo "Check for Logical errors in server log:"
rg -Fa -A20 "Code: 49, e.displayText() = DB::Exception:" /var/log/clickhouse-server/clickhouse-server.backward.*.log > /test_output/bc_check_logical_errors.txt \
&& echo -e 'Backward compatibility check: Logical error thrown (see clickhouse-server.log or bc_check_logical_errors.txt)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'Backward compatibility check: No logical errors\tOK' >> /test_output/test_results.tsv
rg -Fa -A20 "Code: 49. DB::Exception:" /var/log/clickhouse-server/clickhouse-server.backward.*.log > /test_output/bc_check_logical_errors.txt \
&& echo -e "Backward compatibility check: Logical error thrown (see clickhouse-server.log or bc_check_logical_errors.txt)$FAIL$(head_escaped /test_output/bc_check_logical_errors.txt)" \
>> /test_output/test_results.tsv \
|| echo -e "Backward compatibility check: No logical errors$OK" >> /test_output/test_results.tsv
# Remove file bc_check_logical_errors.txt if it's empty
[ -s /test_output/bc_check_logical_errors.txt ] || rm /test_output/bc_check_logical_errors.txt
# Crash
rg -Fa "########################################" /var/log/clickhouse-server/clickhouse-server.backward.*.log > /dev/null \
&& echo -e 'Backward compatibility check: Killed by signal (in clickhouse-server.log)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'Backward compatibility check: Not crashed\tOK' >> /test_output/test_results.tsv
&& echo -e "Backward compatibility check: Killed by signal (in clickhouse-server.log)$FAIL" >> /test_output/test_results.tsv \
|| echo -e "Backward compatibility check: Not crashed$OK" >> /test_output/test_results.tsv
# It also checks for crash without stacktrace (printed by watchdog)
echo "Check for Fatal message in server log:"
rg -Fa " <Fatal> " /var/log/clickhouse-server/clickhouse-server.backward.*.log > /test_output/bc_check_fatal_messages.txt \
&& echo -e 'Backward compatibility check: Fatal message in clickhouse-server.log (see bc_check_fatal_messages.txt)\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'Backward compatibility check: No fatal messages in clickhouse-server.log\tOK' >> /test_output/test_results.tsv
&& echo -e "Backward compatibility check: Fatal message in clickhouse-server.log (see bc_check_fatal_messages.txt)$FAIL$(head_escaped /test_output/bc_check_fatal_messages.txt)" \
>> /test_output/test_results.tsv \
|| echo -e "Backward compatibility check: No fatal messages in clickhouse-server.log$OK" >> /test_output/test_results.tsv
# Remove file bc_check_fatal_messages.txt if it's empty
[ -s /test_output/bc_check_fatal_messages.txt ] || rm /test_output/bc_check_fatal_messages.txt
@ -574,7 +682,8 @@ if [ "$DISABLE_BC_CHECK" -ne "1" ]; then
tar -chf /test_output/coordination.backward.tar /var/lib/clickhouse/coordination ||:
for table in query_log trace_log
do
clickhouse-local --path /var/lib/clickhouse/ --only-system-tables -q "select * from system.$table format TSVWithNamesAndTypes" | zstd --threads=0 > /test_output/$table.backward.tsv.zst ||:
clickhouse-local --path /var/lib/clickhouse/ --only-system-tables -q "select * from system.$table format TSVWithNamesAndTypes" \
| zstd --threads=0 > /test_output/$table.backward.tsv.zst ||:
done
fi
fi
@ -583,13 +692,28 @@ dmesg -T > /test_output/dmesg.log
# OOM in dmesg -- those are real
grep -q -F -e 'Out of memory: Killed process' -e 'oom_reaper: reaped process' -e 'oom-kill:constraint=CONSTRAINT_NONE' /test_output/dmesg.log \
&& echo -e 'OOM in dmesg\tFAIL' >> /test_output/test_results.tsv \
|| echo -e 'No OOM in dmesg\tOK' >> /test_output/test_results.tsv
&& echo -e "OOM in dmesg$FAIL$(head_escaped /test_output/dmesg.log)" >> /test_output/test_results.tsv \
|| echo -e "No OOM in dmesg$OK" >> /test_output/test_results.tsv
mv /var/log/clickhouse-server/stderr.log /test_output/
# Write check result into check_status.tsv
clickhouse-local --structure "test String, res String" -q "SELECT 'failure', test FROM table WHERE res != 'OK' order by (lower(test) like '%hung%'), rowNumberInAllBlocks() LIMIT 1" < /test_output/test_results.tsv > /test_output/check_status.tsv
# Try to choose most specific error for the whole check status
clickhouse-local --structure "test String, res String" -q "SELECT 'failure', test FROM table WHERE res != 'OK' order by
(test like 'Backward compatibility check%'), -- BC check goes last
(test like '%Sanitizer%') DESC,
(test like '%Killed by signal%') DESC,
(test like '%gdb.log%') DESC,
(test ilike '%possible deadlock%') DESC,
(test like '%start%') DESC,
(test like '%dmesg%') DESC,
(test like '%OOM%') DESC,
(test like '%Signal 9%') DESC,
(test like '%Fatal message%') DESC,
(test like '%Error message%') DESC,
(test like '%previous release%') DESC,
rowNumberInAllBlocks()
LIMIT 1" < /test_output/test_results.tsv > /test_output/check_status.tsv
[ -s /test_output/check_status.tsv ] || echo -e "success\tNo errors found" > /test_output/check_status.tsv
# Core dumps

View File

@ -1,7 +1,7 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from multiprocessing import cpu_count
from subprocess import Popen, call, check_output, STDOUT
from subprocess import Popen, call, check_output, STDOUT, PIPE
import os
import argparse
import logging
@ -299,14 +299,19 @@ if __name__ == "__main__":
"00001_select_1",
]
)
res = call(cmd, shell=True, stderr=STDOUT)
hung_check_status = "No queries hung\tOK\n"
hung_check_log = os.path.join(args.output_folder, "hung_check.log")
tee = Popen(['/usr/bin/tee', hung_check_log], stdin=PIPE)
res = call(cmd, shell=True, stdout=tee.stdin, stderr=STDOUT)
tee.stdin.close()
if res != 0 and have_long_running_queries:
logging.info("Hung check failed with exit code {}".format(res))
hung_check_status = "Hung check failed\tFAIL\n"
with open(
os.path.join(args.output_folder, "test_results.tsv"), "w+"
) as results:
results.write(hung_check_status)
else:
hung_check_status = "No queries hung\tOK\t\\N\t\n"
with open(
os.path.join(args.output_folder, "test_results.tsv"), "w+"
) as results:
results.write(hung_check_status)
os.remove(hung_check_log)
logging.info("Stress test finished")

View File

@ -0,0 +1,22 @@
---
sidebar_position: 1
sidebar_label: 2023
---
# 2023 Changelog
### ClickHouse release v22.11.5.15-stable (d763e5a9239) FIXME as compared to v22.11.4.3-stable (7f4cf554f69)
#### Bug Fix (user-visible misbehavior in official stable or prestable release)
* Backported in [#44999](https://github.com/ClickHouse/ClickHouse/issues/44999): Another fix for `Cannot read all data` error which could happen while reading `LowCardinality` dictionary from remote fs. Fixes [#44709](https://github.com/ClickHouse/ClickHouse/issues/44709). [#44875](https://github.com/ClickHouse/ClickHouse/pull/44875) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Backported in [#45552](https://github.com/ClickHouse/ClickHouse/issues/45552): Fix `SELECT ... FROM system.dictionaries` exception when there is a dictionary with a bad structure (e.g. incorrect type in xml config). [#45399](https://github.com/ClickHouse/ClickHouse/pull/45399) ([Aleksei Filatov](https://github.com/aalexfvk)).
#### NOT FOR CHANGELOG / INSIGNIFICANT
* Automatically merge green backport PRs and green approved PRs [#41110](https://github.com/ClickHouse/ClickHouse/pull/41110) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Improve release scripts [#45074](https://github.com/ClickHouse/ClickHouse/pull/45074) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Fix wrong approved_at, simplify conditions [#45302](https://github.com/ClickHouse/ClickHouse/pull/45302) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Get rid of artifactory in favor of r2 + ch-repos-manager [#45421](https://github.com/ClickHouse/ClickHouse/pull/45421) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Trim refs/tags/ from GITHUB_TAG in release workflow [#45636](https://github.com/ClickHouse/ClickHouse/pull/45636) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).

View File

@ -0,0 +1,24 @@
---
sidebar_position: 1
sidebar_label: 2023
---
# 2023 Changelog
### ClickHouse release v22.8.13.20-lts (e4817946d18) FIXME as compared to v22.8.12.45-lts (86b0ecd5d51)
#### Bug Fix (user-visible misbehavior in official stable or prestable release)
* Backported in [#45565](https://github.com/ClickHouse/ClickHouse/issues/45565): Fix positional arguments exception Positional argument out of bounds. Closes [#40634](https://github.com/ClickHouse/ClickHouse/issues/40634). [#41189](https://github.com/ClickHouse/ClickHouse/pull/41189) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Backported in [#44997](https://github.com/ClickHouse/ClickHouse/issues/44997): Another fix for `Cannot read all data` error which could happen while reading `LowCardinality` dictionary from remote fs. Fixes [#44709](https://github.com/ClickHouse/ClickHouse/issues/44709). [#44875](https://github.com/ClickHouse/ClickHouse/pull/44875) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Backported in [#45550](https://github.com/ClickHouse/ClickHouse/issues/45550): Fix `SELECT ... FROM system.dictionaries` exception when there is a dictionary with a bad structure (e.g. incorrect type in xml config). [#45399](https://github.com/ClickHouse/ClickHouse/pull/45399) ([Aleksei Filatov](https://github.com/aalexfvk)).
#### NOT FOR CHANGELOG / INSIGNIFICANT
* Automatically merge green backport PRs and green approved PRs [#41110](https://github.com/ClickHouse/ClickHouse/pull/41110) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Improve release scripts [#45074](https://github.com/ClickHouse/ClickHouse/pull/45074) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Fix wrong approved_at, simplify conditions [#45302](https://github.com/ClickHouse/ClickHouse/pull/45302) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Get rid of artifactory in favor of r2 + ch-repos-manager [#45421](https://github.com/ClickHouse/ClickHouse/pull/45421) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Trim refs/tags/ from GITHUB_TAG in release workflow [#45636](https://github.com/ClickHouse/ClickHouse/pull/45636) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Merge pull request [#38262](https://github.com/ClickHouse/ClickHouse/issues/38262) from PolyProgrammist/fix-ordinary-system-un… [#45650](https://github.com/ClickHouse/ClickHouse/pull/45650) ([alesapin](https://github.com/alesapin)).

View File

@ -0,0 +1,23 @@
---
sidebar_position: 1
sidebar_label: 2023
---
# 2023 Changelog
### ClickHouse release v23.1.2.9-stable (8dfb1700858) FIXME as compared to v23.1.1.3077-stable (dcaac477025)
#### Performance Improvement
* Backported in [#45705](https://github.com/ClickHouse/ClickHouse/issues/45705): Fixed performance of short `SELECT` queries that read from tables with large number of`Array`/`Map`/`Nested` columns. [#45630](https://github.com/ClickHouse/ClickHouse/pull/45630) ([Anton Popov](https://github.com/CurtizJ)).
#### Bug Fix
* Backported in [#45673](https://github.com/ClickHouse/ClickHouse/issues/45673): Fix wiping sensitive info in logs. [#45603](https://github.com/ClickHouse/ClickHouse/pull/45603) ([Vitaly Baranov](https://github.com/vitlibar)).
#### Bug Fix (user-visible misbehavior in official stable or prestable release)
* Backported in [#45730](https://github.com/ClickHouse/ClickHouse/issues/45730): Fix key description when encountering duplicate primary keys. This can happen in projections. See [#45590](https://github.com/ClickHouse/ClickHouse/issues/45590) for details. [#45686](https://github.com/ClickHouse/ClickHouse/pull/45686) ([Amos Bird](https://github.com/amosbird)).
#### NOT FOR CHANGELOG / INSIGNIFICANT
* Trim refs/tags/ from GITHUB_TAG in release workflow [#45636](https://github.com/ClickHouse/ClickHouse/pull/45636) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).

View File

@ -49,20 +49,20 @@ where `N` specifies the tokenizer:
Being a type of skipping index, inverted indexes can be dropped or added to a column after table creation:
``` sql
ALTER TABLE tbl DROP INDEX inv_idx;
ALTER TABLE tbl ADD INDEX inv_idx(s) TYPE inverted(2) GRANULARITY 1;
ALTER TABLE tab DROP INDEX inv_idx;
ALTER TABLE tab ADD INDEX inv_idx(s) TYPE inverted(2) GRANULARITY 1;
```
To use the index, no special functions or syntax are required. Typical string search predicates automatically leverage the index. As
examples, consider:
```sql
SELECT * from tab WHERE s == 'Hello World;
SELECT * from tab WHERE s IN (Hello, World);
SELECT * from tab WHERE s LIKE %Hello%;
SELECT * from tab WHERE multiSearchAny(s, Hello, World);
SELECT * from tab WHERE hasToken(s, Hello);
SELECT * from tab WHERE multiSearchAll(s, [Hello, World]);
INSERT INTO tab(key, str) values (1, 'Hello World');
SELECT * from tab WHERE str == 'Hello World';
SELECT * from tab WHERE str IN ('Hello', 'World');
SELECT * from tab WHERE str LIKE '%Hello%';
SELECT * from tab WHERE multiSearchAny(str, ['Hello', 'World']);
SELECT * from tab WHERE hasToken(str, 'Hello');
```
The inverted index also works on columns of type `Array(String)`, `Array(FixedString)`, `Map(String)` and `Map(String)`.

View File

@ -19,7 +19,7 @@ ENGINE = GenerateRandom([random_seed] [,max_string_length] [,max_array_length])
```
The `max_array_length` and `max_string_length` parameters specify maximum length of all
array columns and strings correspondingly in generated data.
array or map columns and strings correspondingly in generated data.
Generate table engine supports only `SELECT` queries.

View File

@ -56,6 +56,19 @@ sudo apt-get clean
sudo apt-get autoclean
```
### You Can't Get Packages With Yum Because Of Wrong Signature
Possible issue: the cache is wrong, maybe it's broken after updated GPG key in 2022-09.
The solution is to clean out the cache and lib directory for yum:
```
sudo find /var/lib/yum/repos/ /var/cache/yum/ -name 'clickhouse-*' -type d -exec rm -rf {} +
sudo rm -f /etc/yum.repos.d/clickhouse.repo
```
After that follow the [install guide](../getting-started/install.md#from-rpm-packages)
## Connecting to the Server {#troubleshooting-accepts-no-connections}
Possible issues:

View File

@ -79,7 +79,7 @@ The BACKUP and RESTORE statements take a list of DATABASE and TABLE names, a des
- ASYNC: backup or restore asynchronously
- PARTITIONS: a list of partitions to restore
- SETTINGS:
- [`compression_method`](en/sql-reference/statements/create/table/#column-compression-codecs) and compression_level
- [`compression_method`](/docs/en/sql-reference/statements/create/table.md/#column-compression-codecs) and compression_level
- `password` for the file on disk
- `base_backup`: the destination of the previous backup of this source. For example, `Disk('backups', '1.zip')`

View File

@ -0,0 +1,112 @@
---
slug: /en/operations/query-cache
sidebar_position: 65
sidebar_label: Query Cache [experimental]
---
# Query Cache [experimental]
The query cache allows to compute `SELECT` queries just once and to serve further executions of the same query directly from the cache.
Depending on the type of the queries, this can dramatically reduce latency and resource consumption of the ClickHouse server.
## Background, Design and Limitations
Query caches can generally be viewed as transactionally consistent or inconsistent.
- In transactionally consistent caches, the database invalidates (discards) cached query results if the result of the `SELECT` query changes
or potentially changes. In ClickHouse, operations which change the data include inserts/updates/deletes in/of/from tables or collapsing
merges. Transactionally consistent caching is especially suitable for OLTP databases, for example
[MySQL](https://dev.mysql.com/doc/refman/5.6/en/query-cache.html) (which removed query cache after v8.0) and
[Oracle](https://docs.oracle.com/database/121/TGDBA/tune_result_cache.htm).
- In transactionally inconsistent caches, slight inaccuracies in query results are accepted under the assumption that all cache entries are
assigned a validity period after which they expire (e.g. 1 minute) and that the underlying data changes only little during this period.
This approach is overall more suitable for OLAP databases. As an example where transactionally inconsistent caching is sufficient,
consider an hourly sales report in a reporting tool which is simultaneously accessed by multiple users. Sales data changes typically
slowly enough that the database only needs to compute the report once (represented by the first `SELECT` query). Further queries can be
served directly from the query cache. In this example, a reasonable validity period could be 30 min.
Transactionally inconsistent caching is traditionally provided by client tools or proxy packages interacting with the database. As a result,
the same caching logic and configuration is often duplicated. With ClickHouse's query cache, the caching logic moves to the server side.
This reduces maintenance effort and avoids redundancy.
:::warning
The query cache is an experimental feature that should not be used in production. There are known cases (e.g. in distributed query
processing) where wrong results are returned.
:::
## Configuration Settings and Usage
As long as the result cache is experimental it must be activated using the following configuration setting:
```sql
SET allow_experimental_query_cache = true;
```
Afterwards, setting [use_query_cache](settings/settings.md#use-query-cache) can be used to control whether a specific query or all queries
of the current session should utilize the query cache. For example, the first execution of query
```sql
SELECT some_expensive_calculation(column_1, column_2)
FROM table
SETTINGS use_query_cache = true;
```
will store the query result in the query cache. Subsequent executions of the same query (also with parameter `use_query_cache = true`) will
read the computed result from the cache and return it immediately.
The way the cache is utilized can be configured in more detail using settings [enable_writes_to_query_cache](settings/settings.md#enable-writes-to-query-cache)
and [enable_reads_from_query_cache](settings/settings.md#enable-reads-from-query-cache) (both `true` by default). The former setting
controls whether query results are stored in the cache, whereas the latter setting determines if the database should try to retrieve query
results from the cache. For example, the following query will use the cache only passively, i.e. attempt to read from it but not store its
result in it:
```sql
SELECT some_expensive_calculation(column_1, column_2)
FROM table
SETTINGS use_query_cache = true, enable_writes_to_query_cache = false;
```
For maximum control, it is generally recommended to provide settings "use_query_cache", "enable_writes_to_query_cache" and
"enable_reads_from_query_cache" only with specific queries. It is also possible to enable caching at user or profile level (e.g. via `SET
use_query_cache = true`) but one should keep in mind that all `SELECT` queries including monitoring or debugging queries to system tables
may return cached results then.
The query cache can be cleared using statement `SYSTEM DROP QUERY CACHE`. The content of the query cache is displayed in system table
`system.query_cache`. The number of query cache hits and misses are shown as events "QueryCacheHits" and "QueryCacheMisses" in system table
`system.events`. Both counters are only updated for `SELECT` queries which run with setting "use_query_cache = true". Other queries do not
affect the cache miss counter.
The query cache exists once per ClickHouse server process. However, cache results are by default not shared between users. This can be
changed (see below) but doing so is not recommended for security reasons.
Query results are referenced in the query cache by the [Abstract Syntax Tree (AST)](https://en.wikipedia.org/wiki/Abstract_syntax_tree) of
their query. This means that caching is agnostic to upper/lowercase, for example `SELECT 1` and `select 1` are treated as the same query. To
make the matching more natural, all query-level settings related to the query cache are removed from the AST.
If the query was aborted due to an exception or user cancellation, no entry is written into the query cache.
The size of the query cache, the maximum number of cache entries and the maximum size of cache entries (in bytes and in records) can
be configured using different [server configuration options](server-configuration-parameters/settings.md#server_configuration_parameters_query-cache).
To define how long a query must run at least such that its result can be cached, you can use setting
[query_cache_min_query_duration](settings/settings.md#query-cache-min-query-duration). For example, the result of query
``` sql
SELECT some_expensive_calculation(column_1, column_2)
FROM table
SETTINGS use_query_cache = true, query_cache_min_query_duration = 5000;
```
is only cached if the query runs longer than 5 seconds. It is also possible to specify how often a query needs to run until its result is
cached - for that use setting [query_cache_min_query_runs](settings/settings.md#query-cache-min-query-runs).
Entries in the query cache become stale after a certain time period (time-to-live). By default, this period is 60 seconds but a different
value can be specified at session, profile or query level using setting [query_cache_ttl](settings/settings.md#query-cache-ttl).
Also, results of queries with non-deterministic functions such as `rand()` and `now()` are not cached. This can be overruled using
setting [query_cache_store_results_of_queries_with_nondeterministic_functions](settings/settings.md#query-cache-store-results-of-queries-with-nondeterministic-functions).
Finally, entries in the query cache are not shared between users due to security reasons. For example, user A must not be able to bypass a
row policy on a table by running the same query as another user B for whom no such policy exists. However, if necessary, cache entries can
be marked accessible by other users (i.e. shared) by supplying setting
[query_cache_share_between_users](settings/settings.md#query-cache-share-between-users).

View File

@ -6,14 +6,14 @@ sidebar_label: Query Result Cache [experimental]
# Query Result Cache [experimental]
The query result cache allows to compute SELECT queries just once and to serve further executions of the same query directly from the cache.
Depending on the type of the queries, this can dramatically reduce latency and resource consumption of the ClickHouse server.
The query result cache allows to compute `SELECT` queries just once and to serve further executions of the same query directly from the
cache. Depending on the type of the queries, this can dramatically reduce latency and resource consumption of the ClickHouse server.
## Background, Design and Limitations
Query result caches can generally be viewed as transactionally consistent or inconsistent.
- In transactionally consistent caches, the database invalidates (discards) cached query results if the result of the SELECT query changes
- In transactionally consistent caches, the database invalidates (discards) cached query results if the result of the `SELECT` query changes
or potentially changes. In ClickHouse, operations which change the data include inserts/updates/deletes in/of/from tables or collapsing
merges. Transactionally consistent caching is especially suitable for OLTP databases, for example
[MySQL](https://dev.mysql.com/doc/refman/5.6/en/query-cache.html) (which removed query result cache after v8.0) and
@ -22,7 +22,7 @@ Query result caches can generally be viewed as transactionally consistent or inc
assigned a validity period after which they expire (e.g. 1 minute) and that the underlying data changes only little during this period.
This approach is overall more suitable for OLAP databases. As an example where transactionally inconsistent caching is sufficient,
consider an hourly sales report in a reporting tool which is simultaneously accessed by multiple users. Sales data changes typically
slowly enough that the database only needs to compute the report once (represented by the first SELECT query). Further queries can be
slowly enough that the database only needs to compute the report once (represented by the first `SELECT` query). Further queries can be
served directly from the query result cache. In this example, a reasonable validity period could be 30 min.
Transactionally inconsistent caching is traditionally provided by client tools or proxy packages interacting with the database. As a result,
@ -36,32 +36,45 @@ processing) where wrong results are returned.
## Configuration Settings and Usage
Parameter [enable_experimental_query_result_cache](settings/settings.md#enable-experimental-query-result-cache) controls whether query
results are inserted into / retrieved from the cache for the current query or session. For example, the first execution of query
As long as the result cache is experimental it must be activated using the following configuration setting:
``` sql
SELECT some_expensive_calculation(column_1, column_2)
FROM table
SETTINGS enable_experimental_query_result_cache = true;
```sql
SET allow_experimental_query_result_cache = true;
```
stores the query result into the query result cache. Subsequent executions of the same query (also with parameter
`enable_experimental_query_result_cache = true`) will read the computed result directly from the cache.
Afterwards, setting [use_query_result_cache](settings/settings.md#use-query-result-cache) can be used to control whether a specific query or
all queries of the current session should utilize the query result cache. For example, the first execution of query
Sometimes, it is desirable to use the query result cache only passively, i.e. to allow reading from it but not writing into it (if the cache
result is not stored yet). Parameter [enable_experimental_query_result_cache_passive_usage](settings/settings.md#enable-experimental-query-result-cache-passive-usage)
instead of 'enable_experimental_query_result_cache' can be used for that.
```sql
SELECT some_expensive_calculation(column_1, column_2)
FROM table
SETTINGS use_query_result_cache = true;
```
For maximum control, it is generally recommended to provide settings "enable_experimental_query_result_cache" or
"enable_experimental_query_result_cache_passive_usage" only with specific queries. It is also possible to enable caching at user or profile
level but one should keep in mind that all SELECT queries may return a cached results, including monitoring or debugging queries to system
tables.
will store the query result in the query result cache. Subsequent executions of the same query (also with parameter `use_query_result_cache
= true`) will read the computed result from the cache and return it immediately.
The way the cache is utilized can be configured in more detail using settings [enable_writes_to_query_result_cache](settings/settings.md#enable-writes-to-query-result-cache)
and [enable_reads_from_query_result_cache](settings/settings.md#enable-reads-from-query-result-cache) (both `true` by default). The first
settings controls whether query results are stored in the cache, whereas the second parameter determines if the database should try to
retrieve query results from the cache. For example, the following query will use the cache only passively, i.e. attempt to read from it but
not store its result in it:
```sql
SELECT some_expensive_calculation(column_1, column_2)
FROM table
SETTINGS use_query_result_cache = true, enable_writes_to_query_result_cache = false;
```
For maximum control, it is generally recommended to provide settings "use_query_result_cache", "enable_writes_to_query_result_cache" and
"enable_reads_from_query_result_cache" only with specific queries. It is also possible to enable caching at user or profile level (e.g. via
`SET use_query_result_cache = true`) but one should keep in mind that all `SELECT` queries including monitoring or debugging queries to
system tables may return cached results then.
The query result cache can be cleared using statement `SYSTEM DROP QUERY RESULT CACHE`. The content of the query result cache is displayed
in system table `SYSTEM.QUERY_RESULT_CACHE`. The number of query result cache hits and misses are shown as events "QueryResultCacheHits" and
"QueryResultCacheMisses" in system table `SYSTEM.EVENTS`. Both counters are only updated for SELECT queries which run with settings
"enable_experimental_query_result_cache = true" or "enable_experimental_query_result_cache_passive_usage = true". Other queries do not
affect the cache miss counter.
in system table `SYSTEM.QUERY_RESULT_CACHE`. The number of query result cache hits and misses are shown as events "QueryCacheHits" and
"QueryCacheMisses" in system table `SYSTEM.EVENTS`. Both counters are only updated for `SELECT` queries which run with setting
"use_query_result_cache = true". Other queries do not affect the cache miss counter.
The query result cache exists once per ClickHouse server process. However, cache results are by default not shared between users. This can
be changed (see below) but doing so is not recommended for security reasons.
@ -81,7 +94,7 @@ To define how long a query must run at least such that its result can be cached,
``` sql
SELECT some_expensive_calculation(column_1, column_2)
FROM table
SETTINGS enable_experimental_query_result_cache = true, query_result_cache_min_query_duration = 5000;
SETTINGS use_query_result_cache = true, query_result_cache_min_query_duration = 5000;
```
is only cached if the query runs longer than 5 seconds. It is also possible to specify how often a query needs to run until its result is
@ -96,4 +109,4 @@ setting [query_result_cache_store_results_of_queries_with_nondeterministic_funct
Finally, entries in the query cache are not shared between users due to security reasons. For example, user A must not be able to bypass a
row policy on a table by running the same query as another user B for whom no such policy exists. However, if necessary, cache entries can
be marked accessible by other users (i.e. shared) by supplying setting
[query_result_cache_share_between_users]{settings/settings.md#query-result-cache-share-between-users}.
[query_result_cache_share_between_users](settings/settings.md#query-result-cache-share-between-users).

View File

@ -1270,30 +1270,30 @@ If the table does not exist, ClickHouse will create it. If the structure of the
</query_log>
```
## query_result_cache {#server_configuration_parameters_query-result-cache}
## query_cache {#server_configuration_parameters_query-cache}
[Query result cache](../query-result-cache.md) configuration.
[Query cache](../query-cache.md) configuration.
The following settings are available:
- `size`: The maximum cache size in bytes. 0 means the query result cache is disabled. Default value: `1073741824` (1 GiB).
- `max_entries`: The maximum number of SELECT query results stored in the cache. Default value: `1024`.
- `max_entry_size`: The maximum size in bytes SELECT query results may have to be saved in the cache. Default value: `1048576` (1 MiB).
- `max_entry_records`: The maximum number of records SELECT query results may have to be saved in the cache. Default value: `30000000` (30 mil).
- `size`: The maximum cache size in bytes. 0 means the query cache is disabled. Default value: `1073741824` (1 GiB).
- `max_entries`: The maximum number of `SELECT` query results stored in the cache. Default value: `1024`.
- `max_entry_size`: The maximum size in bytes `SELECT` query results may have to be saved in the cache. Default value: `1048576` (1 MiB).
- `max_entry_records`: The maximum number of records `SELECT` query results may have to be saved in the cache. Default value: `30000000` (30 mil).
:::warning
Data for the query result cache is allocated in DRAM. If memory is scarce, make sure to set a small value for `size` or disable the query result cache altogether.
Data for the query cache is allocated in DRAM. If memory is scarce, make sure to set a small value for `size` or disable the query cache altogether.
:::
**Example**
```xml
<query_result_cache>
<query_cache>
<size>1073741824</size>
<max_entries>1024</max_entries>
<max_entry_size>1048576</max_entry_size>
<max_entry_records>30000000</max_entry_records>
</query_result_cache>
</query_cache>
```
## query_thread_log {#server_configuration_parameters-query_thread_log}

View File

@ -1301,9 +1301,43 @@ Possible values:
Default value: `3`.
## enable_experimental_query_result_cache {#enable-experimental-query-result-cache}
## use_query_cache {#use-query-cache}
If turned on, results of SELECT queries are stored in and (if available) retrieved from the [query result cache](../query-result-cache.md).
If turned on, `SELECT` queries may utilize the [query cache](../query-cache.md). Parameters [enable_reads_from_query_cache](#enable-readsfrom-query-cache)
and [enable_writes_to_query_cache](#enable-writes-to-query-cache) control in more detail how the cache is used.
Possible values:
- 0 - Yes
- 1 - No
Default value: `0`.
## enable_reads_from_query_cache {#enable-reads-from-query-cache}
If turned on, results of `SELECT` queries are retrieved from the [query cache](../query-cache.md).
Possible values:
- 0 - Disabled
- 1 - Enabled
Default value: `1`.
## enable_writes_to_query_cache {#enable-writes-to-query-cache}
If turned on, results of `SELECT` queries are stored in the [query cache](../query-cache.md).
Possible values:
- 0 - Disabled
- 1 - Enabled
Default value: `1`.
## query_cache_store_results_of_queries_with_nondeterministic_functions {#query--store-results-of-queries-with-nondeterministic-functions}
If turned on, then results of `SELECT` queries with non-deterministic functions (e.g. `rand()`, `now()`) can be cached in the [query cache](../query-cache.md).
Possible values:
@ -1312,31 +1346,9 @@ Possible values:
Default value: `0`.
## enable_experimental_query_result_cache_passive_usage {#enable-experimental-query-result-cache-passive-usage}
## query_cache_min_query_runs {#query-cache-min-query-runs}
If turned on, results of SELECT queries are (if available) retrieved from the [query result cache](../query-result-cache.md).
Possible values:
- 0 - Disabled
- 1 - Enabled
Default value: `0`.
## query_result_cache_store_results_of_queries_with_nondeterministic_functions {#query-result-cache-store-results-of-queries-with-nondeterministic-functions}
If turned on, then results of SELECT queries with non-deterministic functions (e.g. `rand()`, `now()`) can be cached in the [query result cache](../query-result-cache.md).
Possible values:
- 0 - Disabled
- 1 - Enabled
Default value: `0`.
## query_result_cache_min_query_runs {#query-result-cache-min-query-runs}
Minimum number of times a SELECT query must run before its result is stored in the [query result cache](../query-result-cache.md).
Minimum number of times a `SELECT` query must run before its result is stored in the [query cache](../query-cache.md).
Possible values:
@ -1344,9 +1356,9 @@ Possible values:
Default value: `0`
## query_result_cache_min_query_duration {#query-result-cache-min-query-duration}
## query_cache_min_query_duration {#query-cache-min-query-duration}
Minimum duration in milliseconds a query needs to run for its result to be stored in the [query result cache](../query-result-cache.md).
Minimum duration in milliseconds a query needs to run for its result to be stored in the [query cache](../query-cache.md).
Possible values:
@ -1354,9 +1366,9 @@ Possible values:
Default value: `0`
## query_result_cache_ttl {#query-result-cache-ttl}
## query_cache_ttl {#query-cache-ttl}
After this time in seconds entries in the [query result cache](../query-result-cache.md) become stale.
After this time in seconds entries in the [query cache](../query-cache.md) become stale.
Possible values:
@ -1364,9 +1376,9 @@ Possible values:
Default value: `60`
## query_result_cache_share_between_users {#query-result-cache-share-between-users}
## query_cache_share_between_users {#query-cache-share-between-users}
If turned on, the result of SELECT queries cached in the [query result cache](../query-result-cache.md) can be read by other users.
If turned on, the result of `SELECT` queries cached in the [query cache](../query-cache.md) can be read by other users.
It is not recommended to enable this setting due to security reasons.
Possible values:
@ -1632,6 +1644,49 @@ SELECT * FROM test_table
└───┘
```
## insert_keeper_max_retries
The setting sets the maximum number of retries for ClickHouse Keeper (or ZooKeeper) requests during insert into replicated MergeTree. Only Keeper requests which failed due to network error, Keeper session timeout, or request timeout are considered for retries.
Possible values:
- Positive integer.
- 0 — Retries are disabled
Default value: 0
Keeper request retries are done after some timeout. The timeout is controlled by the following settings: `insert_keeper_retry_initial_backoff_ms`, `insert_keeper_retry_max_backoff_ms`.
The first retry is done after `insert_keeper_retry_initial_backoff_ms` timeout. The consequent timeouts will be calculated as follows:
```
timeout = min(insert_keeper_retry_max_backoff_ms, latest_timeout * 2)
```
For example, if `insert_keeper_retry_initial_backoff_ms=100`, `insert_keeper_retry_max_backoff_ms=10000` and `insert_keeper_max_retries=8` then timeouts will be `100, 200, 400, 800, 1600, 3200, 6400, 10000`.
Apart from fault tolerance, the retries aim to provide a better user experience - they allow to avoid returning an error during INSERT execution if Keeper is restarted, for example, due to an upgrade.
## insert_keeper_retry_initial_backoff_ms {#insert_keeper_retry_initial_backoff_ms}
Initial timeout(in milliseconds) to retry a failed Keeper request during INSERT query execution
Possible values:
- Positive integer.
- 0 — No timeout
Default value: 100
## insert_keeper_retry_max_backoff_ms {#insert_keeper_retry_max_backoff_ms}
Maximum timeout (in milliseconds) to retry a failed Keeper request during INSERT query execution
Possible values:
- Positive integer.
- 0 — Maximum timeout is not limited
Default value: 10000
## max_network_bytes {#settings-max-network-bytes}
Limits the data volume (in bytes) that is received or transmitted over the network when executing a query. This setting applies to every individual query.
@ -3634,6 +3689,30 @@ Default value: `0`.
- [optimize_move_to_prewhere](#optimize_move_to_prewhere) setting
## optimize_using_constraints
Use [constraints](../../sql-reference/statements/create/table#constraints) for query optimization. The default is `false`.
Possible values:
- true, false
## optimize_append_index
Use [constraints](../../sql-reference/statements/create/table#constraints) in order to append index condition. The default is `false`.
Possible values:
- true, false
## optimize_substitute_columns
Use [constraints](../../sql-reference/statements/create/table#constraints) for column substitution. The default is `false`.
Possible values:
- true, false
## describe_include_subcolumns {#describe_include_subcolumns}
Enables describing subcolumns for a [DESCRIBE](../../sql-reference/statements/describe-table.md) query. For example, members of a [Tuple](../../sql-reference/data-types/tuple.md) or subcolumns of a [Map](../../sql-reference/data-types/map.md/#map-subcolumns), [Nullable](../../sql-reference/data-types/nullable.md/#finding-null) or an [Array](../../sql-reference/data-types/array.md/#array-size) data type.

View File

@ -19,11 +19,11 @@ Example:
``` sql
SELECT maxMap(a, b)
FROM values('a Array(Int32), b Array(Int64)', ([1, 2], [2, 2]), ([2, 3], [1, 1]))
FROM values('a Array(Char), b Array(Int64)', (['x', 'y'], [2, 2]), (['y', 'z'], [3, 1]))
```
``` text
┌─maxMap(a, b)──────┐
([1,2,3],[2,2,1])
└───────────────────┘
┌─maxMap(a, b)───────────
[['x','y','z'],[2,3,1]]
└────────────────────────
```

View File

@ -28,15 +28,16 @@ Returns an array of the values with maximum approximate sum of weights.
Query:
``` sql
SELECT topKWeighted(10)(number, number) FROM numbers(1000)
SELECT topKWeighted(2)(k, w) FROM
VALUES('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))
```
Result:
``` text
┌─topKWeighted(10)(number, number)──────────┐
│ [999,998,997,996,995,994,993,992,991,990]
└───────────────────────────────────────────
┌─topKWeighted(2)(k, w)──┐
│ ['z','x']
└────────────────────────┘
```
**See Also**

View File

@ -660,25 +660,30 @@ This type of storage is for use with composite [keys](../../../sql-reference/dic
This type of storage is for mapping network prefixes (IP addresses) to metadata such as ASN.
Example: The table contains network prefixes and their corresponding AS number and country code:
**Example**
``` text
+-----------|-----|------+
| prefix | asn | cca2 |
+=================+=======+========+
| 202.79.32.0/20 | 17501 | NP |
+-----------|-----|------+
| 2620:0:870::/48 | 3856 | US |
+-----------|-----|------+
| 2a02:6b8:1::/48 | 13238 | RU |
+-----------|-----|------+
| 2001:db8::/32 | 65536 | ZZ |
+-----------|-----|------+
Suppose we have a table in ClickHouse that contains our IP prefixes and mappings:
```sql
CREATE TABLE my_ip_addresses (
prefix String,
asn UInt32,
cca2 String
)
ENGINE = MergeTree
PRIMARY KEY prefix;
```
When using this type of layout, the structure must have a composite key.
```sql
INSERT INTO my_ip_addresses VALUES
('202.79.32.0/20', 17501, 'NP'),
('2620:0:870::/48', 3856, 'US'),
('2a02:6b8:1::/48', 13238, 'RU'),
('2001:db8::/32', 65536, 'ZZ')
;
```
Example:
Let's define an `ip_trie` dictionary for this table. The `ip_trie` layout requires a composite key:
``` xml
<structure>
@ -712,26 +717,29 @@ Example:
or
``` sql
CREATE DICTIONARY somedict (
CREATE DICTIONARY my_ip_trie_dictionary (
prefix String,
asn UInt32,
cca2 String DEFAULT '??'
)
PRIMARY KEY prefix
SOURCE(CLICKHOUSE(TABLE 'my_ip_addresses'))
LAYOUT(IP_TRIE)
LIFETIME(3600);
```
The key must have only one String type attribute that contains an allowed IP prefix. Other types are not supported yet.
The key must have only one `String` type attribute that contains an allowed IP prefix. Other types are not supported yet.
For queries, you must use the same functions (`dictGetT` with a tuple) as for dictionaries with composite keys:
For queries, you must use the same functions (`dictGetT` with a tuple) as for dictionaries with composite keys. The syntax is:
``` sql
dictGetT('dict_name', 'attr_name', tuple(ip))
```
The function takes either `UInt32` for IPv4, or `FixedString(16)` for IPv6:
The function takes either `UInt32` for IPv4, or `FixedString(16)` for IPv6. For example:
``` sql
dictGetString('prefix', 'asn', tuple(IPv6StringToNum('2001:db8::1')))
select dictGet('my_ip_trie_dictionary', 'asn', tuple(IPv6StringToNum('2001:db8::1')))
```
Other types are not supported yet. The function returns the attribute for the prefix that corresponds to this IP address. If there are overlapping prefixes, the most specific one is returned.

View File

@ -1293,31 +1293,31 @@ Similar to formatDateTime, except that it formats datetime in Joda style instead
Using replacement fields, you can define a pattern for the resulting string.
| Placeholder | Description | Presentation | Examples |
| ----------- | ----------- | ------------- | -------- |
| G | era | text | AD |
| C | century of era (>=0) | number | 20 |
| Y | year of era (>=0) | year | 1996 |
| x | weekyear(not supported yet) | year | 1996 |
| w | week of weekyear(not supported yet) | number | 27 |
| e | day of week | number | 2 |
| E | day of week | text | Tuesday; Tue |
| y | year | year | 1996 |
| D | day of year | number | 189 |
| M | month of year | month | July; Jul; 07 |
| d | day of month | number | 10 |
| a | halfday of day | text | PM |
| K | hour of halfday (0~11) | number | 0 |
| h | clockhour of halfday (1~12) | number | 12 |
| H | hour of day (0~23) | number | 0 |
| k | clockhour of day (1~24) | number | 24 |
| m | minute of hour | number | 30 |
| s | second of minute | number | 55 |
| S | fraction of second(not supported yet) | number | 978 |
| z | time zone(short name not supported yet) | text | Pacific Standard Time; PST |
| Z | time zone offset/id(not supported yet) | zone | -0800; -08:00; America/Los_Angeles |
| ' | escape for text | delimiter| |
| '' | single quote | literal | ' |
| Placeholder | Description | Presentation | Examples |
| ----------- | ---------------------------------------- | ------------- | ---------------------------------- |
| G | era | text | AD |
| C | century of era (>=0) | number | 20 |
| Y | year of era (>=0) | year | 1996 |
| x | weekyear (not supported yet) | year | 1996 |
| w | week of weekyear (not supported yet) | number | 27 |
| e | day of week | number | 2 |
| E | day of week | text | Tuesday; Tue |
| y | year | year | 1996 |
| D | day of year | number | 189 |
| M | month of year | month | July; Jul; 07 |
| d | day of month | number | 10 |
| a | halfday of day | text | PM |
| K | hour of halfday (0~11) | number | 0 |
| h | clockhour of halfday (1~12) | number | 12 |
| H | hour of day (0~23) | number | 0 |
| k | clockhour of day (1~24) | number | 24 |
| m | minute of hour | number | 30 |
| s | second of minute | number | 55 |
| S | fraction of second (not supported yet) | number | 978 |
| z | time zone (short name not supported yet) | text | Pacific Standard Time; PST |
| Z | time zone offset/id (not supported yet) | zone | -0800; -08:00; America/Los_Angeles |
| ' | escape for text | delimiter | |
| '' | single quote | literal | ' |
**Example**

View File

@ -304,7 +304,7 @@ Result:
└──────────────┘
```
## s2RectUinion
## s2RectUnion
Returns the smallest rectangle containing the union of this rectangle and the given rectangle. In the S2 system, a rectangle is represented by a type of S2Region called a `S2LatLngRect` that represents a rectangle in latitude-longitude space.

View File

@ -45,37 +45,38 @@ SELECT halfMD5(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')
Calculates the MD4 from a string and returns the resulting set of bytes as FixedString(16).
## MD5
## MD5 {#hash_functions-md5}
Calculates the MD5 from a string and returns the resulting set of bytes as FixedString(16).
If you do not need MD5 in particular, but you need a decent cryptographic 128-bit hash, use the sipHash128 function instead.
If you want to get the same result as output by the md5sum utility, use lower(hex(MD5(s))).
## sipHash64
## sipHash64 (#hash_functions-siphash64)
Produces a 64-bit [SipHash](https://131002.net/siphash/) hash value.
Produces a 64-bit [SipHash](https://en.wikipedia.org/wiki/SipHash) hash value.
```sql
sipHash64(par1,...)
```
This is a cryptographic hash function. It works at least three times faster than the [MD5](#hash_functions-md5) function.
This is a cryptographic hash function. It works at least three times faster than the [MD5](#hash_functions-md5) hash function.
Function [interprets](/docs/en/sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input parameters as strings and calculates the hash value for each of them. Then combines hashes by the following algorithm:
The function [interprets](/docs/en/sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input parameters as strings and calculates the hash value for each of them. It then combines the hashes by the following algorithm:
1. After hashing all the input parameters, the function gets the array of hashes.
2. Function takes the first and the second elements and calculates a hash for the array of them.
3. Then the function takes the hash value, calculated at the previous step, and the third element of the initial hash array, and calculates a hash for the array of them.
4. The previous step is repeated for all the remaining elements of the initial hash array.
1. The first and the second hash value are concatenated to an array which is hashed.
2. The previously calculated hash value and the hash of the third input paramter are hashed in a similar way.
3. This calculation is repeated for all remaining hash values of the original input.
**Arguments**
The function takes a variable number of input parameters. Arguments can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md). For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data).
The function takes a variable number of input parameters of any of the [supported data types](/docs/en/sql-reference/data-types/index.md).
**Returned Value**
A [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
Note that the calculated hash values may be equal for the same input values of different argument types. This affects for example integer types of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data.
**Example**
```sql
@ -84,13 +85,45 @@ SELECT sipHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00
```response
┌──────────────SipHash─┬─type───┐
│ 13726873534472839665 │ UInt64 │
│ 11400366955626497465 │ UInt64 │
└──────────────────────┴────────┘
```
## sipHash64Keyed
Same as [sipHash64](#hash_functions-siphash64) but additionally takes an explicit key argument instead of using a fixed key.
**Syntax**
```sql
sipHash64Keyed((k0, k1), par1,...)
```
**Arguments**
Same as [sipHash64](#hash_functions-siphash64), but the first argument is a tuple of two UInt64 values representing the key.
**Returned value**
A [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
**Example**
Query:
```sql
SELECT sipHash64Keyed((506097522914230528, 1084818905618843912), array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS SipHash, toTypeName(SipHash) AS type;
```
```response
┌─────────────SipHash─┬─type───┐
│ 8017656310194184311 │ UInt64 │
└─────────────────────┴────────┘
```
## sipHash128
Produces a 128-bit [SipHash](https://131002.net/siphash/) hash value. Differs from [sipHash64](#hash_functions-siphash64) in that the final xor-folding state is done up to 128 bits.
Like [sipHash64](#hash_functions-siphash64) but produces a 128-bit hash value, i.e. the final xor-folding state is done up to 128 bits.
**Syntax**
@ -100,13 +133,11 @@ sipHash128(par1,...)
**Arguments**
The function takes a variable number of input parameters. Arguments can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md). For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data).
Same as for [sipHash64](#hash_functions-siphash64).
**Returned value**
A 128-bit `SipHash` hash value.
Type: [FixedString(16)](/docs/en/sql-reference/data-types/fixedstring.md).
A 128-bit `SipHash` hash value of type [FixedString(16)](/docs/en/sql-reference/data-types/fixedstring.md).
**Example**
@ -124,6 +155,40 @@ Result:
└──────────────────────────────────┘
```
## sipHash128Keyed
Same as [sipHash128](#hash_functions-siphash128) but additionally takes an explicit key argument instead of using a fixed key.
**Syntax**
```sql
sipHash128Keyed((k0, k1), par1,...)
```
**Arguments**
Same as [sipHash128](#hash_functions-siphash128), but the first argument is a tuple of two UInt64 values representing the key.
**Returned value**
A [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
**Example**
Query:
```sql
SELECT hex(sipHash128Keyed((506097522914230528, 1084818905618843912),'foo', '\x01', 3));
```
Result:
```response
┌─hex(sipHash128Keyed((506097522914230528, 1084818905618843912), 'foo', '', 3))─┐
│ B8467F65C8B4CFD9A5F8BD733917D9BF │
└───────────────────────────────────────────────────────────────────────────────┘
```
## cityHash64
Produces a 64-bit [CityHash](https://github.com/google/cityhash) hash value.

View File

@ -95,6 +95,32 @@ Result:
└───────────────────────────────┘
```
If argument `needle` is empty the following rules apply:
- if no `start_pos` was specified: return `1`
- if `start_pos = 0`: return `1`
- if `start_pos >= 1` and `start_pos <= length(haystack) + 1`: return `start_pos`
- otherwise: return `0`
The same rules also apply to functions `positionCaseInsensitive`, `positionUTF8` and `positionCaseInsensitiveUTF8`
``` sql
SELECT
position('abc', ''),
position('abc', '', 0),
position('abc', '', 1),
position('abc', '', 2),
position('abc', '', 3),
position('abc', '', 4),
position('abc', '', 5)
```
``` text
┌─position('abc', '')─┬─position('abc', '', 0)─┬─position('abc', '', 1)─┬─position('abc', '', 2)─┬─position('abc', '', 3)─┬─position('abc', '', 4)─┬─position('abc', '', 5)─┐
│ 1 │ 1 │ 1 │ 2 │ 3 │ 4 │ 0 │
└─────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┘
```
**Examples for POSITION(needle IN haystack) syntax**
Query:

View File

@ -6,6 +6,10 @@ sidebar_label: TTL
# Manipulations with Table TTL
:::note
If you are looking for details on using TTL for managing old data, check out the [Manage Data with TTL](/docs/en/guides/developer/ttl.md) user guide. The docs below demonstrate how to alter or remove an existing TTL rule.
:::
## MODIFY TTL
You can change [table TTL](../../../engines/table-engines/mergetree-family/mergetree.md#mergetree-table-ttl) with a request of the following form:

View File

@ -110,7 +110,7 @@ LIFETIME(MIN 0 MAX 1000)
### Create a dictionary from a file available by HTTP(S)
```sql
statement: CREATE DICTIONARY default.taxi_zone_dictionary
CREATE DICTIONARY default.taxi_zone_dictionary
(
`LocationID` UInt16 DEFAULT 0,
`Borough` String,

View File

@ -3,6 +3,7 @@ slug: /en/sql-reference/statements/create/table
sidebar_position: 36
sidebar_label: TABLE
title: "CREATE TABLE"
keywords: [compression, codec, schema, DDL]
---
Creates a new table. This query can have various syntax forms depending on a use case.

View File

@ -7,7 +7,7 @@ sidebar_label: DELETE
# DELETE Statement
``` sql
DELETE FROM [db.]table [WHERE expr]
DELETE FROM [db.]table [ON CLUSTER cluster] [WHERE expr]
```
`DELETE FROM` removes rows from table `[db.]table` that match expression `expr`. The deleted rows are marked as deleted immediately and will be automatically filtered out of all subsequent queries. Cleanup of data happens asynchronously in background. This feature is only available for MergeTree table engine family.

View File

@ -510,3 +510,15 @@ Result:
**See Also**
- [system.settings](../../operations/system-tables/settings.md) table
## SHOW ENGINES
``` sql
SHOW ENGINES [INTO OUTFILE filename] [FORMAT format]
```
Outputs the content of the [system.table_engines](../../operations/system-tables/table_engines.md) table, that contains description of table engines supported by server and their feature support information.
**See Also**
- [system.table_engines](../../operations/system-tables/table_engines.md) table

View File

@ -103,9 +103,9 @@ Its size can be configured using the server-level setting [uncompressed_cache_si
Reset the compiled expression cache.
The compiled expression cache is enabled/disabled with the query/user/profile-level setting [compile_expressions](../../operations/settings/settings.md#compile-expressions).
## DROP QUERY RESULT CACHE
## DROP QUERY CACHE
Resets the [query result cache](../../operations/query-result-cache.md).
Resets the [query cache](../../operations/query-cache.md).
## FLUSH LOGS
@ -362,3 +362,15 @@ Allows to drop filesystem cache.
```sql
SYSTEM DROP FILESYSTEM CACHE
```
### SYNC FILE CACHE
:::note
It's too heavy and has potential for misuse.
:::
Will do sync syscall.
```sql
SYSTEM SYNC FILE CACHE
```

View File

@ -8,7 +8,7 @@ sidebar_label: generateRandom
Generates random data with given schema.
Allows to populate test tables with data.
Supports all data types that can be stored in table except `LowCardinality` and `AggregateFunction`.
Not all types are supported.
``` sql
generateRandom('name TypeName[, name TypeName]...', [, 'random_seed'[, 'max_string_length'[, 'max_array_length']]])
@ -18,7 +18,7 @@ generateRandom('name TypeName[, name TypeName]...', [, 'random_seed'[, 'max_stri
- `name` — Name of corresponding column.
- `TypeName` — Type of corresponding column.
- `max_array_length` — Maximum array length for all generated arrays. Defaults to `10`.
- `max_array_length` — Maximum elements for all generated arrays or maps. Defaults to `10`.
- `max_string_length` — Maximum string length for all generated strings. Defaults to `10`.
- `random_seed` — Specify random seed manually to produce stable results. If NULL — seed is randomly generated.

View File

@ -1,3 +1,2 @@
build
__pycache__
*.pyc

View File

@ -127,6 +127,69 @@ void Client::showWarnings()
}
}
void Client::parseConnectionsCredentials()
{
/// It is not possible to correctly handle multiple --host --port options.
if (hosts_and_ports.size() >= 2)
return;
String host;
std::optional<UInt16> port;
if (hosts_and_ports.empty())
{
host = config().getString("host", "localhost");
if (config().has("port"))
port = config().getInt("port");
}
else
{
host = hosts_and_ports.front().host;
port = hosts_and_ports.front().port;
}
Strings keys;
config().keys("connections_credentials", keys);
for (const auto & connection : keys)
{
const String & prefix = "connections_credentials." + connection;
const String & connection_name = config().getString(prefix + ".name", "");
if (connection_name != host)
continue;
String connection_hostname;
if (config().has(prefix + ".hostname"))
connection_hostname = config().getString(prefix + ".hostname");
else
connection_hostname = connection_name;
/// Set "host" unconditionally (since it is used as a "name"), while
/// other options only if they are not set yet (config.xml/cli
/// options).
config().setString("host", connection_hostname);
if (!hosts_and_ports.empty())
hosts_and_ports.front().host = connection_hostname;
if (config().has(prefix + ".port") && !port.has_value())
config().setInt("port", config().getInt(prefix + ".port"));
if (config().has(prefix + ".secure") && !config().has("secure"))
config().setBool("secure", config().getBool(prefix + ".secure"));
if (config().has(prefix + ".user") && !config().has("user"))
config().setString("user", config().getString(prefix + ".user"));
if (config().has(prefix + ".password") && !config().has("password"))
config().setString("password", config().getString(prefix + ".password"));
if (config().has(prefix + ".database") && !config().has("database"))
config().setString("database", config().getString(prefix + ".database"));
if (config().has(prefix + ".history_file") && !config().has("history_file"))
{
String history_file = config().getString(prefix + ".history_file");
if (history_file.starts_with("~") && !home_path.empty())
history_file = home_path + "/" + history_file.substr(1);
config().setString("history_file", history_file);
}
}
}
/// Make query to get all server warnings
std::vector<String> Client::loadWarningMessages()
{
@ -216,6 +279,8 @@ void Client::initialize(Poco::Util::Application & self)
if (env_password)
config().setString("password", env_password);
parseConnectionsCredentials();
// global_context->setApplicationType(Context::ApplicationType::CLIENT);
global_context->setQueryParameters(query_parameters);

View File

@ -47,6 +47,7 @@ protected:
private:
void printChangedSettings() const;
void showWarnings();
void parseConnectionsCredentials();
std::vector<String> loadWarningMessages();
};
}

View File

@ -57,4 +57,28 @@
The same can be done on user-level configuration, just create & adjust: ~/.clickhouse-client/config.xml
-->
<!-- Analog of .netrc -->
<![CDATA[
<connections_credentials>
<connection>
<!-- Name of the connection, host option for the client.
"host" is not the same as "hostname" since you may want to have different settings for one host,
and in this case you can add "prod" and "prod_readonly".
Default: "hostname" will be used. -->
<name>default</name>
<!-- Host that will be used for connection. -->
<hostname>127.0.0.1</hostname>
<port>9000</port>
<secure>1</secure>
<user>default</user>
<password></password>
<database></database>
<!-- '~' is expanded to HOME, like in any shell -->
<history_file></history_file>
</connection>
</connections_credentials>
]]>
</config>

View File

@ -6,7 +6,6 @@
#include <Common/ZooKeeper/ZooKeeper.h>
#include <Common/ZooKeeper/KeeperException.h>
#include <Common/setThreadName.h>
#include <IO/ConnectionTimeoutsContext.h>
#include <Interpreters/InterpreterInsertQuery.h>
#include <Interpreters/InterpreterSelectWithUnionQuery.h>
#include <Parsers/ASTFunction.h>

View File

@ -61,13 +61,18 @@ void IdentifierQuoteHandler::handleRequest(HTTPServerRequest & request, HTTPServ
return;
}
bool use_connection_pooling = params.getParsed<bool>("use_connection_pooling", true);
try
{
std::string connection_string = params.get("connection_string");
auto connection = ODBCPooledConnectionFactory::instance().get(
validateODBCConnectionString(connection_string),
getContext()->getSettingsRef().odbc_bridge_connection_pool_size);
nanodbc::ConnectionHolderPtr connection;
if (use_connection_pooling)
connection = ODBCPooledConnectionFactory::instance().get(
validateODBCConnectionString(connection_string), getContext()->getSettingsRef().odbc_bridge_connection_pool_size);
else
connection = std::make_shared<nanodbc::ConnectionHolder>(validateODBCConnectionString(connection_string));
auto identifier = getIdentifierQuote(std::move(connection));

View File

@ -102,7 +102,9 @@ void ODBCHandler::handleRequest(HTTPServerRequest & request, HTTPServerResponse
std::string format = params.get("format", "RowBinary");
std::string connection_string = params.get("connection_string");
bool use_connection_pooling = params.getParsed<bool>("use_connection_pooling", true);
LOG_TRACE(log, "Connection string: '{}'", connection_string);
LOG_TRACE(log, "Use pooling: {}", use_connection_pooling);
UInt64 max_block_size = DEFAULT_BLOCK_SIZE;
if (params.has("max_block_size"))
@ -134,7 +136,7 @@ void ODBCHandler::handleRequest(HTTPServerRequest & request, HTTPServerResponse
try
{
nanodbc::ConnectionHolderPtr connection_handler;
if (getContext()->getSettingsRef().odbc_bridge_use_connection_pooling)
if (use_connection_pooling)
connection_handler = ODBCPooledConnectionFactory::instance().get(
validateODBCConnectionString(connection_string), getContext()->getSettingsRef().odbc_bridge_connection_pool_size);
else

View File

@ -70,13 +70,19 @@ void SchemaAllowedHandler::handleRequest(HTTPServerRequest & request, HTTPServer
return;
}
bool use_connection_pooling = params.getParsed<bool>("use_connection_pooling", true);
try
{
std::string connection_string = params.get("connection_string");
auto connection = ODBCPooledConnectionFactory::instance().get(
validateODBCConnectionString(connection_string),
getContext()->getSettingsRef().odbc_bridge_connection_pool_size);
nanodbc::ConnectionHolderPtr connection;
if (use_connection_pooling)
connection = ODBCPooledConnectionFactory::instance().get(
validateODBCConnectionString(connection_string), getContext()->getSettingsRef().odbc_bridge_connection_pool_size);
else
connection = std::make_shared<nanodbc::ConnectionHolder>(validateODBCConnectionString(connection_string));
bool result = isSchemaAllowed(std::move(connection));

View File

@ -1517,13 +1517,13 @@ try
global_context->setMMappedFileCache(mmap_cache_size);
/// A cache for query results.
size_t query_result_cache_size = config().getUInt64("query_result_cache.size", 1_GiB);
if (query_result_cache_size)
global_context->setQueryResultCache(
query_result_cache_size,
config().getUInt64("query_result_cache.max_entries", 1024),
config().getUInt64("query_result_cache.max_entry_size", 1_MiB),
config().getUInt64("query_result_cache.max_entry_records", 30'000'000));
size_t query_cache_size = config().getUInt64("query_cache.size", 1_GiB);
if (query_cache_size)
global_context->setQueryCache(
query_cache_size,
config().getUInt64("query_cache.max_entries", 1024),
config().getUInt64("query_cache.max_entry_size", 1_MiB),
config().getUInt64("query_cache.max_entry_records", 30'000'000));
#if USE_EMBEDDED_COMPILER
/// 128 MB

View File

@ -1466,13 +1466,13 @@
</rocksdb>
-->
<!-- Configuration for the query result cache -->
<!-- <query_result_cache> -->
<!-- Configuration for the query cache -->
<!-- <query_cache> -->
<!-- <size>1073741824</size> -->
<!-- <max_entries>1024</max_entries> -->
<!-- <max_entry_size>1048576</max_entry_size> -->
<!-- <max_entry_records>30000000</max_entry_records> -->
<!-- </query_result_cache> -->
<!-- </query_cache> -->
<!-- Uncomment if enable merge tree metadata cache -->
<!--merge_tree_metadata_cache>

View File

@ -142,7 +142,7 @@ enum class AccessType
M(SYSTEM_DROP_MARK_CACHE, "SYSTEM DROP MARK, DROP MARK CACHE, DROP MARKS", GLOBAL, SYSTEM_DROP_CACHE) \
M(SYSTEM_DROP_UNCOMPRESSED_CACHE, "SYSTEM DROP UNCOMPRESSED, DROP UNCOMPRESSED CACHE, DROP UNCOMPRESSED", GLOBAL, SYSTEM_DROP_CACHE) \
M(SYSTEM_DROP_MMAP_CACHE, "SYSTEM DROP MMAP, DROP MMAP CACHE, DROP MMAP", GLOBAL, SYSTEM_DROP_CACHE) \
M(SYSTEM_DROP_QUERY_RESULT_CACHE, "SYSTEM DROP QUERY RESULT, DROP QUERY RESULT CACHE, DROP QUERY RESULT", GLOBAL, SYSTEM_DROP_CACHE) \
M(SYSTEM_DROP_QUERY_CACHE, "SYSTEM DROP QUERY, DROP QUERY CACHE, DROP QUERY", GLOBAL, SYSTEM_DROP_CACHE) \
M(SYSTEM_DROP_COMPILED_EXPRESSION_CACHE, "SYSTEM DROP COMPILED EXPRESSION, DROP COMPILED EXPRESSION CACHE, DROP COMPILED EXPRESSIONS", GLOBAL, SYSTEM_DROP_CACHE) \
M(SYSTEM_DROP_FILESYSTEM_CACHE, "SYSTEM DROP FILESYSTEM CACHE, DROP FILESYSTEM CACHE", GLOBAL, SYSTEM_DROP_CACHE) \
M(SYSTEM_DROP_SCHEMA_CACHE, "SYSTEM DROP SCHEMA CACHE, DROP SCHEMA CACHE", GLOBAL, SYSTEM_DROP_CACHE) \
@ -171,6 +171,7 @@ enum class AccessType
M(SYSTEM_WAIT_LOADING_PARTS, "WAIT LOADING PARTS", TABLE, SYSTEM) \
M(SYSTEM_SYNC_DATABASE_REPLICA, "SYNC DATABASE REPLICA", DATABASE, SYSTEM) \
M(SYSTEM_SYNC_TRANSACTION_LOG, "SYNC TRANSACTION LOG", GLOBAL, SYSTEM) \
M(SYSTEM_SYNC_FILE_CACHE, "SYNC FILE CACHE", GLOBAL, SYSTEM) \
M(SYSTEM_FLUSH_DISTRIBUTED, "FLUSH DISTRIBUTED", TABLE, SYSTEM_FLUSH) \
M(SYSTEM_FLUSH_LOGS, "FLUSH LOGS", GLOBAL, SYSTEM_FLUSH) \
M(SYSTEM_FLUSH, "", GROUP, SYSTEM) \

View File

@ -1,14 +1,19 @@
#pragma once
#include <array>
#include <string_view>
#include <DataTypes/DataTypeString.h>
#include <AggregateFunctions/IAggregateFunction.h>
#include <base/range.h>
#include <IO/ReadHelpers.h>
#include <IO/WriteHelpers.h>
#include <Columns/ColumnString.h>
#include <Common/PODArray.h>
#include <Common/logger_useful.h>
#include <IO/ReadBufferFromString.h>
#include <Common/HashTable/HashMap.h>
#include <Columns/IColumn.h>
namespace DB
{
@ -105,24 +110,12 @@ private:
bool specified_min_max_x;
template <class T>
String getBar(const T value) const
size_t updateFrame(ColumnString::Chars & frame, const T value) const
{
if (isNaN(value) || value > 8 || value < 1)
return " ";
// ▁▂▃▄▅▆▇█
switch (static_cast<UInt8>(value))
{
case 1: return "";
case 2: return "";
case 3: return "";
case 4: return "";
case 5: return "";
case 6: return "";
case 7: return "";
case 8: return "";
}
return " ";
static constexpr std::array<std::string_view, 9> bars{" ", "", "", "", "", "", "", "", ""};
const auto & bar = (isNaN(value) || value > 8 || value < 1) ? bars[0] : bars[static_cast<UInt8>(value)];
frame.insert(bar.begin(), bar.end());
return bar.size();
}
/**
@ -136,11 +129,19 @@ private:
* the actual y value of the first position + the actual second position y*0.1, and the remaining y*0.9 is reserved for the next bucket.
* The next bucket will use the last y*0.9 + the actual third position y*0.2, and the remaining y*0.8 will be reserved for the next bucket. And so on.
*/
String render(const AggregateFunctionSparkbarData<X, Y> & data) const
void render(ColumnString & to_column, const AggregateFunctionSparkbarData<X, Y> & data) const
{
String value;
size_t sz = 0;
auto & values = to_column.getChars();
auto & offsets = to_column.getOffsets();
auto update_column = [&] ()
{
values.push_back('\0');
offsets.push_back(offsets.empty() ? sz + 1 : offsets.back() + sz + 1);
};
if (data.points.empty() || !width)
return value;
return update_column();
size_t diff_x;
X min_x_local;
@ -167,13 +168,13 @@ private:
{
auto it = data.points.find(static_cast<X>(min_x_local + i));
bool found = it != data.points.end();
value += getBar(found ? std::round(((it->getMapped() - min_y) / diff_y) * 7) + 1 : 0.0);
sz += updateFrame(values, found ? std::round(((it->getMapped() - min_y) / diff_y) * 7) + 1 : 0.0);
}
}
else
{
for (size_t i = 0; i <= diff_x; ++i)
value += getBar(data.points.has(min_x_local + static_cast<X>(i)) ? 1 : 0);
sz += updateFrame(values, data.points.has(min_x_local + static_cast<X>(i)) ? 1 : 0);
}
}
else
@ -236,25 +237,25 @@ private:
}
if (!min_y || !max_y) // No value is set
return {};
return update_column();
Float64 diff_y = max_y.value() - min_y.value();
auto get_bars = [&] (const std::optional<Float64> & point_y)
auto update_frame = [&] (const std::optional<Float64> & point_y)
{
value += getBar(point_y ? std::round(((point_y.value() - min_y.value()) / diff_y) * 7) + 1 : 0);
sz += updateFrame(values, point_y ? std::round(((point_y.value() - min_y.value()) / diff_y) * 7) + 1 : 0);
};
auto get_bars_for_constant = [&] (const std::optional<Float64> & point_y)
auto update_frame_for_constant = [&] (const std::optional<Float64> & point_y)
{
value += getBar(point_y ? 1 : 0);
sz += updateFrame(values, point_y ? 1 : 0);
};
if (diff_y != 0.0)
std::for_each(new_points.begin(), new_points.end(), get_bars);
std::for_each(new_points.begin(), new_points.end(), update_frame);
else
std::for_each(new_points.begin(), new_points.end(), get_bars_for_constant);
std::for_each(new_points.begin(), new_points.end(), update_frame_for_constant);
}
return value;
update_column();
}
@ -314,8 +315,7 @@ public:
{
auto & to_column = assert_cast<ColumnString &>(to);
const auto & data = this->data(place);
const String & value = render(data);
to_column.insertData(value.data(), value.size());
render(to_column, data);
}
};

View File

@ -313,7 +313,7 @@ QueryTreeNodePtr QueryTreeBuilder::buildSelectExpression(const ASTPtr & select_q
if (select_limit_by_limit)
current_query_tree->getLimitByLimit() = buildExpression(select_limit_by_limit, current_context);
auto select_limit_by_offset = select_query_typed.limitOffset();
auto select_limit_by_offset = select_query_typed.limitByOffset();
if (select_limit_by_offset)
current_query_tree->getLimitByOffset() = buildExpression(select_limit_by_offset, current_context);

View File

@ -271,6 +271,18 @@ size_t BackupImpl::getNumFiles() const
return num_files;
}
size_t BackupImpl::getNumProcessedFiles() const
{
std::lock_guard lock{mutex};
return num_processed_files;
}
UInt64 BackupImpl::getProcessedFilesSize() const
{
std::lock_guard lock{mutex};
return processed_files_size;
}
UInt64 BackupImpl::getUncompressedSize() const
{
std::lock_guard lock{mutex};
@ -355,6 +367,7 @@ void BackupImpl::writeBackupMetadata()
out->finalize();
increaseUncompressedSize(str.size());
increaseProcessedSize(str.size());
}
@ -380,6 +393,7 @@ void BackupImpl::readBackupMetadata()
String str;
readStringUntilEOF(str, *in);
increaseUncompressedSize(str.size());
increaseProcessedSize(str.size());
Poco::XML::DOMParser dom_parser;
Poco::AutoPtr<Poco::XML::Document> config = dom_parser.parseMemory(str.data(), str.size());
const Poco::XML::Node * config_root = getRootNode(config);
@ -598,6 +612,8 @@ BackupEntryPtr BackupImpl::readFile(const SizeAndChecksum & size_and_checksum) c
if (open_mode != OpenMode::READ)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Backup is not opened for reading");
increaseProcessedSize(size_and_checksum.first);
if (!size_and_checksum.first)
{
/// Entry's data is empty.
@ -762,6 +778,11 @@ void BackupImpl::writeFile(const String & file_name, BackupEntryPtr entry)
.base_checksum = 0,
};
{
std::lock_guard lock{mutex};
increaseProcessedSize(info);
}
/// Empty file, nothing to backup
if (info.size == 0 && deduplicate_files)
{
@ -972,6 +993,17 @@ void BackupImpl::increaseUncompressedSize(const FileInfo & info)
increaseUncompressedSize(info.size - info.base_size);
}
void BackupImpl::increaseProcessedSize(UInt64 file_size) const
{
processed_files_size += file_size;
++num_processed_files;
}
void BackupImpl::increaseProcessedSize(const FileInfo & info)
{
increaseProcessedSize(info.size);
}
void BackupImpl::setCompressedSize()
{
if (use_archives)
@ -1011,7 +1043,7 @@ std::shared_ptr<IArchiveWriter> BackupImpl::getArchiveWriter(const String & suff
String archive_name_with_suffix = getArchiveNameWithSuffix(suffix);
auto new_archive_writer = createArchiveWriter(archive_params.archive_name, writer->writeFile(archive_name_with_suffix));
new_archive_writer->setPassword(archive_params.password);
new_archive_writer->setCompression(archive_params.compression_method, archive_params.compression_level);
size_t pos = suffix.empty() ? 0 : 1;
archive_writers[pos] = {suffix, new_archive_writer};

View File

@ -59,6 +59,8 @@ public:
time_t getTimestamp() const override { return timestamp; }
UUID getUUID() const override { return *uuid; }
size_t getNumFiles() const override;
size_t getNumProcessedFiles() const override;
UInt64 getProcessedFilesSize() const override;
UInt64 getUncompressedSize() const override;
UInt64 getCompressedSize() const override;
Strings listFiles(const String & directory, bool recursive) const override;
@ -101,10 +103,16 @@ private:
std::shared_ptr<IArchiveReader> getArchiveReader(const String & suffix) const;
std::shared_ptr<IArchiveWriter> getArchiveWriter(const String & suffix);
/// Increases `uncompressed_size` by a specific value and `num_files` by 1.
/// Increases `uncompressed_size` by a specific value,
/// also increases `num_files` by 1.
void increaseUncompressedSize(UInt64 file_size);
void increaseUncompressedSize(const FileInfo & info);
/// Increases `num_processed_files` by a specific value,
/// also increases `num_processed_files` by 1.
void increaseProcessedSize(UInt64 file_size) const;
void increaseProcessedSize(const FileInfo & info);
/// Calculates and sets `compressed_size`.
void setCompressedSize();
@ -121,6 +129,8 @@ private:
std::optional<UUID> uuid;
time_t timestamp = 0;
size_t num_files = 0;
mutable size_t num_processed_files = 0;
mutable UInt64 processed_files_size = 0;
UInt64 uncompressed_size = 0;
UInt64 compressed_size = 0;
int version;

View File

@ -338,16 +338,20 @@ void BackupsWorker::doBackup(
}
size_t num_files = 0;
size_t num_processed_files = 0;
UInt64 uncompressed_size = 0;
UInt64 compressed_size = 0;
UInt64 processed_files_size = 0;
/// Finalize backup (write its metadata).
if (!backup_settings.internal)
{
backup->finalizeWriting();
num_files = backup->getNumFiles();
num_processed_files = backup->getNumProcessedFiles();
uncompressed_size = backup->getUncompressedSize();
compressed_size = backup->getCompressedSize();
processed_files_size = backup->getProcessedFilesSize();
}
/// Close the backup.
@ -355,7 +359,7 @@ void BackupsWorker::doBackup(
LOG_INFO(log, "{} {} was created successfully", (backup_settings.internal ? "Internal backup" : "Backup"), backup_name_for_logging);
setStatus(backup_id, BackupStatus::BACKUP_CREATED);
setNumFilesAndSize(backup_id, num_files, uncompressed_size, compressed_size);
setNumFilesAndSize(backup_id, num_files, num_processed_files, processed_files_size, uncompressed_size, compressed_size);
}
catch (...)
{
@ -496,8 +500,6 @@ void BackupsWorker::doRestore(
backup_open_params.password = restore_settings.password;
BackupPtr backup = BackupFactory::instance().createBackup(backup_open_params);
setNumFilesAndSize(restore_id, backup->getNumFiles(), backup->getUncompressedSize(), backup->getCompressedSize());
String current_database = context->getCurrentDatabase();
/// Checks access rights if this is ON CLUSTER query.
@ -578,6 +580,13 @@ void BackupsWorker::doRestore(
LOG_INFO(log, "Restored from {} {} successfully", (restore_settings.internal ? "internal backup" : "backup"), backup_name_for_logging);
setStatus(restore_id, BackupStatus::RESTORED);
setNumFilesAndSize(
restore_id,
backup->getNumFiles(),
backup->getNumProcessedFiles(),
backup->getProcessedFilesSize(),
backup->getUncompressedSize(),
backup->getCompressedSize());
}
catch (...)
{
@ -658,7 +667,7 @@ void BackupsWorker::setStatus(const String & id, BackupStatus status, bool throw
}
void BackupsWorker::setNumFilesAndSize(const String & id, size_t num_files, UInt64 uncompressed_size, UInt64 compressed_size)
void BackupsWorker::setNumFilesAndSize(const String & id, size_t num_files, size_t num_processed_files, UInt64 processed_files_size, UInt64 uncompressed_size, UInt64 compressed_size)
{
std::lock_guard lock{infos_mutex};
auto it = infos.find(id);
@ -667,6 +676,8 @@ void BackupsWorker::setNumFilesAndSize(const String & id, size_t num_files, UInt
auto & info = it->second;
info.num_files = num_files;
info.num_processed_files = num_processed_files;
info.processed_files_size = processed_files_size;
info.uncompressed_size = uncompressed_size;
info.compressed_size = compressed_size;
}

View File

@ -56,6 +56,14 @@ public:
/// Number of files in the backup (including backup's metadata; only unique files are counted).
size_t num_files = 0;
/// Number of processed files during backup or restore process
/// For restore it includes files from base backups
size_t num_processed_files = 0;
/// Size of processed files during backup or restore
/// For restore in includes sizes from base backups
UInt64 processed_files_size = 0;
/// Size of all files in the backup (including backup's metadata; only unique files are counted).
UInt64 uncompressed_size = 0;
@ -102,7 +110,7 @@ private:
void addInfo(const OperationID & id, const String & name, bool internal, BackupStatus status);
void setStatus(const OperationID & id, BackupStatus status, bool throw_if_error = true);
void setStatusSafe(const String & id, BackupStatus status) { setStatus(id, status, false); }
void setNumFilesAndSize(const OperationID & id, size_t num_files, UInt64 uncompressed_size, UInt64 compressed_size);
void setNumFilesAndSize(const OperationID & id, size_t num_files, size_t num_processed_files, UInt64 processed_files_size, UInt64 uncompressed_size, UInt64 compressed_size);
std::vector<Info> getAllActiveBackupInfos() const;
std::vector<Info> getAllActiveRestoreInfos() const;
bool hasConcurrentBackups(const BackupSettings & backup_settings) const;

View File

@ -40,6 +40,12 @@ public:
/// Returns the number of unique files in the backup.
virtual size_t getNumFiles() const = 0;
/// Returns the number of files were processed for backup or restore
virtual size_t getNumProcessedFiles() const = 0;
// Returns the total size of processed files for backup or restore
virtual UInt64 getProcessedFilesSize() const = 0;
/// Returns the total size of unique files in the backup.
virtual UInt64 getUncompressedSize() const = 0;

View File

@ -6,7 +6,6 @@
#include <Poco/Net/HTTPRequest.h>
#include <Common/ShellCommand.h>
#include <Common/logger_useful.h>
#include <IO/ConnectionTimeoutsContext.h>
namespace DB

View File

@ -1,5 +1,7 @@
#include "LibraryBridgeHelper.h"
#include <IO/ConnectionTimeoutsContext.h>
namespace DB
{

View File

@ -63,10 +63,12 @@ public:
XDBCBridgeHelper(
ContextPtr context_,
Poco::Timespan http_timeout_,
const std::string & connection_string_)
const std::string & connection_string_,
bool use_connection_pooling_)
: IXDBCBridgeHelper(context_->getGlobalContext())
, log(&Poco::Logger::get(BridgeHelperMixin::getName() + "BridgeHelper"))
, connection_string(connection_string_)
, use_connection_pooling(use_connection_pooling_)
, http_timeout(http_timeout_)
, config(context_->getGlobalContext()->getConfigRef())
{
@ -132,6 +134,7 @@ protected:
uri.setHost(bridge_host);
uri.setPort(bridge_port);
uri.setScheme("http");
uri.addQueryParameter("use_connection_pooling", toString(use_connection_pooling));
return uri;
}
@ -146,6 +149,7 @@ private:
Poco::Logger * log;
std::string connection_string;
bool use_connection_pooling;
Poco::Timespan http_timeout;
std::string bridge_host;
size_t bridge_port;
@ -189,6 +193,7 @@ protected:
uri.setPath(SCHEMA_ALLOWED_HANDLER);
uri.addQueryParameter("version", std::to_string(XDBC_BRIDGE_PROTOCOL_VERSION));
uri.addQueryParameter("connection_string", getConnectionString());
uri.addQueryParameter("use_connection_pooling", toString(use_connection_pooling));
ReadWriteBufferFromHTTP buf(uri, Poco::Net::HTTPRequest::HTTP_POST, {}, ConnectionTimeouts::getHTTPTimeouts(getContext()), credentials);
@ -210,6 +215,7 @@ protected:
uri.setPath(IDENTIFIER_QUOTE_HANDLER);
uri.addQueryParameter("version", std::to_string(XDBC_BRIDGE_PROTOCOL_VERSION));
uri.addQueryParameter("connection_string", getConnectionString());
uri.addQueryParameter("use_connection_pooling", toString(use_connection_pooling));
ReadWriteBufferFromHTTP buf(uri, Poco::Net::HTTPRequest::HTTP_POST, {}, ConnectionTimeouts::getHTTPTimeouts(getContext()), credentials);

View File

@ -513,6 +513,11 @@ if (TARGET ch_contrib::msgpack)
target_link_libraries (clickhouse_common_io PUBLIC ch_contrib::msgpack)
endif()
if (TARGET ch_contrib::liburing)
target_link_libraries (clickhouse_common_io PUBLIC ch_contrib::liburing)
target_include_directories (clickhouse_common_io SYSTEM BEFORE PUBLIC ${LIBURING_COMPAT_INCLUDE_DIR} ${LIBURING_INCLUDE_DIR})
endif()
target_link_libraries (clickhouse_common_io PUBLIC ch_contrib::fast_float)
if (USE_ORC)

View File

@ -106,6 +106,8 @@
M(KeeperAliveConnections, "Number of alive connections") \
M(KeeperOutstandingRequets, "Number of outstanding requests") \
M(ThreadsInOvercommitTracker, "Number of waiting threads inside of OvercommitTracker") \
M(IOUringPendingEvents, "Number of io_uring SQEs waiting to be submitted") \
M(IOUringInFlightEvents, "Number of io_uring SQEs in flight") \
namespace CurrentMetrics
{

View File

@ -646,6 +646,8 @@
M(675, CANNOT_PARSE_IPV4) \
M(676, CANNOT_PARSE_IPV6) \
M(677, THREAD_WAS_CANCELED) \
M(678, IO_URING_INIT_FAILED) \
M(679, IO_URING_SUBMIT_ERROR) \
\
M(999, KEEPER_EXCEPTION) \
M(1000, POCO_EXCEPTION) \

View File

@ -68,17 +68,17 @@ inline std::string_view toDescription(OvercommitResult result)
switch (result)
{
case OvercommitResult::NONE:
return "Memory overcommit isn't used. OvercommitTracker isn't set";
return "";
case OvercommitResult::DISABLED:
return "Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero";
return "Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero.";
case OvercommitResult::MEMORY_FREED:
throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "OvercommitResult::MEMORY_FREED shouldn't be asked for description");
case OvercommitResult::SELECTED:
return "Query was selected to stop by OvercommitTracker";
return "Query was selected to stop by OvercommitTracker.";
case OvercommitResult::TIMEOUTED:
return "Waiting timeout for memory to be freed is reached";
return "Waiting timeout for memory to be freed is reached.";
case OvercommitResult::NOT_ENOUGH_FREED:
return "Memory overcommit has freed not enough memory";
return "Memory overcommit has freed not enough memory.";
}
}
@ -263,13 +263,14 @@ void MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceeded, MemoryT
throw DB::Exception(
DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED,
"Memory limit{}{} exceeded: "
"would use {} (attempt to allocate chunk of {} bytes), maximum: {}. "
"OvercommitTracker decision: {}.",
"would use {} (attempt to allocate chunk of {} bytes), maximum: {}."
"{}{}",
description ? " " : "",
description ? description : "",
formatReadableSizeWithBinarySuffix(will_be),
size,
formatReadableSizeWithBinarySuffix(current_hard_limit),
overcommit_result == OvercommitResult::NONE ? "" : " OvercommitTracker decision: ",
toDescription(overcommit_result));
}
else

View File

@ -52,11 +52,13 @@ template <typename T> T getConfigValueOrDefault(
return config.getInt64(path);
else if constexpr (std::is_same_v<T, Float64>)
return config.getDouble(path);
else if constexpr (std::is_same_v<T, bool>)
return config.getBool(path);
else
throw Exception(
ErrorCodes::NOT_IMPLEMENTED,
"Unsupported type in getConfigValueOrDefault(). "
"Supported types are String, UInt64, Int64, Float64");
"Supported types are String, UInt64, Int64, Float64, bool");
}
catch (const Poco::SyntaxException &)
{
@ -85,11 +87,13 @@ template<typename T> void setConfigValue(
config.setInt64(path, value);
else if constexpr (std::is_same_v<T, Float64>)
config.setDouble(path, value);
else if constexpr (std::is_same_v<T, bool>)
config.setBool(path, value);
else
throw Exception(
ErrorCodes::NOT_IMPLEMENTED,
"Unsupported type in setConfigValue(). "
"Supported types are String, UInt64, Int64, Float64");
"Supported types are String, UInt64, Int64, Float64, bool");
}
template <typename T> void copyConfigValue(
@ -206,6 +210,8 @@ template Int64 getConfigValue<Int64>(const Poco::Util::AbstractConfiguration & c
const std::string & path);
template Float64 getConfigValue<Float64>(const Poco::Util::AbstractConfiguration & config,
const std::string & path);
template bool getConfigValue<bool>(const Poco::Util::AbstractConfiguration & config,
const std::string & path);
template String getConfigValueOrDefault<String>(const Poco::Util::AbstractConfiguration & config,
const std::string & path, const String * default_value);
@ -215,6 +221,8 @@ template Int64 getConfigValueOrDefault<Int64>(const Poco::Util::AbstractConfigur
const std::string & path, const Int64 * default_value);
template Float64 getConfigValueOrDefault<Float64>(const Poco::Util::AbstractConfiguration & config,
const std::string & path, const Float64 * default_value);
template bool getConfigValueOrDefault<bool>(const Poco::Util::AbstractConfiguration & config,
const std::string & path, const bool * default_value);
template void setConfigValue<String>(Poco::Util::AbstractConfiguration & config,
const std::string & path, const String & value, bool update);
@ -224,6 +232,8 @@ template void setConfigValue<Int64>(Poco::Util::AbstractConfiguration & config,
const std::string & path, const Int64 & value, bool update);
template void setConfigValue<Float64>(Poco::Util::AbstractConfiguration & config,
const std::string & path, const Float64 & value, bool update);
template void setConfigValue<bool>(Poco::Util::AbstractConfiguration & config,
const std::string & path, const bool & value, bool update);
template void copyConfigValue<String>(const Poco::Util::AbstractConfiguration & from_config, const std::string & from_path,
Poco::Util::AbstractConfiguration & to_config, const std::string & to_path);
@ -233,6 +243,8 @@ template void copyConfigValue<Int64>(const Poco::Util::AbstractConfiguration & f
Poco::Util::AbstractConfiguration & to_config, const std::string & to_path);
template void copyConfigValue<Float64>(const Poco::Util::AbstractConfiguration & from_config, const std::string & from_path,
Poco::Util::AbstractConfiguration & to_config, const std::string & to_path);
template void copyConfigValue<bool>(const Poco::Util::AbstractConfiguration & from_config, const std::string & from_path,
Poco::Util::AbstractConfiguration & to_config, const std::string & to_path);
}
}

View File

@ -436,11 +436,13 @@ template String NamedCollection::get<String>(const NamedCollection::Key & key) c
template UInt64 NamedCollection::get<UInt64>(const NamedCollection::Key & key) const;
template Int64 NamedCollection::get<Int64>(const NamedCollection::Key & key) const;
template Float64 NamedCollection::get<Float64>(const NamedCollection::Key & key) const;
template bool NamedCollection::get<bool>(const NamedCollection::Key & key) const;
template String NamedCollection::getOrDefault<String>(const NamedCollection::Key & key, const String & default_value) const;
template UInt64 NamedCollection::getOrDefault<UInt64>(const NamedCollection::Key & key, const UInt64 & default_value) const;
template Int64 NamedCollection::getOrDefault<Int64>(const NamedCollection::Key & key, const Int64 & default_value) const;
template Float64 NamedCollection::getOrDefault<Float64>(const NamedCollection::Key & key, const Float64 & default_value) const;
template bool NamedCollection::getOrDefault<bool>(const NamedCollection::Key & key, const bool & default_value) const;
template void NamedCollection::set<String, true>(const NamedCollection::Key & key, const String & value);
template void NamedCollection::set<String, false>(const NamedCollection::Key & key, const String & value);
@ -450,6 +452,7 @@ template void NamedCollection::set<Int64, true>(const NamedCollection::Key & key
template void NamedCollection::set<Int64, false>(const NamedCollection::Key & key, const Int64 & value);
template void NamedCollection::set<Float64, true>(const NamedCollection::Key & key, const Float64 & value);
template void NamedCollection::set<Float64, false>(const NamedCollection::Key & key, const Float64 & value);
template void NamedCollection::set<bool, false>(const NamedCollection::Key & key, const bool & value);
template void NamedCollection::setOrUpdate<String, true>(const NamedCollection::Key & key, const String & value);
template void NamedCollection::setOrUpdate<String, false>(const NamedCollection::Key & key, const String & value);
@ -459,6 +462,7 @@ template void NamedCollection::setOrUpdate<Int64, true>(const NamedCollection::K
template void NamedCollection::setOrUpdate<Int64, false>(const NamedCollection::Key & key, const Int64 & value);
template void NamedCollection::setOrUpdate<Float64, true>(const NamedCollection::Key & key, const Float64 & value);
template void NamedCollection::setOrUpdate<Float64, false>(const NamedCollection::Key & key, const Float64 & value);
template void NamedCollection::setOrUpdate<bool, false>(const NamedCollection::Key & key, const bool & value);
template void NamedCollection::remove<true>(const Key & key);
template void NamedCollection::remove<false>(const Key & key);

View File

@ -53,8 +53,8 @@
M(TableFunctionExecute, "Number of table function calls.") \
M(MarkCacheHits, "Number of times an entry has been found in the mark cache, so we didn't have to load a mark file.") \
M(MarkCacheMisses, "Number of times an entry has not been found in the mark cache, so we had to load a mark file in memory, which is a costly operation, adding to query latency.") \
M(QueryResultCacheHits, "Number of times a query result has been found in the query result cache (and query computation was avoided).") \
M(QueryResultCacheMisses, "Number of times a query result has not been found in the query result cache (and required query computation).") \
M(QueryCacheHits, "Number of times a query result has been found in the query cache (and query computation was avoided).") \
M(QueryCacheMisses, "Number of times a query result has not been found in the query cache (and required query computation).") \
M(CreatedReadBufferOrdinary, "Number of times ordinary read buffer was created for reading data (while choosing among other read methods).") \
M(CreatedReadBufferDirectIO, "Number of times a read buffer with O_DIRECT was created for reading data (while choosing among other read methods).") \
M(CreatedReadBufferDirectIOFailed, "Number of times a read buffer with O_DIRECT was attempted to be created for reading data (while choosing among other read methods), but the OS did not allow it (due to lack of filesystem support or other reasons) and we fallen back to the ordinary reading method.") \
@ -326,7 +326,6 @@ The server successfully detected this situation and will download merged part fr
M(S3ListObjects, "Number of S3 API ListObjects calls.") \
M(S3HeadObject, "Number of S3 API HeadObject calls.") \
M(S3GetObjectAttributes, "Number of S3 API GetObjectAttributes calls.") \
M(S3GetObjectMetadata, "Number of S3 API GetObject calls for getting metadata.") \
M(S3CreateMultipartUpload, "Number of S3 API CreateMultipartUpload calls.") \
M(S3UploadPartCopy, "Number of S3 API UploadPartCopy calls.") \
M(S3UploadPart, "Number of S3 API UploadPart calls.") \
@ -340,7 +339,6 @@ The server successfully detected this situation and will download merged part fr
M(DiskS3ListObjects, "Number of DiskS3 API ListObjects calls.") \
M(DiskS3HeadObject, "Number of DiskS3 API HeadObject calls.") \
M(DiskS3GetObjectAttributes, "Number of DiskS3 API GetObjectAttributes calls.") \
M(DiskS3GetObjectMetadata, "Number of DiskS3 API GetObject calls for getting metadata.") \
M(DiskS3CreateMultipartUpload, "Number of DiskS3 API CreateMultipartUpload calls.") \
M(DiskS3UploadPartCopy, "Number of DiskS3 API UploadPartCopy calls.") \
M(DiskS3UploadPart, "Number of DiskS3 API UploadPart calls.") \
@ -474,6 +472,10 @@ The server successfully detected this situation and will download merged part fr
M(OverflowAny, "Number of times approximate GROUP BY was in effect: when aggregation was performed only on top of first 'max_rows_to_group_by' unique keys and other keys were ignored due to 'group_by_overflow_mode' = 'any'.") \
\
M(ServerStartupMilliseconds, "Time elapsed from starting server to listening to sockets in milliseconds")\
M(IOUringSQEsSubmitted, "Total number of io_uring SQEs submitted") \
M(IOUringSQEsResubmits, "Total number of io_uring SQE resubmits performed") \
M(IOUringCQEsCompleted, "Total number of successfully completed io_uring CQEs") \
M(IOUringCQEsFailed, "Total number of completed io_uring CQEs with failures") \
namespace ProfileEvents
{

View File

@ -78,13 +78,13 @@ private:
public:
/// Arguments - seed.
SipHash(UInt64 k0 = 0, UInt64 k1 = 0) /// NOLINT
SipHash(UInt64 key0 = 0, UInt64 key1 = 0) /// NOLINT
{
/// Initialize the state with some random bytes and seed.
v0 = 0x736f6d6570736575ULL ^ k0;
v1 = 0x646f72616e646f6dULL ^ k1;
v2 = 0x6c7967656e657261ULL ^ k0;
v3 = 0x7465646279746573ULL ^ k1;
v0 = 0x736f6d6570736575ULL ^ key0;
v1 = 0x646f72616e646f6dULL ^ key1;
v2 = 0x6c7967656e657261ULL ^ key0;
v3 = 0x7465646279746573ULL ^ key1;
cnt = 0;
current_word = 0;
@ -216,20 +216,30 @@ inline void sipHash128(const char * data, const size_t size, char * out)
hash.get128(out);
}
inline UInt128 sipHash128(const char * data, const size_t size)
inline UInt128 sipHash128Keyed(UInt64 key0, UInt64 key1, const char * data, const size_t size)
{
SipHash hash;
SipHash hash(key0, key1);
hash.update(data, size);
return hash.get128();
}
inline UInt64 sipHash64(const char * data, const size_t size)
inline UInt128 sipHash128(const char * data, const size_t size)
{
SipHash hash;
return sipHash128Keyed(0, 0, data, size);
}
inline UInt64 sipHash64Keyed(UInt64 key0, UInt64 key1, const char * data, const size_t size)
{
SipHash hash(key0, key1);
hash.update(data, size);
return hash.get64();
}
inline UInt64 sipHash64(const char * data, const size_t size)
{
return sipHash64Keyed(0, 0, data, size);
}
template <typename T>
UInt64 sipHash64(const T & x)
{

View File

@ -312,39 +312,32 @@ bool exists(const std::string & path)
bool canRead(const std::string & path)
{
struct stat st;
if (stat(path.c_str(), &st) == 0)
{
if (st.st_uid == geteuid())
return (st.st_mode & S_IRUSR) != 0;
else if (st.st_gid == getegid())
return (st.st_mode & S_IRGRP) != 0;
else
return (st.st_mode & S_IROTH) != 0 || geteuid() == 0;
}
int err = faccessat(AT_FDCWD, path.c_str(), R_OK, AT_EACCESS);
if (err == 0)
return true;
if (errno == EACCES)
return false;
DB::throwFromErrnoWithPath("Cannot check read access to file: " + path, path, DB::ErrorCodes::PATH_ACCESS_DENIED);
}
bool canWrite(const std::string & path)
{
struct stat st;
if (stat(path.c_str(), &st) == 0)
{
if (st.st_uid == geteuid())
return (st.st_mode & S_IWUSR) != 0;
else if (st.st_gid == getegid())
return (st.st_mode & S_IWGRP) != 0;
else
return (st.st_mode & S_IWOTH) != 0 || geteuid() == 0;
}
int err = faccessat(AT_FDCWD, path.c_str(), W_OK, AT_EACCESS);
if (err == 0)
return true;
if (errno == EACCES)
return false;
DB::throwFromErrnoWithPath("Cannot check write access to file: " + path, path, DB::ErrorCodes::PATH_ACCESS_DENIED);
}
bool canExecute(const std::string & path)
{
if (exists(path))
return faccessat(AT_FDCWD, path.c_str(), X_OK, AT_EACCESS) == 0;
DB::throwFromErrnoWithPath("Cannot check execute access to file: " + path, path, DB::ErrorCodes::PATH_ACCESS_DENIED);
int err = faccessat(AT_FDCWD, path.c_str(), X_OK, AT_EACCESS);
if (err == 0)
return true;
if (errno == EACCES)
return false;
DB::throwFromErrnoWithPath("Cannot check write access to file: " + path, path, DB::ErrorCodes::PATH_ACCESS_DENIED);
}
time_t getModificationTime(const std::string & path)

View File

@ -194,6 +194,9 @@ inline bool parseIPv6(T * &src, EOFfunction eof, unsigned char * dst, int32_t fi
if (groups <= 1 && zptr == nullptr) /// IPv4 block can't be the first
return clear_dst();
if (group_start) /// first octet of IPv4 should be already parsed as an IPv6 group
return clear_dst();
++src;
if (eof())
return clear_dst();

54
src/Common/shuffle.h Normal file
View File

@ -0,0 +1,54 @@
#pragma once
#include <iterator>
#include <random>
#include <utility>
/* Reorders the elements in the given range [first, last) such that each
* possible permutation of those elements has equal probability of appearance.
*
* for i [0, n-2):
* j random from [i, n)
* swap arr[i] arr[j]
*/
template <typename Iter, typename Rng>
void shuffle(Iter first, Iter last, Rng && rng)
{
using diff_t = typename std::iterator_traits<Iter>::difference_type;
using distr_t = std::uniform_int_distribution<diff_t>;
using param_t = typename distr_t::param_type;
distr_t d;
diff_t n = last - first;
for (ssize_t i = 0; i < n - 1; ++i)
{
using std::swap;
auto j = d(rng, param_t(i, n - 1));
swap(first[i], first[j]);
}
}
/* Partially shuffle elements in range [first, last) in such a way that
* [first, first + limit) is a random subset of the original range.
* [first + limit, last) shall contain the elements not in [first, first + limit)
* in undefined order.
*
* for i [0, limit):
* j random from [i, n)
* swap arr[i] arr[j]
*/
template <typename Iter, typename Rng>
void partial_shuffle(Iter first, Iter last, size_t limit, Rng && rng)
{
using diff_t = typename std::iterator_traits<Iter>::difference_type;
using distr_t = std::uniform_int_distribution<diff_t>;
using param_t = typename distr_t::param_type;
distr_t d;
diff_t n = last - first;
for (size_t i = 0; i < limit; ++i)
{
using std::swap;
auto j = d(rng, param_t(i, n - 1));
swap(first[i], first[j]);
}
}

View File

@ -28,6 +28,7 @@ protected:
bool isCompression() const override { return false; }
bool isGenericCompression() const override { return false; }
bool isDeltaCompression() const override { return true; }
private:
const UInt8 delta_bytes_size;

View File

@ -133,6 +133,7 @@ protected:
bool isCompression() const override { return true; }
bool isGenericCompression() const override { return false; }
bool isDeltaCompression() const override { return true; }
private:
UInt8 data_bytes_size;

View File

@ -85,8 +85,9 @@ ASTPtr CompressionCodecFactory::validateCodecAndGetPreprocessedAST(
bool with_compression_codec = false;
bool with_none_codec = false;
bool with_floating_point_timeseries_codec = false;
std::optional<size_t> generic_compression_codec_pos;
std::optional<size_t> first_generic_compression_codec_pos;
std::optional<size_t> first_delta_codec_pos;
std::optional<size_t> last_floating_point_time_series_codec_pos;
std::set<size_t> encryption_codecs_pos;
bool can_substitute_codec_arguments = true;
@ -163,10 +164,15 @@ ASTPtr CompressionCodecFactory::validateCodecAndGetPreprocessedAST(
with_compression_codec |= result_codec->isCompression();
with_none_codec |= result_codec->isNone();
with_floating_point_timeseries_codec |= result_codec->isFloatingPointTimeSeriesCodec();
if (!generic_compression_codec_pos && result_codec->isGenericCompression())
generic_compression_codec_pos = i;
if (result_codec->isGenericCompression() && !first_generic_compression_codec_pos.has_value())
first_generic_compression_codec_pos = i;
if (result_codec->isDeltaCompression() && !first_delta_codec_pos.has_value())
first_delta_codec_pos = i;
if (result_codec->isFloatingPointTimeSeriesCodec())
last_floating_point_time_series_codec_pos = i;
if (result_codec->isEncryption())
encryption_codecs_pos.insert(i);
@ -178,44 +184,55 @@ ASTPtr CompressionCodecFactory::validateCodecAndGetPreprocessedAST(
{
if (codecs_descriptions->children.size() > 1 && with_none_codec)
throw Exception(ErrorCodes::BAD_ARGUMENTS,
"It does not make sense to have codec NONE along with other compression codecs: {}. "
"(Note: you can enable setting 'allow_suspicious_codecs' to skip this check).",
codec_description);
"It does not make sense to have codec NONE along with other compression codecs: {}. "
"(Note: you can enable setting 'allow_suspicious_codecs' to skip this check).",
codec_description);
/// Allow to explicitly specify single NONE codec if user don't want any compression.
/// But applying other transformations solely without compression (e.g. Delta) does not make sense.
/// It's okay to apply encryption codecs solely without anything else.
if (!with_compression_codec && !with_none_codec && encryption_codecs_pos.size() != codecs_descriptions->children.size())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Compression codec {} does not compress anything. "
throw Exception(ErrorCodes::BAD_ARGUMENTS,
"Compression codec {} does not compress anything. "
"You may want to add generic compression algorithm after other transformations, like: {}, LZ4. "
"(Note: you can enable setting 'allow_suspicious_codecs' to skip this check).",
codec_description, codec_description);
/// It does not make sense to apply any non-encryption codecs
/// after encryption one.
/// It does not make sense to apply any non-encryption codecs after encryption one.
if (!encryption_codecs_pos.empty() &&
*encryption_codecs_pos.begin() != codecs_descriptions->children.size() - encryption_codecs_pos.size())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "The combination of compression codecs {} is meaningless, "
"because it does not make sense to apply any non-post-processing codecs after "
"post-processing ones. (Note: you can enable setting 'allow_suspicious_codecs' "
"to skip this check).", codec_description);
throw Exception(ErrorCodes::BAD_ARGUMENTS,
"The combination of compression codecs {} is meaningless, "
"because it does not make sense to apply any non-post-processing codecs after "
"post-processing ones. (Note: you can enable setting 'allow_suspicious_codecs' "
"to skip this check).", codec_description);
/// Floating-point time series codecs are not supposed to compress non-floating-point data
if (with_floating_point_timeseries_codec &&
column_type && !innerDataTypeIsFloat(column_type))
if (last_floating_point_time_series_codec_pos.has_value()
&& column_type && !innerDataTypeIsFloat(column_type))
throw Exception(ErrorCodes::BAD_ARGUMENTS,
"The combination of compression codecs {} is meaningless,"
" because it does not make sense to apply a floating-point time series codec to non-floating-point columns"
" (Note: you can enable setting 'allow_suspicious_codecs' to skip this check).", codec_description);
/// Floating-point time series codecs usually do implicit delta compression (or something equivalent), and makes no sense to run
/// delta compression manually.
if (first_delta_codec_pos.has_value() && last_floating_point_time_series_codec_pos.has_value()
&& (*first_delta_codec_pos < *last_floating_point_time_series_codec_pos))
throw Exception(ErrorCodes::BAD_ARGUMENTS,
"The combination of compression codecs {} is meaningless,"
" because floating point time series codecs do delta compression implicitly by themselves."
" (Note: you can enable setting 'allow_suspicious_codecs' to skip this check).", codec_description);
/// It does not make sense to apply any transformations after generic compression algorithm
/// So, generic compression can be only one and only at the end.
if (generic_compression_codec_pos &&
*generic_compression_codec_pos != codecs_descriptions->children.size() - 1 - encryption_codecs_pos.size())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "The combination of compression codecs {} is meaningless, "
"because it does not make sense to apply any transformations after generic "
"compression algorithm. (Note: you can enable setting 'allow_suspicious_codecs' "
"to skip this check).", codec_description);
if (first_generic_compression_codec_pos &&
*first_generic_compression_codec_pos != codecs_descriptions->children.size() - 1 - encryption_codecs_pos.size())
throw Exception(ErrorCodes::BAD_ARGUMENTS,
"The combination of compression codecs {} is meaningless, "
"because it does not make sense to apply any transformations after generic "
"compression algorithm. (Note: you can enable setting 'allow_suspicious_codecs' "
"to skip this check).", codec_description);
}

View File

@ -102,6 +102,9 @@ public:
/// If it is a specialized codec for floating-point time series. Applying it to non-floating point data is suspicious.
virtual bool isFloatingPointTimeSeriesCodec() const { return false; }
/// If the codec's purpose is to calculate deltas between consecutive values.
virtual bool isDeltaCompression() const { return false; }
/// It is a codec available only for evaluation purposes and not meant to be used in production.
/// It will not be allowed to use unless the user will turn off the safety switch.
virtual bool isExperimental() const { return false; }

View File

@ -1,15 +1,18 @@
#include <Coordination/Changelog.h>
#include <IO/WriteHelpers.h>
#include <IO/ReadHelpers.h>
#include <IO/ReadBufferFromFile.h>
#include <IO/ZstdDeflatingAppendableWriteBuffer.h>
#include <filesystem>
#include <boost/algorithm/string/split.hpp>
#include <Coordination/Changelog.h>
#include <IO/ReadBufferFromFile.h>
#include <IO/ReadHelpers.h>
#include <IO/WriteHelpers.h>
#include <IO/ZstdDeflatingAppendableWriteBuffer.h>
#include <boost/algorithm/string/join.hpp>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/trim.hpp>
#include <Common/filesystemHelpers.h>
#include <Common/Exception.h>
#include <Common/SipHash.h>
#include <Common/logger_useful.h>
#include <IO/WriteBufferFromFile.h>
#include <base/errnoToString.h>
namespace DB
@ -19,7 +22,6 @@ namespace ErrorCodes
{
extern const int CHECKSUM_DOESNT_MATCH;
extern const int CORRUPTED_DATA;
extern const int UNSUPPORTED_METHOD;
extern const int UNKNOWN_FORMAT_VERSION;
extern const int LOGICAL_ERROR;
}
@ -29,27 +31,35 @@ namespace
constexpr auto DEFAULT_PREFIX = "changelog";
std::string formatChangelogPath(const std::string & prefix, const ChangelogFileDescription & name)
std::string formatChangelogPath(
const std::string & prefix, const std::string & name_prefix, uint64_t from_index, uint64_t to_index, const std::string & extension)
{
std::filesystem::path path(prefix);
path /= std::filesystem::path(name.prefix + "_" + std::to_string(name.from_log_index) + "_" + std::to_string(name.to_log_index) + "." + name.extension);
path /= std::filesystem::path(fmt::format("{}_{}_{}.{}", name_prefix, from_index, to_index, extension));
return path;
}
ChangelogFileDescription getChangelogFileDescription(const std::filesystem::path & path)
ChangelogFileDescriptionPtr getChangelogFileDescription(const std::filesystem::path & path)
{
std::string filename = path.stem();
// we can have .bin.zstd so we cannot use std::filesystem stem and extension
std::string filename_with_extension = path.filename();
std::string_view filename_with_extension_view = filename_with_extension;
auto first_dot = filename_with_extension.find('.');
if (first_dot == std::string::npos)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Invalid changelog file {}", path.generic_string());
Strings filename_parts;
boost::split(filename_parts, filename, boost::is_any_of("_"));
boost::split(filename_parts, filename_with_extension_view.substr(0, first_dot), boost::is_any_of("_"));
if (filename_parts.size() < 3)
throw Exception(ErrorCodes::CORRUPTED_DATA, "Invalid changelog {}", path.generic_string());
ChangelogFileDescription result;
result.prefix = filename_parts[0];
result.from_log_index = parse<uint64_t>(filename_parts[1]);
result.to_log_index = parse<uint64_t>(filename_parts[2]);
result.extension = path.extension();
result.path = path.generic_string();
auto result = std::make_shared<ChangelogFileDescription>();
result->prefix = filename_parts[0];
result->from_log_index = parse<uint64_t>(filename_parts[1]);
result->to_log_index = parse<uint64_t>(filename_parts[2]);
result->extension = std::string(filename_with_extension.substr(first_dot + 1));
result->path = path.generic_string();
return result;
}
@ -68,81 +78,298 @@ Checksum computeRecordChecksum(const ChangelogRecord & record)
}
/// Appendable log writer
/// New file on disk will be created when:
/// - we have already "rotation_interval" amount of logs in a single file
/// - maximum log file size is reached
/// At least 1 log record should be contained in each log
class ChangelogWriter
{
public:
ChangelogWriter(const std::string & filepath_, WriteMode mode, uint64_t start_index_)
: filepath(filepath_)
, file_buf(std::make_unique<WriteBufferFromFile>(filepath, DBMS_DEFAULT_BUFFER_SIZE, mode == WriteMode::Rewrite ? -1 : (O_APPEND | O_CREAT | O_WRONLY)))
, start_index(start_index_)
ChangelogWriter(
std::map<uint64_t, ChangelogFileDescriptionPtr> & existing_changelogs_,
const std::filesystem::path & changelogs_dir_,
LogFileSettings log_file_settings_)
: existing_changelogs(existing_changelogs_)
, log_file_settings(log_file_settings_)
, changelogs_dir(changelogs_dir_)
, log(&Poco::Logger::get("Changelog"))
{
auto compression_method = chooseCompressionMethod(filepath_, "");
if (compression_method != CompressionMethod::Zstd && compression_method != CompressionMethod::None)
}
void setFile(ChangelogFileDescriptionPtr file_description, WriteMode mode)
{
try
{
throw Exception(ErrorCodes::UNSUPPORTED_METHOD, "Unsupported coordination log serialization format {}",
toContentEncodingName(compression_method));
if (mode == WriteMode::Append && file_description->expectedEntriesCountInLog() != log_file_settings.rotate_interval)
LOG_TRACE(
log,
"Looks like rotate_logs_interval was changed, current {}, expected entries in last log {}",
log_file_settings.rotate_interval,
file_description->expectedEntriesCountInLog());
// we have a file we need to finalize first
if (tryGetFileBuffer() && prealloc_done)
{
finalizeCurrentFile();
assert(current_file_description);
// if we wrote at least 1 log in the log file we can rename the file to reflect correctly the
// contained logs
// file can be deleted from disk earlier by compaction
if (!current_file_description->deleted && last_index_written
&& *last_index_written != current_file_description->to_log_index)
{
auto new_path = formatChangelogPath(
changelogs_dir,
current_file_description->prefix,
current_file_description->from_log_index,
*last_index_written,
current_file_description->extension);
std::filesystem::rename(current_file_description->path, new_path);
current_file_description->path = std::move(new_path);
}
}
file_buf = std::make_unique<WriteBufferFromFile>(
file_description->path, DBMS_DEFAULT_BUFFER_SIZE, mode == WriteMode::Rewrite ? -1 : (O_APPEND | O_CREAT | O_WRONLY));
last_index_written.reset();
current_file_description = std::move(file_description);
if (log_file_settings.compress_logs)
compressed_buffer = std::make_unique<ZstdDeflatingAppendableWriteBuffer>(std::move(file_buf), /* compression level = */ 3, /* append_to_existing_file_ = */ mode == WriteMode::Append);
prealloc_done = false;
}
else if (compression_method == CompressionMethod::Zstd)
catch (...)
{
compressed_buffer = std::make_unique<ZstdDeflatingAppendableWriteBuffer>(
std::move(file_buf), /* compression level = */ 3, /* append_to_existing_file_ = */ mode == WriteMode::Append);
tryLogCurrentException(log);
throw;
}
}
bool isFileSet() const { return tryGetFileBuffer() != nullptr; }
bool appendRecord(ChangelogRecord && record)
{
const auto * file_buffer = tryGetFileBuffer();
assert(file_buffer && current_file_description);
assert(record.header.index - getStartIndex() <= current_file_description->expectedEntriesCountInLog());
// check if log file reached the limit for amount of records it can contain
if (record.header.index - getStartIndex() == current_file_description->expectedEntriesCountInLog())
{
rotate(record.header.index);
}
else
{
/// no compression, only file buffer
// writing at least 1 log is requirement - we don't want empty log files
// we use count() that can be unreliable for more complex WriteBuffers, so we should be careful if we change the type of it in the future
const bool log_too_big = record.header.index != getStartIndex() && log_file_settings.max_size != 0
&& initial_file_size + file_buffer->count() > log_file_settings.max_size;
if (log_too_big)
{
LOG_TRACE(log, "Log file reached maximum allowed size ({} bytes), creating new log file", log_file_settings.max_size);
rotate(record.header.index);
}
}
}
if (!prealloc_done) [[unlikely]]
{
tryPreallocateForFile();
void appendRecord(ChangelogRecord && record)
{
writeIntBinary(computeRecordChecksum(record), getBuffer());
if (!prealloc_done)
return false;
}
writeIntBinary(record.header.version, getBuffer());
writeIntBinary(record.header.index, getBuffer());
writeIntBinary(record.header.term, getBuffer());
writeIntBinary(record.header.value_type, getBuffer());
writeIntBinary(record.header.blob_size, getBuffer());
auto & write_buffer = getBuffer();
writeIntBinary(computeRecordChecksum(record), write_buffer);
writeIntBinary(record.header.version, write_buffer);
writeIntBinary(record.header.index, write_buffer);
writeIntBinary(record.header.term, write_buffer);
writeIntBinary(record.header.value_type, write_buffer);
writeIntBinary(record.header.blob_size, write_buffer);
if (record.header.blob_size != 0)
getBuffer().write(reinterpret_cast<char *>(record.blob->data_begin()), record.blob->size());
}
write_buffer.write(reinterpret_cast<char *>(record.blob->data_begin()), record.blob->size());
void flush(bool force_fsync)
{
if (compressed_buffer)
{
/// Flush compressed data to WriteBufferFromFile working_buffer
/// Flush compressed data to file buffer
compressed_buffer->next();
}
WriteBuffer * working_buf = compressed_buffer ? compressed_buffer->getNestedBuffer() : file_buf.get();
last_index_written = record.header.index;
/// Flush working buffer to file system
working_buf->next();
return true;
}
void flush()
{
auto * file_buffer = tryGetFileBuffer();
/// Fsync file system if needed
if (force_fsync)
working_buf->sync();
if (file_buffer && log_file_settings.force_sync)
file_buffer->sync();
}
uint64_t getStartIndex() const
{
return start_index;
assert(current_file_description);
return current_file_description->from_log_index;
}
void rotate(uint64_t new_start_log_index)
{
/// Start new one
auto new_description = std::make_shared<ChangelogFileDescription>();
new_description->prefix = DEFAULT_PREFIX;
new_description->from_log_index = new_start_log_index;
new_description->to_log_index = new_start_log_index + log_file_settings.rotate_interval - 1;
new_description->extension = "bin";
if (log_file_settings.compress_logs)
new_description->extension += "." + toContentEncodingName(CompressionMethod::Zstd);
new_description->path = formatChangelogPath(
changelogs_dir,
new_description->prefix,
new_start_log_index,
new_start_log_index + log_file_settings.rotate_interval - 1,
new_description->extension);
LOG_TRACE(log, "Starting new changelog {}", new_description->path);
auto [it, inserted] = existing_changelogs.insert(std::make_pair(new_start_log_index, std::move(new_description)));
setFile(it->second, WriteMode::Rewrite);
}
void finalize()
{
if (isFileSet() && prealloc_done)
finalizeCurrentFile();
}
private:
void finalizeCurrentFile()
{
const auto * file_buffer = tryGetFileBuffer();
assert(file_buffer && prealloc_done);
assert(current_file_description);
// compact can delete the file and we don't need to do anything
if (current_file_description->deleted)
{
LOG_WARNING(log, "Log {} is already deleted", file_buffer->getFileName());
return;
}
if (log_file_settings.compress_logs)
compressed_buffer->finalize();
flush();
if (log_file_settings.max_size != 0)
ftruncate(file_buffer->getFD(), initial_file_size + file_buffer->count());
if (log_file_settings.compress_logs)
compressed_buffer.reset();
else
file_buf.reset();
}
WriteBuffer & getBuffer()
{
if (compressed_buffer)
return *compressed_buffer;
return *file_buf;
if (file_buf)
return *file_buf;
throw Exception(ErrorCodes::LOGICAL_ERROR, "Log writer wasn't initialized for any file");
}
std::string filepath;
WriteBufferFromFile & getFileBuffer()
{
auto * file_buffer = tryGetFileBuffer();
if (!file_buffer)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Log writer wasn't initialized for any file");
return *file_buffer;
}
const WriteBufferFromFile * tryGetFileBuffer() const
{
return const_cast<ChangelogWriter *>(this)->tryGetFileBuffer();
}
WriteBufferFromFile * tryGetFileBuffer()
{
if (compressed_buffer)
return dynamic_cast<WriteBufferFromFile *>(compressed_buffer->getNestedBuffer());
if (file_buf)
return file_buf.get();
return nullptr;
}
void tryPreallocateForFile()
{
if (log_file_settings.max_size == 0)
{
initial_file_size = 0;
prealloc_done = true;
return;
}
const auto & file_buffer = getFileBuffer();
#ifdef OS_LINUX
{
int res = -1;
do
{
res = fallocate(file_buffer.getFD(), FALLOC_FL_KEEP_SIZE, 0, log_file_settings.max_size + log_file_settings.overallocate_size);
} while (res < 0 && errno == EINTR);
if (res != 0)
{
if (errno == ENOSPC)
{
LOG_FATAL(log, "Failed to allocate enough space on disk for logs");
return;
}
LOG_WARNING(log, "Could not preallocate space on disk using fallocate. Error: {}, errno: {}", errnoToString(), errno);
}
}
#endif
initial_file_size = getSizeFromFileDescriptor(file_buffer.getFD());
prealloc_done = true;
}
std::map<uint64_t, ChangelogFileDescriptionPtr> & existing_changelogs;
ChangelogFileDescriptionPtr current_file_description{nullptr};
std::unique_ptr<WriteBufferFromFile> file_buf;
std::optional<uint64_t> last_index_written;
size_t initial_file_size{0};
std::unique_ptr<ZstdDeflatingAppendableWriteBuffer> compressed_buffer;
uint64_t start_index;
bool prealloc_done{false};
LogFileSettings log_file_settings;
const std::filesystem::path changelogs_dir;
Poco::Logger * const log;
};
struct ChangelogReadResult
@ -170,8 +397,7 @@ struct ChangelogReadResult
class ChangelogReader
{
public:
explicit ChangelogReader(const std::string & filepath_)
: filepath(filepath_)
explicit ChangelogReader(const std::string & filepath_) : filepath(filepath_)
{
auto compression_method = chooseCompressionMethod(filepath, "");
auto read_buffer_from_file = std::make_unique<ReadBufferFromFile>(filepath);
@ -200,7 +426,8 @@ public:
readIntBinary(record.header.blob_size, *read_buf);
if (record.header.version > CURRENT_CHANGELOG_VERSION)
throw Exception(ErrorCodes::UNKNOWN_FORMAT_VERSION, "Unsupported changelog version {} on path {}", record.header.version, filepath);
throw Exception(
ErrorCodes::UNKNOWN_FORMAT_VERSION, "Unsupported changelog version {} on path {}", record.header.version, filepath);
/// Read data
if (record.header.blob_size != 0)
@ -217,14 +444,18 @@ public:
Checksum checksum = computeRecordChecksum(record);
if (checksum != record_checksum)
{
throw Exception(ErrorCodes::CHECKSUM_DOESNT_MATCH,
"Checksums doesn't match for log {} (version {}), index {}, blob_size {}",
filepath, record.header.version, record.header.index, record.header.blob_size);
throw Exception(
ErrorCodes::CHECKSUM_DOESNT_MATCH,
"Checksums doesn't match for log {} (version {}), index {}, blob_size {}",
filepath,
record.header.version,
record.header.index,
record.header.blob_size);
}
/// Check for duplicated changelog ids
if (logs.contains(record.header.index))
std::erase_if(logs, [&record] (const auto & item) { return item.first >= record.header.index; });
std::erase_if(logs, [&record](const auto & item) { return item.first >= record.header.index; });
result.total_entries_read_from_log += 1;
@ -271,16 +502,12 @@ private:
Changelog::Changelog(
const std::string & changelogs_dir_,
uint64_t rotate_interval_,
bool force_sync_,
Poco::Logger * log_,
bool compress_logs_)
LogFileSettings log_file_settings)
: changelogs_dir(changelogs_dir_)
, changelogs_detached_dir(changelogs_dir / "detached")
, rotate_interval(rotate_interval_)
, force_sync(force_sync_)
, rotate_interval(log_file_settings.rotate_interval)
, log(log_)
, compress_logs(compress_logs_)
, write_operations(std::numeric_limits<size_t>::max())
, append_completion_queue(std::numeric_limits<size_t>::max())
{
@ -295,7 +522,7 @@ Changelog::Changelog(
continue;
auto file_description = getChangelogFileDescription(p.path());
existing_changelogs[file_description.from_log_index] = file_description;
existing_changelogs[file_description->from_log_index] = std::move(file_description);
}
if (existing_changelogs.empty())
@ -306,6 +533,9 @@ Changelog::Changelog(
write_thread = ThreadFromGlobalPool([this] { writeThread(); });
append_completion_thread = ThreadFromGlobalPool([this] { appendCompletionThread(); });
current_writer = std::make_unique<ChangelogWriter>(
existing_changelogs, changelogs_dir, log_file_settings);
}
void Changelog::readChangelogAndInitWriter(uint64_t last_commited_log_index, uint64_t logs_to_keep)
@ -326,9 +556,9 @@ void Changelog::readChangelogAndInitWriter(uint64_t last_commited_log_index, uin
start_to_read_from = 1;
/// Got through changelog files in order of start_index
for (const auto & [changelog_start_index, changelog_description] : existing_changelogs)
for (const auto & [changelog_start_index, changelog_description_ptr] : existing_changelogs)
{
const auto & changelog_description = *changelog_description_ptr;
/// [from_log_index.>=.......start_to_read_from.....<=.to_log_index]
if (changelog_description.to_log_index >= start_to_read_from)
{
@ -337,26 +567,42 @@ void Changelog::readChangelogAndInitWriter(uint64_t last_commited_log_index, uin
/// Our first log starts from the more fresh log_id than we required to read and this changelog is not empty log.
/// So we are missing something in our logs, but it's not dataloss, we will receive snapshot and required
/// entries from leader.
if (changelog_description.from_log_index > last_commited_log_index && (changelog_description.from_log_index - last_commited_log_index) > 1)
if (changelog_description.from_log_index > last_commited_log_index
&& (changelog_description.from_log_index - last_commited_log_index) > 1)
{
LOG_ERROR(log, "Some records were lost, last committed log index {}, smallest available log index on disk {}. Hopefully will receive missing records from leader.", last_commited_log_index, changelog_description.from_log_index);
LOG_ERROR(
log,
"Some records were lost, last committed log index {}, smallest available log index on disk {}. Hopefully will "
"receive missing records from leader.",
last_commited_log_index,
changelog_description.from_log_index);
/// Nothing to do with our more fresh log, leader will overwrite them, so remove everything and just start from last_commited_index
removeAllLogs();
min_log_id = last_commited_log_index;
max_log_id = last_commited_log_index == 0 ? 0 : last_commited_log_index - 1;
rotate(max_log_id + 1, writer_lock);
current_writer->rotate(max_log_id + 1);
initialized = true;
return;
}
else if (changelog_description.from_log_index > start_to_read_from)
{
/// We don't have required amount of reserved logs, but nothing was lost.
LOG_WARNING(log, "Don't have required amount of reserved log records. Need to read from {}, smallest available log index on disk {}.", start_to_read_from, changelog_description.from_log_index);
LOG_WARNING(
log,
"Don't have required amount of reserved log records. Need to read from {}, smallest available log index on disk "
"{}.",
start_to_read_from,
changelog_description.from_log_index);
}
}
else if ((changelog_description.from_log_index - last_log_read_result->last_read_index) > 1)
{
LOG_ERROR(log, "Some records were lost, last found log index {}, while the next log index on disk is {}. Hopefully will receive missing records from leader.", last_log_read_result->last_read_index, changelog_description.from_log_index);
LOG_ERROR(
log,
"Some records were lost, last found log index {}, while the next log index on disk is {}. Hopefully will receive "
"missing records from leader.",
last_log_read_result->last_read_index,
changelog_description.from_log_index);
removeAllLogsAfter(last_log_read_result->log_start_index);
break;
}
@ -378,10 +624,10 @@ void Changelog::readChangelogAndInitWriter(uint64_t last_commited_log_index, uin
max_log_id = last_log_read_result->last_read_index;
/// How many entries we have in the last changelog
uint64_t expected_entries_in_log = changelog_description.expectedEntriesCountInLog();
uint64_t log_count = changelog_description.expectedEntriesCountInLog();
/// Unfinished log
if (last_log_read_result->error || last_log_read_result->total_entries_read_from_log < expected_entries_in_log)
if (last_log_read_result->error || last_log_read_result->total_entries_read_from_log < log_count)
{
last_log_is_not_complete = true;
break;
@ -400,7 +646,11 @@ void Changelog::readChangelogAndInitWriter(uint64_t last_commited_log_index, uin
}
else if (last_commited_log_index != 0 && max_log_id < last_commited_log_index - 1) /// If we have more fresh snapshot than our logs
{
LOG_WARNING(log, "Our most fresh log_id {} is smaller than stored data in snapshot {}. It can indicate data loss. Removing outdated logs.", max_log_id, last_commited_log_index - 1);
LOG_WARNING(
log,
"Our most fresh log_id {} is smaller than stored data in snapshot {}. It can indicate data loss. Removing outdated logs.",
max_log_id,
last_commited_log_index - 1);
removeAllLogs();
min_log_id = last_commited_log_index;
@ -419,14 +669,14 @@ void Changelog::readChangelogAndInitWriter(uint64_t last_commited_log_index, uin
assert(existing_changelogs.find(last_log_read_result->log_start_index)->first == existing_changelogs.rbegin()->first);
/// Continue to write into incomplete existing log if it doesn't finished with error
auto description = existing_changelogs[last_log_read_result->log_start_index];
const auto & description = existing_changelogs[last_log_read_result->log_start_index];
if (last_log_read_result->last_read_index == 0 || last_log_read_result->error) /// If it's broken log then remove it
{
LOG_INFO(log, "Removing chagelog {} because it's empty or read finished with error", description.path);
std::filesystem::remove(description.path);
LOG_INFO(log, "Removing chagelog {} because it's empty or read finished with error", description->path);
std::filesystem::remove(description->path);
existing_changelogs.erase(last_log_read_result->log_start_index);
std::erase_if(logs, [last_log_read_result] (const auto & item) { return item.first >= last_log_read_result->log_start_index; });
std::erase_if(logs, [last_log_read_result](const auto & item) { return item.first >= last_log_read_result->log_start_index; });
}
else
{
@ -435,20 +685,17 @@ void Changelog::readChangelogAndInitWriter(uint64_t last_commited_log_index, uin
}
/// Start new log if we don't initialize writer from previous log. All logs can be "complete".
if (!current_writer)
rotate(max_log_id + 1, writer_lock);
if (!current_writer->isFileSet())
current_writer->rotate(max_log_id + 1);
initialized = true;
}
void Changelog::initWriter(const ChangelogFileDescription & description)
void Changelog::initWriter(ChangelogFileDescriptionPtr description)
{
if (description.expectedEntriesCountInLog() != rotate_interval)
LOG_TRACE(log, "Looks like rotate_logs_interval was changed, current {}, expected entries in last log {}", rotate_interval, description.expectedEntriesCountInLog());
LOG_TRACE(log, "Continue to write into {}", description.path);
current_writer = std::make_unique<ChangelogWriter>(description.path, WriteMode::Append, description.from_log_index);
LOG_TRACE(log, "Continue to write into {}", description->path);
current_writer->setFile(std::move(description), WriteMode::Append);
}
namespace
@ -481,8 +728,8 @@ void Changelog::removeExistingLogs(ChangelogIter begin, ChangelogIter end)
std::filesystem::create_directories(timestamp_folder);
}
LOG_WARNING(log, "Removing changelog {}", itr->second.path);
const std::filesystem::path path = itr->second.path;
LOG_WARNING(log, "Removing changelog {}", itr->second->path);
const std::filesystem::path & path = itr->second->path;
const auto new_path = timestamp_folder / path.filename();
std::filesystem::rename(path, new_path);
itr = existing_changelogs.erase(itr);
@ -501,7 +748,7 @@ void Changelog::removeAllLogsAfter(uint64_t remove_after_log_start_index)
LOG_WARNING(log, "Removing changelogs that go after broken changelog entry");
removeExistingLogs(start_to_remove_from_itr, existing_changelogs.end());
std::erase_if(logs, [start_to_remove_from_log_id] (const auto & item) { return item.first >= start_to_remove_from_log_id; });
std::erase_if(logs, [start_to_remove_from_log_id](const auto & item) { return item.first >= start_to_remove_from_log_id; });
}
void Changelog::removeAllLogs()
@ -511,29 +758,6 @@ void Changelog::removeAllLogs()
logs.clear();
}
void Changelog::rotate(uint64_t new_start_log_index, std::lock_guard<std::mutex> &)
{
/// Flush previous log
if (current_writer)
current_writer->flush(force_sync);
/// Start new one
ChangelogFileDescription new_description;
new_description.prefix = DEFAULT_PREFIX;
new_description.from_log_index = new_start_log_index;
new_description.to_log_index = new_start_log_index + rotate_interval - 1;
new_description.extension = "bin";
if (compress_logs)
new_description.extension += "." + toContentEncodingName(CompressionMethod::Zstd);
new_description.path = formatChangelogPath(changelogs_dir, new_description);
LOG_TRACE(log, "Starting new changelog {}", new_description.path);
existing_changelogs[new_start_log_index] = new_description;
current_writer = std::make_unique<ChangelogWriter>(new_description.path, WriteMode::Rewrite, new_start_log_index);
}
ChangelogRecord Changelog::buildRecord(uint64_t index, const LogEntryPtr & log_entry)
{
ChangelogRecord record;
@ -553,12 +777,15 @@ ChangelogRecord Changelog::buildRecord(uint64_t index, const LogEntryPtr & log_e
}
void Changelog::appendCompletionThread()
{
uint64_t flushed_index = 0;
while (append_completion_queue.pop(flushed_index))
bool append_ok = false;
while (append_completion_queue.pop(append_ok))
{
if (!append_ok)
current_writer->finalize();
// we shouldn't start the raft_server before sending it here
if (auto raft_server_locked = raft_server.lock())
raft_server_locked->notify_log_append_completion(true);
raft_server_locked->notify_log_append_completion(append_ok);
else
LOG_WARNING(log, "Raft server is not set in LogStore.");
}
@ -567,36 +794,40 @@ void Changelog::appendCompletionThread()
void Changelog::writeThread()
{
WriteOperation write_operation;
bool batch_append_ok = true;
while (write_operations.pop(write_operation))
{
assert(initialized);
if (auto * append_log = std::get_if<AppendLog>(&write_operation))
{
if (!batch_append_ok)
continue;
std::lock_guard writer_lock(writer_mutex);
assert(current_writer);
const auto & current_changelog_description = existing_changelogs[current_writer->getStartIndex()];
const bool log_is_complete = append_log->index - current_writer->getStartIndex() == current_changelog_description.expectedEntriesCountInLog();
if (log_is_complete)
rotate(append_log->index, writer_lock);
current_writer->appendRecord(buildRecord(append_log->index, append_log->log_entry));
batch_append_ok = current_writer->appendRecord(buildRecord(append_log->index, append_log->log_entry));
}
else
{
const auto & flush = std::get<Flush>(write_operation);
if (batch_append_ok)
{
std::lock_guard writer_lock(writer_mutex);
if (current_writer)
current_writer->flush(force_sync);
}
{
std::lock_guard writer_lock(writer_mutex);
current_writer->flush();
}
{
std::lock_guard lock{durable_idx_mutex};
last_durable_idx = flush.index;
}
}
else
{
std::lock_guard lock{durable_idx_mutex};
last_durable_idx = flush.index;
*flush.failed = true;
}
durable_idx_cv.notify_all();
@ -605,8 +836,10 @@ void Changelog::writeThread()
// NuRaft will in some places wait for flush to be done while having the same global lock leading to deadlock
// -> future write operations are blocked by flush that cannot be completed because it cannot take NuRaft lock
// -> NuRaft won't leave lock until its flush is done
if (!append_completion_queue.push(flush.index))
if (!append_completion_queue.push(batch_append_ok))
LOG_WARNING(log, "Changelog is shut down");
batch_append_ok = true;
}
}
}
@ -642,21 +875,20 @@ void Changelog::writeAt(uint64_t index, const LogEntryPtr & log_entry)
{
auto index_changelog = existing_changelogs.lower_bound(index);
ChangelogFileDescription description;
ChangelogFileDescriptionPtr description{nullptr};
if (index_changelog->first == index) /// exactly this file starts from index
description = index_changelog->second;
else
description = std::prev(index_changelog)->second;
/// Initialize writer from this log file
current_writer = std::make_unique<ChangelogWriter>(description.path, WriteMode::Append, index_changelog->first);
current_writer->setFile(std::move(description), WriteMode::Append);
/// Remove all subsequent files if overwritten something in previous one
auto to_remove_itr = existing_changelogs.upper_bound(index);
for (auto itr = to_remove_itr; itr != existing_changelogs.end();)
{
std::filesystem::remove(itr->second.path);
std::filesystem::remove(itr->second->path);
itr = existing_changelogs.erase(itr);
}
}
@ -664,7 +896,7 @@ void Changelog::writeAt(uint64_t index, const LogEntryPtr & log_entry)
/// Remove redundant logs from memory
/// Everything >= index must be removed
std::erase_if(logs, [index] (const auto & item) { return item.first >= index; });
std::erase_if(logs, [index](const auto & item) { return item.first >= index; });
/// Now we can actually override entry at index
appendEntry(index, log_entry);
@ -690,28 +922,34 @@ void Changelog::compact(uint64_t up_to_log_index)
bool need_rotate = false;
for (auto itr = existing_changelogs.begin(); itr != existing_changelogs.end();)
{
auto & changelog_description = *itr->second;
/// Remove all completely outdated changelog files
if (remove_all_logs || itr->second.to_log_index <= up_to_log_index)
if (remove_all_logs || changelog_description.to_log_index <= up_to_log_index)
{
if (current_writer && itr->second.from_log_index == current_writer->getStartIndex())
if (current_writer && changelog_description.from_log_index == current_writer->getStartIndex())
{
LOG_INFO(log, "Trying to remove log {} which is current active log for write. Possibly this node recovers from snapshot", itr->second.path);
LOG_INFO(
log,
"Trying to remove log {} which is current active log for write. Possibly this node recovers from snapshot",
changelog_description.path);
need_rotate = true;
current_writer.reset();
}
LOG_INFO(log, "Removing changelog {} because of compaction", itr->second.path);
LOG_INFO(log, "Removing changelog {} because of compaction", changelog_description.path);
/// If failed to push to queue for background removing, then we will remove it now
if (!log_files_to_delete_queue.tryPush(itr->second.path, 1))
if (!log_files_to_delete_queue.tryPush(changelog_description.path, 1))
{
std::error_code ec;
std::filesystem::remove(itr->second.path, ec);
std::filesystem::remove(changelog_description.path, ec);
if (ec)
LOG_WARNING(log, "Failed to remove changelog {} in compaction, error message: {}", itr->second.path, ec.message());
LOG_WARNING(log, "Failed to remove changelog {} in compaction, error message: {}", changelog_description.path, ec.message());
else
LOG_INFO(log, "Removed changelog {} because of compaction", itr->second.path);
LOG_INFO(log, "Removed changelog {} because of compaction", changelog_description.path);
}
changelog_description.deleted = true;
itr = existing_changelogs.erase(itr);
}
else /// Files are ordered, so all subsequent should exist
@ -719,10 +957,10 @@ void Changelog::compact(uint64_t up_to_log_index)
}
/// Compaction from the past is possible, so don't make our min_log_id smaller.
min_log_id = std::max(min_log_id, up_to_log_index + 1);
std::erase_if(logs, [up_to_log_index] (const auto & item) { return item.first <= up_to_log_index; });
std::erase_if(logs, [up_to_log_index](const auto & item) { return item.first <= up_to_log_index; });
if (need_rotate)
rotate(up_to_log_index + 1, lock);
current_writer->rotate(up_to_log_index + 1);
LOG_INFO(log, "Compaction up to {} finished new min index {}, new max index {}", up_to_log_index, min_log_id, max_log_id);
}
@ -824,24 +1062,35 @@ void Changelog::applyEntriesFromBuffer(uint64_t index, nuraft::buffer & buffer)
}
}
void Changelog::flush()
bool Changelog::flush()
{
if (flushAsync())
if (auto failed_ptr = flushAsync())
{
std::unique_lock lock{durable_idx_mutex};
durable_idx_cv.wait(lock, [&] { return last_durable_idx == max_log_id; });
durable_idx_cv.wait(lock, [&] { return *failed_ptr || last_durable_idx == max_log_id; });
return !*failed_ptr;
}
// if we are shutting down let's return true to avoid abort inside NuRaft
// this can only happen when the config change is appended so no data loss should happen
return true;
}
bool Changelog::flushAsync()
std::shared_ptr<bool> Changelog::flushAsync()
{
if (!initialized)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Changelog must be initialized before flushing records");
bool pushed = write_operations.push(Flush{max_log_id});
auto failed = std::make_shared<bool>(false);
bool pushed = write_operations.push(Flush{max_log_id, failed});
if (!pushed)
{
LOG_WARNING(log, "Changelog is shut down");
return pushed;
return nullptr;
}
return failed;
}
void Changelog::shutdown()
@ -863,6 +1112,12 @@ void Changelog::shutdown()
if (append_completion_thread.joinable())
append_completion_thread.join();
if (current_writer)
{
current_writer->finalize();
current_writer.reset();
}
}
Changelog::~Changelog()

View File

@ -1,14 +1,14 @@
#pragma once
#include <optional>
#include <city.h>
#include <Disks/IDisk.h>
#include <IO/CompressionMethod.h>
#include <IO/HashingWriteBuffer.h>
#include <IO/WriteBufferFromFile.h>
#include <base/defines.h>
#include <libnuraft/nuraft.hxx>
#include <libnuraft/raft_server.hxx>
#include <city.h>
#include <optional>
#include <base/defines.h>
#include <IO/WriteBufferFromFile.h>
#include <IO/HashingWriteBuffer.h>
#include <IO/CompressionMethod.h>
#include <Disks/IDisk.h>
#include <Common/ConcurrentBoundedQueue.h>
namespace DB
@ -60,24 +60,37 @@ struct ChangelogFileDescription
std::string path;
bool deleted = false;
/// How many entries should be stored in this log
uint64_t expectedEntriesCountInLog() const
{
return to_log_index - from_log_index + 1;
}
uint64_t expectedEntriesCountInLog() const { return to_log_index - from_log_index + 1; }
};
using ChangelogFileDescriptionPtr = std::shared_ptr<ChangelogFileDescription>;
class ChangelogWriter;
struct LogFileSettings
{
bool force_sync = true;
bool compress_logs = true;
uint64_t rotate_interval = 100000;
uint64_t max_size = 0;
uint64_t overallocate_size = 0;
};
/// Simplest changelog with files rotation.
/// No compression, no metadata, just entries with headers one by one.
/// Able to read broken files/entries and discard them. Not thread safe.
class Changelog
{
public:
Changelog(const std::string & changelogs_dir_, uint64_t rotate_interval_,
bool force_sync_, Poco::Logger * log_, bool compress_logs_ = true);
Changelog(
const std::string & changelogs_dir_,
Poco::Logger * log_,
LogFileSettings log_file_settings);
Changelog(Changelog &&) = delete;
/// Read changelog from files on changelogs_dir_ skipping all entries before from_log_index
/// Truncate broken entries, remove files after broken entries.
@ -92,15 +105,9 @@ public:
/// Remove log files with to_log_index <= up_to_log_index.
void compact(uint64_t up_to_log_index);
uint64_t getNextEntryIndex() const
{
return max_log_id + 1;
}
uint64_t getNextEntryIndex() const { return max_log_id + 1; }
uint64_t getStartIndex() const
{
return min_log_id;
}
uint64_t getStartIndex() const { return min_log_id; }
/// Last entry in log, or fake entry with term 0 if log is empty
LogEntryPtr getLastEntry() const;
@ -121,16 +128,13 @@ public:
void applyEntriesFromBuffer(uint64_t index, nuraft::buffer & buffer);
/// Fsync latest log to disk and flush buffer
void flush();
bool flush();
bool flushAsync();
std::shared_ptr<bool> flushAsync();
void shutdown();
uint64_t size() const
{
return logs.size();
}
uint64_t size() const { return logs.size(); }
uint64_t lastDurableIndex() const
{
@ -147,11 +151,8 @@ private:
/// Pack log_entry into changelog record
static ChangelogRecord buildRecord(uint64_t index, const LogEntryPtr & log_entry);
/// Starts new file [new_start_log_index, new_start_log_index + rotate_interval]
void rotate(uint64_t new_start_log_index, std::lock_guard<std::mutex> & writer_lock);
/// Currently existing changelogs
std::map<uint64_t, ChangelogFileDescription> existing_changelogs;
std::map<uint64_t, ChangelogFileDescriptionPtr> existing_changelogs;
using ChangelogIter = decltype(existing_changelogs)::iterator;
void removeExistingLogs(ChangelogIter begin, ChangelogIter end);
@ -162,7 +163,7 @@ private:
/// Remove all logs from disk
void removeAllLogs();
/// Init writer for existing log with some entries already written
void initWriter(const ChangelogFileDescription & description);
void initWriter(ChangelogFileDescriptionPtr description);
/// Clean useless log files in a background thread
void cleanLogThread();
@ -170,9 +171,7 @@ private:
const std::filesystem::path changelogs_dir;
const std::filesystem::path changelogs_detached_dir;
const uint64_t rotate_interval;
const bool force_sync;
Poco::Logger * log;
bool compress_logs;
std::mutex writer_mutex;
/// Current writer for changelog file
@ -197,6 +196,7 @@ private:
struct Flush
{
uint64_t index;
std::shared_ptr<bool> failed;
};
using WriteOperation = std::variant<AppendLog, Flush>;
@ -213,7 +213,7 @@ private:
void appendCompletionThread();
ThreadFromGlobalPool append_completion_thread;
ConcurrentBoundedQueue<uint64_t> append_completion_queue;
ConcurrentBoundedQueue<bool> append_completion_queue;
// last_durable_index needs to be exposed through const getter so we make mutex mutable
mutable std::mutex durable_idx_mutex;

View File

@ -44,7 +44,9 @@ struct Settings;
M(Bool, force_sync, true, "Call fsync on each change in RAFT changelog", 0) \
M(Bool, compress_logs, true, "Write compressed coordination logs in ZSTD format", 0) \
M(Bool, compress_snapshots_with_zstd_format, true, "Write compressed snapshots in ZSTD format (instead of custom LZ4)", 0) \
M(UInt64, configuration_change_tries_count, 20, "How many times we will try to apply configuration change (add/remove server) to the cluster", 0)
M(UInt64, configuration_change_tries_count, 20, "How many times we will try to apply configuration change (add/remove server) to the cluster", 0) \
M(UInt64, max_log_file_size, 50 * 1024 * 1024, "Max size of the Raft log file. If possible, each created log file will preallocate this amount of bytes on disk. Set to 0 to disable the limit", 0) \
M(UInt64, log_file_overallocate_size, 50 * 1024 * 1024, "If max_log_file_size is not set to 0, this value will be added to it for preallocating bytes on disk. If a log record is larger than this value, it could lead to uncaught out-of-space issues so a larger value is preferred", 0)
DECLARE_SETTINGS_TRAITS(CoordinationSettingsTraits, LIST_OF_COORDINATION_SETTINGS)

View File

@ -4,11 +4,12 @@
namespace DB
{
KeeperLogStore::KeeperLogStore(const std::string & changelogs_path, uint64_t rotate_interval_, bool force_sync_, bool compress_logs_)
KeeperLogStore::KeeperLogStore(
const std::string & changelogs_path, LogFileSettings log_file_settings)
: log(&Poco::Logger::get("KeeperLogStore"))
, changelog(changelogs_path, rotate_interval_, force_sync_, log, compress_logs_)
, changelog(changelogs_path, log, log_file_settings)
{
if (force_sync_)
if (log_file_settings.force_sync)
LOG_INFO(log, "force_sync enabled");
else
LOG_INFO(log, "force_sync disabled");
@ -90,8 +91,7 @@ bool KeeperLogStore::compact(uint64_t last_log_index)
bool KeeperLogStore::flush()
{
std::lock_guard lock(changelog_lock);
changelog.flush();
return true;
return changelog.flush();
}
void KeeperLogStore::apply_pack(uint64_t index, nuraft::buffer & pack)

View File

@ -14,7 +14,7 @@ namespace DB
class KeeperLogStore : public nuraft::log_store
{
public:
KeeperLogStore(const std::string & changelogs_path, uint64_t rotate_interval_, bool force_sync_, bool compress_logs_);
KeeperLogStore(const std::string & changelogs_path, LogFileSettings log_file_settings);
/// Read log storage from filesystem starting from last_commited_log_index
void init(uint64_t last_commited_log_index, uint64_t logs_to_keep);

View File

@ -6,7 +6,7 @@
#include <Common/Exception.h>
#include <Common/setThreadName.h>
#include <IO/S3Common.h>
#include <IO/S3/getObjectInfo.h>
#include <IO/WriteBufferFromS3.h>
#include <IO/ReadBufferFromS3.h>
#include <IO/ReadBufferFromFile.h>

View File

@ -214,7 +214,7 @@ KeeperStateManager::KeeperStateManager(
int server_id_, const std::string & host, int port, const std::string & logs_path, const std::string & state_file_path)
: my_server_id(server_id_)
, secure(false)
, log_store(nuraft::cs_new<KeeperLogStore>(logs_path, 5000, false, false))
, log_store(nuraft::cs_new<KeeperLogStore>(logs_path, LogFileSettings{.force_sync =false, .compress_logs = false, .rotate_interval = 5000}))
, server_state_path(state_file_path)
, logger(&Poco::Logger::get("KeeperStateManager"))
{
@ -238,9 +238,14 @@ KeeperStateManager::KeeperStateManager(
, configuration_wrapper(parseServersConfiguration(config, false))
, log_store(nuraft::cs_new<KeeperLogStore>(
log_storage_path,
coordination_settings->rotate_log_storage_interval,
coordination_settings->force_sync,
coordination_settings->compress_logs))
LogFileSettings
{
.force_sync = coordination_settings->force_sync,
.compress_logs = coordination_settings->compress_logs,
.rotate_interval = coordination_settings->rotate_log_storage_interval,
.max_size = coordination_settings->max_log_file_size,
.overallocate_size = coordination_settings->log_file_overallocate_size
}))
, server_state_path(state_file_path)
, logger(&Poco::Logger::get("KeeperStateManager"))
{

View File

@ -235,7 +235,8 @@ TEST_P(CoordinationTest, ChangelogTestSimple)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog.init(1, 0);
auto entry = getLogEntry("hello world", 77);
changelog.append(entry);
@ -262,7 +263,7 @@ TEST_P(CoordinationTest, ChangelogTestFile)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog.init(1, 0);
auto entry = getLogEntry("hello world", 77);
changelog.append(entry);
@ -291,7 +292,7 @@ TEST_P(CoordinationTest, ChangelogReadWrite)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 1000, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 1000});
changelog.init(1, 0);
for (size_t i = 0; i < 10; ++i)
@ -305,7 +306,7 @@ TEST_P(CoordinationTest, ChangelogReadWrite)
waitDurableLogs(changelog);
DB::KeeperLogStore changelog_reader("./logs", 1000, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 1000});
changelog_reader.init(1, 0);
EXPECT_EQ(changelog_reader.size(), 10);
EXPECT_EQ(changelog_reader.last_entry()->get_term(), changelog.last_entry()->get_term());
@ -325,7 +326,7 @@ TEST_P(CoordinationTest, ChangelogWriteAt)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 1000, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 1000});
changelog.init(1, 0);
for (size_t i = 0; i < 10; ++i)
{
@ -347,7 +348,7 @@ TEST_P(CoordinationTest, ChangelogWriteAt)
EXPECT_EQ(changelog.entry_at(7)->get_term(), 77);
EXPECT_EQ(changelog.next_slot(), 8);
DB::KeeperLogStore changelog_reader("./logs", 1000, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 1000});
changelog_reader.init(1, 0);
EXPECT_EQ(changelog_reader.size(), changelog.size());
@ -361,7 +362,7 @@ TEST_P(CoordinationTest, ChangelogTestAppendAfterRead)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog.init(1, 0);
for (size_t i = 0; i < 7; ++i)
{
@ -377,7 +378,7 @@ TEST_P(CoordinationTest, ChangelogTestAppendAfterRead)
EXPECT_TRUE(fs::exists("./logs/changelog_1_5.bin" + params.extension));
EXPECT_TRUE(fs::exists("./logs/changelog_6_10.bin" + params.extension));
DB::KeeperLogStore changelog_reader("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog_reader.init(1, 0);
EXPECT_EQ(changelog_reader.size(), 7);
@ -405,6 +406,7 @@ TEST_P(CoordinationTest, ChangelogTestAppendAfterRead)
EXPECT_EQ(changelog_reader.size(), 11);
waitDurableLogs(changelog_reader);
EXPECT_TRUE(fs::exists("./logs/changelog_1_5.bin" + params.extension));
EXPECT_TRUE(fs::exists("./logs/changelog_6_10.bin" + params.extension));
EXPECT_TRUE(fs::exists("./logs/changelog_11_15.bin" + params.extension));
@ -438,7 +440,7 @@ TEST_P(CoordinationTest, ChangelogTestCompaction)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog.init(1, 0);
for (size_t i = 0; i < 3; ++i)
@ -487,7 +489,7 @@ TEST_P(CoordinationTest, ChangelogTestCompaction)
EXPECT_EQ(changelog.next_slot(), 8);
EXPECT_EQ(changelog.last_entry()->get_term(), 60);
/// And we able to read it
DB::KeeperLogStore changelog_reader("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog_reader.init(7, 0);
EXPECT_EQ(changelog_reader.size(), 1);
@ -500,7 +502,7 @@ TEST_P(CoordinationTest, ChangelogTestBatchOperations)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog.init(1, 0);
for (size_t i = 0; i < 10; ++i)
{
@ -515,7 +517,7 @@ TEST_P(CoordinationTest, ChangelogTestBatchOperations)
auto entries = changelog.pack(1, 5);
DB::KeeperLogStore apply_changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore apply_changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
apply_changelog.init(1, 0);
for (size_t i = 0; i < 10; ++i)
@ -547,7 +549,7 @@ TEST_P(CoordinationTest, ChangelogTestBatchOperationsEmpty)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog.init(1, 0);
for (size_t i = 0; i < 10; ++i)
{
@ -563,7 +565,7 @@ TEST_P(CoordinationTest, ChangelogTestBatchOperationsEmpty)
auto entries = changelog.pack(5, 5);
ChangelogDirTest test1("./logs1");
DB::KeeperLogStore changelog_new("./logs1", 100, true, params.enable_compression);
DB::KeeperLogStore changelog_new("./logs1", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog_new.init(1, 0);
EXPECT_EQ(changelog_new.size(), 0);
@ -585,7 +587,7 @@ TEST_P(CoordinationTest, ChangelogTestBatchOperationsEmpty)
EXPECT_EQ(changelog_new.start_index(), 5);
EXPECT_EQ(changelog_new.next_slot(), 11);
DB::KeeperLogStore changelog_reader("./logs1", 100, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs1", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog_reader.init(5, 0);
}
@ -594,7 +596,7 @@ TEST_P(CoordinationTest, ChangelogTestWriteAtPreviousFile)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog.init(1, 0);
for (size_t i = 0; i < 33; ++i)
@ -635,7 +637,7 @@ TEST_P(CoordinationTest, ChangelogTestWriteAtPreviousFile)
EXPECT_FALSE(fs::exists("./logs/changelog_26_30.bin" + params.extension));
EXPECT_FALSE(fs::exists("./logs/changelog_31_35.bin" + params.extension));
DB::KeeperLogStore changelog_read("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog_read("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog_read.init(1, 0);
EXPECT_EQ(changelog_read.size(), 7);
EXPECT_EQ(changelog_read.start_index(), 1);
@ -647,7 +649,7 @@ TEST_P(CoordinationTest, ChangelogTestWriteAtFileBorder)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog.init(1, 0);
for (size_t i = 0; i < 33; ++i)
@ -688,7 +690,7 @@ TEST_P(CoordinationTest, ChangelogTestWriteAtFileBorder)
EXPECT_FALSE(fs::exists("./logs/changelog_26_30.bin" + params.extension));
EXPECT_FALSE(fs::exists("./logs/changelog_31_35.bin" + params.extension));
DB::KeeperLogStore changelog_read("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog_read("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog_read.init(1, 0);
EXPECT_EQ(changelog_read.size(), 11);
EXPECT_EQ(changelog_read.start_index(), 1);
@ -700,7 +702,7 @@ TEST_P(CoordinationTest, ChangelogTestWriteAtAllFiles)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog.init(1, 0);
for (size_t i = 0; i < 33; ++i)
{
@ -745,7 +747,7 @@ TEST_P(CoordinationTest, ChangelogTestStartNewLogAfterRead)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog.init(1, 0);
for (size_t i = 0; i < 35; ++i)
@ -766,7 +768,7 @@ TEST_P(CoordinationTest, ChangelogTestStartNewLogAfterRead)
EXPECT_TRUE(fs::exists("./logs/changelog_31_35.bin" + params.extension));
EXPECT_FALSE(fs::exists("./logs/changelog_36_40.bin" + params.extension));
DB::KeeperLogStore changelog_reader("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog_reader.init(1, 0);
auto entry = getLogEntry("36_hello_world", 360);
@ -811,7 +813,7 @@ TEST_P(CoordinationTest, ChangelogTestReadAfterBrokenTruncate)
auto params = GetParam();
ChangelogDirTest test(log_folder);
DB::KeeperLogStore changelog(log_folder, 5, true, params.enable_compression);
DB::KeeperLogStore changelog(log_folder, DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog.init(1, 0);
for (size_t i = 0; i < 35; ++i)
@ -834,7 +836,7 @@ TEST_P(CoordinationTest, ChangelogTestReadAfterBrokenTruncate)
DB::WriteBufferFromFile plain_buf("./logs/changelog_11_15.bin" + params.extension, DBMS_DEFAULT_BUFFER_SIZE, O_APPEND | O_CREAT | O_WRONLY);
plain_buf.truncate(0);
DB::KeeperLogStore changelog_reader("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog_reader.init(1, 0);
changelog_reader.end_of_append_batch(0, 0);
@ -867,7 +869,7 @@ TEST_P(CoordinationTest, ChangelogTestReadAfterBrokenTruncate)
assertBrokenLogRemoved(log_folder, "changelog_26_30.bin" + params.extension);
assertBrokenLogRemoved(log_folder, "changelog_31_35.bin" + params.extension);
DB::KeeperLogStore changelog_reader2("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog_reader2("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog_reader2.init(1, 0);
EXPECT_EQ(changelog_reader2.size(), 11);
EXPECT_EQ(changelog_reader2.last_entry()->get_term(), 7777);
@ -878,7 +880,7 @@ TEST_P(CoordinationTest, ChangelogTestReadAfterBrokenTruncate2)
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 20, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 20});
changelog.init(1, 0);
for (size_t i = 0; i < 35; ++i)
@ -893,9 +895,9 @@ TEST_P(CoordinationTest, ChangelogTestReadAfterBrokenTruncate2)
EXPECT_TRUE(fs::exists("./logs/changelog_21_40.bin" + params.extension));
DB::WriteBufferFromFile plain_buf("./logs/changelog_1_20.bin" + params.extension, DBMS_DEFAULT_BUFFER_SIZE, O_APPEND | O_CREAT | O_WRONLY);
plain_buf.truncate(140);
plain_buf.truncate(30);
DB::KeeperLogStore changelog_reader("./logs", 20, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 20});
changelog_reader.init(1, 0);
EXPECT_EQ(changelog_reader.size(), 0);
@ -910,7 +912,7 @@ TEST_P(CoordinationTest, ChangelogTestReadAfterBrokenTruncate2)
EXPECT_EQ(changelog_reader.size(), 1);
EXPECT_EQ(changelog_reader.last_entry()->get_term(), 7777);
DB::KeeperLogStore changelog_reader2("./logs", 1, true, params.enable_compression);
DB::KeeperLogStore changelog_reader2("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 1});
changelog_reader2.init(1, 0);
EXPECT_EQ(changelog_reader2.size(), 1);
EXPECT_EQ(changelog_reader2.last_entry()->get_term(), 7777);
@ -921,7 +923,7 @@ TEST_P(CoordinationTest, ChangelogTestLostFiles)
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 20, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 20});
changelog.init(1, 0);
for (size_t i = 0; i < 35; ++i)
@ -937,7 +939,7 @@ TEST_P(CoordinationTest, ChangelogTestLostFiles)
fs::remove("./logs/changelog_1_20.bin" + params.extension);
DB::KeeperLogStore changelog_reader("./logs", 20, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 20});
/// It should print error message, but still able to start
changelog_reader.init(5, 0);
assertBrokenLogRemoved("./logs", "changelog_21_40.bin" + params.extension);
@ -948,7 +950,7 @@ TEST_P(CoordinationTest, ChangelogTestLostFiles2)
auto params = GetParam();
ChangelogDirTest test("./logs");
DB::KeeperLogStore changelog("./logs", 10, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 10});
changelog.init(1, 0);
for (size_t i = 0; i < 35; ++i)
@ -968,7 +970,7 @@ TEST_P(CoordinationTest, ChangelogTestLostFiles2)
// we have a gap in our logs, we need to remove all the logs after the gap
fs::remove("./logs/changelog_21_30.bin" + params.extension);
DB::KeeperLogStore changelog_reader("./logs", 10, true, params.enable_compression);
DB::KeeperLogStore changelog_reader("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 10});
/// It should print error message, but still able to start
changelog_reader.init(5, 0);
EXPECT_TRUE(fs::exists("./logs/changelog_1_10.bin" + params.extension));
@ -1406,7 +1408,7 @@ void testLogAndStateMachine(Coordination::CoordinationSettingsPtr settings, uint
SnapshotsQueue snapshots_queue{1};
auto state_machine = std::make_shared<KeeperStateMachine>(queue, snapshots_queue, "./snapshots", settings, keeper_context, nullptr);
state_machine->init();
DB::KeeperLogStore changelog("./logs", settings->rotate_log_storage_interval, true, enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = enable_compression, .rotate_interval = settings->rotate_log_storage_interval});
changelog.init(state_machine->last_commit_index() + 1, settings->reserved_log_items);
for (size_t i = 1; i < total_logs + 1; ++i)
{
@ -1446,7 +1448,7 @@ void testLogAndStateMachine(Coordination::CoordinationSettingsPtr settings, uint
restore_machine->init();
EXPECT_EQ(restore_machine->last_commit_index(), total_logs - total_logs % settings->snapshot_distance);
DB::KeeperLogStore restore_changelog("./logs", settings->rotate_log_storage_interval, true, enable_compression);
DB::KeeperLogStore restore_changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = enable_compression, .rotate_interval = settings->rotate_log_storage_interval});
restore_changelog.init(restore_machine->last_commit_index() + 1, settings->reserved_log_items);
EXPECT_EQ(restore_changelog.size(), std::min(settings->reserved_log_items + total_logs % settings->snapshot_distance, total_logs));
@ -1583,7 +1585,7 @@ TEST_P(CoordinationTest, TestRotateIntervalChanges)
auto params = GetParam();
ChangelogDirTest snapshots("./logs");
{
DB::KeeperLogStore changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog.init(0, 3);
for (size_t i = 1; i < 55; ++i)
@ -1601,7 +1603,7 @@ TEST_P(CoordinationTest, TestRotateIntervalChanges)
EXPECT_TRUE(fs::exists("./logs/changelog_1_100.bin" + params.extension));
DB::KeeperLogStore changelog_1("./logs", 10, true, params.enable_compression);
DB::KeeperLogStore changelog_1("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 10});
changelog_1.init(0, 50);
for (size_t i = 0; i < 55; ++i)
{
@ -1617,7 +1619,7 @@ TEST_P(CoordinationTest, TestRotateIntervalChanges)
EXPECT_TRUE(fs::exists("./logs/changelog_1_100.bin" + params.extension));
EXPECT_TRUE(fs::exists("./logs/changelog_101_110.bin" + params.extension));
DB::KeeperLogStore changelog_2("./logs", 7, true, params.enable_compression);
DB::KeeperLogStore changelog_2("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 7});
changelog_2.init(98, 55);
for (size_t i = 0; i < 17; ++i)
@ -1640,7 +1642,7 @@ TEST_P(CoordinationTest, TestRotateIntervalChanges)
EXPECT_TRUE(fs::exists("./logs/changelog_118_124.bin" + params.extension));
EXPECT_TRUE(fs::exists("./logs/changelog_125_131.bin" + params.extension));
DB::KeeperLogStore changelog_3("./logs", 5, true, params.enable_compression);
DB::KeeperLogStore changelog_3("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 5});
changelog_3.init(116, 3);
for (size_t i = 0; i < 17; ++i)
{
@ -1688,7 +1690,7 @@ TEST_P(CoordinationTest, TestCompressedLogsMultipleRewrite)
using namespace Coordination;
auto test_params = GetParam();
ChangelogDirTest snapshots("./logs");
DB::KeeperLogStore changelog("./logs", 100, true, test_params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = test_params.enable_compression, .rotate_interval = 100});
changelog.init(0, 3);
for (size_t i = 1; i < 55; ++i)
@ -1702,7 +1704,7 @@ TEST_P(CoordinationTest, TestCompressedLogsMultipleRewrite)
waitDurableLogs(changelog);
DB::KeeperLogStore changelog1("./logs", 100, true, test_params.enable_compression);
DB::KeeperLogStore changelog1("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = test_params.enable_compression, .rotate_interval = 100});
changelog1.init(0, 3);
for (size_t i = 55; i < 70; ++i)
{
@ -1713,7 +1715,7 @@ TEST_P(CoordinationTest, TestCompressedLogsMultipleRewrite)
changelog1.end_of_append_batch(0, 0);
}
DB::KeeperLogStore changelog2("./logs", 100, true, test_params.enable_compression);
DB::KeeperLogStore changelog2("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = test_params.enable_compression, .rotate_interval = 100});
changelog2.init(0, 3);
for (size_t i = 70; i < 80; ++i)
{
@ -1776,7 +1778,7 @@ TEST_P(CoordinationTest, ChangelogInsertThreeTimesSmooth)
ChangelogDirTest test("./logs");
{
LOG_INFO(log, "================First time=====================");
DB::KeeperLogStore changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog.init(1, 0);
auto entry = getLogEntry("hello_world", 1000);
changelog.append(entry);
@ -1787,7 +1789,7 @@ TEST_P(CoordinationTest, ChangelogInsertThreeTimesSmooth)
{
LOG_INFO(log, "================Second time=====================");
DB::KeeperLogStore changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog.init(1, 0);
auto entry = getLogEntry("hello_world", 1000);
changelog.append(entry);
@ -1798,7 +1800,7 @@ TEST_P(CoordinationTest, ChangelogInsertThreeTimesSmooth)
{
LOG_INFO(log, "================Third time=====================");
DB::KeeperLogStore changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog.init(1, 0);
auto entry = getLogEntry("hello_world", 1000);
changelog.append(entry);
@ -1809,7 +1811,7 @@ TEST_P(CoordinationTest, ChangelogInsertThreeTimesSmooth)
{
LOG_INFO(log, "================Fourth time=====================");
DB::KeeperLogStore changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog.init(1, 0);
auto entry = getLogEntry("hello_world", 1000);
changelog.append(entry);
@ -1827,7 +1829,7 @@ TEST_P(CoordinationTest, ChangelogInsertMultipleTimesSmooth)
for (size_t i = 0; i < 36; ++i)
{
LOG_INFO(log, "================First time=====================");
DB::KeeperLogStore changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog.init(1, 0);
for (size_t j = 0; j < 7; ++j)
{
@ -1838,7 +1840,7 @@ TEST_P(CoordinationTest, ChangelogInsertMultipleTimesSmooth)
waitDurableLogs(changelog);
}
DB::KeeperLogStore changelog("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog.init(1, 0);
EXPECT_EQ(changelog.next_slot(), 36 * 7 + 1);
}
@ -1849,7 +1851,7 @@ TEST_P(CoordinationTest, ChangelogInsertThreeTimesHard)
ChangelogDirTest test("./logs");
{
LOG_INFO(log, "================First time=====================");
DB::KeeperLogStore changelog1("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog1("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog1.init(1, 0);
auto entry = getLogEntry("hello_world", 1000);
changelog1.append(entry);
@ -1860,7 +1862,7 @@ TEST_P(CoordinationTest, ChangelogInsertThreeTimesHard)
{
LOG_INFO(log, "================Second time=====================");
DB::KeeperLogStore changelog2("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog2("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog2.init(1, 0);
auto entry = getLogEntry("hello_world", 1000);
changelog2.append(entry);
@ -1871,7 +1873,7 @@ TEST_P(CoordinationTest, ChangelogInsertThreeTimesHard)
{
LOG_INFO(log, "================Third time=====================");
DB::KeeperLogStore changelog3("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog3("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog3.init(1, 0);
auto entry = getLogEntry("hello_world", 1000);
changelog3.append(entry);
@ -1882,7 +1884,7 @@ TEST_P(CoordinationTest, ChangelogInsertThreeTimesHard)
{
LOG_INFO(log, "================Fourth time=====================");
DB::KeeperLogStore changelog4("./logs", 100, true, params.enable_compression);
DB::KeeperLogStore changelog4("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100});
changelog4.init(1, 0);
auto entry = getLogEntry("hello_world", 1000);
changelog4.append(entry);
@ -1939,7 +1941,7 @@ TEST_P(CoordinationTest, TestLogGap)
using namespace Coordination;
auto test_params = GetParam();
ChangelogDirTest logs("./logs");
DB::KeeperLogStore changelog("./logs", 100, true, test_params.enable_compression);
DB::KeeperLogStore changelog("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = test_params.enable_compression, .rotate_interval = 100});
changelog.init(0, 3);
for (size_t i = 1; i < 55; ++i)
@ -1951,7 +1953,7 @@ TEST_P(CoordinationTest, TestLogGap)
changelog.end_of_append_batch(0, 0);
}
DB::KeeperLogStore changelog1("./logs", 100, true, test_params.enable_compression);
DB::KeeperLogStore changelog1("./logs", DB::LogFileSettings{.force_sync = true, .compress_logs = test_params.enable_compression, .rotate_interval = 100});
changelog1.init(61, 3);
/// Logs discarded
@ -2283,6 +2285,66 @@ TEST_P(CoordinationTest, TestSystemNodeModify)
assert_create("/keeper1/test", Error::ZOK);
}
TEST_P(CoordinationTest, ChangelogTestMaxLogSize)
{
auto params = GetParam();
ChangelogDirTest test("./logs");
uint64_t last_entry_index{0};
size_t i{0};
{
SCOPED_TRACE("Small rotation interval, big size limit");
DB::KeeperLogStore changelog(
"./logs",
DB::LogFileSettings{
.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 20, .max_size = 50 * 1024 * 1024});
changelog.init(1, 0);
for (; i < 100; ++i)
{
auto entry = getLogEntry(std::to_string(i) + "_hello_world", (i + 44) * 10);
last_entry_index = changelog.append(entry);
}
changelog.end_of_append_batch(0, 0);
waitDurableLogs(changelog);
ASSERT_EQ(changelog.entry_at(last_entry_index)->get_term(), (i - 1 + 44) * 10);
}
{
SCOPED_TRACE("Large rotation interval, small size limit");
DB::KeeperLogStore changelog(
"./logs",
DB::LogFileSettings{
.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100'000, .max_size = 4000});
changelog.init(1, 0);
ASSERT_EQ(changelog.entry_at(last_entry_index)->get_term(), (i - 1 + 44) * 10);
for (; i < 500; ++i)
{
auto entry = getLogEntry(std::to_string(i) + "_hello_world", (i + 44) * 10);
last_entry_index = changelog.append(entry);
}
changelog.end_of_append_batch(0, 0);
waitDurableLogs(changelog);
ASSERT_EQ(changelog.entry_at(last_entry_index)->get_term(), (i - 1 + 44) * 10);
}
{
SCOPED_TRACE("Final verify all logs");
DB::KeeperLogStore changelog(
"./logs",
DB::LogFileSettings{
.force_sync = true, .compress_logs = params.enable_compression, .rotate_interval = 100'000, .max_size = 4000});
changelog.init(1, 0);
ASSERT_EQ(changelog.entry_at(last_entry_index)->get_term(), (i - 1 + 44) * 10);
}
}
INSTANTIATE_TEST_SUITE_P(CoordinationTestSuite,
CoordinationTest,
::testing::ValuesIn(std::initializer_list<CompressionParam>{

View File

@ -95,7 +95,15 @@ void BackgroundSchedulePoolTaskInfo::execute()
executing = true;
}
function();
try
{
function();
}
catch (...)
{
tryLogCurrentException(__PRETTY_FUNCTION__);
chassert(false && "Tasks in BackgroundSchedulePool cannot throw");
}
UInt64 milliseconds = watch.elapsedMilliseconds();
/// If the task is executed longer than specified time, it will be logged.

View File

@ -540,6 +540,15 @@ class IColumn;
M(Bool, describe_extend_object_types, false, "Deduce concrete type of columns of type Object in DESCRIBE query", 0) \
M(Bool, describe_include_subcolumns, false, "If true, subcolumns of all table columns will be included into result of DESCRIBE query", 0) \
\
M(Bool, use_query_cache, false, "Enable the query cache", 0) \
M(Bool, enable_writes_to_query_cache, true, "Enable storing results of SELECT queries in the query cache", 0) \
M(Bool, enable_reads_from_query_cache, true, "Enable reading results of SELECT queries from the query cache", 0) \
M(Bool, query_cache_store_results_of_queries_with_nondeterministic_functions, false, "Store results of queries with non-deterministic functions (e.g. rand(), now()) in the query cache", 0) \
M(UInt64, query_cache_min_query_runs, 0, "Minimum number a SELECT query must run before its result is stored in the query cache", 0) \
M(Milliseconds, query_cache_min_query_duration, 0, "Minimum time in milliseconds for a query to run for its result to be stored in the query cache.", 0) \
M(Seconds, query_cache_ttl, 60, "After this time in seconds entries in the query cache become stale", 0) \
M(Bool, query_cache_share_between_users, false, "Allow other users to read entry in the query cache", 0) \
\
M(Bool, optimize_rewrite_sum_if_to_count_if, false, "Rewrite sumIf() and sum(if()) function countIf() function when logically equivalent", 0) \
M(UInt64, insert_shard_id, 0, "If non zero, when insert into a distributed table, the data will be inserted into the shard `insert_shard_id` synchronously. Possible values range from 1 to `shards_number` of corresponding distributed table", 0) \
\
@ -594,7 +603,7 @@ class IColumn;
M(ShortCircuitFunctionEvaluation, short_circuit_function_evaluation, ShortCircuitFunctionEvaluation::ENABLE, "Setting for short-circuit function evaluation configuration. Possible values: 'enable' - use short-circuit function evaluation for functions that are suitable for it, 'disable' - disable short-circuit function evaluation, 'force_enable' - use short-circuit function evaluation for all functions.", 0) \
\
M(LocalFSReadMethod, storage_file_read_method, LocalFSReadMethod::mmap, "Method of reading data from storage file, one of: read, pread, mmap.", 0) \
M(String, local_filesystem_read_method, "pread_threadpool", "Method of reading data from local filesystem, one of: read, pread, mmap, pread_threadpool.", 0) \
M(String, local_filesystem_read_method, "pread_threadpool", "Method of reading data from local filesystem, one of: read, pread, mmap, io_uring, pread_threadpool.", 0) \
M(String, remote_filesystem_read_method, "threadpool", "Method of reading data from remote filesystem, one of: read, threadpool.", 0) \
M(Bool, local_filesystem_read_prefetch, false, "Should use prefetching when reading data from local filesystem.", 0) \
M(Bool, remote_filesystem_read_prefetch, true, "Should use prefetching when reading data from remote filesystem.", 0) \
@ -660,6 +669,7 @@ class IColumn;
M(Bool, allow_experimental_nlp_functions, false, "Enable experimental functions for natural language processing.", 0) \
M(Bool, allow_experimental_hash_functions, false, "Enable experimental hash functions (hashid, etc)", 0) \
M(Bool, allow_experimental_object_type, false, "Allow Object and JSON data types", 0) \
M(Bool, allow_experimental_query_cache, false, "Enable experimental query cache", 0) \
M(String, insert_deduplication_token, "", "If not empty, used for duplicate detection instead of data digest", 0) \
M(String, ann_index_select_query_params, "", "Parameters passed to ANN indexes in SELECT queries, the format is 'param1=x, param2=y, ...'", 0) \
M(UInt64, max_limit_for_ann_queries, 1000000, "Maximum limit value for using ANN indexes is used to prevent memory overflow in search queries for indexes", 0) \
@ -675,13 +685,6 @@ class IColumn;
M(UInt64, grace_hash_join_max_buckets, 1024, "Limit on the number of grace hash join buckets", 0) \
M(Bool, optimize_distinct_in_order, true, "Enable DISTINCT optimization if some columns in DISTINCT form a prefix of sorting. For example, prefix of sorting key in merge tree or ORDER BY statement", 0) \
M(Bool, optimize_sorting_by_input_stream_properties, true, "Optimize sorting by sorting properties of input stream", 0) \
M(Bool, enable_experimental_query_result_cache, false, "Store and retrieve results of SELECT queries in/from the query result cache", 0) \
M(Bool, enable_experimental_query_result_cache_passive_usage, false, "Retrieve results of SELECT queries from the query result cache", 0) \
M(Bool, query_result_cache_store_results_of_queries_with_nondeterministic_functions, false, "Store results of queries with non-deterministic functions (e.g. rand(), now()) in the query result cache", 0) \
M(UInt64, query_result_cache_min_query_runs, 0, "Minimum number a SELECT query must run before its result is stored in the query result cache", 0) \
M(Milliseconds, query_result_cache_min_query_duration, 0, "Minimum time in milliseconds for a query to run for its result to be stored in the query result cache.", 0) \
M(Seconds, query_result_cache_ttl, 60, "After this time in seconds entries in the query result cache become stale", 0) \
M(Bool, query_result_cache_share_between_users, false, "Allow other users to read entry in the query result cache", 0) \
M(UInt64, insert_keeper_max_retries, 0, "Max retries for keeper operations during insert", 0) \
M(UInt64, insert_keeper_retry_initial_backoff_ms, 100, "Initial backoff timeout for keeper operations during insert", 0) \
M(UInt64, insert_keeper_retry_max_backoff_ms, 10000, "Max backoff timeout for keeper operations during insert", 0) \

View File

@ -233,7 +233,11 @@ void registerDictionarySourceXDBC(DictionarySourceFactory & factory)
bool /* check_config */) -> DictionarySourcePtr {
#if USE_ODBC
BridgeHelperPtr bridge = std::make_shared<XDBCBridgeHelper<ODBCBridgeMixin>>(
global_context, global_context->getSettings().http_receive_timeout, config.getString(config_prefix + ".odbc.connection_string"));
global_context,
global_context->getSettings().http_receive_timeout,
config.getString(config_prefix + ".odbc.connection_string"),
config.getBool(config_prefix + ".settings.odbc_bridge_use_connection_pooling",
global_context->getSettingsRef().odbc_bridge_use_connection_pooling));
std::string settings_config_prefix = config_prefix + ".odbc";

View File

@ -54,8 +54,11 @@ void DiskLocalCheckThread::run()
else
{
retry = 0;
if (!disk->broken)
LOG_ERROR(log, "Disk {} marked as broken", disk->getName());
else
LOG_INFO(log, "Disk {} is still broken", disk->getName());
disk->broken = true;
LOG_INFO(log, "Disk {} is broken", disk->getName());
task->scheduleAfter(check_period_ms);
}
}

View File

@ -0,0 +1,341 @@
#if defined(OS_LINUX)
#include "IOUringReader.h"
#include <base/errnoToString.h>
#include <Common/assert_cast.h>
#include <Common/Exception.h>
#include <Common/MemorySanitizer.h>
#include <Common/ProfileEvents.h>
#include <Common/CurrentMetrics.h>
#include <Common/Stopwatch.h>
#include <Common/setThreadName.h>
#include <Common/logger_useful.h>
#include <future>
namespace ProfileEvents
{
extern const Event ReadBufferFromFileDescriptorRead;
extern const Event ReadBufferFromFileDescriptorReadFailed;
extern const Event ReadBufferFromFileDescriptorReadBytes;
extern const Event IOUringSQEsSubmitted;
extern const Event IOUringSQEsResubmits;
extern const Event IOUringCQEsCompleted;
extern const Event IOUringCQEsFailed;
}
namespace CurrentMetrics
{
extern const Metric IOUringPendingEvents;
extern const Metric IOUringInFlightEvents;
extern const Metric Read;
}
namespace DB
{
namespace ErrorCodes
{
extern const int LOGICAL_ERROR;
extern const int CANNOT_READ_FROM_FILE_DESCRIPTOR;
extern const int IO_URING_INIT_FAILED;
extern const int IO_URING_SUBMIT_ERROR;
}
IOUringReader::IOUringReader(uint32_t entries_)
: log(&Poco::Logger::get("IOUringReader"))
{
struct io_uring_probe * probe = io_uring_get_probe();
if (!probe)
{
is_supported = false;
return;
}
is_supported = io_uring_opcode_supported(probe, IORING_OP_READ);
io_uring_free_probe(probe);
if (!is_supported)
return;
struct io_uring_params params =
{
.cq_entries = 0, // filled by the kernel, initializing to silence warning
.flags = 0,
};
int ret = io_uring_queue_init_params(entries_, &ring, &params);
if (ret < 0)
throwFromErrno("Failed initializing io_uring", ErrorCodes::IO_URING_INIT_FAILED, -ret);
cq_entries = params.cq_entries;
ring_completion_monitor = ThreadFromGlobalPool([this] { monitorRing(); });
}
std::future<IAsynchronousReader::Result> IOUringReader::submit(Request request)
{
assert(request.size);
// take lock here because we're modifying containers and submitting to the ring,
// the monitor thread can also do the same
std::unique_lock lock{mutex};
// use the requested read destination address as the request id, the assumption
// here is that we won't get asked to fill in the same address more than once in parallel
auto request_id = reinterpret_cast<UInt64>(request.buf);
std::promise<IAsynchronousReader::Result> promise;
auto enqueued_request = EnqueuedRequest
{
.promise = std::move(promise),
.request = request,
.resubmitting = false,
.bytes_read = 0
};
// if there's room in the completion queue submit the request to the ring immediately,
// otherwise push it to the back of the pending queue
if (in_flight_requests.size() < cq_entries)
{
int ret = submitToRing(enqueued_request);
if (ret > 0)
{
const auto [kv, success] = in_flight_requests.emplace(request_id, std::move(enqueued_request));
if (!success)
{
ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorReadFailed);
return makeFailedResult(Exception(
ErrorCodes::LOGICAL_ERROR, "Tried enqueuing read request for {} that is already submitted", request_id));
}
return (kv->second).promise.get_future();
}
else
{
ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorReadFailed);
return makeFailedResult(Exception(
ErrorCodes::IO_URING_SUBMIT_ERROR, "Failed submitting SQE: {}", ret < 0 ? errnoToString(-ret) : "no SQE submitted"));
}
}
else
{
CurrentMetrics::add(CurrentMetrics::IOUringPendingEvents);
pending_requests.push_back(std::move(enqueued_request));
return pending_requests.back().promise.get_future();
}
}
int IOUringReader::submitToRing(EnqueuedRequest & enqueued)
{
struct io_uring_sqe * sqe = io_uring_get_sqe(&ring);
if (!sqe)
return 0;
auto request = enqueued.request;
auto request_id = reinterpret_cast<UInt64>(request.buf);
int fd = assert_cast<const LocalFileDescriptor &>(*request.descriptor).fd;
io_uring_sqe_set_data64(sqe, request_id);
io_uring_prep_read(sqe, fd, request.buf, static_cast<unsigned>(request.size - enqueued.bytes_read), request.offset + enqueued.bytes_read);
int ret = 0;
do
{
ret = io_uring_submit(&ring);
} while (ret == -EINTR || ret == -EAGAIN);
if (ret > 0 && !enqueued.resubmitting)
{
ProfileEvents::increment(ProfileEvents::IOUringSQEsSubmitted);
CurrentMetrics::add(CurrentMetrics::IOUringInFlightEvents);
CurrentMetrics::add(CurrentMetrics::Read);
}
return ret;
}
void IOUringReader::failRequest(const EnqueuedIterator & requestIt, const Exception & ex)
{
ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorReadFailed);
(requestIt->second).promise.set_exception(std::make_exception_ptr(ex));
finalizeRequest(requestIt);
}
void IOUringReader::finalizeRequest(const EnqueuedIterator & requestIt)
{
in_flight_requests.erase(requestIt);
CurrentMetrics::sub(CurrentMetrics::IOUringInFlightEvents);
CurrentMetrics::sub(CurrentMetrics::Read);
// since we just finalized a request there's now room in the completion queue,
// see if there are any pending requests and submit one from the front of the queue
if (!pending_requests.empty())
{
auto pending_request = std::move(pending_requests.front());
pending_requests.pop_front();
int ret = submitToRing(pending_request);
if (ret > 0)
{
auto request_id = reinterpret_cast<UInt64>(pending_request.request.buf);
if (!in_flight_requests.contains(request_id))
in_flight_requests.emplace(request_id, std::move(pending_request));
else
failPromise(pending_request.promise, Exception(ErrorCodes::LOGICAL_ERROR,
"Tried enqueuing pending read request for {} that is already submitted", request_id));
}
else
{
ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorReadFailed);
failPromise(pending_request.promise, Exception(ErrorCodes::IO_URING_SUBMIT_ERROR,
"Failed submitting SQE for pending request: {}", ret < 0 ? errnoToString(-ret) : "no SQE submitted"));
}
CurrentMetrics::sub(CurrentMetrics::IOUringPendingEvents);
}
}
void IOUringReader::monitorRing()
{
setThreadName("IOUringMonitor");
while (!cancelled.load(std::memory_order_relaxed))
{
// we can't use wait_cqe_* variants with timeouts as they can
// submit timeout events in older kernels that do not support IORING_FEAT_EXT_ARG
// and it is not safe to mix submission and consumption event threads.
struct io_uring_cqe * cqe = nullptr;
int ret = io_uring_wait_cqe(&ring, &cqe);
if (ret == -EAGAIN || ret == -EINTR)
{
LOG_DEBUG(log, "Restarting waiting for CQEs due to: {}", errnoToString(-ret));
io_uring_cqe_seen(&ring, cqe);
continue;
}
if (ret < 0)
{
LOG_ERROR(log, "Failed waiting for io_uring CQEs: {}", errnoToString(-ret));
continue;
}
// user_data zero means a noop event sent from the destructor meant to interrupt the thread
if (cancelled.load(std::memory_order_relaxed) || (cqe && cqe->user_data == 0))
{
LOG_DEBUG(log, "Stopping IOUringMonitor thread");
io_uring_cqe_seen(&ring, cqe);
break;
}
if (!cqe)
{
LOG_ERROR(log, "Unexpectedly got a null CQE, continuing");
continue;
}
// it is safe to re-submit events once we take the lock here
std::unique_lock lock{mutex};
auto request_id = cqe->user_data;
const auto it = in_flight_requests.find(request_id);
if (it == in_flight_requests.end())
{
LOG_ERROR(log, "Got a completion event for a request {} that was not submitted", request_id);
io_uring_cqe_seen(&ring, cqe);
continue;
}
auto & enqueued = it->second;
if (cqe->res == -EAGAIN || cqe->res == -EINTR)
{
enqueued.resubmitting = true;
ProfileEvents::increment(ProfileEvents::IOUringSQEsResubmits);
ret = submitToRing(enqueued);
if (ret <= 0)
{
failRequest(it, Exception(ErrorCodes::IO_URING_SUBMIT_ERROR,
"Failed re-submitting SQE: {}", ret < 0 ? errnoToString(-ret) : "no SQE submitted"));
}
io_uring_cqe_seen(&ring, cqe);
continue;
}
if (cqe->res < 0)
{
auto req = enqueued.request;
int fd = assert_cast<const LocalFileDescriptor &>(*req.descriptor).fd;
failRequest(it, Exception(
ErrorCodes::CANNOT_READ_FROM_FILE_DESCRIPTOR,
"Failed reading {} bytes at offset {} to address {} from fd {}: {}",
req.size, req.offset, static_cast<void*>(req.buf), fd, errnoToString(-cqe->res)
));
ProfileEvents::increment(ProfileEvents::IOUringCQEsFailed);
io_uring_cqe_seen(&ring, cqe);
continue;
}
size_t bytes_read = cqe->res;
size_t total_bytes_read = enqueued.bytes_read + bytes_read;
if (bytes_read > 0)
{
__msan_unpoison(enqueued.request.buf + enqueued.bytes_read, bytes_read);
ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorRead);
ProfileEvents::increment(ProfileEvents::ReadBufferFromFileDescriptorReadBytes, bytes_read);
}
if (bytes_read > 0 && total_bytes_read < enqueued.request.size)
{
// potential short read, re-submit
enqueued.resubmitting = true;
enqueued.bytes_read += bytes_read;
ret = submitToRing(enqueued);
if (ret <= 0)
{
failRequest(it, Exception(ErrorCodes::IO_URING_SUBMIT_ERROR,
"Failed re-submitting SQE for short read: {}", ret < 0 ? errnoToString(-ret) : "no SQE submitted"));
}
}
else
{
enqueued.promise.set_value(Result{ .size = total_bytes_read, .offset = enqueued.request.ignore });
finalizeRequest(it);
}
ProfileEvents::increment(ProfileEvents::IOUringCQEsCompleted);
io_uring_cqe_seen(&ring, cqe);
}
}
IOUringReader::~IOUringReader()
{
cancelled.store(true, std::memory_order_relaxed);
// interrupt the monitor thread by sending a noop event
{
std::unique_lock lock{mutex};
struct io_uring_sqe * sqe = io_uring_get_sqe(&ring);
io_uring_prep_nop(sqe);
io_uring_sqe_set_data(sqe, nullptr);
io_uring_submit(&ring);
}
ring_completion_monitor.join();
io_uring_queue_exit(&ring);
}
}
#endif

View File

@ -0,0 +1,78 @@
#pragma once
#if defined(OS_LINUX)
#include <Common/ThreadPool.h>
#include <IO/AsynchronousReader.h>
#include <deque>
#include <unordered_map>
#include <liburing.h>
namespace DB
{
/** Perform reads using the io_uring Linux subsystem.
*
* The class sets up a single io_uring that clients submit read requests to, they are
* placed in a map using the read buffer address as the key and the original request
* with a promise as the value. A monitor thread continuously polls the completion queue,
* looks up the completed request and completes the matching promise.
*/
class IOUringReader final : public IAsynchronousReader
{
private:
bool is_supported;
std::mutex mutex;
struct io_uring ring;
uint32_t cq_entries;
std::atomic<bool> cancelled{false};
ThreadFromGlobalPool ring_completion_monitor;
struct EnqueuedRequest
{
std::promise<IAsynchronousReader::Result> promise;
Request request;
bool resubmitting; // resubmits can happen due to short reads or when io_uring returns -EAGAIN
size_t bytes_read; // keep track of bytes already read in case short reads happen
};
std::deque<EnqueuedRequest> pending_requests;
std::unordered_map<UInt64, EnqueuedRequest> in_flight_requests;
int submitToRing(EnqueuedRequest & enqueued);
using EnqueuedIterator = std::unordered_map<UInt64, EnqueuedRequest>::iterator;
void failRequest(const EnqueuedIterator & requestIt, const Exception & ex);
void finalizeRequest(const EnqueuedIterator & requestIt);
void monitorRing();
template<typename T> inline void failPromise(std::promise<T> & promise, const Exception & ex)
{
promise.set_exception(std::make_exception_ptr(ex));
}
inline std::future<Result> makeFailedResult(const Exception & ex)
{
auto promise = std::promise<Result>{};
failPromise(promise, ex);
return promise.get_future();
}
const Poco::Logger * log;
public:
IOUringReader(uint32_t entries_);
inline bool isSupported() { return is_supported; }
std::future<Result> submit(Request request) override;
void wait() override {}
virtual ~IOUringReader() override;
};
}
#endif

View File

@ -4,7 +4,6 @@
#include <base/sleep.h>
#include <Core/Types.h>
#include <IO/ReadWriteBufferFromHTTP.h>
#include <IO/ConnectionTimeoutsContext.h>
#include <IO/WriteBufferFromString.h>
#include <IO/Operators.h>
#include <thread>

View File

@ -3,6 +3,7 @@
#include <IO/ReadBufferFromFile.h>
#include <IO/MMapReadBufferFromFileWithCache.h>
#include <IO/AsynchronousReadBufferFromFile.h>
#include <Disks/IO/IOUringReader.h>
#include <Disks/IO/ThreadPoolReader.h>
#include <IO/SynchronousReader.h>
#include <Common/ProfileEvents.h>
@ -23,6 +24,7 @@ namespace DB
namespace ErrorCodes
{
extern const int LOGICAL_ERROR;
extern const int UNSUPPORTED_METHOD;
}
@ -80,6 +82,19 @@ std::unique_ptr<ReadBufferFromFileBase> createReadBufferFromFileBase(
{
res = std::make_unique<ReadBufferFromFilePReadWithDescriptorsCache>(filename, buffer_size, actual_flags, existing_memory, buffer_alignment, file_size);
}
else if (settings.local_fs_method == LocalFSReadMethod::io_uring)
{
#if defined(OS_LINUX)
static std::shared_ptr<IOUringReader> reader = std::make_shared<IOUringReader>(512);
if (!reader->isSupported())
throw Exception(ErrorCodes::UNSUPPORTED_METHOD, "io_uring is not supported by this system");
res = std::make_unique<AsynchronousReadBufferFromFileWithDescriptorsCache>(
*reader, settings.priority, filename, buffer_size, actual_flags, existing_memory, buffer_alignment, file_size);
#else
throw Exception(ErrorCodes::UNSUPPORTED_METHOD, "Read method io_uring is only supported in Linux");
#endif
}
else if (settings.local_fs_method == LocalFSReadMethod::pread_fake_async)
{
auto context = Context::getGlobalContextInstance();

View File

@ -294,7 +294,9 @@ struct RemoveRecursiveObjectStorageOperation final : public IDiskObjectStorageOp
void execute(MetadataTransactionPtr tx) override
{
removeMetadataRecursive(tx, path);
/// Similar to DiskLocal and https://en.cppreference.com/w/cpp/filesystem/remove
if (metadata_storage.exists(path))
removeMetadataRecursive(tx, path);
}
void undo() override

Some files were not shown because too many files have changed in this diff Show More