Merge branch 'ClickHouse:master' into gcddelta-codec

This commit is contained in:
Александр Нам 2023-08-31 15:25:52 +03:00 committed by GitHub
commit 43a3650a9e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
110 changed files with 1999 additions and 795 deletions

View File

@ -3,9 +3,6 @@ name: BackportPR
env:
# Force the stdout and stderr streams to be unbuffered
PYTHONUNBUFFERED: 1
# Export system tables to ClickHouse Cloud
CLICKHOUSE_CI_LOGS_HOST: ${{ secrets.CLICKHOUSE_CI_LOGS_HOST }}
CLICKHOUSE_CI_LOGS_PASSWORD: ${{ secrets.CLICKHOUSE_CI_LOGS_PASSWORD }}
on: # yamllint disable-line rule:truthy
push:

View File

@ -3,9 +3,6 @@ name: MasterCI
env:
# Force the stdout and stderr streams to be unbuffered
PYTHONUNBUFFERED: 1
# Export system tables to ClickHouse Cloud
CLICKHOUSE_CI_LOGS_HOST: ${{ secrets.CLICKHOUSE_CI_LOGS_HOST }}
CLICKHOUSE_CI_LOGS_PASSWORD: ${{ secrets.CLICKHOUSE_CI_LOGS_PASSWORD }}
on: # yamllint disable-line rule:truthy
push:

View File

@ -3,9 +3,6 @@ name: PullRequestCI
env:
# Force the stdout and stderr streams to be unbuffered
PYTHONUNBUFFERED: 1
# Export system tables to ClickHouse Cloud
CLICKHOUSE_CI_LOGS_HOST: ${{ secrets.CLICKHOUSE_CI_LOGS_HOST }}
CLICKHOUSE_CI_LOGS_PASSWORD: ${{ secrets.CLICKHOUSE_CI_LOGS_PASSWORD }}
on: # yamllint disable-line rule:truthy
pull_request:

View File

@ -3,9 +3,6 @@ name: ReleaseBranchCI
env:
# Force the stdout and stderr streams to be unbuffered
PYTHONUNBUFFERED: 1
# Export system tables to ClickHouse Cloud
CLICKHOUSE_CI_LOGS_HOST: ${{ secrets.CLICKHOUSE_CI_LOGS_HOST }}
CLICKHOUSE_CI_LOGS_PASSWORD: ${{ secrets.CLICKHOUSE_CI_LOGS_PASSWORD }}
on: # yamllint disable-line rule:truthy
push:

View File

@ -1,4 +1,5 @@
### Table of Contents
**[ClickHouse release v23.8 LTS, 2023-08-31](#238)**<br/>
**[ClickHouse release v23.7, 2023-07-27](#237)**<br/>
**[ClickHouse release v23.6, 2023-06-30](#236)**<br/>
**[ClickHouse release v23.5, 2023-06-08](#235)**<br/>
@ -10,6 +11,228 @@
# 2023 Changelog
### <a id="238"></a> ClickHouse release 23.8 LTS, 2023-08-31
#### Backward Incompatible Change
* If a dynamic disk contains a name, it should be specified as `disk = disk(name = 'disk_name'`, ...) in disk function arguments. In previous version it could be specified as `disk = disk_<disk_name>(...)`, which is no longer supported. [#52820](https://github.com/ClickHouse/ClickHouse/pull/52820) ([Kseniia Sumarokova](https://github.com/kssenii)).
* `clickhouse-benchmark` will establish connections in parallel when invoked with `--concurrency` more than one. Previously it was unusable if you ran it with 1000 concurrent connections from Europe to the US. Correct calculation of QPS for connections with high latency. Backward incompatible change: the option for JSON output of `clickhouse-benchmark` is removed. If you've used this option, you can also extract data from the `system.query_log` in JSON format as a workaround. [#53293](https://github.com/ClickHouse/ClickHouse/pull/53293) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* The `microseconds` column is removed from the `system.text_log`, and the `milliseconds` column is removed from the `system.metric_log`, because they are redundant in the presence of the `event_time_microseconds` column. [#53601](https://github.com/ClickHouse/ClickHouse/pull/53601) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Deprecate the metadata cache feature. It is experimental and we have never used it. The feature is dangerous: [#51182](https://github.com/ClickHouse/ClickHouse/issues/51182). Remove the `system.merge_tree_metadata_cache` system table. The metadata cache is still available in this version but will be removed soon. This closes [#39197](https://github.com/ClickHouse/ClickHouse/issues/39197). [#51303](https://github.com/ClickHouse/ClickHouse/pull/51303) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Disable support for 3DES in TLS connections. [#52893](https://github.com/ClickHouse/ClickHouse/pull/52893) ([Kenji Noguchi](https://github.com/knoguchi)).
#### New Feature
* Direct import from zip/7z/tar archives. Example: `file('*.zip :: *.csv')`. [#50321](https://github.com/ClickHouse/ClickHouse/pull/50321) ([nikitakeba](https://github.com/nikitakeba)).
* Add column `ptr` to `system.trace_log` for `trace_type = 'MemorySample'`. This column contains an address of allocation. Added function `flameGraph` which can build flamegraph containing allocated and not released memory. Reworking of [#38391](https://github.com/ClickHouse/ClickHouse/issues/38391). [#45322](https://github.com/ClickHouse/ClickHouse/pull/45322) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Added table function `azureBlobStorageCluster`. The supported set of features is very similar to table function `s3Cluster`. [#50795](https://github.com/ClickHouse/ClickHouse/pull/50795) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Allow using `cluster`, `clusterAllReplicas`, `remote`, and `remoteSecure` without table name in issue [#50808](https://github.com/ClickHouse/ClickHouse/issues/50808). [#50848](https://github.com/ClickHouse/ClickHouse/pull/50848) ([Yangkuan Liu](https://github.com/LiuYangkuan)).
* A system table to monitor kafka consumers. [#50999](https://github.com/ClickHouse/ClickHouse/pull/50999) ([Ilya Golshtein](https://github.com/ilejn)).
* Added `max_sessions_for_user` setting. [#51724](https://github.com/ClickHouse/ClickHouse/pull/51724) ([Alexey Gerasimchuck](https://github.com/Demilivor)).
* New functions `toUTCTimestamp/fromUTCTimestamp` to act same as spark's `to_utc_timestamp/from_utc_timestamp`. [#52117](https://github.com/ClickHouse/ClickHouse/pull/52117) ([KevinyhZou](https://github.com/KevinyhZou)).
* Add new functions `structureToCapnProtoSchema`/`structureToProtobufSchema` that convert ClickHouse table structure to CapnProto/Protobuf format schema. Allow to input/output data in CapnProto/Protobuf format without external format schema using autogenerated schema from table structure (controled by settings `format_capn_proto_use_autogenerated_schema`/`format_protobuf_use_autogenerated_schema`). Allow to export autogenerated schema while input/outoput using setting `output_format_schema`. [#52278](https://github.com/ClickHouse/ClickHouse/pull/52278) ([Kruglov Pavel](https://github.com/Avogar)).
* A new field `query_cache_usage` in `system.query_log` now shows if and how the query cache was used. [#52384](https://github.com/ClickHouse/ClickHouse/pull/52384) ([Robert Schulze](https://github.com/rschu1ze)).
* Add new function `startsWithUTF8` and `endsWithUTF8`. [#52555](https://github.com/ClickHouse/ClickHouse/pull/52555) ([李扬](https://github.com/taiyang-li)).
* Allow variable number of columns in TSV/CuatomSeprarated/JSONCompactEachRow, make schema inference work with variable number of columns. Add settings `input_format_tsv_allow_variable_number_of_columns`, `input_format_custom_allow_variable_number_of_columns`, `input_format_json_compact_allow_variable_number_of_columns`. [#52692](https://github.com/ClickHouse/ClickHouse/pull/52692) ([Kruglov Pavel](https://github.com/Avogar)).
* Added `SYSTEM STOP/START PULLING REPLICATION LOG` queries (for testing `ReplicatedMergeTree`). [#52881](https://github.com/ClickHouse/ClickHouse/pull/52881) ([Alexander Tokmakov](https://github.com/tavplubix)).
* Allow to execute constant non-deterministic functions in mutations on initiator. [#53129](https://github.com/ClickHouse/ClickHouse/pull/53129) ([Anton Popov](https://github.com/CurtizJ)).
* Add input format `One` that doesn't read any data and always returns single row with column `dummy` with type `UInt8` and value `0` like `system.one`. It can be used together with `_file/_path` virtual columns to list files in file/s3/url/hdfs/etc table functions without reading any data. [#53209](https://github.com/ClickHouse/ClickHouse/pull/53209) ([Kruglov Pavel](https://github.com/Avogar)).
* Add `tupleConcat` function. Closes [#52759](https://github.com/ClickHouse/ClickHouse/issues/52759). [#53239](https://github.com/ClickHouse/ClickHouse/pull/53239) ([Nikolay Degterinsky](https://github.com/evillique)).
* Support `TRUNCATE DATABASE` operation. [#53261](https://github.com/ClickHouse/ClickHouse/pull/53261) ([Bharat Nallan](https://github.com/bharatnc)).
* Add `max_threads_for_indexes` setting to limit number of threads used for primary key processing. [#53313](https://github.com/ClickHouse/ClickHouse/pull/53313) ([jorisgio](https://github.com/jorisgio)).
* Re-add SipHash keyed functions. [#53525](https://github.com/ClickHouse/ClickHouse/pull/53525) ([Salvatore Mesoraca](https://github.com/aiven-sal)).
* ([#52755](https://github.com/ClickHouse/ClickHouse/issues/52755) , [#52895](https://github.com/ClickHouse/ClickHouse/issues/52895)) Added functions `arrayRotateLeft`, `arrayRotateRight`, `arrayShiftLeft`, `arrayShiftRight`. [#53557](https://github.com/ClickHouse/ClickHouse/pull/53557) ([Mikhail Koviazin](https://github.com/mkmkme)).
* Add column `name` to `system.clusters` as an alias to cluster. [#53605](https://github.com/ClickHouse/ClickHouse/pull/53605) ([irenjj](https://github.com/irenjj)).
* The advanced dashboard now allows mass editing (save/load). [#53608](https://github.com/ClickHouse/ClickHouse/pull/53608) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* The advanced dashboard now has an option to maximize charts and move them around. [#53622](https://github.com/ClickHouse/ClickHouse/pull/53622) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* TODO: edit it: Add support for plural units. [#53641](https://github.com/ClickHouse/ClickHouse/pull/53641) ([irenjj](https://github.com/irenjj)).
* TODO: edit it: Added server setting validate_tcp_client_information determines whether validation of client information enabled when query packet is received. [#53907](https://github.com/ClickHouse/ClickHouse/pull/53907) ([Alexey Gerasimchuck](https://github.com/Demilivor)).
* Added support for adding and subtracting arrays: `[5,2] + [1,7]`. Division and multiplication were not implemented due to confusion between pointwise multiplication and the scalar product of arguments. Closes [#49939](https://github.com/ClickHouse/ClickHouse/issues/49939). [#52625](https://github.com/ClickHouse/ClickHouse/pull/52625) ([Yarik Briukhovetskyi](https://github.com/yariks5s)).
* Add support for string literals as table names. Closes [#52178](https://github.com/ClickHouse/ClickHouse/issues/52178). [#52635](https://github.com/ClickHouse/ClickHouse/pull/52635) ([hendrik-m](https://github.com/hendrik-m)).
#### Experimental Feature
* Add new table engine `S3Queue` for streaming data import from s3. Closes [#37012](https://github.com/ClickHouse/ClickHouse/issues/37012). [#49086](https://github.com/ClickHouse/ClickHouse/pull/49086) ([s-kat](https://github.com/s-kat)). It is not ready to use. Do not use it.
* Enable parallel reading from replicas over distributed table. Related to [#49708](https://github.com/ClickHouse/ClickHouse/issues/49708). [#53005](https://github.com/ClickHouse/ClickHouse/pull/53005) ([Igor Nikonov](https://github.com/devcrafter)).
* Add experimental support for HNSW as approximate neighbor search method. [#53447](https://github.com/ClickHouse/ClickHouse/pull/53447) ([Davit Vardanyan](https://github.com/davvard)). This is currently intended for those who continue working on the implementation. Do not use it.
#### Performance Improvement
* Parquet filter pushdown. I.e. when reading Parquet files, row groups (chunks of the file) are skipped based on the WHERE condition and the min/max values in each column. In particular, if the file is roughly sorted by some column, queries that filter by a short range of that column will be much faster. [#52951](https://github.com/ClickHouse/ClickHouse/pull/52951) ([Michael Kolupaev](https://github.com/al13n321)).
* Optimize reading small row groups by batching them together in Parquet. Closes [#53069](https://github.com/ClickHouse/ClickHouse/issues/53069). [#53281](https://github.com/ClickHouse/ClickHouse/pull/53281) ([Kruglov Pavel](https://github.com/Avogar)).
* Optimize count from files in most input formats. Closes [#44334](https://github.com/ClickHouse/ClickHouse/issues/44334). [#53637](https://github.com/ClickHouse/ClickHouse/pull/53637) ([Kruglov Pavel](https://github.com/Avogar)).
* Use filter by file/path before reading in `url`/`file`/`hdfs` table functins. [#53529](https://github.com/ClickHouse/ClickHouse/pull/53529) ([Kruglov Pavel](https://github.com/Avogar)).
* Enable JIT compilation for AArch64, PowerPC, SystemZ, RISC-V. [#38217](https://github.com/ClickHouse/ClickHouse/pull/38217) ([Maksim Kita](https://github.com/kitaisreal)).
* Add setting `rewrite_count_distinct_if_with_count_distinct_implementation` to rewrite `countDistinctIf` with `count_distinct_implementation`. Closes [#30642](https://github.com/ClickHouse/ClickHouse/issues/30642). [#46051](https://github.com/ClickHouse/ClickHouse/pull/46051) ([flynn](https://github.com/ucasfl)).
* TODO: edit it: This patch will provide a method to deal with all the hashsets in parallel before merge. [#50748](https://github.com/ClickHouse/ClickHouse/pull/50748) ([Jiebin Sun](https://github.com/jiebinn)).
* Optimize aggregation performance of nullable string key when using a large number of variable length keys. [#51399](https://github.com/ClickHouse/ClickHouse/pull/51399) ([LiuNeng](https://github.com/liuneng1994)).
* Add a pass in Analyzer for time filter optimization with preimage. The performance experiments of SSB on the ICX device (Intel Xeon Platinum 8380 CPU, 80 cores, 160 threads) show that this change could bring an improvement of 8.5% to the geomean QPS when the experimental analyzer is enabled. [#52091](https://github.com/ClickHouse/ClickHouse/pull/52091) ([Zhiguo Zhou](https://github.com/ZhiguoZh)).
* Optimize the merge if all hash sets are single-level in the `uniqExact` (COUNT DISTINCT) function. [#52973](https://github.com/ClickHouse/ClickHouse/pull/52973) ([Jiebin Sun](https://github.com/jiebinn)).
* `Join` table engine: do not clone hash join data structure with all columns. [#53046](https://github.com/ClickHouse/ClickHouse/pull/53046) ([Duc Canh Le](https://github.com/canhld94)).
* Implement native `ORC` input format without the "apache arrow" library to improve performance. [#53324](https://github.com/ClickHouse/ClickHouse/pull/53324) ([李扬](https://github.com/taiyang-li)).
* The dashboard will tell the server to compress the data, which is useful for large time frames over slow internet connections. For example, one chart with 86400 points can be 1.5 MB uncompressed and 60 KB compressed with `br`. [#53569](https://github.com/ClickHouse/ClickHouse/pull/53569) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Better utilization of thread pool for BACKUPs and RESTOREs. [#53649](https://github.com/ClickHouse/ClickHouse/pull/53649) ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)).
* Load filesystem cache metadata on startup in parallel. Configured by `load_metadata_threads` (default: 1) cache config setting. Related to [#52037](https://github.com/ClickHouse/ClickHouse/issues/52037). [#52943](https://github.com/ClickHouse/ClickHouse/pull/52943) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Improve `move_primary_key_columns_to_end_of_prewhere`. [#53337](https://github.com/ClickHouse/ClickHouse/pull/53337) ([Han Fei](https://github.com/hanfei1991)).
* This optimizes the interaction with ClickHouse Keeper. Previously the caller could register the same watch callback multiple times. In that case each entry was consuming memory and the same callback was called multiple times which didn't make much sense. In order to avoid this the caller could have some logic to not add the same watch multiple times. With this change this deduplication is done internally if the watch callback is passed via shared_ptr. [#53452](https://github.com/ClickHouse/ClickHouse/pull/53452) ([Alexander Gololobov](https://github.com/davenger)).
* Cache number of rows in files for count in file/s3/url/hdfs/azure functions. The cache can be enabled/disabled by setting `use_cache_for_count_from_files` (enabled by default). Continuation of https://github.com/ClickHouse/ClickHouse/pull/53637. [#53692](https://github.com/ClickHouse/ClickHouse/pull/53692) ([Kruglov Pavel](https://github.com/Avogar)).
* More careful thread management will improve the speed of the S3 table function over a large number of files by more than ~25%. [#53668](https://github.com/ClickHouse/ClickHouse/pull/53668) ([pufit](https://github.com/pufit)).
#### Improvement
* Add `stderr_reaction` configuration/setting to control the reaction (none, log or throw) when external command stderr has data. This helps make debugging external command easier. [#43210](https://github.com/ClickHouse/ClickHouse/pull/43210) ([Amos Bird](https://github.com/amosbird)).
* Add `partition` column to the `system part_log` and merge table. [#48990](https://github.com/ClickHouse/ClickHouse/pull/48990) ([Jianfei Hu](https://github.com/incfly)).
* The sizes of the (index) uncompressed/mark, mmap and query caches can now be configured dynamically at runtime (without server restart). [#51446](https://github.com/ClickHouse/ClickHouse/pull/51446) ([Robert Schulze](https://github.com/rschu1ze)).
* If a dictionary is created with a complex key, automatically choose the "complex key" layout variant. [#49587](https://github.com/ClickHouse/ClickHouse/pull/49587) ([xiebin](https://github.com/xbthink)).
* Add setting `use_concurrency_control` for better testing of the new concurrency control feature. [#49618](https://github.com/ClickHouse/ClickHouse/pull/49618) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Added suggestions for mistyped names for databases and tables. [#49801](https://github.com/ClickHouse/ClickHouse/pull/49801) ([Yarik Briukhovetskyi](https://github.com/yariks5s)).
* While read small files from HDFS by Gluten, we found that it will cost more times when compare to directly query by Spark. And we did something with that. [#50063](https://github.com/ClickHouse/ClickHouse/pull/50063) ([KevinyhZou](https://github.com/KevinyhZou)).
* There were too many worthless error logs after session expiration, which we didn't like. [#50171](https://github.com/ClickHouse/ClickHouse/pull/50171) ([helifu](https://github.com/helifu)).
* Introduce fallback ZooKeeper sessions which are time-bound. Fixed `index` column in system.zookeeper_connection for DNS addresses. [#50424](https://github.com/ClickHouse/ClickHouse/pull/50424) ([Anton Kozlov](https://github.com/tonickkozlov)).
* Add ability to log when max_partitions_per_insert_block is reached. [#50948](https://github.com/ClickHouse/ClickHouse/pull/50948) ([Sean Haynes](https://github.com/seandhaynes)).
* Added a bunch of custom commands to clickhouse-keeper-client (mostly to make ClickHouse debugging easier). [#51117](https://github.com/ClickHouse/ClickHouse/pull/51117) ([pufit](https://github.com/pufit)).
* Updated check for connection string in `azureBlobStorage` table function as connection string with "sas" does not always begin with the default endpoint and updated connection URL to include "sas" token after adding Azure's container to URL. [#51141](https://github.com/ClickHouse/ClickHouse/pull/51141) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Fix description for filtering sets in the `full_sorting_merge` JOIN algorithm. [#51329](https://github.com/ClickHouse/ClickHouse/pull/51329) ([Tanay Tummalapalli](https://github.com/ttanay)).
* Fixed memory consumption in `Aggregator` when `max_block_size` is huge. [#51566](https://github.com/ClickHouse/ClickHouse/pull/51566) ([Nikita Taranov](https://github.com/nickitat)).
* Add `SYSTEM SYNC FILESYSTEM CACHE` command. It will compare in-memory state of filesystem cache with what it has on disk and fix in-memory state if needed. This is only needed if you are making manual interventions in on-disk data, which is highly discouraged. [#51622](https://github.com/ClickHouse/ClickHouse/pull/51622) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Attempt to create a generic proxy resolver for CH while keeping backwards compatibility with existing S3 storage conf proxy resolver. [#51749](https://github.com/ClickHouse/ClickHouse/pull/51749) ([Arthur Passos](https://github.com/arthurpassos)).
* Support reading tuple subcolumns from file/s3/hdfs/url/azureBlobStorage table functions. [#51806](https://github.com/ClickHouse/ClickHouse/pull/51806) ([Kruglov Pavel](https://github.com/Avogar)).
* Function `arrayIntersect` now returns the values in the order, corresponding to the first argument. Closes [#27622](https://github.com/ClickHouse/ClickHouse/issues/27622). [#51850](https://github.com/ClickHouse/ClickHouse/pull/51850) ([Yarik Briukhovetskyi](https://github.com/yariks5s)).
* Add new queries, which allow to create/drop of access entities in specified access storage or move access entities from one access storage to another. [#51912](https://github.com/ClickHouse/ClickHouse/pull/51912) ([pufit](https://github.com/pufit)).
* Make `ALTER TABLE FREEZE` queries not replicated in the Replicated database engine. [#52064](https://github.com/ClickHouse/ClickHouse/pull/52064) ([Mike Kot](https://github.com/myrrc)).
* Added possibility to flush system tables on unexpected shutdown. [#52174](https://github.com/ClickHouse/ClickHouse/pull/52174) ([Alexey Gerasimchuck](https://github.com/Demilivor)).
* Fix the case when `s3` table function refused to work with pre-signed URLs. close [#50846](https://github.com/ClickHouse/ClickHouse/issues/50846). [#52310](https://github.com/ClickHouse/ClickHouse/pull/52310) ([chen](https://github.com/xiedeyantu)).
* Add column `name` as an alias to `event` and `metric` in the `system.events` and `system.metrics` tables. Closes [#51257](https://github.com/ClickHouse/ClickHouse/issues/51257). [#52315](https://github.com/ClickHouse/ClickHouse/pull/52315) ([chen](https://github.com/xiedeyantu)).
* Added support of syntax `CREATE UNIQUE INDEX` in parser as a no-op for better SQL compatibility. `UNIQUE` index is not supported. Set `create_index_ignore_unique = 1` to ignore UNIQUE keyword in queries. [#52320](https://github.com/ClickHouse/ClickHouse/pull/52320) ([Ilya Yatsishin](https://github.com/qoega)).
* Add support of predefined macro (`{database}` and `{table}`) in some Kafka engine settings: topic, consumer, client_id, etc. [#52386](https://github.com/ClickHouse/ClickHouse/pull/52386) ([Yury Bogomolov](https://github.com/ybogo)).
* Disable updating the filesystem cache during backup/restore. Filesystem cache must not be updated during backup/restore, it seems it just slows down the process without any profit (because the BACKUP command can read a lot of data and it's no use to put all the data to the filesystem cache and immediately evict it). [#52402](https://github.com/ClickHouse/ClickHouse/pull/52402) ([Vitaly Baranov](https://github.com/vitlibar)).
* The configuration of S3 endpoint allow using it from the root, and append '/' automatically if needed. [#47809](https://github.com/ClickHouse/ClickHouse/issues/47809). [#52600](https://github.com/ClickHouse/ClickHouse/pull/52600) ([xiaolei565](https://github.com/xiaolei565)).
* For clickhouse-local allow positional options and populate global UDF settings (user_scripts_path and user_defined_executable_functions_config). [#52643](https://github.com/ClickHouse/ClickHouse/pull/52643) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* `system.asynchronous_metrics` now includes metrics "QueryCacheEntries" and "QueryCacheBytes" to inspect the query cache. [#52650](https://github.com/ClickHouse/ClickHouse/pull/52650) ([Robert Schulze](https://github.com/rschu1ze)).
* Added possibility to use `s3_storage_class` parameter in the `SETTINGS` clause of the `BACKUP` statement for backups to S3. [#52658](https://github.com/ClickHouse/ClickHouse/pull/52658) ([Roman Vasin](https://github.com/rvasin)).
* Add utility `print-backup-info.py` which parses a backup metadata file and prints information about the backup. [#52690](https://github.com/ClickHouse/ClickHouse/pull/52690) ([Vitaly Baranov](https://github.com/vitlibar)).
* Closes [#49510](https://github.com/ClickHouse/ClickHouse/issues/49510). Currently we have database and table names case-sensitive, but BI tools query `information_schema` sometimes in lowercase, sometimes in uppercase. For this reason we have `information_schema` database, containing lowercase tables, such as `information_schema.tables` and `INFORMATION_SCHEMA` database, containing uppercase tables, such as `INFORMATION_SCHEMA.TABLES`. But some tools are querying `INFORMATION_SCHEMA.tables` and `information_schema.TABLES`. The proposed solution is to duplicate both lowercase and uppercase tables in lowercase and uppercase `information_schema` database. [#52695](https://github.com/ClickHouse/ClickHouse/pull/52695) ([Yarik Briukhovetskyi](https://github.com/yariks5s)).
* Query`CHECK TABLE` has better performance and usability (sends progress updates, cancellable). [#52745](https://github.com/ClickHouse/ClickHouse/pull/52745) ([vdimir](https://github.com/vdimir)).
* Add support for `modulo`, `intDiv`, `intDivOrZero` for tuples by distributing them across tuple's elements. [#52758](https://github.com/ClickHouse/ClickHouse/pull/52758) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Search for default `yaml` and `yml` configs in clickhouse-client after `xml`. [#52767](https://github.com/ClickHouse/ClickHouse/pull/52767) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* When merging into non-'clickhouse' rooted configuration, configs with different root node name just bypassed without exception. [#52770](https://github.com/ClickHouse/ClickHouse/pull/52770) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Now it's possible to specify min (`memory_profiler_sample_min_allocation_size`) and max (`memory_profiler_sample_max_allocation_size`) size for allocations to be tracked with sampling memory profiler. [#52779](https://github.com/ClickHouse/ClickHouse/pull/52779) ([alesapin](https://github.com/alesapin)).
* Add `precise_float_parsing` setting to switch float parsing methods (fast/precise). [#52791](https://github.com/ClickHouse/ClickHouse/pull/52791) ([Andrey Zvonov](https://github.com/zvonand)).
* Use the same default paths for `clickhouse-keeper` (symlink) as for `clickhouse-keeper` (executable). [#52861](https://github.com/ClickHouse/ClickHouse/pull/52861) ([Vitaly Baranov](https://github.com/vitlibar)).
* Improve error message for table function `remote`. Closes [#40220](https://github.com/ClickHouse/ClickHouse/issues/40220). [#52959](https://github.com/ClickHouse/ClickHouse/pull/52959) ([jiyoungyoooo](https://github.com/jiyoungyoooo)).
* Added the possibility to specify custom storage policy in the `SETTINGS` clause of `RESTORE` queries. [#52970](https://github.com/ClickHouse/ClickHouse/pull/52970) ([Victor Krasnov](https://github.com/sirvickr)).
* Add the ability to throttle the S3 requests on backup operations (`BACKUP` and `RESTORE` commands now honor `s3_max_[get/put]_[rps/burst]`). [#52974](https://github.com/ClickHouse/ClickHouse/pull/52974) ([Daniel Pozo Escalona](https://github.com/danipozo)).
* Add settings to ignore ON CLUSTER clause in queries for management of replicated user-defined functions or access control entities with replicated storage. [#52975](https://github.com/ClickHouse/ClickHouse/pull/52975) ([Aleksei Filatov](https://github.com/aalexfvk)).
* EXPLAIN actions for JOIN step. [#53006](https://github.com/ClickHouse/ClickHouse/pull/53006) ([Maksim Kita](https://github.com/kitaisreal)).
* Make `hasTokenOrNull` and `hasTokenCaseInsensitiveOrNull` return null for empty needles. [#53059](https://github.com/ClickHouse/ClickHouse/pull/53059) ([ltrk2](https://github.com/ltrk2)).
* Allow to restrict allowed paths for filesystem caches. Mainly useful for dynamic disks. If in server config `filesystem_caches_path` is specified, all filesystem caches' paths will be restricted to this directory. E.g. if the `path` in cache config is relative - it will be put in `filesystem_caches_path`; if `path` in cache config is absolute, it will be required to lie inside `filesystem_caches_path`. If `filesystem_caches_path` is not specified in config, then behaviour will be the same as in earlier versions. [#53124](https://github.com/ClickHouse/ClickHouse/pull/53124) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Added a bunch of custom commands (mostly to make ClickHouse debugging easier). [#53127](https://github.com/ClickHouse/ClickHouse/pull/53127) ([pufit](https://github.com/pufit)).
* Add diagnostic info about file name during schema inference - it helps when you process multiple files with globs. [#53135](https://github.com/ClickHouse/ClickHouse/pull/53135) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Client will load suggestions using the main connection if the second connection is not allowed to create a session. [#53177](https://github.com/ClickHouse/ClickHouse/pull/53177) ([Alexey Gerasimchuck](https://github.com/Demilivor)).
* Add EXCEPT clause to `SYSTEM STOP/START LISTEN QUERIES [ALL/DEFAULT/CUSTOM]` query, for example `SYSTEM STOP LISTEN QUERIES ALL EXCEPT TCP, HTTP`. [#53280](https://github.com/ClickHouse/ClickHouse/pull/53280) ([Nikolay Degterinsky](https://github.com/evillique)).
* Change the default of `max_concurrent_queries` from 100 to 1000. It's ok to have many concurrent queries if they are not heavy, and mostly waiting for the network. Note: don't confuse concurrent queries and QPS: for example, ClickHouse server can do tens of thousands of QPS with less than 100 concurrent queries. [#53285](https://github.com/ClickHouse/ClickHouse/pull/53285) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Limit number of concurrent background partition optimize merges. [#53405](https://github.com/ClickHouse/ClickHouse/pull/53405) ([Duc Canh Le](https://github.com/canhld94)).
* Added a setting `allow_moving_table_directory_to_trash` that allows to ignore `Directory for table data already exists` error when replicating/recovering a `Replicated` database. [#53425](https://github.com/ClickHouse/ClickHouse/pull/53425) ([Alexander Tokmakov](https://github.com/tavplubix)).
* If server settings `asynchronous_metrics_update_period_s` and `asynchronous_heavy_metrics_update_period_s` are misconfigured to 0, it will now fail gracefully instead of terminating the application. [#53428](https://github.com/ClickHouse/ClickHouse/pull/53428) ([Robert Schulze](https://github.com/rschu1ze)).
* The ClickHouse server now respects memory limits changed via cgroups when reloading its configuration. [#53455](https://github.com/ClickHouse/ClickHouse/pull/53455) ([Robert Schulze](https://github.com/rschu1ze)).
* Add ability to turn off flush of Distributed tables on `DETACH`, `DROP`, or server shutdown. [#53501](https://github.com/ClickHouse/ClickHouse/pull/53501) ([Azat Khuzhin](https://github.com/azat)).
* The `domainRFC` function now supports IPv6 in square brackets. [#53506](https://github.com/ClickHouse/ClickHouse/pull/53506) ([Chen768959](https://github.com/Chen768959)).
* Use longer timeout for S3 CopyObject requests, which are used in backups. [#53533](https://github.com/ClickHouse/ClickHouse/pull/53533) ([Michael Kolupaev](https://github.com/al13n321)).
* Added server setting `aggregate_function_group_array_max_element_size`. This setting is used to limit array size for `groupArray` function at serialization. The default value is `16777215`. [#53550](https://github.com/ClickHouse/ClickHouse/pull/53550) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* `SCHEMA()` was added as alias for `DATABASE()` to improve MySQL compatibility. [#53587](https://github.com/ClickHouse/ClickHouse/pull/53587) ([Daniël van Eeden](https://github.com/dveeden)).
* Add asynchronous metrics about tables in the system database. For example, `TotalBytesOfMergeTreeTablesSystem`. This closes [#53603](https://github.com/ClickHouse/ClickHouse/issues/53603). [#53604](https://github.com/ClickHouse/ClickHouse/pull/53604) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* SQL editor in the Play UI and Dashboard will not use Grammarly. [#53614](https://github.com/ClickHouse/ClickHouse/pull/53614) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* As expert-level settings, it is now possible to (1) configure the size_ratio (i.e. the relative size of the protected queue) of the [index] mark/uncompressed caches, (2) configure the cache policy of the index mark and index uncompressed caches. [#53657](https://github.com/ClickHouse/ClickHouse/pull/53657) ([Robert Schulze](https://github.com/rschu1ze)).
* Added client info validation to the query packet in TCPHandler. [#53673](https://github.com/ClickHouse/ClickHouse/pull/53673) ([Alexey Gerasimchuck](https://github.com/Demilivor)).
* Retry loading parts in case of network errors while interaction with Microsoft Azure. [#53750](https://github.com/ClickHouse/ClickHouse/pull/53750) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Stacktrace for exceptions, Materailized view exceptions are propagated. [#53766](https://github.com/ClickHouse/ClickHouse/pull/53766) ([Ilya Golshtein](https://github.com/ilejn)).
* If no hostname or port were specified, keeper client will try to search for a connection string in the ClickHouse's config.xml. [#53769](https://github.com/ClickHouse/ClickHouse/pull/53769) ([pufit](https://github.com/pufit)).
* Add profile event `PartsLockMicroseconds` which shows the amount of microseconds we hold the data parts lock in MergeTree table engine family. [#53797](https://github.com/ClickHouse/ClickHouse/pull/53797) ([alesapin](https://github.com/alesapin)).
* Make reconnect limit in RAFT limits configurable for keeper. This configuration can help to make keeper to rebuild connection with peers quicker if the current connection is broken. [#53817](https://github.com/ClickHouse/ClickHouse/pull/53817) ([Pengyuan Bian](https://github.com/bianpengyuan)).
* Ignore foreign keys in tables definition to improve compatibility with MySQL, so a user wouldn't need to rewrite his SQL of the foreign key part, ref [#53380](https://github.com/ClickHouse/ClickHouse/issues/53380). [#53864](https://github.com/ClickHouse/ClickHouse/pull/53864) ([jsc0218](https://github.com/jsc0218)).
#### Build/Testing/Packaging Improvement
* Don't expose symbols from ClickHouse binary to dynamic linker. It might fix [#43933](https://github.com/ClickHouse/ClickHouse/issues/43933). [#47475](https://github.com/ClickHouse/ClickHouse/pull/47475) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Add `clickhouse-keeper-client` symlink to the clickhouse-server package. [#51882](https://github.com/ClickHouse/ClickHouse/pull/51882) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Add https://github.com/elliotchance/sqltest to CI to report the SQL 2016 conformance. [#52293](https://github.com/ClickHouse/ClickHouse/pull/52293) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Upgrade PRQL to 0.9.3. [#53060](https://github.com/ClickHouse/ClickHouse/pull/53060) ([Maximilian Roos](https://github.com/max-sixty)).
* System tables from CI checks are exported to ClickHouse Cloud. [#53086](https://github.com/ClickHouse/ClickHouse/pull/53086) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* The compiler's profile data (`-ftime-trace`) is uploaded to ClickHouse Cloud. [#53100](https://github.com/ClickHouse/ClickHouse/pull/53100) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Speed up Debug and Tidy builds. [#53178](https://github.com/ClickHouse/ClickHouse/pull/53178) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Speed up the build by removing tons and tonnes of garbage. One of the frequently included headers was poisoned by boost. [#53180](https://github.com/ClickHouse/ClickHouse/pull/53180) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Remove even more garbage. [#53182](https://github.com/ClickHouse/ClickHouse/pull/53182) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* The function `arrayAUC` was using heavy C++ templates - ditched them. [#53183](https://github.com/ClickHouse/ClickHouse/pull/53183) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Some translation units were always rebuilt regardless of ccache. The culprit is found and fixed. [#53184](https://github.com/ClickHouse/ClickHouse/pull/53184) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* The compiler's profile data (`-ftime-trace`) is uploaded to ClickHouse Cloud., the second attempt after [#53100](https://github.com/ClickHouse/ClickHouse/issues/53100). [#53213](https://github.com/ClickHouse/ClickHouse/pull/53213) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Export logs from CI in stateful tests to ClickHouse Cloud. [#53351](https://github.com/ClickHouse/ClickHouse/pull/53351) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Export logs from CI in stress tests. [#53353](https://github.com/ClickHouse/ClickHouse/pull/53353) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Export logs from CI in fuzzer. [#53354](https://github.com/ClickHouse/ClickHouse/pull/53354) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Preserve environment parameters in `clickhouse start` command. Fixes [#51962](https://github.com/ClickHouse/ClickHouse/issues/51962). [#53418](https://github.com/ClickHouse/ClickHouse/pull/53418) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Follow up for [#53418](https://github.com/ClickHouse/ClickHouse/issues/53418). Small improvements for install_check.py, adding tests for proper ENV parameters passing to the main process on `init.d start`. [#53457](https://github.com/ClickHouse/ClickHouse/pull/53457) ([Mikhail f. Shiryaev](https://github.com/Felixoid)).
* Reorganize file management in CMake to prevent potential duplications. For instance, `indexHint.cpp` is duplicated in both `dbms_sources` and `clickhouse_functions_sources`. [#53621](https://github.com/ClickHouse/ClickHouse/pull/53621) ([Amos Bird](https://github.com/amosbird)).
* Upgrade snappy to 1.1.10. [#53672](https://github.com/ClickHouse/ClickHouse/pull/53672) ([李扬](https://github.com/taiyang-li)).
* Slightly improve cmake build by sanitizing some dependencies and removing some duplicates. Each commit includes a short description of the changes made. [#53759](https://github.com/ClickHouse/ClickHouse/pull/53759) ([Amos Bird](https://github.com/amosbird)).
#### Bug Fix (user-visible misbehavior in an official stable release)
* Do not reset (experimental) Annoy index during build-up with more than one mark [#51325](https://github.com/ClickHouse/ClickHouse/pull/51325) ([Tian Xinhui](https://github.com/xinhuitian)).
* Fix usage of temporary directories during RESTORE [#51493](https://github.com/ClickHouse/ClickHouse/pull/51493) ([Azat Khuzhin](https://github.com/azat)).
* Fix binary arithmetic for Nullable(IPv4) [#51642](https://github.com/ClickHouse/ClickHouse/pull/51642) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Support IPv4 and IPv6 data types as dictionary attributes [#51756](https://github.com/ClickHouse/ClickHouse/pull/51756) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* A fix for checksum of compress marks [#51777](https://github.com/ClickHouse/ClickHouse/pull/51777) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Fix mistakenly comma parsing as part of datetime in CSV best effort parsing [#51950](https://github.com/ClickHouse/ClickHouse/pull/51950) ([Kruglov Pavel](https://github.com/Avogar)).
* Don't throw exception when executable UDF has parameters [#51961](https://github.com/ClickHouse/ClickHouse/pull/51961) ([Nikita Taranov](https://github.com/nickitat)).
* Fix recalculation of skip indexes and projections in `ALTER DELETE` queries [#52530](https://github.com/ClickHouse/ClickHouse/pull/52530) ([Anton Popov](https://github.com/CurtizJ)).
* MaterializedMySQL: Fix the infinite loop in ReadBuffer::read [#52621](https://github.com/ClickHouse/ClickHouse/pull/52621) ([Val Doroshchuk](https://github.com/valbok)).
* Load suggestion only with `clickhouse` dialect [#52628](https://github.com/ClickHouse/ClickHouse/pull/52628) ([János Benjamin Antal](https://github.com/antaljanosbenjamin)).
* Init and destroy ares channel on demand. [#52634](https://github.com/ClickHouse/ClickHouse/pull/52634) ([Arthur Passos](https://github.com/arthurpassos)).
* Fix filtering by virtual columns with OR expression [#52653](https://github.com/ClickHouse/ClickHouse/pull/52653) ([Azat Khuzhin](https://github.com/azat)).
* Fix crash in function `tuple` with one sparse column argument [#52659](https://github.com/ClickHouse/ClickHouse/pull/52659) ([Anton Popov](https://github.com/CurtizJ)).
* Fix named collections on cluster [#52687](https://github.com/ClickHouse/ClickHouse/pull/52687) ([Al Korgun](https://github.com/alkorgun)).
* Fix reading of unnecessary column in case of multistage `PREWHERE` [#52689](https://github.com/ClickHouse/ClickHouse/pull/52689) ([Anton Popov](https://github.com/CurtizJ)).
* Fix unexpected sort result on multi columns with nulls first direction [#52761](https://github.com/ClickHouse/ClickHouse/pull/52761) ([copperybean](https://github.com/copperybean)).
* Fix data race in Keeper reconfiguration [#52804](https://github.com/ClickHouse/ClickHouse/pull/52804) ([Antonio Andelic](https://github.com/antonio2368)).
* Fix sorting of sparse columns with large limit [#52827](https://github.com/ClickHouse/ClickHouse/pull/52827) ([Anton Popov](https://github.com/CurtizJ)).
* clickhouse-keeper: fix implementation of server with poll. [#52833](https://github.com/ClickHouse/ClickHouse/pull/52833) ([Andy Fiddaman](https://github.com/citrus-it)).
* Make regexp analyzer recognize named capturing groups [#52840](https://github.com/ClickHouse/ClickHouse/pull/52840) ([Han Fei](https://github.com/hanfei1991)).
* Fix possible assert in `~PushingAsyncPipelineExecutor` in clickhouse-local [#52862](https://github.com/ClickHouse/ClickHouse/pull/52862) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix reading of empty `Nested(Array(LowCardinality(...)))` [#52949](https://github.com/ClickHouse/ClickHouse/pull/52949) ([Anton Popov](https://github.com/CurtizJ)).
* Added new tests for session_log and fixed the inconsistency between login and logout. [#52958](https://github.com/ClickHouse/ClickHouse/pull/52958) ([Alexey Gerasimchuck](https://github.com/Demilivor)).
* Fix password leak in show create mysql table [#52962](https://github.com/ClickHouse/ClickHouse/pull/52962) ([Duc Canh Le](https://github.com/canhld94)).
* Convert sparse column format to full in CreateSetAndFilterOnTheFlyStep [#53000](https://github.com/ClickHouse/ClickHouse/pull/53000) ([vdimir](https://github.com/vdimir)).
* Fix rare race condition with empty key prefix directory deletion in fs cache [#53055](https://github.com/ClickHouse/ClickHouse/pull/53055) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Fix ZstdDeflatingWriteBuffer truncating the output sometimes [#53064](https://github.com/ClickHouse/ClickHouse/pull/53064) ([Michael Kolupaev](https://github.com/al13n321)).
* Fix query_id in part_log with async flush queries [#53103](https://github.com/ClickHouse/ClickHouse/pull/53103) ([Raúl Marín](https://github.com/Algunenano)).
* Fix possible error from cache "Read unexpected size" [#53121](https://github.com/ClickHouse/ClickHouse/pull/53121) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Disable the new parquet encoder [#53130](https://github.com/ClickHouse/ClickHouse/pull/53130) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Fix "Not-ready Set" exception [#53162](https://github.com/ClickHouse/ClickHouse/pull/53162) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix character escaping in the PostgreSQL engine [#53250](https://github.com/ClickHouse/ClickHouse/pull/53250) ([Nikolay Degterinsky](https://github.com/evillique)).
* Experimental session_log table: Added new tests for session_log and fixed the inconsistency between login and logout. [#53255](https://github.com/ClickHouse/ClickHouse/pull/53255) ([Alexey Gerasimchuck](https://github.com/Demilivor)). Fixed inconsistency between login success and logout [#53302](https://github.com/ClickHouse/ClickHouse/pull/53302) ([Alexey Gerasimchuck](https://github.com/Demilivor)).
* Fix adding sub-second intervals to DateTime [#53309](https://github.com/ClickHouse/ClickHouse/pull/53309) ([Michael Kolupaev](https://github.com/al13n321)).
* Fix "Context has expired" error in dictionaries [#53342](https://github.com/ClickHouse/ClickHouse/pull/53342) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Fix incorrect normal projection AST format [#53347](https://github.com/ClickHouse/ClickHouse/pull/53347) ([Amos Bird](https://github.com/amosbird)).
* Forbid use_structure_from_insertion_table_in_table_functions when execute Scalar [#53348](https://github.com/ClickHouse/ClickHouse/pull/53348) ([flynn](https://github.com/ucasfl)).
* Fix loading lazy database during system.table select query [#53372](https://github.com/ClickHouse/ClickHouse/pull/53372) ([SmitaRKulkarni](https://github.com/SmitaRKulkarni)).
* Fixed system.data_skipping_indices for MaterializedMySQL [#53381](https://github.com/ClickHouse/ClickHouse/pull/53381) ([Filipp Ozinov](https://github.com/bakwc)).
* Fix processing single carriage return in TSV file segmentation engine [#53407](https://github.com/ClickHouse/ClickHouse/pull/53407) ([Kruglov Pavel](https://github.com/Avogar)).
* Fix `Context has expired` error properly [#53433](https://github.com/ClickHouse/ClickHouse/pull/53433) ([Michael Kolupaev](https://github.com/al13n321)).
* Fix `timeout_overflow_mode` when having subquery in the rhs of IN [#53439](https://github.com/ClickHouse/ClickHouse/pull/53439) ([Duc Canh Le](https://github.com/canhld94)).
* Fix an unexpected behavior in [#53152](https://github.com/ClickHouse/ClickHouse/issues/53152) [#53440](https://github.com/ClickHouse/ClickHouse/pull/53440) ([Zhiguo Zhou](https://github.com/ZhiguoZh)).
* Fix JSON_QUERY Function parse error while path is all number [#53470](https://github.com/ClickHouse/ClickHouse/pull/53470) ([KevinyhZou](https://github.com/KevinyhZou)).
* Fix wrong columns order for queries with parallel FINAL. [#53489](https://github.com/ClickHouse/ClickHouse/pull/53489) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fixed SELECTing from ReplacingMergeTree with do_not_merge_across_partitions_select_final [#53511](https://github.com/ClickHouse/ClickHouse/pull/53511) ([Vasily Nemkov](https://github.com/Enmk)).
* Flush async insert queue first on shutdown [#53547](https://github.com/ClickHouse/ClickHouse/pull/53547) ([joelynch](https://github.com/joelynch)).
* Fix crash in join on sparse columna [#53548](https://github.com/ClickHouse/ClickHouse/pull/53548) ([vdimir](https://github.com/vdimir)).
* Fix possible UB in Set skipping index for functions with incorrect args [#53559](https://github.com/ClickHouse/ClickHouse/pull/53559) ([Azat Khuzhin](https://github.com/azat)).
* Fix possible UB in inverted indexes (experimental feature) [#53560](https://github.com/ClickHouse/ClickHouse/pull/53560) ([Azat Khuzhin](https://github.com/azat)).
* Fix: interpolate expression takes source column instead of same name aliased from select expression. [#53572](https://github.com/ClickHouse/ClickHouse/pull/53572) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Fix number of dropped granules in EXPLAIN PLAN index=1 [#53616](https://github.com/ClickHouse/ClickHouse/pull/53616) ([wangxiaobo](https://github.com/wzb5212)).
* Correctly handle totals and extremes with `DelayedSource` [#53644](https://github.com/ClickHouse/ClickHouse/pull/53644) ([Antonio Andelic](https://github.com/antonio2368)).
* Prepared set cache in mutation pipeline stuck [#53645](https://github.com/ClickHouse/ClickHouse/pull/53645) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Fix bug on mutations with subcolumns of type JSON in predicates of UPDATE and DELETE queries. [#53677](https://github.com/ClickHouse/ClickHouse/pull/53677) ([VanDarkholme7](https://github.com/VanDarkholme7)).
* Fix filter pushdown for full_sorting_merge join [#53699](https://github.com/ClickHouse/ClickHouse/pull/53699) ([vdimir](https://github.com/vdimir)).
* Try to fix bug with `NULL::LowCardinality(Nullable(...)) NOT IN` [#53706](https://github.com/ClickHouse/ClickHouse/pull/53706) ([Andrey Zvonov](https://github.com/zvonand)).
* Fix: sorted distinct with sparse columns [#53711](https://github.com/ClickHouse/ClickHouse/pull/53711) ([Igor Nikonov](https://github.com/devcrafter)).
* `transform`: correctly handle default column with multiple rows [#53742](https://github.com/ClickHouse/ClickHouse/pull/53742) ([Salvatore Mesoraca](https://github.com/aiven-sal)).
* Fix fuzzer crash in parseDateTime [#53764](https://github.com/ClickHouse/ClickHouse/pull/53764) ([Robert Schulze](https://github.com/rschu1ze)).
* MaterializedPostgreSQL: fix uncaught exception in getCreateTableQueryImpl [#53832](https://github.com/ClickHouse/ClickHouse/pull/53832) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Fix possible segfault while using PostgreSQL engine [#53847](https://github.com/ClickHouse/ClickHouse/pull/53847) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Fix named_collection_admin alias [#54066](https://github.com/ClickHouse/ClickHouse/pull/54066) ([Kseniia Sumarokova](https://github.com/kssenii)).
### <a id="237"></a> ClickHouse release 23.7, 2023-07-27
#### Backward Incompatible Change

View File

@ -5,61 +5,166 @@
# and their names will contain a hash of the table structure,
# which allows exporting tables from servers of different versions.
# Config file contains KEY=VALUE pairs with any necessary parameters like:
# CLICKHOUSE_CI_LOGS_HOST - remote host
# CLICKHOUSE_CI_LOGS_USER - password for user
# CLICKHOUSE_CI_LOGS_PASSWORD - password for user
CLICKHOUSE_CI_LOGS_CREDENTIALS=${CLICKHOUSE_CI_LOGS_CREDENTIALS:-/tmp/export-logs-config.sh}
CLICKHOUSE_CI_LOGS_USER=${CLICKHOUSE_CI_LOGS_USER:-ci}
# Pre-configured destination cluster, where to export the data
CLUSTER=${CLUSTER:=system_logs_export}
CLICKHOUSE_CI_LOGS_CLUSTER=${CLICKHOUSE_CI_LOGS_CLUSTER:-system_logs_export}
EXTRA_COLUMNS=${EXTRA_COLUMNS:="pull_request_number UInt32, commit_sha String, check_start_time DateTime, check_name LowCardinality(String), instance_type LowCardinality(String), "}
EXTRA_COLUMNS_EXPRESSION=${EXTRA_COLUMNS_EXPRESSION:="0 AS pull_request_number, '' AS commit_sha, now() AS check_start_time, '' AS check_name, '' AS instance_type"}
EXTRA_ORDER_BY_COLUMNS=${EXTRA_ORDER_BY_COLUMNS:="check_name, "}
EXTRA_COLUMNS=${EXTRA_COLUMNS:-"pull_request_number UInt32, commit_sha String, check_start_time DateTime, check_name LowCardinality(String), instance_type LowCardinality(String), "}
EXTRA_COLUMNS_EXPRESSION=${EXTRA_COLUMNS_EXPRESSION:-"0 AS pull_request_number, '' AS commit_sha, now() AS check_start_time, '' AS check_name, '' AS instance_type"}
EXTRA_ORDER_BY_COLUMNS=${EXTRA_ORDER_BY_COLUMNS:-"check_name, "}
CONNECTION_PARAMETERS=${CONNECTION_PARAMETERS:=""}
function __set_connection_args
{
# It's impossible to use generous $CONNECTION_ARGS string, it's unsafe from word splitting perspective.
# That's why we must stick to the generated option
CONNECTION_ARGS=(
--receive_timeout=10 --send_timeout=10 --secure
--user "${CLICKHOUSE_CI_LOGS_USER}" --host "${CLICKHOUSE_CI_LOGS_HOST}"
--password "${CLICKHOUSE_CI_LOGS_PASSWORD}"
)
}
# Create all configured system logs:
clickhouse-client --query "SYSTEM FLUSH LOGS"
function __shadow_credentials
{
# The function completely screws the output, it shouldn't be used in normal functions, only in ()
# The only way to substitute the env as a plain text is using perl 's/\Qsomething\E/another/
exec &> >(perl -pe '
s(\Q$ENV{CLICKHOUSE_CI_LOGS_HOST}\E)[CLICKHOUSE_CI_LOGS_HOST]g;
s(\Q$ENV{CLICKHOUSE_CI_LOGS_USER}\E)[CLICKHOUSE_CI_LOGS_USER]g;
s(\Q$ENV{CLICKHOUSE_CI_LOGS_PASSWORD}\E)[CLICKHOUSE_CI_LOGS_PASSWORD]g;
')
}
# It's doesn't make sense to try creating tables if SYNC fails
echo "SYSTEM SYNC DATABASE REPLICA default" | clickhouse-client --receive_timeout 180 $CONNECTION_PARAMETERS || exit 0
function check_logs_credentials
(
# The function connects with given credentials, and if it's unable to execute the simplest query, returns exit code
# For each system log table:
clickhouse-client --query "SHOW TABLES FROM system LIKE '%\\_log'" | while read -r table
do
# Calculate hash of its structure:
hash=$(clickhouse-client --query "
SELECT sipHash64(groupArray((name, type)))
FROM (SELECT name, type FROM system.columns
WHERE database = 'system' AND table = '$table'
ORDER BY position)
")
# First check, if all necessary parameters are set
set +x
for parameter in CLICKHOUSE_CI_LOGS_HOST CLICKHOUSE_CI_LOGS_USER CLICKHOUSE_CI_LOGS_PASSWORD; do
export -p | grep -q "$parameter" || {
echo "Credentials parameter $parameter is unset"
return 1
}
done
# Create the destination table with adapted name and structure:
statement=$(clickhouse-client --format TSVRaw --query "SHOW CREATE TABLE system.${table}" | sed -r -e '
s/^\($/('"$EXTRA_COLUMNS"'/;
s/ORDER BY \(/ORDER BY ('"$EXTRA_ORDER_BY_COLUMNS"'/;
s/^CREATE TABLE system\.\w+_log$/CREATE TABLE IF NOT EXISTS '"$table"'_'"$hash"'/;
/^TTL /d
')
__shadow_credentials
__set_connection_args
local code
# Catch both success and error to not fail on `set -e`
clickhouse-client "${CONNECTION_ARGS[@]}" -q 'SELECT 1 FORMAT Null' && return 0 || code=$?
if [ "$code" != 0 ]; then
echo 'Failed to connect to CI Logs cluster'
return $code
fi
)
echo "Creating destination table ${table}_${hash}" >&2
function config_logs_export_cluster
(
# The function is launched in a separate shell instance to not expose the
# exported values from CLICKHOUSE_CI_LOGS_CREDENTIALS
set +x
if ! [ -r "${CLICKHOUSE_CI_LOGS_CREDENTIALS}" ]; then
echo "File $CLICKHOUSE_CI_LOGS_CREDENTIALS does not exist, do not setup"
return
fi
set -a
# shellcheck disable=SC1090
source "${CLICKHOUSE_CI_LOGS_CREDENTIALS}"
set +a
__shadow_credentials
echo "Checking if the credentials work"
check_logs_credentials || return 0
cluster_config="${1:-/etc/clickhouse-server/config.d/system_logs_export.yaml}"
mkdir -p "$(dirname "$cluster_config")"
echo "remote_servers:
${CLICKHOUSE_CI_LOGS_CLUSTER}:
shard:
replica:
secure: 1
user: '${CLICKHOUSE_CI_LOGS_USER}'
host: '${CLICKHOUSE_CI_LOGS_HOST}'
port: 9440
password: '${CLICKHOUSE_CI_LOGS_PASSWORD}'
" > "$cluster_config"
echo "Cluster ${CLICKHOUSE_CI_LOGS_CLUSTER} is confugured in ${cluster_config}"
)
echo "$statement" | clickhouse-client --distributed_ddl_task_timeout=10 --receive_timeout=10 --send_timeout=10 $CONNECTION_PARAMETERS || continue
function setup_logs_replication
(
# The function is launched in a separate shell instance to not expose the
# exported values from CLICKHOUSE_CI_LOGS_CREDENTIALS
set +x
# disable output
if ! [ -r "${CLICKHOUSE_CI_LOGS_CREDENTIALS}" ]; then
echo "File $CLICKHOUSE_CI_LOGS_CREDENTIALS does not exist, do not setup"
return 0
fi
set -a
# shellcheck disable=SC1090
source "${CLICKHOUSE_CI_LOGS_CREDENTIALS}"
set +a
__shadow_credentials
echo "Checking if the credentials work"
check_logs_credentials || return 0
__set_connection_args
echo "Creating table system.${table}_sender" >&2
echo 'Create all configured system logs'
clickhouse-client --query "SYSTEM FLUSH LOGS"
# Create Distributed table and materialized view to watch on the original table:
clickhouse-client --query "
CREATE TABLE system.${table}_sender
ENGINE = Distributed(${CLUSTER}, default, ${table}_${hash})
SETTINGS flush_on_detach=0
EMPTY AS
SELECT ${EXTRA_COLUMNS_EXPRESSION}, *
FROM system.${table}
"
# It's doesn't make sense to try creating tables if SYNC fails
echo "SYSTEM SYNC DATABASE REPLICA default" | clickhouse-client --receive_timeout 180 "${CONNECTION_ARGS[@]}" || return 0
echo "Creating materialized view system.${table}_watcher" >&2
# For each system log table:
echo 'Create %_log tables'
clickhouse-client --query "SHOW TABLES FROM system LIKE '%\\_log'" | while read -r table
do
# Calculate hash of its structure:
hash=$(clickhouse-client --query "
SELECT sipHash64(groupArray((name, type)))
FROM (SELECT name, type FROM system.columns
WHERE database = 'system' AND table = '$table'
ORDER BY position)
")
clickhouse-client --query "
CREATE MATERIALIZED VIEW system.${table}_watcher TO system.${table}_sender AS
SELECT ${EXTRA_COLUMNS_EXPRESSION}, *
FROM system.${table}
"
done
# Create the destination table with adapted name and structure:
statement=$(clickhouse-client --format TSVRaw --query "SHOW CREATE TABLE system.${table}" | sed -r -e '
s/^\($/('"$EXTRA_COLUMNS"'/;
s/ORDER BY \(/ORDER BY ('"$EXTRA_ORDER_BY_COLUMNS"'/;
s/^CREATE TABLE system\.\w+_log$/CREATE TABLE IF NOT EXISTS '"$table"'_'"$hash"'/;
/^TTL /d
')
echo -e "Creating remote destination table ${table}_${hash} with statement:\n${statement}" >&2
echo "$statement" | clickhouse-client --database_replicated_initial_query_timeout_sec=10 \
--distributed_ddl_task_timeout=10 --receive_timeout=10 --send_timeout=10 \
"${CONNECTION_ARGS[@]}" || continue
echo "Creating table system.${table}_sender" >&2
# Create Distributed table and materialized view to watch on the original table:
clickhouse-client --query "
CREATE TABLE system.${table}_sender
ENGINE = Distributed(${CLICKHOUSE_CI_LOGS_CLUSTER}, default, ${table}_${hash})
SETTINGS flush_on_detach=0
EMPTY AS
SELECT ${EXTRA_COLUMNS_EXPRESSION}, *
FROM system.${table}
" || continue
echo "Creating materialized view system.${table}_watcher" >&2
clickhouse-client --query "
CREATE MATERIALIZED VIEW system.${table}_watcher TO system.${table}_sender AS
SELECT ${EXTRA_COLUMNS_EXPRESSION}, *
FROM system.${table}
" || continue
done
)

View File

@ -1,6 +1,8 @@
#!/bin/bash
# shellcheck disable=SC2086,SC2001,SC2046,SC2030,SC2031,SC2010,SC2015
# shellcheck disable=SC1091
source /setup_export_logs.sh
set -x
# core.COMM.PID-TID
@ -123,22 +125,7 @@ EOL
</clickhouse>
EOL
# Setup a cluster for logs export to ClickHouse Cloud
# Note: these variables are provided to the Docker run command by the Python script in tests/ci
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
echo "
remote_servers:
system_logs_export:
shard:
replica:
secure: 1
user: ci
host: '${CLICKHOUSE_CI_LOGS_HOST}'
port: 9440
password: '${CLICKHOUSE_CI_LOGS_PASSWORD}'
" > db/config.d/system_logs_export.yaml
fi
config_logs_export_cluster db/config.d/system_logs_export.yaml
}
function filter_exists_and_template
@ -242,20 +229,7 @@ quit
kill -0 $server_pid # This checks that it is our server that is started and not some other one
echo 'Server started and responded'
# Initialize export of system logs to ClickHouse Cloud
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
export EXTRA_COLUMNS_EXPRESSION="$PR_TO_TEST AS pull_request_number, '$SHA_TO_TEST' AS commit_sha, '$CHECK_START_TIME' AS check_start_time, '$CHECK_NAME' AS check_name, '$INSTANCE_TYPE' AS instance_type"
# TODO: Check if the password will appear in the logs.
export CONNECTION_PARAMETERS="--secure --user ci --host ${CLICKHOUSE_CI_LOGS_HOST} --password ${CLICKHOUSE_CI_LOGS_PASSWORD}"
/setup_export_logs.sh
# Unset variables after use
export CONNECTION_PARAMETERS=''
export CLICKHOUSE_CI_LOGS_HOST=''
export CLICKHOUSE_CI_LOGS_PASSWORD=''
fi
setup_logs_replication
# SC2012: Use find instead of ls to better handle non-alphanumeric filenames. They are all alphanumeric.
# SC2046: Quote this to prevent word splitting. Actually I need word splitting.

View File

@ -6,7 +6,7 @@ RUN apt-get install -y wget openjdk-8-jre
RUN wget https://archive.apache.org/dist/hadoop/common/hadoop-3.1.0/hadoop-3.1.0.tar.gz && \
tar -xf hadoop-3.1.0.tar.gz && rm -rf hadoop-3.1.0.tar.gz
RUN wget https://dlcdn.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz && \
RUN wget https://apache.apache.org/dist/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz && \
tar -xf apache-hive-2.3.9-bin.tar.gz && rm -rf apache-hive-2.3.9-bin.tar.gz
RUN apt install -y vim

View File

@ -103,7 +103,7 @@ RUN python3 -m pip install --no-cache-dir \
urllib3
# Hudi supports only spark 3.3.*, not 3.4
RUN curl -fsSL -O https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz \
RUN curl -fsSL -O https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz \
&& tar xzvf spark-3.3.2-bin-hadoop3.tgz -C / \
&& rm spark-3.3.2-bin-hadoop3.tgz

View File

@ -1,5 +1,7 @@
#!/bin/bash
# shellcheck disable=SC1091
source /setup_export_logs.sh
set -e -x
# Choose random timezone for this test run
@ -20,21 +22,7 @@ ln -s /usr/share/clickhouse-test/clickhouse-test /usr/bin/clickhouse-test
azurite-blob --blobHost 0.0.0.0 --blobPort 10000 --debug /azurite_log &
./setup_minio.sh stateful
# Setup a cluster for logs export to ClickHouse Cloud
# Note: these variables are provided to the Docker run command by the Python script in tests/ci
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
echo "
remote_servers:
system_logs_export:
shard:
replica:
secure: 1
user: ci
host: '${CLICKHOUSE_CI_LOGS_HOST}'
password: '${CLICKHOUSE_CI_LOGS_PASSWORD}'
" > /etc/clickhouse-server/config.d/system_logs_export.yaml
fi
config_logs_export_cluster /etc/clickhouse-server/config.d/system_logs_export.yaml
function start()
{
@ -82,20 +70,7 @@ function start()
start
# Initialize export of system logs to ClickHouse Cloud
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
export EXTRA_COLUMNS_EXPRESSION="$PULL_REQUEST_NUMBER AS pull_request_number, '$COMMIT_SHA' AS commit_sha, '$CHECK_START_TIME' AS check_start_time, '$CHECK_NAME' AS check_name, '$INSTANCE_TYPE' AS instance_type"
# TODO: Check if the password will appear in the logs.
export CONNECTION_PARAMETERS="--secure --user ci --host ${CLICKHOUSE_CI_LOGS_HOST} --password ${CLICKHOUSE_CI_LOGS_PASSWORD}"
./setup_export_logs.sh
# Unset variables after use
export CONNECTION_PARAMETERS=''
export CLICKHOUSE_CI_LOGS_HOST=''
export CLICKHOUSE_CI_LOGS_PASSWORD=''
fi
setup_logs_replication
# shellcheck disable=SC2086 # No quotes because I want to split it into words.
/s3downloader --url-prefix "$S3_URL" --dataset-names $DATASETS

View File

@ -73,7 +73,7 @@ RUN arch=${TARGETARCH:-amd64} \
&& chmod +x ./mc ./minio
RUN wget --no-verbose 'https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz' \
RUN wget --no-verbose 'https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz' \
&& tar -xvf hadoop-3.3.1.tar.gz \
&& rm -rf hadoop-3.3.1.tar.gz

View File

@ -1,5 +1,8 @@
#!/bin/bash
# shellcheck disable=SC1091
source /setup_export_logs.sh
# fail on errors, verbose and export all env variables
set -e -x -a
@ -36,21 +39,7 @@ fi
./setup_minio.sh stateless
./setup_hdfs_minicluster.sh
# Setup a cluster for logs export to ClickHouse Cloud
# Note: these variables are provided to the Docker run command by the Python script in tests/ci
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
echo "
remote_servers:
system_logs_export:
shard:
replica:
secure: 1
user: ci
host: '${CLICKHOUSE_CI_LOGS_HOST}'
password: '${CLICKHOUSE_CI_LOGS_PASSWORD}'
" > /etc/clickhouse-server/config.d/system_logs_export.yaml
fi
config_logs_export_cluster /etc/clickhouse-server/config.d/system_logs_export.yaml
# For flaky check we also enable thread fuzzer
if [ "$NUM_TRIES" -gt "1" ]; then
@ -116,20 +105,7 @@ do
sleep 1
done
# Initialize export of system logs to ClickHouse Cloud
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
export EXTRA_COLUMNS_EXPRESSION="$PULL_REQUEST_NUMBER AS pull_request_number, '$COMMIT_SHA' AS commit_sha, '$CHECK_START_TIME' AS check_start_time, '$CHECK_NAME' AS check_name, '$INSTANCE_TYPE' AS instance_type"
# TODO: Check if the password will appear in the logs.
export CONNECTION_PARAMETERS="--secure --user ci --host ${CLICKHOUSE_CI_LOGS_HOST} --password ${CLICKHOUSE_CI_LOGS_PASSWORD}"
./setup_export_logs.sh
# Unset variables after use
export CONNECTION_PARAMETERS=''
export CLICKHOUSE_CI_LOGS_HOST=''
export CLICKHOUSE_CI_LOGS_PASSWORD=''
fi
setup_logs_replication
attach_gdb_to_clickhouse || true # FIXME: to not break old builds, clean on 2023-09-01

View File

@ -5,6 +5,8 @@
# Avoid overlaps with previous runs
dmesg --clear
# shellcheck disable=SC1091
source /setup_export_logs.sh
set -x
@ -51,38 +53,11 @@ configure
azurite-blob --blobHost 0.0.0.0 --blobPort 10000 --debug /azurite_log &
./setup_minio.sh stateless # to have a proper environment
# Setup a cluster for logs export to ClickHouse Cloud
# Note: these variables are provided to the Docker run command by the Python script in tests/ci
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
echo "
remote_servers:
system_logs_export:
shard:
replica:
secure: 1
user: ci
host: '${CLICKHOUSE_CI_LOGS_HOST}'
password: '${CLICKHOUSE_CI_LOGS_PASSWORD}'
" > /etc/clickhouse-server/config.d/system_logs_export.yaml
fi
config_logs_export_cluster /etc/clickhouse-server/config.d/system_logs_export.yaml
start
# Initialize export of system logs to ClickHouse Cloud
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
export EXTRA_COLUMNS_EXPRESSION="$PULL_REQUEST_NUMBER AS pull_request_number, '$COMMIT_SHA' AS commit_sha, '$CHECK_START_TIME' AS check_start_time, '$CHECK_NAME' AS check_name, '$INSTANCE_TYPE' AS instance_type"
# TODO: Check if the password will appear in the logs.
export CONNECTION_PARAMETERS="--secure --user ci --host ${CLICKHOUSE_CI_LOGS_HOST} --password ${CLICKHOUSE_CI_LOGS_PASSWORD}"
./setup_export_logs.sh
# Unset variables after use
export CONNECTION_PARAMETERS=''
export CLICKHOUSE_CI_LOGS_HOST=''
export CLICKHOUSE_CI_LOGS_PASSWORD=''
fi
setup_logs_replication
# shellcheck disable=SC2086 # No quotes because I want to split it into words.
/s3downloader --url-prefix "$S3_URL" --dataset-names $DATASETS

View File

@ -39,7 +39,7 @@ CREATE TABLE s3_queue_engine_table (name String, value UInt32)
CREATE TABLE s3queue_engine_table (name String, value UInt32)
ENGINE=S3Queue('https://clickhouse-public-datasets.s3.amazonaws.com/my-test-bucket-768/*', 'CSV', 'gzip')
SETTINGS
mode = 'ordred';
mode = 'ordered';
```
Using named collections:
@ -60,7 +60,7 @@ Using named collections:
CREATE TABLE s3queue_engine_table (name String, value UInt32)
ENGINE=S3Queue(s3queue_conf, format = 'CSV', compression_method = 'gzip')
SETTINGS
mode = 'ordred';
mode = 'ordered';
```
## Settings {#s3queue-settings}
@ -188,7 +188,7 @@ Example:
CREATE TABLE s3queue_engine_table (name String, value UInt32)
ENGINE=S3Queue('https://clickhouse-public-datasets.s3.amazonaws.com/my-test-bucket-768/*', 'CSV', 'gzip')
SETTINGS
mode = 'unordred',
mode = 'unordered',
keeper_path = '/clickhouse/s3queue/';
CREATE TABLE stats (name String, value UInt32)

View File

@ -10,6 +10,10 @@ ClickHouse supports the MySQL wire protocol. This allow tools that are MySQL-com
## Enabling the MySQL Interface On ClickHouse Cloud
:::note
The MySQL interface for ClickHouse Cloud is currently in private preview. Please contact support@clickhouse.com to enable this feature for your ClickHouse Cloud service.
:::
1. After creating your ClickHouse Cloud Service, on the credentials screen, select the MySQL tab
![Credentials screen - Prompt](./images/mysql1.png)

View File

@ -65,7 +65,7 @@ XML substitution example:
Substitutions can also be performed from ZooKeeper. To do this, specify the attribute `from_zk = "/path/to/node"`. The element value is replaced with the contents of the node at `/path/to/node` in ZooKeeper. You can also put an entire XML subtree on the ZooKeeper node and it will be fully inserted into the source element.
## Encrypting Configuration {#encryption}
## Encrypting and Hiding Configuration {#encryption}
You can use symmetric encryption to encrypt a configuration element, for example, a password field. To do so, first configure the [encryption codec](../sql-reference/statements/create/table.md#encryption-codecs), then add attribute `encrypted_by` with the name of the encryption codec as value to the element to encrypt.
@ -102,6 +102,21 @@ Example:
961F000000040000000000EEDDEF4F453CFE6457C4234BD7C09258BD651D85
```
Even with applied encryption in the preprocessed file the elements are still saved in plain text. In case this is a problem, we suggest two alternatives: either set file permissions of the preprocessed file to 600 or use the `hide_in_preprocessed` attribute.
Example:
```xml
<clickhouse>
<interserver_http_credentials hide_in_preprocessed="true">
<user>admin</user>
<password>secret</password>
</interserver_http_credentials>
</clickhouse>
```
## User Settings {#user-settings}
The `config.xml` file can specify a separate config with user settings, profiles, and quotas. The relative path to this config is set in the `users_config` element. By default, it is `users.xml`. If `users_config` is omitted, the user settings, profiles, and quotas are specified directly in `config.xml`.

View File

@ -4643,3 +4643,19 @@ SELECT toFloat64('1.7091'), toFloat64('1.5008753E7') SETTINGS precise_float_pars
│ 1.7091 │ 15008753 │
└─────────────────────┴──────────────────────────┘
```
## validate_tcp_client_information {#validate-tcp-client-information}
Determines whether validation of client information enabled when query packet is received from a client using a TCP connection.
If `true`, an exception will be thrown on invalid client information from the TCP client.
If `false`, the data will not be validated. The server will work with clients of all versions.
The default value is `false`.
**Example**
``` xml
<validate_tcp_client_information>true</validate_tcp_client_information>
```

View File

@ -30,7 +30,7 @@ curl https://clickhouse.com/ | sh
The binary you just downloaded can run all sorts of ClickHouse tools and utilities. If you want to run ClickHouse as a database server, check out the [Quick Start](../../quick-start.mdx).
:::
## Query data in a CSV file using SQL
## Query data in a file using SQL {#query_data_in_file}
A common use of `clickhouse-local` is to run ad-hoc queries on files: where you don't have to insert the data into a table. `clickhouse-local` can stream the data from a file into a temporary table and execute your SQL.
@ -57,6 +57,19 @@ The `file` table function creates a table, and you can use `DESCRIBE` to see the
./clickhouse local -q "DESCRIBE file('reviews.tsv')"
```
:::tip
You are allowed to use globs in file name (See [glob substitutions](/docs/en/sql-reference/table-functions/file.md/#globs-in-path)).
Examples:
```bash
./clickhouse local -q "SELECT * FROM 'reviews*.jsonl'"
./clickhouse local -q "SELECT * FROM 'review_?.csv'"
./clickhouse local -q "SELECT * FROM 'review_{1..3}.csv'"
```
:::
```response
marketplace Nullable(String)
customer_id Nullable(Int64)

View File

@ -85,7 +85,7 @@ $ cat /etc/clickhouse-server/users.d/alice.xml
Сервер следит за изменениями конфигурационных файлов, а также файлов и ZooKeeper-узлов, которые были использованы при выполнении подстановок и переопределений, и перезагружает настройки пользователей и кластеров на лету. То есть, можно изменять кластера, пользователей и их настройки без перезапуска сервера.
## Шифрование {#encryption}
## Шифрование и Скрытие {#encryption}
Вы можете использовать симметричное шифрование для зашифровки элемента конфигурации, например, поля password. Чтобы это сделать, сначала настройте [кодек шифрования](../sql-reference/statements/create/table.md#encryption-codecs), затем добавьте аттибут`encrypted_by` с именем кодека шифрования как значение к элементу, который надо зашифровать.
@ -122,6 +122,21 @@ $ cat /etc/clickhouse-server/users.d/alice.xml
961F000000040000000000EEDDEF4F453CFE6457C4234BD7C09258BD651D85
```
Даже с применённым шифрованием в файле предобработки элементы все равно сохраняются в незашифрованном виде. В случае если это является проблемой, мы предлагаем две альтернативы: или установить разрешения на файл предобработки 600 или использовать аттрибут `hide_in_preprocessed`.
Пример:
```xml
<clickhouse>
<interserver_http_credentials hide_in_preprocessed="true">
<user>admin</user>
<password>secret</password>
</interserver_http_credentials>
</clickhouse>
```
## Примеры записи конфигурации на YAML {#example}
Здесь можно рассмотреть пример реальной конфигурации записанной на YAML: [config.yaml.example](https://github.com/ClickHouse/ClickHouse/blob/master/programs/server/config.yaml.example).

View File

@ -1923,3 +1923,19 @@ ClickHouse использует ZooKeeper для хранения метадан
- Положительное целое число.
Значение по умолчанию: `10000`.
## validate_tcp_client_information {#validate-tcp-client-information}
Включена ли валидация данных о клиенте при запросе от клиента, использующего TCP соединение.
Если `true`, то на неверные данные от клиента будет выброшено исключение.
Если `false`, то данные не будут валидироваться. Сервер будет работать с клиентами всех версий.
Значение по умолчанию: `false`.
**Пример**
``` xml
<validate_tcp_client_information>true</validate_tcp_client_information>
```

View File

@ -110,3 +110,42 @@ Read 186 rows, 4.15 KiB in 0.035 sec., 5302 rows/sec., 118.34 KiB/sec.
├──────────┼──────────┤
...
```
## Запрос данных в файле с помощью SQL {#query_data_in_file}
Часто `clickhouse-local` используется для выполнения специальных запросов к файлам, когда не нужно вставлять данные в таблицу. `clickhouse-local` может транслировать данные из файла во временную таблицу и выполнить ваш SQL.
Если файл находится на той же машине, что и `clickhouse-local`, то можно просто указать файл для загрузки. Следующий файл `reviews.tsv` содержит выборку отзывов о товарах Amazon:
```bash
./clickhouse local -q "SELECT * FROM 'reviews.tsv'"
```
Эта команда является сокращением команды:
```bash
./clickhouse local -q "SELECT * FROM file('reviews.tsv')"
```
ClickHouse знает, что файл использует формат, разделенный табуляцией, из расширения имени файла. Если необходимо явно указать формат, просто добавьте один из [множества входных форматов ClickHouse](../../interfaces/formats.md):
```bash
./clickhouse local -q "SELECT * FROM file('reviews.tsv', 'TabSeparated')"
```
Функция таблицы `file` создает таблицу, и вы можете использовать `DESCRIBE` для просмотра предполагаемой схемы:
```bash
./clickhouse local -q "DESCRIBE file('reviews.tsv')"
```
:::tip
В имени файла разрешается использовать [Шаблоны поиска](/docs/ru/sql-reference/table-functions/file.md/#globs-in-path).
Примеры:
```bash
./clickhouse local -q "SELECT * FROM 'reviews*.jsonl'"
./clickhouse local -q "SELECT * FROM 'review_?.csv'"
./clickhouse local -q "SELECT * FROM 'review_{1..3}.csv'"
```

View File

@ -419,6 +419,8 @@
<!-- Cache size in elements for compiled expressions.-->
<compiled_expression_cache_elements_size>10000</compiled_expression_cache_elements_size>
<validate_tcp_client_information>false</validate_tcp_client_information>
<!-- Path to data directory, with trailing slash. -->
<path>/var/lib/clickhouse/</path>

View File

@ -289,7 +289,7 @@ namespace
}
bool access_management = config.getBool(user_config + ".access_management", false);
bool named_collection_control = config.getBool(user_config + ".named_collection_control", false);
bool named_collection_control = config.getBool(user_config + ".named_collection_control", false) || config.getBool(user_config + ".named_collection_admin", false);
bool show_named_collections_secrets = config.getBool(user_config + ".show_named_collections_secrets", false);
if (grant_queries)

View File

@ -34,6 +34,7 @@ public:
static void parallelizeMergePrepare(const std::vector<UniqExactSet *> & data_vec, ThreadPool & thread_pool)
{
unsigned long single_level_set_num = 0;
unsigned long all_single_hash_size = 0;
for (auto ele : data_vec)
{
@ -41,7 +42,17 @@ public:
single_level_set_num ++;
}
if (single_level_set_num > 0 && single_level_set_num < data_vec.size())
if (single_level_set_num == data_vec.size())
{
for (auto ele : data_vec)
all_single_hash_size += ele->size();
}
/// If all the hashtables are mixed by singleLevel and twoLevel, or all singleLevel (larger than 6000 for average value), they could be converted into
/// twoLevel hashtables in parallel and then merge together. please refer to the following PR for more details.
/// https://github.com/ClickHouse/ClickHouse/pull/50748
/// https://github.com/ClickHouse/ClickHouse/pull/52973
if ((single_level_set_num > 0 && single_level_set_num < data_vec.size()) || ((all_single_hash_size/data_vec.size()) > 6000))
{
try
{

View File

@ -39,6 +39,7 @@ public:
std::optional<UUID> backup_uuid;
bool deduplicate_files = true;
bool allow_s3_native_copy = true;
bool use_same_s3_credentials_for_base_backup = false;
ReadSettings read_settings;
WriteSettings write_settings;
};

View File

@ -77,12 +77,14 @@ namespace
BackupImpl::BackupImpl(
const String & backup_name_for_logging_,
const BackupInfo & backup_info_,
const ArchiveParams & archive_params_,
const std::optional<BackupInfo> & base_backup_info_,
std::shared_ptr<IBackupReader> reader_,
const ContextPtr & context_)
: backup_name_for_logging(backup_name_for_logging_)
const ContextPtr & context_,
bool use_same_s3_credentials_for_base_backup_)
: backup_info(backup_info_)
, backup_name_for_logging(backup_info.toStringForLogging())
, use_archive(!archive_params_.archive_name.empty())
, archive_params(archive_params_)
, open_mode(OpenMode::READ)
@ -90,13 +92,14 @@ BackupImpl::BackupImpl(
, is_internal_backup(false)
, version(INITIAL_BACKUP_VERSION)
, base_backup_info(base_backup_info_)
, use_same_s3_credentials_for_base_backup(use_same_s3_credentials_for_base_backup_)
{
open(context_);
}
BackupImpl::BackupImpl(
const String & backup_name_for_logging_,
const BackupInfo & backup_info_,
const ArchiveParams & archive_params_,
const std::optional<BackupInfo> & base_backup_info_,
std::shared_ptr<IBackupWriter> writer_,
@ -104,8 +107,10 @@ BackupImpl::BackupImpl(
bool is_internal_backup_,
const std::shared_ptr<IBackupCoordination> & coordination_,
const std::optional<UUID> & backup_uuid_,
bool deduplicate_files_)
: backup_name_for_logging(backup_name_for_logging_)
bool deduplicate_files_,
bool use_same_s3_credentials_for_base_backup_)
: backup_info(backup_info_)
, backup_name_for_logging(backup_info.toStringForLogging())
, use_archive(!archive_params_.archive_name.empty())
, archive_params(archive_params_)
, open_mode(OpenMode::WRITE)
@ -116,6 +121,7 @@ BackupImpl::BackupImpl(
, version(CURRENT_BACKUP_VERSION)
, base_backup_info(base_backup_info_)
, deduplicate_files(deduplicate_files_)
, use_same_s3_credentials_for_base_backup(use_same_s3_credentials_for_base_backup_)
, log(&Poco::Logger::get("BackupImpl"))
{
open(context_);
@ -162,10 +168,16 @@ void BackupImpl::open(const ContextPtr & context)
if (base_backup_info)
{
if (use_same_s3_credentials_for_base_backup)
backup_info.copyS3CredentialsTo(*base_backup_info);
BackupFactory::CreateParams params;
params.backup_info = *base_backup_info;
params.open_mode = OpenMode::READ;
params.context = context;
/// use_same_s3_credentials_for_base_backup should be inherited for base backups
params.use_same_s3_credentials_for_base_backup = use_same_s3_credentials_for_base_backup;
base_backup = BackupFactory::instance().createBackup(params);
if (open_mode == OpenMode::WRITE)

View File

@ -35,14 +35,15 @@ public:
};
BackupImpl(
const String & backup_name_for_logging_,
const BackupInfo & backup_info_,
const ArchiveParams & archive_params_,
const std::optional<BackupInfo> & base_backup_info_,
std::shared_ptr<IBackupReader> reader_,
const ContextPtr & context_);
const ContextPtr & context_,
bool use_same_s3_credentials_for_base_backup_);
BackupImpl(
const String & backup_name_for_logging_,
const BackupInfo & backup_info_,
const ArchiveParams & archive_params_,
const std::optional<BackupInfo> & base_backup_info_,
std::shared_ptr<IBackupWriter> writer_,
@ -50,7 +51,8 @@ public:
bool is_internal_backup_,
const std::shared_ptr<IBackupCoordination> & coordination_,
const std::optional<UUID> & backup_uuid_,
bool deduplicate_files_);
bool deduplicate_files_,
bool use_same_s3_credentials_for_base_backup_);
~BackupImpl() override;
@ -109,6 +111,7 @@ private:
std::unique_ptr<SeekableReadBuffer> readFileImpl(const SizeAndChecksum & size_and_checksum, bool read_encrypted) const;
BackupInfo backup_info;
const String backup_name_for_logging;
const bool use_archive;
const ArchiveParams archive_params;
@ -145,6 +148,7 @@ private:
bool writing_finalized = false;
bool deduplicate_files = true;
bool use_same_s3_credentials_for_base_backup = false;
const Poco::Logger * log;
};

View File

@ -98,4 +98,23 @@ String BackupInfo::toStringForLogging() const
return toAST()->formatForLogging();
}
void BackupInfo::copyS3CredentialsTo(BackupInfo & dest) const
{
/// named_collection case, no need to update
if (!dest.id_arg.empty() || !id_arg.empty())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "use_same_s3_credentials_for_base_backup is not compatible with named_collections");
if (backup_engine_name != "S3")
throw Exception(ErrorCodes::BAD_ARGUMENTS, "use_same_s3_credentials_for_base_backup supported only for S3, got {}", toStringForLogging());
if (dest.backup_engine_name != "S3")
throw Exception(ErrorCodes::BAD_ARGUMENTS, "use_same_s3_credentials_for_base_backup supported only for S3, got {}", dest.toStringForLogging());
if (args.size() != 3)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "use_same_s3_credentials_for_base_backup requires access_key_id, secret_access_key, got {}", toStringForLogging());
auto & dest_args = dest.args;
dest_args.resize(3);
dest_args[1] = args[1];
dest_args[2] = args[2];
}
}

View File

@ -23,6 +23,8 @@ struct BackupInfo
static BackupInfo fromAST(const IAST & ast);
String toStringForLogging() const;
void copyS3CredentialsTo(BackupInfo & dest) const;
};
}

View File

@ -27,6 +27,7 @@ namespace ErrorCodes
M(Bool, decrypt_files_from_encrypted_disks) \
M(Bool, deduplicate_files) \
M(Bool, allow_s3_native_copy) \
M(Bool, use_same_s3_credentials_for_base_backup) \
M(Bool, read_from_filesystem_cache) \
M(UInt64, shard_num) \
M(UInt64, replica_num) \

View File

@ -44,6 +44,9 @@ struct BackupSettings
/// Whether native copy is allowed (optimization for cloud storages, that sometimes could have bugs)
bool allow_s3_native_copy = true;
/// Whether base backup to S3 should inherit credentials from the BACKUP query.
bool use_same_s3_credentials_for_base_backup = false;
/// Allow to use the filesystem cache in passive mode - benefit from the existing cache entries,
/// but don't put more entries into the cache.
bool read_from_filesystem_cache = true;

View File

@ -386,6 +386,7 @@ void BackupsWorker::doBackup(
backup_create_params.backup_uuid = backup_settings.backup_uuid;
backup_create_params.deduplicate_files = backup_settings.deduplicate_files;
backup_create_params.allow_s3_native_copy = backup_settings.allow_s3_native_copy;
backup_create_params.use_same_s3_credentials_for_base_backup = backup_settings.use_same_s3_credentials_for_base_backup;
backup_create_params.read_settings = getReadSettingsForBackup(context, backup_settings);
backup_create_params.write_settings = getWriteSettingsForBackup(context);
BackupMutablePtr backup = BackupFactory::instance().createBackup(backup_create_params);
@ -693,6 +694,7 @@ void BackupsWorker::doRestore(
backup_open_params.base_backup_info = restore_settings.base_backup_info;
backup_open_params.password = restore_settings.password;
backup_open_params.allow_s3_native_copy = restore_settings.allow_s3_native_copy;
backup_open_params.use_same_s3_credentials_for_base_backup = restore_settings.use_same_s3_credentials_for_base_backup;
backup_open_params.read_settings = getReadSettingsForRestore(context);
backup_open_params.write_settings = getWriteSettingsForRestore(context);
BackupPtr backup = BackupFactory::instance().createBackup(backup_open_params);

View File

@ -163,6 +163,7 @@ namespace
M(Bool, allow_unresolved_access_dependencies) \
M(RestoreUDFCreationMode, create_function) \
M(Bool, allow_s3_native_copy) \
M(Bool, use_same_s3_credentials_for_base_backup) \
M(Bool, internal) \
M(String, host_id) \
M(OptionalString, storage_policy) \

View File

@ -110,6 +110,9 @@ struct RestoreSettings
/// Whether native copy is allowed (optimization for cloud storages, that sometimes could have bugs)
bool allow_s3_native_copy = true;
/// Whether base backup from S3 should inherit credentials from the RESTORE query.
bool use_same_s3_credentials_for_base_backup = false;
/// Internal, should not be specified by user.
bool internal = false;

View File

@ -47,7 +47,6 @@ void registerBackupEngineS3(BackupFactory & factory)
auto creator_fn = []([[maybe_unused]] const BackupFactory::CreateParams & params) -> std::unique_ptr<IBackup>
{
#if USE_AWS_S3
String backup_name_for_logging = params.backup_info.toStringForLogging();
const String & id_arg = params.backup_info.id_arg;
const auto & args = params.backup_info.args;
@ -115,7 +114,13 @@ void registerBackupEngineS3(BackupFactory & factory)
params.write_settings,
params.context);
return std::make_unique<BackupImpl>(backup_name_for_logging, archive_params, params.base_backup_info, reader, params.context);
return std::make_unique<BackupImpl>(
params.backup_info,
archive_params,
params.base_backup_info,
reader,
params.context,
params.use_same_s3_credentials_for_base_backup);
}
else
{
@ -129,7 +134,7 @@ void registerBackupEngineS3(BackupFactory & factory)
params.context);
return std::make_unique<BackupImpl>(
backup_name_for_logging,
params.backup_info,
archive_params,
params.base_backup_info,
writer,
@ -137,7 +142,8 @@ void registerBackupEngineS3(BackupFactory & factory)
params.is_internal_backup,
params.backup_coordination,
params.backup_uuid,
params.deduplicate_files);
params.deduplicate_files,
params.use_same_s3_credentials_for_base_backup);
}
#else
throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "S3 support is disabled");

View File

@ -103,7 +103,6 @@ void registerBackupEnginesFileAndDisk(BackupFactory & factory)
{
auto creator_fn = [](const BackupFactory::CreateParams & params) -> std::unique_ptr<IBackup>
{
String backup_name_for_logging = params.backup_info.toStringForLogging();
const String & engine_name = params.backup_info.backup_engine_name;
if (!params.backup_info.id_arg.empty())
@ -172,7 +171,13 @@ void registerBackupEnginesFileAndDisk(BackupFactory & factory)
reader = std::make_shared<BackupReaderFile>(path, params.read_settings, params.write_settings);
else
reader = std::make_shared<BackupReaderDisk>(disk, path, params.read_settings, params.write_settings);
return std::make_unique<BackupImpl>(backup_name_for_logging, archive_params, params.base_backup_info, reader, params.context);
return std::make_unique<BackupImpl>(
params.backup_info,
archive_params,
params.base_backup_info,
reader,
params.context,
params.use_same_s3_credentials_for_base_backup);
}
else
{
@ -182,7 +187,7 @@ void registerBackupEnginesFileAndDisk(BackupFactory & factory)
else
writer = std::make_shared<BackupWriterDisk>(disk, path, params.read_settings, params.write_settings);
return std::make_unique<BackupImpl>(
backup_name_for_logging,
params.backup_info,
archive_params,
params.base_backup_info,
writer,
@ -190,7 +195,8 @@ void registerBackupEnginesFileAndDisk(BackupFactory & factory)
params.is_internal_backup,
params.backup_coordination,
params.backup_uuid,
params.deduplicate_files);
params.deduplicate_files,
params.use_same_s3_credentials_for_base_backup);
}
};

View File

@ -15,6 +15,7 @@
#include <Poco/DOM/Comment.h>
#include <Poco/XML/XMLWriter.h>
#include <Poco/Util/XMLConfiguration.h>
#include <Poco/NumberParser.h>
#include <Common/ZooKeeper/ZooKeeperNodeCache.h>
#include <Common/ZooKeeper/KeeperException.h>
#include <Common/StringUtils/StringUtils.h>
@ -254,6 +255,25 @@ void ConfigProcessor::decryptRecursive(Poco::XML::Node * config_root)
#endif
void ConfigProcessor::hideRecursive(Poco::XML::Node * config_root)
{
for (Node * node = config_root->firstChild(); node;)
{
Node * next_node = node->nextSibling();
if (node->nodeType() == Node::ELEMENT_NODE)
{
Element & element = dynamic_cast<Element &>(*node);
if (element.hasAttribute("hide_in_preprocessed") && Poco::NumberParser::parseBool(element.getAttribute("hide_in_preprocessed")))
{
config_root->removeChild(node);
} else
hideRecursive(node);
}
node = next_node;
}
}
void ConfigProcessor::mergeRecursive(XMLDocumentPtr config, Node * config_root, const Node * with_root)
{
const NodeListPtr with_nodes = with_root->childNodes();
@ -792,6 +812,24 @@ void ConfigProcessor::decryptEncryptedElements(LoadedConfig & loaded_config)
#endif
XMLDocumentPtr ConfigProcessor::hideElements(XMLDocumentPtr xml_tree)
{
/// Create a copy of XML Document because hiding elements from preprocessed_xml document
/// also influences on configuration which has a pointer to preprocessed_xml document.
XMLDocumentPtr xml_tree_copy = new Poco::XML::Document;
for (Node * node = xml_tree->firstChild(); node; node = node->nextSibling())
{
NodePtr new_node = xml_tree_copy->importNode(node, true);
xml_tree_copy->appendChild(new_node);
}
Node * new_config_root = getRootNode(xml_tree_copy.get());
hideRecursive(new_config_root);
return xml_tree_copy;
}
void ConfigProcessor::savePreprocessedConfig(LoadedConfig & loaded_config, std::string preprocessed_dir)
{
try
@ -840,7 +878,8 @@ void ConfigProcessor::savePreprocessedConfig(LoadedConfig & loaded_config, std::
writer.setNewLine("\n");
writer.setIndent(" ");
writer.setOptions(Poco::XML::XMLWriter::PRETTY_PRINT);
writer.writeNode(preprocessed_path, loaded_config.preprocessed_xml);
XMLDocumentPtr preprocessed_xml_without_hidden_elements = hideElements(loaded_config.preprocessed_xml);
writer.writeNode(preprocessed_path, preprocessed_xml_without_hidden_elements);
LOG_DEBUG(log, "Saved preprocessed configuration to '{}'.", preprocessed_path);
}
catch (Poco::Exception & e)

View File

@ -142,6 +142,9 @@ private:
void decryptEncryptedElements(LoadedConfig & loaded_config);
#endif
void hideRecursive(Poco::XML::Node * config_root);
XMLDocumentPtr hideElements(XMLDocumentPtr xml_tree);
void mergeRecursive(XMLDocumentPtr config, Poco::XML::Node * config_root, const Poco::XML::Node * with_root);
/// If config root node name is not 'clickhouse' and merging config's root node names doesn't match, bypasses merging and returns false.

View File

@ -318,3 +318,8 @@ inline void trim(std::string & str, char c = ' ')
trimRight(str, c);
trimLeft(str, c);
}
constexpr bool containsGlobs(const std::string & str)
{
return str.find_first_of("*?{") != std::string::npos;
}

View File

@ -96,7 +96,8 @@ namespace DB
M(UInt64, total_memory_profiler_step, 0, "Whenever server memory usage becomes larger than every next step in number of bytes the memory profiler will collect the allocating stack trace. Zero means disabled memory profiler. Values lower than a few megabytes will slow down server.", 0) \
M(Double, total_memory_tracker_sample_probability, 0, "Collect random allocations and deallocations and write them into system.trace_log with 'MemorySample' trace_type. The probability is for every alloc/free regardless to the size of the allocation (can be changed with `memory_profiler_sample_min_allocation_size` and `memory_profiler_sample_max_allocation_size`). Note that sampling happens only when the amount of untracked memory exceeds 'max_untracked_memory'. You may want to set 'max_untracked_memory' to 0 for extra fine grained sampling.", 0) \
M(UInt64, total_memory_profiler_sample_min_allocation_size, 0, "Collect random allocations of size greater or equal than specified value with probability equal to `total_memory_profiler_sample_probability`. 0 means disabled. You may want to set 'max_untracked_memory' to 0 to make this threshold to work as expected.", 0) \
M(UInt64, total_memory_profiler_sample_max_allocation_size, 0, "Collect random allocations of size less or equal than specified value with probability equal to `total_memory_profiler_sample_probability`. 0 means disabled. You may want to set 'max_untracked_memory' to 0 to make this threshold to work as expected.", 0)
M(UInt64, total_memory_profiler_sample_max_allocation_size, 0, "Collect random allocations of size less or equal than specified value with probability equal to `total_memory_profiler_sample_probability`. 0 means disabled. You may want to set 'max_untracked_memory' to 0 to make this threshold to work as expected.", 0) \
M(Bool, validate_tcp_client_information, false, "Validate client_information in the query packet over the native TCP protocol.", 0)
DECLARE_SETTINGS_TRAITS(ServerSettingsTraits, SERVER_SETTINGS)

View File

@ -81,22 +81,24 @@ bool DatabaseFilesystem::checkTableFilePath(const std::string & table_path, Cont
throw Exception(ErrorCodes::PATH_ACCESS_DENIED, "File is not inside {}", user_files_path);
}
/// Check if the corresponding file exists.
if (!fs::exists(table_path))
if (!containsGlobs(table_path))
{
if (throw_on_error)
throw Exception(ErrorCodes::FILE_DOESNT_EXIST, "File does not exist: {}", table_path);
else
return false;
}
/// Check if the corresponding file exists.
if (!fs::exists(table_path))
{
if (throw_on_error)
throw Exception(ErrorCodes::FILE_DOESNT_EXIST, "File does not exist: {}", table_path);
else
return false;
}
if (!fs::is_regular_file(table_path))
{
if (throw_on_error)
throw Exception(ErrorCodes::FILE_DOESNT_EXIST,
"File is directory, but expected a file: {}", table_path);
else
return false;
if (!fs::is_regular_file(table_path))
{
if (throw_on_error)
throw Exception(ErrorCodes::FILE_DOESNT_EXIST, "File is directory, but expected a file: {}", table_path);
else
return false;
}
}
return true;
@ -141,19 +143,18 @@ StoragePtr DatabaseFilesystem::getTableImpl(const String & name, ContextPtr cont
if (!checkTableFilePath(table_path, context_, throw_on_error))
return {};
String format = FormatFactory::instance().getFormatFromFileName(table_path, throw_on_error);
auto format = FormatFactory::instance().getFormatFromFileName(table_path, throw_on_error);
if (format.empty())
return {};
/// If the file exists, create a new table using TableFunctionFile and return it.
auto args = makeASTFunction("file", std::make_shared<ASTLiteral>(table_path), std::make_shared<ASTLiteral>(format));
auto ast_function_ptr = makeASTFunction("file", std::make_shared<ASTLiteral>(table_path), std::make_shared<ASTLiteral>(format));
auto table_function = TableFunctionFactory::instance().get(args, context_);
auto table_function = TableFunctionFactory::instance().get(ast_function_ptr, context_);
if (!table_function)
return nullptr;
/// TableFunctionFile throws exceptions, if table cannot be created.
auto table_storage = table_function->execute(args, context_, name);
auto table_storage = table_function->execute(ast_function_ptr, context_, name);
if (table_storage)
addTable(name, table_storage);

View File

@ -231,6 +231,8 @@ public:
String getFileName() const override { return handle.getFileName(); }
size_t getFileSize() override { return handle.getFileInfo().uncompressed_size; }
Handle releaseHandle() &&
{
return std::move(handle);

View File

@ -312,6 +312,8 @@ public:
String getFileName() const override { return handle.getFileName(); }
size_t getFileSize() override { return handle.getFileInfo().uncompressed_size; }
/// Releases owned handle to pass it to an enumerator.
HandleHolder releaseHandle() &&
{

View File

@ -226,6 +226,8 @@ KeyMetadataPtr CacheMetadata::getKeyMetadata(
if (it == end())
{
if (key_not_found_policy == KeyNotFoundPolicy::THROW)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "No such key `{}` in cache", key);
else if (key_not_found_policy == KeyNotFoundPolicy::THROW_LOGICAL)
throw Exception(ErrorCodes::LOGICAL_ERROR, "No such key `{}` in cache", key);
else if (key_not_found_policy == KeyNotFoundPolicy::RETURN_NULL)
return nullptr;

View File

@ -213,6 +213,10 @@ String ClientInfo::getVersionStr() const
return std::format("{}.{}.{} ({})", client_version_major, client_version_minor, client_version_patch, client_tcp_protocol_version);
}
VersionNumber ClientInfo::getVersionNumber() const
{
return VersionNumber(client_version_major, client_version_minor, client_version_patch);
}
void ClientInfo::fillOSUserHostNameAndVersionInfo()
{

View File

@ -4,6 +4,7 @@
#include <Poco/Net/SocketAddress.h>
#include <base/types.h>
#include <Common/OpenTelemetryTraceContext.h>
#include <Common/VersionNumber.h>
#include <boost/algorithm/string/trim.hpp>
namespace DB
@ -137,6 +138,7 @@ public:
bool clientVersionEquals(const ClientInfo & other, bool compare_patch) const;
String getVersionStr() const;
VersionNumber getVersionNumber() const;
private:
void fillOSUserHostNameAndVersionInfo();

View File

@ -865,6 +865,10 @@ public:
if (!ParserKeyword("FROM").ignore(test_pos, test_expected))
return true;
// If there is a comma after 'from' then the first one was a name of a column
if (test_pos->type == TokenType::Comma)
return true;
/// If we parse a second FROM then the first one was a name of a column
if (ParserKeyword("FROM").ignore(test_pos, test_expected))
return true;

View File

@ -35,6 +35,7 @@
#include <Storages/MergeTree/MergeTreeDataPartUUID.h>
#include <Storages/StorageS3Cluster.h>
#include <Core/ExternalTable.h>
#include <Core/ServerSettings.h>
#include <Access/AccessControl.h>
#include <Access/Credentials.h>
#include <DataTypes/DataTypeLowCardinality.h>
@ -114,6 +115,20 @@ NameToNameMap convertToQueryParameters(const Settings & passed_params)
return query_parameters;
}
// This function corrects the wrong client_name from the old client.
// Old clients 28.7 and some intermediate versions of 28.7 were sending different ClientInfo.client_name
// "ClickHouse client" was sent with the hello message.
// "ClickHouse" or "ClickHouse " was sent with the query message.
void correctQueryClientInfo(const ClientInfo & session_client_info, ClientInfo & client_info)
{
if (client_info.getVersionNumber() <= VersionNumber(23, 8, 1) &&
session_client_info.client_name == "ClickHouse client" &&
(client_info.client_name == "ClickHouse" || client_info.client_name == "ClickHouse "))
{
client_info.client_name = "ClickHouse client";
}
}
void validateClientInfo(const ClientInfo & session_client_info, const ClientInfo & client_info)
{
// Secondary query may contain different client_info.
@ -1532,7 +1547,11 @@ void TCPHandler::receiveQuery()
if (client_tcp_protocol_version >= DBMS_MIN_REVISION_WITH_CLIENT_INFO)
{
client_info.read(*in, client_tcp_protocol_version);
validateClientInfo(session->getClientInfo(), client_info);
correctQueryClientInfo(session->getClientInfo(), client_info);
const auto & config_ref = Context::getGlobalContextInstance()->getServerSettings();
if (config_ref.validate_tcp_client_information)
validateClientInfo(session->getClientInfo(), client_info);
}
/// Per query settings are also passed via TCP.

View File

@ -18,7 +18,9 @@ namespace CurrentMetrics
namespace DB
{
struct AsyncBlockIDsCache::Cache : public std::unordered_set<String>
template <typename TStorage>
struct AsyncBlockIDsCache<TStorage>::Cache : public std::unordered_set<String>
{
CurrentMetrics::Increment cache_size_increment;
explicit Cache(std::unordered_set<String> && set_)
@ -27,7 +29,8 @@ struct AsyncBlockIDsCache::Cache : public std::unordered_set<String>
{}
};
std::vector<String> AsyncBlockIDsCache::getChildren()
template <typename TStorage>
std::vector<String> AsyncBlockIDsCache<TStorage>::getChildren()
{
auto zookeeper = storage.getZooKeeper();
@ -50,7 +53,8 @@ std::vector<String> AsyncBlockIDsCache::getChildren()
return children;
}
void AsyncBlockIDsCache::update()
template <typename TStorage>
void AsyncBlockIDsCache<TStorage>::update()
try
{
std::vector<String> paths = getChildren();
@ -73,24 +77,27 @@ catch (...)
task->scheduleAfter(update_min_interval.count());
}
AsyncBlockIDsCache::AsyncBlockIDsCache(StorageReplicatedMergeTree & storage_)
template <typename TStorage>
AsyncBlockIDsCache<TStorage>::AsyncBlockIDsCache(TStorage & storage_)
: storage(storage_),
update_min_interval(storage.getSettings()->async_block_ids_cache_min_update_interval_ms),
path(storage.zookeeper_path + "/async_blocks"),
path(storage.getZooKeeperPath() + "/async_blocks"),
log_name(storage.getStorageID().getFullTableName() + " (AsyncBlockIDsCache)"),
log(&Poco::Logger::get(log_name))
{
task = storage.getContext()->getSchedulePool().createTask(log_name, [this]{ update(); });
}
void AsyncBlockIDsCache::start()
template <typename TStorage>
void AsyncBlockIDsCache<TStorage>::start()
{
if (storage.getSettings()->use_async_block_ids_cache)
task->activateAndSchedule();
}
/// Caller will keep the version of last call. When the caller calls again, it will wait util gets a newer version.
Strings AsyncBlockIDsCache::detectConflicts(const Strings & paths, UInt64 & last_version)
template <typename TStorage>
Strings AsyncBlockIDsCache<TStorage>::detectConflicts(const Strings & paths, UInt64 & last_version)
{
if (!storage.getSettings()->use_async_block_ids_cache)
return {};
@ -128,4 +135,6 @@ Strings AsyncBlockIDsCache::detectConflicts(const Strings & paths, UInt64 & last
return conflicts;
}
template class AsyncBlockIDsCache<StorageReplicatedMergeTree>;
}

View File

@ -8,8 +8,7 @@
namespace DB
{
class StorageReplicatedMergeTree;
template <typename TStorage>
class AsyncBlockIDsCache
{
struct Cache;
@ -20,7 +19,7 @@ class AsyncBlockIDsCache
void update();
public:
explicit AsyncBlockIDsCache(StorageReplicatedMergeTree & storage_);
explicit AsyncBlockIDsCache(TStorage & storage_);
void start();
@ -30,7 +29,7 @@ public:
private:
StorageReplicatedMergeTree & storage;
TStorage & storage;
std::atomic<std::chrono::steady_clock::time_point> last_updatetime;
const std::chrono::milliseconds update_min_interval;
@ -48,6 +47,4 @@ private:
Poco::Logger * log;
};
using AsyncBlockIDsCachePtr = std::shared_ptr<AsyncBlockIDsCache>;
}

View File

@ -0,0 +1,150 @@
#include <Storages/MergeTree/InsertBlockInfo.h>
namespace DB
{
namespace ErrorCodes
{
extern const int LOGICAL_ERROR;
}
AsyncInsertBlockInfo::AsyncInsertBlockInfo(
Poco::Logger * log_,
std::vector<std::string> && block_id_,
BlockWithPartition && block_,
std::optional<BlockWithPartition> && unmerged_block_with_partition_)
: log(log_)
, block_id(std::move(block_id_))
, block_with_partition(std::move(block_))
, unmerged_block_with_partition(std::move(unmerged_block_with_partition_))
{
initBlockIDMap();
}
void AsyncInsertBlockInfo::initBlockIDMap()
{
block_id_to_offset_idx.clear();
for (size_t i = 0; i < block_id.size(); ++i)
{
block_id_to_offset_idx[block_id[i]].push_back(i);
}
}
/// this function check if the block contains duplicate inserts.
/// if so, we keep only one insert for every duplicate ones.
bool AsyncInsertBlockInfo::filterSelfDuplicate()
{
std::vector<String> dup_block_ids;
for (const auto & [hash_id, offset_indexes] : block_id_to_offset_idx)
{
/// It means more than one inserts have the same hash id, in this case, we should keep only one of them.
if (offset_indexes.size() > 1)
dup_block_ids.push_back(hash_id);
}
if (dup_block_ids.empty())
return false;
filterBlockDuplicate(dup_block_ids, true);
return true;
}
/// remove the conflict parts of block for rewriting again.
void AsyncInsertBlockInfo::filterBlockDuplicate(const std::vector<String> & block_paths, bool self_dedup)
{
auto * current_block_with_partition = unmerged_block_with_partition.has_value() ? &unmerged_block_with_partition.value() : &block_with_partition;
std::vector<size_t> offset_idx;
for (const auto & raw_path : block_paths)
{
std::filesystem::path p(raw_path);
String conflict_block_id = p.filename();
auto it = block_id_to_offset_idx.find(conflict_block_id);
if (it == block_id_to_offset_idx.end())
throw Exception(ErrorCodes::LOGICAL_ERROR, "Unknown conflict path {}", conflict_block_id);
/// if this filter is for self_dedup, that means the block paths is selected by `filterSelfDuplicate`, which is a self purge.
/// in this case, we don't know if zk has this insert, then we should keep one insert, to avoid missing this insert.
offset_idx.insert(std::end(offset_idx), std::begin(it->second) + self_dedup, std::end(it->second));
}
std::sort(offset_idx.begin(), offset_idx.end());
auto & offsets = current_block_with_partition->offsets;
size_t idx = 0, remove_count = 0;
auto it = offset_idx.begin();
std::vector<size_t> new_offsets;
std::vector<String> new_block_ids;
/// construct filter
size_t rows = current_block_with_partition->block.rows();
auto filter_col = ColumnUInt8::create(rows, 1u);
ColumnUInt8::Container & vec = filter_col->getData();
UInt8 * pos = vec.data();
for (auto & offset : offsets)
{
if (it != offset_idx.end() && *it == idx)
{
size_t start_pos = idx > 0 ? offsets[idx - 1] : 0;
size_t end_pos = offset;
remove_count += end_pos - start_pos;
while (start_pos < end_pos)
{
*(pos + start_pos) = 0;
start_pos++;
}
it++;
}
else
{
new_offsets.push_back(offset - remove_count);
new_block_ids.push_back(block_id[idx]);
}
idx++;
}
LOG_TRACE(log, "New block IDs: {}, new offsets: {}, size: {}", toString(new_block_ids), toString(new_offsets), new_offsets.size());
current_block_with_partition->offsets = std::move(new_offsets);
block_id = std::move(new_block_ids);
auto cols = current_block_with_partition->block.getColumns();
for (auto & col : cols)
{
col = col->filter(vec, rows - remove_count);
}
current_block_with_partition->block.setColumns(cols);
LOG_TRACE(log, "New block rows {}", current_block_with_partition->block.rows());
initBlockIDMap();
if (unmerged_block_with_partition.has_value())
block_with_partition.block = unmerged_block_with_partition->block;
}
std::vector<String> AsyncInsertBlockInfo::getHashesForBlocks(BlockWithPartition & block, String partition_id)
{
size_t start = 0;
auto cols = block.block.getColumns();
std::vector<String> block_id_vec;
for (size_t i = 0; i < block.offsets.size(); ++i)
{
size_t offset = block.offsets[i];
std::string_view token = block.tokens[i];
if (token.empty())
{
SipHash hash;
for (size_t j = start; j < offset; ++j)
{
for (const auto & col : cols)
col->updateHashWithValue(j, hash);
}
const auto hash_value = hash.get128();
block_id_vec.push_back(partition_id + "_" + DB::toString(hash_value.items[0]) + "_" + DB::toString(hash_value.items[1]));
}
else
block_id_vec.push_back(partition_id + "_" + std::string(token));
start = offset;
}
return block_id_vec;
}
}

View File

@ -0,0 +1,55 @@
#pragma once
#include <Storages/MergeTree/MergeTreeDataWriter.h>
namespace DB
{
struct SyncInsertBlockInfo
{
SyncInsertBlockInfo(
Poco::Logger * /*log_*/,
std::string && block_id_,
BlockWithPartition && /*block_*/,
std::optional<BlockWithPartition> && /*unmerged_block_with_partition_*/)
: block_id(std::move(block_id_))
{
}
explicit SyncInsertBlockInfo(std::string block_id_)
: block_id(std::move(block_id_))
{}
std::string block_id;
};
struct AsyncInsertBlockInfo
{
Poco::Logger * log;
std::vector<std::string> block_id;
BlockWithPartition block_with_partition;
/// Some merging algorithms can mofidy the block which loses the information about the async insert offsets
/// when preprocessing or filtering data for asnyc inserts deduplication we want to use the initial, unmerged block
std::optional<BlockWithPartition> unmerged_block_with_partition;
std::unordered_map<String, std::vector<size_t>> block_id_to_offset_idx;
AsyncInsertBlockInfo(
Poco::Logger * log_,
std::vector<std::string> && block_id_,
BlockWithPartition && block_,
std::optional<BlockWithPartition> && unmerged_block_with_partition_);
void initBlockIDMap();
/// this function check if the block contains duplicate inserts.
/// if so, we keep only one insert for every duplicate ones.
bool filterSelfDuplicate();
/// remove the conflict parts of block for rewriting again.
void filterBlockDuplicate(const std::vector<String> & block_paths, bool self_dedup);
/// Convert block id vector to string. Output at most 50 ids.
static std::vector<String> getHashesForBlocks(BlockWithPartition & block, String partition_id);
};
}

View File

@ -119,11 +119,7 @@ MergeTreeSequentialSource::MergeTreeSequentialSource(
addTotalRowsApprox(data_part->rows_count);
/// Add columns because we don't want to read empty blocks
injectRequiredColumns(
LoadedMergeTreeDataPartInfoForReader(data_part, alter_conversions),
storage_snapshot,
storage.supportsSubcolumns(),
columns_to_read);
injectRequiredColumns(LoadedMergeTreeDataPartInfoForReader(data_part, alter_conversions), storage_snapshot, /*with_subcolumns=*/ false, columns_to_read);
NamesAndTypesList columns_for_reader;
if (take_column_types_from_storage)
@ -131,8 +127,6 @@ MergeTreeSequentialSource::MergeTreeSequentialSource(
auto options = GetColumnsOptions(GetColumnsOptions::AllPhysical)
.withExtendedObjects()
.withSystemColumns();
if (storage.supportsSubcolumns())
options.withSubcolumns();
columns_for_reader = storage_snapshot->getColumnsByNames(options, columns_to_read);
}
else

View File

@ -1,6 +1,7 @@
#include <Storages/StorageReplicatedMergeTree.h>
#include <Storages/MergeTree/ReplicatedMergeTreeQuorumEntry.h>
#include <Storages/MergeTree/ReplicatedMergeTreeSink.h>
#include <Storages/MergeTree/InsertBlockInfo.h>
#include <Interpreters/PartLog.h>
#include <Common/FailPoint.h>
#include <Common/ProfileEventsScope.h>
@ -49,17 +50,11 @@ namespace ErrorCodes
template<bool async_insert>
struct ReplicatedMergeTreeSinkImpl<async_insert>::DelayedChunk
{
struct Partition
using BlockInfo = std::conditional_t<async_insert, AsyncInsertBlockInfo, SyncInsertBlockInfo>;
struct Partition : public BlockInfo
{
Poco::Logger * log;
MergeTreeDataWriter::TemporaryPart temp_part;
UInt64 elapsed_ns;
BlockIDsType block_id;
BlockWithPartition block_with_partition;
/// Some merging algorithms can mofidy the block which loses the information about the async insert offsets
/// when preprocessing or filtering data for asnyc inserts deduplication we want to use the initial, unmerged block
std::optional<BlockWithPartition> unmerged_block_with_partition;
std::unordered_map<String, std::vector<size_t>> block_id_to_offset_idx;
ProfileEvents::Counters part_counters;
Partition() = default;
@ -70,127 +65,11 @@ struct ReplicatedMergeTreeSinkImpl<async_insert>::DelayedChunk
BlockWithPartition && block_,
std::optional<BlockWithPartition> && unmerged_block_with_partition_,
ProfileEvents::Counters && part_counters_)
: log(log_),
: BlockInfo(log_, std::move(block_id_), std::move(block_), std::move(unmerged_block_with_partition_)),
temp_part(std::move(temp_part_)),
elapsed_ns(elapsed_ns_),
block_id(std::move(block_id_)),
block_with_partition(std::move(block_)),
unmerged_block_with_partition(std::move(unmerged_block_with_partition_)),
part_counters(std::move(part_counters_))
{
initBlockIDMap();
}
void initBlockIDMap()
{
if constexpr (async_insert)
{
block_id_to_offset_idx.clear();
for (size_t i = 0; i < block_id.size(); ++i)
{
block_id_to_offset_idx[block_id[i]].push_back(i);
}
}
}
/// this function check if the block contains duplicate inserts.
/// if so, we keep only one insert for every duplicate ones.
bool filterSelfDuplicate()
{
if constexpr (async_insert)
{
std::vector<String> dup_block_ids;
for (const auto & [hash_id, offset_indexes] : block_id_to_offset_idx)
{
/// It means more than one inserts have the same hash id, in this case, we should keep only one of them.
if (offset_indexes.size() > 1)
dup_block_ids.push_back(hash_id);
}
if (dup_block_ids.empty())
return false;
filterBlockDuplicate(dup_block_ids, true);
return true;
}
return false;
}
/// remove the conflict parts of block for rewriting again.
void filterBlockDuplicate(const std::vector<String> & block_paths, bool self_dedup)
{
if constexpr (async_insert)
{
auto * current_block_with_partition = unmerged_block_with_partition.has_value() ? &unmerged_block_with_partition.value() : &block_with_partition;
std::vector<size_t> offset_idx;
for (const auto & raw_path : block_paths)
{
std::filesystem::path p(raw_path);
String conflict_block_id = p.filename();
auto it = block_id_to_offset_idx.find(conflict_block_id);
if (it == block_id_to_offset_idx.end())
throw Exception(ErrorCodes::LOGICAL_ERROR, "Unknown conflict path {}", conflict_block_id);
/// if this filter is for self_dedup, that means the block paths is selected by `filterSelfDuplicate`, which is a self purge.
/// in this case, we don't know if zk has this insert, then we should keep one insert, to avoid missing this insert.
offset_idx.insert(std::end(offset_idx), std::begin(it->second) + self_dedup, std::end(it->second));
}
std::sort(offset_idx.begin(), offset_idx.end());
auto & offsets = current_block_with_partition->offsets;
size_t idx = 0, remove_count = 0;
auto it = offset_idx.begin();
std::vector<size_t> new_offsets;
std::vector<String> new_block_ids;
/// construct filter
size_t rows = current_block_with_partition->block.rows();
auto filter_col = ColumnUInt8::create(rows, 1u);
ColumnUInt8::Container & vec = filter_col->getData();
UInt8 * pos = vec.data();
for (auto & offset : offsets)
{
if (it != offset_idx.end() && *it == idx)
{
size_t start_pos = idx > 0 ? offsets[idx - 1] : 0;
size_t end_pos = offset;
remove_count += end_pos - start_pos;
while (start_pos < end_pos)
{
*(pos + start_pos) = 0;
start_pos++;
}
it++;
}
else
{
new_offsets.push_back(offset - remove_count);
new_block_ids.push_back(block_id[idx]);
}
idx++;
}
LOG_TRACE(log, "New block IDs: {}, new offsets: {}, size: {}", toString(new_block_ids), toString(new_offsets), new_offsets.size());
current_block_with_partition->offsets = std::move(new_offsets);
block_id = std::move(new_block_ids);
auto cols = current_block_with_partition->block.getColumns();
for (auto & col : cols)
{
col = col->filter(vec, rows - remove_count);
}
current_block_with_partition->block.setColumns(cols);
LOG_TRACE(log, "New block rows {}", current_block_with_partition->block.rows());
initBlockIDMap();
if (unmerged_block_with_partition.has_value())
block_with_partition.block = unmerged_block_with_partition->block;
}
else
{
throw Exception(ErrorCodes::LOGICAL_ERROR, "sync insert should not call rewriteBlock");
}
}
{}
};
DelayedChunk() = default;
@ -236,35 +115,6 @@ namespace
if (size > 50) size = 50;
return fmt::format("({})", fmt::join(vec.begin(), vec.begin() + size, ","));
}
std::vector<String> getHashesForBlocks(BlockWithPartition & block, String partition_id)
{
size_t start = 0;
auto cols = block.block.getColumns();
std::vector<String> block_id_vec;
for (size_t i = 0; i < block.offsets.size(); ++i)
{
size_t offset = block.offsets[i];
std::string_view token = block.tokens[i];
if (token.empty())
{
SipHash hash;
for (size_t j = start; j < offset; ++j)
{
for (const auto & col : cols)
col->updateHashWithValue(j, hash);
}
const auto hash_value = hash.get128();
block_id_vec.push_back(partition_id + "_" + DB::toString(hash_value.items[0]) + "_" + DB::toString(hash_value.items[1]));
}
else
block_id_vec.push_back(partition_id + "_" + std::string(token));
start = offset;
}
return block_id_vec;
}
}
template<bool async_insert>
@ -470,7 +320,7 @@ void ReplicatedMergeTreeSinkImpl<async_insert>::consume(Chunk chunk)
if constexpr (async_insert)
{
block_id = getHashesForBlocks(unmerged_block.has_value() ? *unmerged_block : current_block, temp_part.part->info.partition_id);
block_id = AsyncInsertBlockInfo::getHashesForBlocks(unmerged_block.has_value() ? *unmerged_block : current_block, temp_part.part->info.partition_id);
LOG_TRACE(log, "async insert part, part id {}, block id {}, offsets {}, size {}", temp_part.part->info.partition_id, toString(block_id), toString(current_block.offsets), current_block.offsets.size());
}
else

View File

@ -89,8 +89,6 @@ public:
bool supportsDynamicSubcolumns() const override { return true; }
bool supportsSubcolumns() const override { return true; }
bool mayBenefitFromIndexForIn(
const ASTPtr & left_in_operand, ContextPtr query_context, const StorageMetadataPtr & metadata_snapshot) const override
{
@ -112,12 +110,6 @@ public:
return storage.getPartitionIDFromQuery(ast, context);
}
StorageSnapshotPtr getStorageSnapshotForQuery(
const StorageMetadataPtr & metadata_snapshot, const ASTPtr & /*query*/, ContextPtr query_context) const override
{
return storage.getStorageSnapshot(metadata_snapshot, query_context);
}
bool materializeTTLRecalculateOnly() const
{
if (parts.empty())

View File

@ -384,33 +384,9 @@ std::unique_ptr<ReadBuffer> createReadBuffer(
bool use_table_fd,
int table_fd,
const String & compression_method,
ContextPtr context,
const String & path_to_archive = "")
ContextPtr context)
{
CompressionMethod method;
if (!path_to_archive.empty())
{
auto reader = createArchiveReader(path_to_archive);
if (current_path.find_first_of("*?{") != std::string::npos)
{
auto matcher = std::make_shared<re2::RE2>(makeRegexpPatternFromGlobs(current_path));
if (!matcher->ok())
throw Exception(ErrorCodes::CANNOT_COMPILE_REGEXP,
"Cannot compile regex from glob ({}): {}", current_path, matcher->error());
return reader->readFile([my_matcher = std::move(matcher)](const std::string & path)
{
return re2::RE2::FullMatch(path, *my_matcher);
}, /*throw_on_not_found=*/true);
}
else
{
return reader->readFile(current_path, /*throw_on_not_found=*/true);
}
}
if (use_table_fd)
method = chooseCompressionMethod("", compression_method);
else
@ -471,14 +447,12 @@ namespace
public:
ReadBufferFromFileIterator(
const std::vector<String> & paths_,
const std::vector<String> & paths_to_archive_,
const String & format_,
const String & compression_method_,
const std::optional<FormatSettings> & format_settings_,
ContextPtr context_)
: WithContext(context_)
, paths(paths_)
, paths_to_archive(paths_to_archive_)
, format(format_)
, compression_method(compression_method_)
, format_settings(format_settings_)
@ -487,15 +461,13 @@ namespace
std::unique_ptr<ReadBuffer> next() override
{
String path;
struct stat file_stat;
bool is_first = current_index == 0;
const auto & paths_ref = paths_to_archive.empty() ? paths : paths_to_archive;
do
{
if (current_index == paths_ref.size())
if (current_index == paths.size())
{
if (is_first)
throw Exception(
@ -505,19 +477,16 @@ namespace
return nullptr;
}
path = paths_ref[current_index++];
path = paths[current_index++];
file_stat = getFileStat(path, false, -1, "File");
} while (getContext()->getSettingsRef().engine_file_skip_empty_files && file_stat.st_size == 0);
if (paths_to_archive.empty())
return createReadBuffer(path, file_stat, false, -1, compression_method, getContext());
return createReadBuffer(paths[0], file_stat, false, -1, compression_method, getContext(), path);
return createReadBuffer(path, file_stat, false, -1, compression_method, getContext());
}
void setNumRowsToLastFile(size_t num_rows) override
{
if (!getContext()->getSettingsRef().use_cache_for_count_from_files || !paths_to_archive.empty())
if (!getContext()->getSettingsRef().use_cache_for_count_from_files)
return;
auto key = getKeyForSchemaCache(paths[current_index - 1], format, format_settings, getContext());
@ -526,12 +495,182 @@ namespace
private:
const std::vector<String> & paths;
const std::vector<String> & paths_to_archive;
size_t current_index = 0;
String format;
String compression_method;
const std::optional<FormatSettings> & format_settings;
};
struct ReadBufferFromArchiveIterator : public IReadBufferIterator, WithContext
{
public:
ReadBufferFromArchiveIterator(
const StorageFile::ArchiveInfo & archive_info_,
const String & format_,
const std::optional<FormatSettings> & format_settings_,
ContextPtr context_)
: WithContext(context_)
, archive_info(archive_info_)
, format(format_)
, format_settings(format_settings_)
{
}
std::unique_ptr<ReadBuffer> next() override
{
std::unique_ptr<ReadBuffer> read_buf;
struct stat file_stat;
while (true)
{
if (current_archive_index == archive_info.paths_to_archives.size())
{
if (is_first)
throw Exception(
ErrorCodes::CANNOT_EXTRACT_TABLE_STRUCTURE,
"Cannot extract table structure from {} format file, because all files are empty. You must specify table structure manually",
format);
return nullptr;
}
const auto & archive = archive_info.paths_to_archives[current_archive_index];
file_stat = getFileStat(archive, false, -1, "File");
if (file_stat.st_size == 0)
{
if (getContext()->getSettingsRef().engine_file_skip_empty_files)
{
++current_archive_index;
continue;
}
throw Exception(
ErrorCodes::CANNOT_EXTRACT_TABLE_STRUCTURE,
"Cannot extract table structure from {} format file, because the archive {} is empty. "
"You must specify table structure manually",
format,
archive);
}
auto archive_reader = createArchiveReader(archive);
auto try_get_columns_from_schema_cache = [&](const std::string & full_path) -> std::optional<ColumnsDescription>
{
auto context = getContext();
if (!getContext()->getSettingsRef().schema_inference_use_cache_for_file)
return std::nullopt;
auto & schema_cache = StorageFile::getSchemaCache(context);
auto get_last_mod_time = [&]() -> std::optional<time_t>
{
if (0 != stat(archive_reader->getPath().c_str(), &file_stat))
return std::nullopt;
return file_stat.st_mtime;
};
auto cache_key = getKeyForSchemaCache(full_path, format, format_settings, context);
auto columns = schema_cache.tryGetColumns(cache_key, get_last_mod_time);
if (columns)
return columns;
return std::nullopt;
};
if (archive_info.isSingleFileRead())
{
read_buf = archive_reader->readFile(archive_info.path_in_archive, false);
++current_archive_index;
if (!read_buf)
continue;
last_read_file_path = processed_files.emplace_back(fmt::format("{}::{}", archive_reader->getPath(), archive_info.path_in_archive));
columns_from_cache = try_get_columns_from_schema_cache(last_read_file_path);
if (columns_from_cache)
return nullptr;
}
else
{
auto file_enumerator = archive_reader->firstFile();
if (!file_enumerator)
{
if (getContext()->getSettingsRef().engine_file_skip_empty_files)
{
read_files_from_archive.clear();
++current_archive_index;
continue;
}
throw Exception(
ErrorCodes::CANNOT_EXTRACT_TABLE_STRUCTURE,
"Cannot extract table structure from {} format file, because the archive {} has no files. "
"You must specify table structure manually",
format,
archive);
}
const auto * filename = &file_enumerator->getFileName();
while (read_files_from_archive.contains(*filename) || !archive_info.filter(*filename))
{
if (!file_enumerator->nextFile())
{
archive_reader = nullptr;
break;
}
filename = &file_enumerator->getFileName();
}
if (!archive_reader)
{
read_files_from_archive.clear();
++current_archive_index;
continue;
}
last_read_file_path = processed_files.emplace_back(fmt::format("{}::{}", archive_reader->getPath(), *filename));
columns_from_cache = try_get_columns_from_schema_cache(last_read_file_path);
if (columns_from_cache)
return nullptr;
read_files_from_archive.insert(*filename);
read_buf = archive_reader->readFile(std::move(file_enumerator));
}
break;
}
is_first = false;
return read_buf;
}
void setNumRowsToLastFile(size_t num_rows) override
{
if (!getContext()->getSettingsRef().use_cache_for_count_from_files)
return;
auto key = getKeyForSchemaCache(last_read_file_path, format, format_settings, getContext());
StorageFile::getSchemaCache(getContext()).addNumRows(key, num_rows);
}
std::vector<std::string> processed_files;
std::optional<ColumnsDescription> columns_from_cache;
private:
const StorageFile::ArchiveInfo & archive_info;
size_t current_archive_index = 0;
std::unordered_set<std::string> read_files_from_archive;
bool is_first = true;
std::string last_read_file_path;
String format;
const std::optional<FormatSettings> & format_settings;
};
}
ColumnsDescription StorageFile::getTableStructureFromFileDescriptor(ContextPtr context)
@ -567,7 +706,7 @@ ColumnsDescription StorageFile::getTableStructureFromFile(
const String & compression_method,
const std::optional<FormatSettings> & format_settings,
ContextPtr context,
const std::vector<String> & paths_to_archive)
const std::optional<ArchiveInfo> & archive_info)
{
if (format == "Distributed")
{
@ -577,29 +716,102 @@ ColumnsDescription StorageFile::getTableStructureFromFile(
return ColumnsDescription(DistributedAsyncInsertSource(paths[0]).getOutputs().front().getHeader().getNamesAndTypesList());
}
if (paths.empty() && !FormatFactory::instance().checkIfFormatHasExternalSchemaReader(format))
if (((archive_info && archive_info->paths_to_archives.empty()) || (!archive_info && paths.empty()))
&& !FormatFactory::instance().checkIfFormatHasExternalSchemaReader(format))
throw Exception(
ErrorCodes::CANNOT_EXTRACT_TABLE_STRUCTURE,
"Cannot extract table structure from {} format file, because there are no files with provided path. "
"You must specify table structure manually", format);
std::optional<ColumnsDescription> columns_from_cache;
if (context->getSettingsRef().schema_inference_use_cache_for_file)
columns_from_cache = tryGetColumnsFromCache(paths, format, format_settings, context);
ColumnsDescription columns;
if (columns_from_cache)
if (archive_info)
{
columns = *columns_from_cache;
std::vector<std::string> paths_for_schema_cache;
std::optional<ColumnsDescription> columns_from_cache;
if (context->getSettingsRef().schema_inference_use_cache_for_file)
{
paths_for_schema_cache.reserve(archive_info->paths_to_archives.size());
struct stat file_stat{};
for (const auto & archive : archive_info->paths_to_archives)
{
const auto & full_path = paths_for_schema_cache.emplace_back(fmt::format("{}::{}", archive, archive_info->path_in_archive));
if (!columns_from_cache)
{
auto & schema_cache = getSchemaCache(context);
auto get_last_mod_time = [&]() -> std::optional<time_t>
{
if (0 != stat(archive.c_str(), &file_stat))
return std::nullopt;
return file_stat.st_mtime;
};
auto cache_key = getKeyForSchemaCache(full_path, format, format_settings, context);
columns_from_cache = schema_cache.tryGetColumns(cache_key, get_last_mod_time);
}
}
}
if (columns_from_cache)
{
columns = std::move(*columns_from_cache);
}
else
{
ReadBufferFromArchiveIterator read_buffer_iterator(*archive_info, format, format_settings, context);
try
{
columns = readSchemaFromFormat(
format,
format_settings,
read_buffer_iterator,
/*retry=*/archive_info->paths_to_archives.size() > 1 || !archive_info->isSingleFileRead(),
context);
}
catch (const DB::Exception & e)
{
/// maybe we found something in cache while iterating files
if (e.code() == ErrorCodes::CANNOT_EXTRACT_TABLE_STRUCTURE)
{
if (read_buffer_iterator.columns_from_cache)
columns = std::move(*read_buffer_iterator.columns_from_cache);
else
throw;
}
else
{
throw;
}
}
for (auto & file : read_buffer_iterator.processed_files)
paths_for_schema_cache.push_back(std::move(file));
}
if (context->getSettingsRef().schema_inference_use_cache_for_file)
addColumnsToCache(paths_for_schema_cache, columns, format, format_settings, context);
}
else
{
ReadBufferFromFileIterator read_buffer_iterator(paths, paths_to_archive, format, compression_method, format_settings, context);
columns = readSchemaFromFormat(format, format_settings, read_buffer_iterator, paths.size() > 1, context);
}
std::optional<ColumnsDescription> columns_from_cache;
if (context->getSettingsRef().schema_inference_use_cache_for_file)
columns_from_cache = tryGetColumnsFromCache(paths, format, format_settings, context);
if (context->getSettingsRef().schema_inference_use_cache_for_file)
addColumnsToCache(paths, columns, format, format_settings, context);
if (columns_from_cache)
{
columns = *columns_from_cache;
}
else
{
ReadBufferFromFileIterator read_buffer_iterator(paths, format, compression_method, format_settings, context);
columns = readSchemaFromFormat(format, format_settings, read_buffer_iterator, paths.size() > 1, context);
}
if (context->getSettingsRef().schema_inference_use_cache_for_file)
addColumnsToCache(archive_info ? archive_info->paths_to_archives : paths, columns, format, format_settings, context);
}
return columns;
}
@ -643,14 +855,9 @@ StorageFile::StorageFile(const std::string & table_path_, const std::string & us
: StorageFile(args)
{
if (!args.path_to_archive.empty())
{
paths_to_archive = getPathsList(args.path_to_archive, user_files_path, args.getContext(), total_bytes_to_read);
paths = {table_path_};
}
archive_info = getArchiveInfo(args.path_to_archive, table_path_, user_files_path, args.getContext(), total_bytes_to_read);
else
{
paths = getPathsList(table_path_, user_files_path, args.getContext(), total_bytes_to_read);
}
is_db_table = false;
is_path_with_globs = paths.size() > 1;
@ -706,7 +913,13 @@ void StorageFile::setStorageMetadata(CommonArguments args)
columns = getTableStructureFromFileDescriptor(args.getContext());
else
{
columns = getTableStructureFromFile(format_name, paths, compression_method, format_settings, args.getContext(), paths_to_archive);
columns = getTableStructureFromFile(
format_name,
paths,
compression_method,
format_settings,
args.getContext(),
archive_info);
if (!args.columns.empty() && args.columns != columns)
throw Exception(ErrorCodes::INCOMPATIBLE_COLUMNS, "Table structure and file structure are different");
}
@ -743,15 +956,14 @@ public:
public:
explicit FilesIterator(
const Strings & files_,
std::vector<std::string> archives_,
const IArchiveReader::NameFilter & name_filter_,
std::optional<StorageFile::ArchiveInfo> archive_info_,
ASTPtr query,
const NamesAndTypesList & virtual_columns,
ContextPtr context_)
: files(files_), archives(std::move(archives_)), name_filter(name_filter_)
: files(files_), archive_info(std::move(archive_info_))
{
ASTPtr filter_ast;
if (archives.empty() && !files.empty() && !files[0].empty())
if (!archive_info && !files.empty() && !files[0].empty())
filter_ast = VirtualColumnUtils::createPathAndFileFilterAst(query, virtual_columns, files[0], context_);
if (filter_ast)
@ -760,7 +972,7 @@ public:
String next()
{
const auto & fs = fromArchive() ? archives : files;
const auto & fs = isReadFromArchive() ? archive_info->paths_to_archives : files;
auto current_index = index.fetch_add(1, std::memory_order_relaxed);
if (current_index >= fs.size())
@ -769,35 +981,32 @@ public:
return fs[current_index];
}
bool fromArchive() const
bool isReadFromArchive() const
{
return !archives.empty();
return archive_info.has_value();
}
bool readSingleFileFromArchive() const
bool validFileInArchive(const std::string & path) const
{
return !name_filter;
return archive_info->filter(path);
}
bool passesFilter(const std::string & name) const
bool isSingleFileReadFromArchive() const
{
std::lock_guard lock(filter_mutex);
return name_filter(name);
return archive_info->isSingleFileRead();
}
const String & getFileName()
const String & getFileNameInArchive()
{
if (files.size() != 1)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Expected only 1 filename but got {}", files.size());
if (archive_info->path_in_archive.empty())
throw Exception(ErrorCodes::LOGICAL_ERROR, "Expected only 1 filename but it's empty");
return files[0];
return archive_info->path_in_archive;
}
private:
std::vector<std::string> files;
std::vector<std::string> archives;
mutable std::mutex filter_mutex;
IArchiveReader::NameFilter name_filter;
std::optional<StorageFile::ArchiveInfo> archive_info;
std::atomic<size_t> index = 0;
};
@ -901,6 +1110,32 @@ public:
return storage->getName();
}
bool tryGetCountFromCache(const struct stat & file_stat)
{
if (!context->getSettingsRef().use_cache_for_count_from_files)
return false;
auto num_rows_from_cache = tryGetNumRowsFromCache(current_path, file_stat.st_mtime);
if (!num_rows_from_cache)
return false;
/// We should not return single chunk with all number of rows,
/// because there is a chance that this chunk will be materialized later
/// (it can cause memory problems even with default values in columns or when virtual columns are requested).
/// Instead, we use special ConstChunkGenerator that will generate chunks
/// with max_block_size rows until total number of rows is reached.
auto const_chunk_generator = std::make_shared<ConstChunkGenerator>(block_for_format, *num_rows_from_cache, max_block_size);
QueryPipelineBuilder builder;
builder.init(Pipe(const_chunk_generator));
builder.addSimpleTransform([&](const Block & header)
{
return std::make_shared<ExtractColumnsTransform>(header, requested_columns);
});
pipeline = std::make_unique<QueryPipeline>(QueryPipelineBuilder::getPipeline(std::move(builder)));
reader = std::make_unique<PullingPipelineExecutor>(*pipeline);
return true;
}
Chunk generate() override
{
while (!finished_generate)
@ -910,21 +1145,27 @@ public:
{
if (!storage->use_table_fd)
{
if (files_iterator->fromArchive())
if (files_iterator->isReadFromArchive())
{
if (files_iterator->readSingleFileFromArchive())
struct stat file_stat;
if (files_iterator->isSingleFileReadFromArchive())
{
auto archive = files_iterator->next();
if (archive.empty())
return {};
struct stat file_stat = getFileStat(archive, storage->use_table_fd, storage->table_fd, storage->getName());
file_stat = getFileStat(archive, storage->use_table_fd, storage->table_fd, storage->getName());
if (context->getSettingsRef().engine_file_skip_empty_files && file_stat.st_size == 0)
continue;
archive_reader = createArchiveReader(archive);
current_path = files_iterator->getFileName();
read_buf = archive_reader->readFile(current_path, /*throw_on_not_found=*/false);
filename_override = files_iterator->getFileNameInArchive();
current_path = fmt::format("{}::{}", archive_reader->getPath(), *filename_override);
if (need_only_count && tryGetCountFromCache(file_stat))
continue;
read_buf = archive_reader->readFile(*filename_override, /*throw_on_not_found=*/false);
if (!read_buf)
continue;
}
@ -938,7 +1179,7 @@ public:
if (archive.empty())
return {};
struct stat file_stat = getFileStat(archive, storage->use_table_fd, storage->table_fd, storage->getName());
file_stat = getFileStat(archive, storage->use_table_fd, storage->table_fd, storage->getName());
if (context->getSettingsRef().engine_file_skip_empty_files && file_stat.st_size == 0)
continue;
@ -948,7 +1189,7 @@ public:
}
bool file_found = true;
while (!files_iterator->passesFilter(file_enumerator->getFileName()))
while (!files_iterator->validFileInArchive(file_enumerator->getFileName()))
{
if (!file_enumerator->nextFile())
{
@ -959,7 +1200,7 @@ public:
if (file_found)
{
current_path = file_enumerator->getFileName();
filename_override = file_enumerator->getFileName();
break;
}
@ -967,6 +1208,10 @@ public:
}
chassert(file_enumerator);
current_path = fmt::format("{}::{}", archive_reader->getPath(), *filename_override);
if (need_only_count && tryGetCountFromCache(file_stat))
continue;
read_buf = archive_reader->readFile(std::move(file_enumerator));
}
}
@ -994,35 +1239,23 @@ public:
if (context->getSettingsRef().engine_file_skip_empty_files && file_stat.st_size == 0)
continue;
if (need_only_count && context->getSettingsRef().use_cache_for_count_from_files)
{
auto num_rows_from_cache = tryGetNumRowsFromCache(current_path, file_stat.st_mtime);
if (num_rows_from_cache)
{
/// We should not return single chunk with all number of rows,
/// because there is a chance that this chunk will be materialized later
/// (it can cause memory problems even with default values in columns or when virtual columns are requested).
/// Instead, we use special ConstChunkGenerator that will generate chunks
/// with max_block_size rows until total number of rows is reached.
auto const_chunk_generator = std::make_shared<ConstChunkGenerator>(block_for_format, *num_rows_from_cache, max_block_size);
QueryPipelineBuilder builder;
builder.init(Pipe(const_chunk_generator));
builder.addSimpleTransform([&](const Block & header)
{
return std::make_shared<ExtractColumnsTransform>(header, requested_columns);
});
pipeline = std::make_unique<QueryPipeline>(QueryPipelineBuilder::getPipeline(std::move(builder)));
reader = std::make_unique<PullingPipelineExecutor>(*pipeline);
continue;
}
}
if (need_only_count && tryGetCountFromCache(file_stat))
continue;
read_buf = createReadBuffer(current_path, file_stat, storage->use_table_fd, storage->table_fd, storage->compression_method, context);
}
const Settings & settings = context->getSettingsRef();
chassert(!storage->paths.empty());
const auto max_parsing_threads = std::max<size_t>(settings.max_threads/ storage->paths.size(), 1UL);
size_t file_num = 0;
if (storage->archive_info)
file_num = storage->archive_info->paths_to_archives.size();
else
file_num = storage->paths.size();
chassert(file_num > 0);
const auto max_parsing_threads = std::max<size_t>(settings.max_threads / file_num, 1UL);
input_format = context->getInputFormat(storage->format_name, *read_buf, block_for_format, max_block_size, storage->format_settings, need_only_count ? 1 : max_parsing_threads);
input_format->setQueryInfo(query_info, context);
if (need_only_count)
@ -1063,7 +1296,8 @@ public:
progress(num_rows, chunk_size ? chunk_size : chunk.bytes());
/// Enrich with virtual columns.
VirtualColumnUtils::addRequestedPathAndFileVirtualsToChunk(chunk, requested_virtual_columns, current_path);
VirtualColumnUtils::addRequestedPathAndFileVirtualsToChunk(
chunk, requested_virtual_columns, current_path, filename_override.has_value() ? &filename_override.value() : nullptr);
return chunk;
}
@ -1081,8 +1315,18 @@ public:
pipeline.reset();
input_format.reset();
if (files_iterator->fromArchive() && !files_iterator->readSingleFileFromArchive())
file_enumerator = archive_reader->nextFile(std::move(read_buf));
if (files_iterator->isReadFromArchive() && !files_iterator->isSingleFileReadFromArchive())
{
if (file_enumerator)
{
if (!file_enumerator->nextFile())
file_enumerator = nullptr;
}
else
{
file_enumerator = archive_reader->nextFile(std::move(read_buf));
}
}
read_buf.reset();
}
@ -1114,6 +1358,7 @@ private:
StorageSnapshotPtr storage_snapshot;
FilesIteratorPtr files_iterator;
String current_path;
std::optional<String> filename_override;
Block sample_block;
std::unique_ptr<ReadBuffer> read_buf;
InputFormatPtr input_format;
@ -1155,44 +1400,35 @@ Pipe StorageFile::read(
}
else
{
const auto & p = paths_to_archive.empty() ? paths : paths_to_archive;
if (p.size() == 1 && !fs::exists(p[0]))
const std::vector<std::string> * p;
if (archive_info.has_value())
p = &archive_info->paths_to_archives;
else
p = &paths;
if (p->size() == 1 && !fs::exists(p->at(0)))
{
if (context->getSettingsRef().engine_file_empty_if_not_exists)
return Pipe(std::make_shared<NullSource>(storage_snapshot->getSampleBlockForColumns(column_names)));
else
throw Exception(ErrorCodes::FILE_DOESNT_EXIST, "File {} doesn't exist", p[0]);
throw Exception(ErrorCodes::FILE_DOESNT_EXIST, "File {} doesn't exist", p->at(0));
}
}
IArchiveReader::NameFilter filter;
if (!paths_to_archive.empty())
{
if (paths.size() != 1)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Multiple paths defined for reading from archive");
auto files_iterator
= std::make_shared<StorageFileSource::FilesIterator>(paths, archive_info, query_info.query, virtual_columns, context);
const auto & path = paths[0];
if (path.find_first_of("*?{") != std::string::npos)
{
auto matcher = std::make_shared<re2::RE2>(makeRegexpPatternFromGlobs(path));
if (!matcher->ok())
throw Exception(ErrorCodes::CANNOT_COMPILE_REGEXP,
"Cannot compile regex from glob ({}): {}", path, matcher->error());
filter = [matcher](const std::string & p)
{
return re2::RE2::FullMatch(p, *matcher);
};
}
}
auto files_iterator = std::make_shared<StorageFileSource::FilesIterator>(paths, paths_to_archive, std::move(filter), query_info.query, virtual_columns, context);
auto this_ptr = std::static_pointer_cast<StorageFile>(shared_from_this());
size_t num_streams = max_num_streams;
auto files_to_read = std::max(paths_to_archive.size(), paths.size());
size_t files_to_read = 0;
if (archive_info)
files_to_read = archive_info->paths_to_archives.size();
else
files_to_read = paths.size();
if (max_num_streams > files_to_read)
num_streams = files_to_read;
@ -1478,7 +1714,7 @@ SinkToStoragePtr StorageFile::write(
ContextPtr context,
bool /*async_insert*/)
{
if (!use_table_fd && !paths_to_archive.empty())
if (!use_table_fd && archive_info.has_value())
throw Exception(ErrorCodes::NOT_IMPLEMENTED, "Writing to archives is not supported");
if (format_name == "Distributed")
@ -1817,4 +2053,34 @@ void StorageFile::parseFileSource(String source, String & filename, String & pat
filename = filename_view;
}
StorageFile::ArchiveInfo StorageFile::getArchiveInfo(
const std::string & path_to_archive,
const std::string & file_in_archive,
const std::string & user_files_path,
ContextPtr context,
size_t & total_bytes_to_read
)
{
ArchiveInfo archive_info;
archive_info.path_in_archive = file_in_archive;
if (file_in_archive.find_first_of("*?{") != std::string::npos)
{
auto matcher = std::make_shared<re2::RE2>(makeRegexpPatternFromGlobs(file_in_archive));
if (!matcher->ok())
throw Exception(ErrorCodes::CANNOT_COMPILE_REGEXP,
"Cannot compile regex from glob ({}): {}", file_in_archive, matcher->error());
archive_info.filter = [matcher, matcher_mutex = std::make_shared<std::mutex>()](const std::string & p) mutable
{
std::lock_guard lock(*matcher_mutex);
return re2::RE2::FullMatch(p, *matcher);
};
}
archive_info.paths_to_archives = getPathsList(path_to_archive, user_files_path, context, total_bytes_to_read);
return archive_info;
}
}

View File

@ -3,6 +3,7 @@
#include <Storages/Cache/SchemaCache.h>
#include <Storages/IStorage.h>
#include <Common/FileRenamer.h>
#include <IO/Archives/IArchiveReader.h>
#include <atomic>
#include <shared_mutex>
@ -83,6 +84,18 @@ public:
bool supportsPartitionBy() const override { return true; }
struct ArchiveInfo
{
std::vector<std::string> paths_to_archives;
std::string path_in_archive; // used when reading a single file from archive
IArchiveReader::NameFilter filter = {}; // used when files inside archive are defined with a glob
bool isSingleFileRead() const
{
return !filter;
}
};
ColumnsDescription getTableStructureFromFileDescriptor(ContextPtr context);
static ColumnsDescription getTableStructureFromFile(
@ -91,12 +104,19 @@ public:
const String & compression_method,
const std::optional<FormatSettings> & format_settings,
ContextPtr context,
const std::vector<String> & paths_to_archive = {"auto"});
const std::optional<ArchiveInfo> & archive_info = std::nullopt);
static SchemaCache & getSchemaCache(const ContextPtr & context);
static void parseFileSource(String source, String & filename, String & path_to_archive);
static ArchiveInfo getArchiveInfo(
const std::string & path_to_archive,
const std::string & file_in_archive,
const std::string & user_files_path,
ContextPtr context,
size_t & total_bytes_to_read);
bool supportsTrivialCountOptimization() const override { return true; }
protected:
@ -128,7 +148,8 @@ private:
std::string base_path;
std::vector<std::string> paths;
std::vector<std::string> paths_to_archive;
std::optional<ArchiveInfo> archive_info;
bool is_db_table = true; /// Table is stored in real database, not user's file
bool use_table_fd = false; /// Use table_fd instead of path

View File

@ -385,7 +385,7 @@ private:
friend class ReplicatedMergeTreeSinkImpl;
friend class ReplicatedMergeTreePartCheckThread;
friend class ReplicatedMergeTreeCleanupThread;
friend class AsyncBlockIDsCache;
friend class AsyncBlockIDsCache<StorageReplicatedMergeTree>;
friend class ReplicatedMergeTreeAlterThread;
friend class ReplicatedMergeTreeRestartingThread;
friend class ReplicatedMergeTreeAttachThread;
@ -512,7 +512,7 @@ private:
/// A thread that removes old parts, log entries, and blocks.
ReplicatedMergeTreeCleanupThread cleanup_thread;
AsyncBlockIDsCache async_block_ids_cache;
AsyncBlockIDsCache<StorageReplicatedMergeTree> async_block_ids_cache;
/// A thread that checks the data of the parts, as well as the queue of the parts to be checked.
ReplicatedMergeTreePartCheckThread part_check_thread;

View File

@ -340,7 +340,8 @@ ColumnPtr getFilterByPathAndFileIndexes(const std::vector<String> & paths, const
return block.getByName("_idx").column;
}
void addRequestedPathAndFileVirtualsToChunk(Chunk & chunk, const NamesAndTypesList & requested_virtual_columns, const String & path)
void addRequestedPathAndFileVirtualsToChunk(
Chunk & chunk, const NamesAndTypesList & requested_virtual_columns, const String & path, const String * filename)
{
for (const auto & virtual_column : requested_virtual_columns)
{
@ -350,9 +351,16 @@ void addRequestedPathAndFileVirtualsToChunk(Chunk & chunk, const NamesAndTypesLi
}
else if (virtual_column.name == "_file")
{
size_t last_slash_pos = path.find_last_of('/');
auto file_name = path.substr(last_slash_pos + 1);
chunk.addColumn(virtual_column.type->createColumnConst(chunk.getNumRows(), file_name));
if (filename)
{
chunk.addColumn(virtual_column.type->createColumnConst(chunk.getNumRows(), *filename));
}
else
{
size_t last_slash_pos = path.find_last_of('/');
auto filename_from_path = path.substr(last_slash_pos + 1);
chunk.addColumn(virtual_column.type->createColumnConst(chunk.getNumRows(), filename_from_path));
}
}
}
}

View File

@ -67,7 +67,8 @@ void filterByPathOrFile(std::vector<T> & sources, const std::vector<String> & pa
sources = std::move(filtered_sources);
}
void addRequestedPathAndFileVirtualsToChunk(Chunk & chunk, const NamesAndTypesList & requested_virtual_columns, const String & path);
void addRequestedPathAndFileVirtualsToChunk(
Chunk & chunk, const NamesAndTypesList & requested_virtual_columns, const String & path, const String * filename = nullptr);
}
}

View File

@ -97,13 +97,14 @@ ColumnsDescription TableFunctionFile::getActualTableStructure(ContextPtr context
size_t total_bytes_to_read = 0;
Strings paths;
Strings paths_to_archives;
std::optional<StorageFile::ArchiveInfo> archive_info;
if (path_to_archive.empty())
paths = StorageFile::getPathsList(filename, context->getUserFilesPath(), context, total_bytes_to_read);
else
paths_to_archives = StorageFile::getPathsList(path_to_archive, context->getUserFilesPath(), context, total_bytes_to_read);
archive_info
= StorageFile::getArchiveInfo(path_to_archive, filename, context->getUserFilesPath(), context, total_bytes_to_read);
return StorageFile::getTableStructureFromFile(format, paths, compression_method, std::nullopt, context, paths_to_archives);
return StorageFile::getTableStructureFromFile(format, paths, compression_method, std::nullopt, context, archive_info);
}

View File

@ -4,14 +4,15 @@ import logging
import subprocess
import os
import sys
from pathlib import Path
from github import Github
from build_download_helper import get_build_name_for_check, read_build_urls
from clickhouse_helper import (
CiLogsCredentials,
ClickHouseHelper,
prepare_tests_results_for_clickhouse,
get_instance_type,
)
from commit_status_helper import (
RerunHelper,
@ -19,7 +20,7 @@ from commit_status_helper import (
get_commit,
post_commit_status,
)
from docker_pull_helper import get_image_with_version
from docker_pull_helper import DockerImage, get_image_with_version
from env_helper import (
REPORTS_PATH,
TEMP_PATH,
@ -29,25 +30,23 @@ from pr_info import PRInfo
from report import TestResult
from s3_helper import S3Helper
from stopwatch import Stopwatch
from tee_popen import TeePopen
from upload_result_helper import upload_results
IMAGE_NAME = "clickhouse/fuzzer"
def get_run_command(
check_start_time, check_name, pr_number, sha, download_url, workspace_path, image
):
instance_type = get_instance_type()
pr_info: PRInfo,
build_url: str,
workspace_path: str,
ci_logs_args: str,
image: DockerImage,
) -> str:
envs = [
"-e CLICKHOUSE_CI_LOGS_HOST",
"-e CLICKHOUSE_CI_LOGS_PASSWORD",
f"-e CHECK_START_TIME='{check_start_time}'",
f"-e CHECK_NAME='{check_name}'",
f"-e INSTANCE_TYPE='{instance_type}'",
f"-e PR_TO_TEST={pr_number}",
f"-e SHA_TO_TEST={sha}",
f"-e BINARY_URL_TO_DOWNLOAD='{download_url}'",
f"-e PR_TO_TEST={pr_info.number}",
f"-e SHA_TO_TEST={pr_info.sha}",
f"-e BINARY_URL_TO_DOWNLOAD='{build_url}'",
]
env_str = " ".join(envs)
@ -57,6 +56,7 @@ def get_run_command(
# For sysctl
"--privileged "
"--network=host "
f"{ci_logs_args}"
f"--volume={workspace_path}:/workspace "
f"{env_str} "
"--cap-add syslog --cap-add sys_admin --cap-add=SYS_PTRACE "
@ -107,14 +107,16 @@ def main():
workspace_path = os.path.join(temp_path, "workspace")
if not os.path.exists(workspace_path):
os.makedirs(workspace_path)
ci_logs_credentials = CiLogsCredentials(Path(temp_path) / "export-logs-config.sh")
ci_logs_args = ci_logs_credentials.get_docker_arguments(
pr_info, stopwatch.start_time_str, check_name
)
run_command = get_run_command(
stopwatch.start_time_str,
check_name,
pr_info.number,
pr_info.sha,
pr_info,
build_url,
workspace_path,
ci_logs_args,
docker_image,
)
logging.info("Going to run %s", run_command)
@ -122,35 +124,15 @@ def main():
run_log_path = os.path.join(temp_path, "run.log")
main_log_path = os.path.join(workspace_path, "main.log")
with open(run_log_path, "w", encoding="utf-8") as log:
with subprocess.Popen(
run_command, shell=True, stderr=log, stdout=log
) as process:
retcode = process.wait()
if retcode == 0:
logging.info("Run successfully")
else:
logging.info("Run failed")
with TeePopen(run_command, run_log_path) as process:
retcode = process.wait()
if retcode == 0:
logging.info("Run successfully")
else:
logging.info("Run failed")
subprocess.check_call(f"sudo chown -R ubuntu:ubuntu {temp_path}", shell=True)
# Cleanup run log from the credentials of CI logs database.
# Note: a malicious user can still print them by splitting the value into parts.
# But we will be warned when a malicious user modifies CI script.
# Although they can also print them from inside tests.
# Nevertheless, the credentials of the CI logs have limited scope
# and does not provide access to sensitive info.
ci_logs_host = os.getenv("CLICKHOUSE_CI_LOGS_HOST", "CLICKHOUSE_CI_LOGS_HOST")
ci_logs_password = os.getenv(
"CLICKHOUSE_CI_LOGS_PASSWORD", "CLICKHOUSE_CI_LOGS_PASSWORD"
)
if ci_logs_host not in ("CLICKHOUSE_CI_LOGS_HOST", ""):
subprocess.check_call(
f"sed -i -r -e 's!{ci_logs_host}!CLICKHOUSE_CI_LOGS_HOST!g; s!{ci_logs_password}!CLICKHOUSE_CI_LOGS_PASSWORD!g;' '{run_log_path}' '{main_log_path}'",
shell=True,
)
ci_logs_credentials.clean_ci_logs_from_credentials(Path(run_log_path))
check_name_lower = (
check_name.lower().replace("(", "").replace(")", "").replace(" ", "")

View File

@ -42,3 +42,5 @@ quit
# gdb will send SIGSTOP, spend some time loading debug info and then send SIGCONT, wait for it (up to send_timeout, 300s)
run_with_retry 60 clickhouse-client --query "SELECT 'Connected to clickhouse-server after attaching gdb'"
}
# vi: ft=bash

View File

@ -31,6 +31,7 @@ from version_helper import (
)
from clickhouse_helper import (
ClickHouseHelper,
CiLogsCredentials,
prepare_tests_results_for_clickhouse,
get_instance_type,
)
@ -375,8 +376,8 @@ def main():
# Upload profile data
ch_helper = ClickHouseHelper()
clickhouse_ci_logs_host = os.getenv("CLICKHOUSE_CI_LOGS_HOST", "")
if clickhouse_ci_logs_host:
ci_logs_credentials = CiLogsCredentials(Path("/dev/null"))
if ci_logs_credentials.host:
instance_type = get_instance_type()
query = f"""INSERT INTO build_time_trace
(
@ -420,9 +421,9 @@ FORMAT JSONCompactEachRow"""
auth = {
"X-ClickHouse-User": "ci",
"X-ClickHouse-Key": os.getenv("CLICKHOUSE_CI_LOGS_PASSWORD", ""),
"X-ClickHouse-Key": ci_logs_credentials.password,
}
url = f"https://{clickhouse_ci_logs_host}/"
url = f"https://{ci_logs_credentials.host}/"
profiles_dir = temp_path / "profiles_source"
os.makedirs(profiles_dir, exist_ok=True)
logging.info("Processing profile JSON files from {GIT_REPO_ROOT}/build_docker")

View File

@ -1,6 +1,7 @@
#!/usr/bin/env python3
from pathlib import Path
from typing import Dict, List, Optional
import fileinput
import json
import logging
import time
@ -235,3 +236,89 @@ def prepare_tests_results_for_clickhouse(
result.append(current_row)
return result
class CiLogsCredentials:
def __init__(self, config_path: Path):
self.config_path = config_path
try:
self._host = get_parameter_from_ssm("clickhouse_ci_logs_host") # type: str
self._password = get_parameter_from_ssm(
"clickhouse_ci_logs_password"
) # type: str
except:
logging.warning(
"Unable to retreive host and/or password from smm, all other "
"methods will noop"
)
self._host = ""
self._password = ""
def create_ci_logs_credentials(self) -> None:
if not (self.host and self.password):
logging.info(
"Hostname or password for CI logs instance are unknown, "
"skipping creating of credentials file, removing existing"
)
self.config_path.unlink(missing_ok=True)
return
self.config_path.parent.mkdir(parents=True, exist_ok=True)
self.config_path.write_text(
f"CLICKHOUSE_CI_LOGS_HOST={self.host}\n"
"CLICKHOUSE_CI_LOGS_USER=ci\n"
f"CLICKHOUSE_CI_LOGS_PASSWORD={self.password}\n",
encoding="utf-8",
)
def get_docker_arguments(
self, pr_info: PRInfo, check_start_time: str, check_name: str
) -> str:
self.create_ci_logs_credentials()
if not self.config_path.exists():
logging.info("Do not use external logs pushing")
return ""
extra_columns = (
f"{pr_info.number} AS pull_request_number, '{pr_info.sha}' AS commit_sha, "
f"'{check_start_time}' AS check_start_time, '{check_name}' AS check_name, "
f"'{get_instance_type()}' AS instance_type"
)
return (
f'-e EXTRA_COLUMNS_EXPRESSION="{extra_columns}" '
f"-e CLICKHOUSE_CI_LOGS_CREDENTIALS=/tmp/export-logs-config.sh "
f"--volume={self.config_path.absolute()}:/tmp/export-logs-config.sh:ro "
)
def clean_ci_logs_from_credentials(self, log_path: Path) -> None:
if not (self.host or self.password):
logging.info(
"Hostname and password for CI logs instance are unknown, "
"skipping cleaning %s",
log_path,
)
return
def process_line(line: str) -> str:
if self.host and self.password:
return line.replace(self.host, "CLICKHOUSE_CI_LOGS_HOST").replace(
self.password, "CLICKHOUSE_CI_LOGS_PASSWORD"
)
if self.host:
return line.replace(self.host, "CLICKHOUSE_CI_LOGS_HOST")
# the remaining is self.password
return line.replace(self.password, "CLICKHOUSE_CI_LOGS_PASSWORD")
# errors="surrogateescape" require python 3.10.
# With ubuntu 22.04 we are safe
with fileinput.input(
log_path, inplace=True, errors="surrogateescape"
) as log_fd:
for line in log_fd:
print(process_line(line), end="")
@property
def host(self) -> str:
return self._host
@property
def password(self) -> str:
return self._password

View File

@ -15,9 +15,9 @@ from github import Github
from build_download_helper import download_all_deb_packages
from clickhouse_helper import (
CiLogsCredentials,
ClickHouseHelper,
prepare_tests_results_for_clickhouse,
get_instance_type,
)
from commit_status_helper import (
NotSet,
@ -28,7 +28,7 @@ from commit_status_helper import (
post_commit_status_to_file,
update_mergeable_check,
)
from docker_pull_helper import get_image_with_version
from docker_pull_helper import DockerImage, get_image_with_version
from download_release_packages import download_last_release
from env_helper import TEMP_PATH, REPO_COPY, REPORTS_PATH
from get_robot_token import get_best_robot_token
@ -74,19 +74,18 @@ def get_image_name(check_name):
def get_run_command(
pr_info,
check_start_time,
check_name,
builds_path,
repo_path,
result_path,
server_log_path,
kill_timeout,
additional_envs,
image,
flaky_check,
tests_to_run,
):
check_name: str,
builds_path: str,
repo_path: str,
result_path: str,
server_log_path: str,
kill_timeout: int,
additional_envs: List[str],
ci_logs_args: str,
image: DockerImage,
flaky_check: bool,
tests_to_run: List[str],
) -> str:
additional_options = ["--hung-check"]
additional_options.append("--print-time")
@ -104,39 +103,30 @@ def get_run_command(
]
if flaky_check:
envs += ["-e NUM_TRIES=100", "-e MAX_RUN_TIME=1800"]
envs.append("-e NUM_TRIES=100")
envs.append("-e MAX_RUN_TIME=1800")
envs += [f"-e {e}" for e in additional_envs]
instance_type = get_instance_type()
envs += [
"-e CLICKHOUSE_CI_LOGS_HOST",
"-e CLICKHOUSE_CI_LOGS_PASSWORD",
f"-e PULL_REQUEST_NUMBER='{pr_info.number}'",
f"-e COMMIT_SHA='{pr_info.sha}'",
f"-e CHECK_START_TIME='{check_start_time}'",
f"-e CHECK_NAME='{check_name}'",
f"-e INSTANCE_TYPE='{instance_type}'",
]
env_str = " ".join(envs)
volume_with_broken_test = (
f"--volume={repo_path}/tests/analyzer_tech_debt.txt:/analyzer_tech_debt.txt"
f"--volume={repo_path}/tests/analyzer_tech_debt.txt:/analyzer_tech_debt.txt "
if "analyzer" in check_name
else ""
)
return (
f"docker run --volume={builds_path}:/package_folder "
f"{ci_logs_args}"
f"--volume={repo_path}/tests:/usr/share/clickhouse-test "
f"{volume_with_broken_test} "
f"--volume={result_path}:/test_output --volume={server_log_path}:/var/log/clickhouse-server "
f"{volume_with_broken_test}"
f"--volume={result_path}:/test_output "
f"--volume={server_log_path}:/var/log/clickhouse-server "
f"--cap-add=SYS_PTRACE {env_str} {additional_options_str} {image}"
)
def get_tests_to_run(pr_info):
def get_tests_to_run(pr_info: PRInfo) -> List[str]:
result = set()
if pr_info.changed_files is None:
@ -346,9 +336,12 @@ def main():
if validate_bugfix_check:
additional_envs.append("GLOBAL_TAGS=no-random-settings")
ci_logs_credentials = CiLogsCredentials(Path(temp_path) / "export-logs-config.sh")
ci_logs_args = ci_logs_credentials.get_docker_arguments(
pr_info, stopwatch.start_time_str, check_name
)
run_command = get_run_command(
pr_info,
stopwatch.start_time_str,
check_name,
packages_path,
repo_path,
@ -356,6 +349,7 @@ def main():
server_log_path,
kill_timeout,
additional_envs,
ci_logs_args,
docker_image,
flaky_check,
tests_to_run,
@ -374,6 +368,7 @@ def main():
except subprocess.CalledProcessError:
logging.warning("Failed to change files owner in %s, ignoring it", temp_path)
ci_logs_credentials.clean_ci_logs_from_credentials(Path(run_log_path))
s3_helper = S3Helper()
state, description, test_results, additional_logs = process_results(
@ -383,23 +378,6 @@ def main():
ch_helper = ClickHouseHelper()
# Cleanup run log from the credentials of CI logs database.
# Note: a malicious user can still print them by splitting the value into parts.
# But we will be warned when a malicious user modifies CI script.
# Although they can also print them from inside tests.
# Nevertheless, the credentials of the CI logs have limited scope
# and does not provide access to sensitive info.
ci_logs_host = os.getenv("CLICKHOUSE_CI_LOGS_HOST", "CLICKHOUSE_CI_LOGS_HOST")
ci_logs_password = os.getenv(
"CLICKHOUSE_CI_LOGS_PASSWORD", "CLICKHOUSE_CI_LOGS_PASSWORD"
)
if ci_logs_host not in ("CLICKHOUSE_CI_LOGS_HOST", ""):
subprocess.check_call(
f"sed -i -r -e 's!{ci_logs_host}!CLICKHOUSE_CI_LOGS_HOST!g; s!{ci_logs_password}!CLICKHOUSE_CI_LOGS_PASSWORD!g;' '{run_log_path}'",
shell=True,
)
report_url = upload_results(
s3_helper,
pr_info.number,

View File

@ -12,12 +12,12 @@ from github import Github
from build_download_helper import download_all_deb_packages
from clickhouse_helper import (
CiLogsCredentials,
ClickHouseHelper,
prepare_tests_results_for_clickhouse,
get_instance_type,
)
from commit_status_helper import RerunHelper, get_commit, post_commit_status
from docker_pull_helper import get_image_with_version
from docker_pull_helper import DockerImage, get_image_with_version
from env_helper import TEMP_PATH, REPO_COPY, REPORTS_PATH
from get_robot_token import get_best_robot_token
from pr_info import PRInfo
@ -29,40 +29,24 @@ from upload_result_helper import upload_results
def get_run_command(
pr_info,
check_start_time,
check_name,
build_path,
result_folder,
repo_tests_path,
server_log_folder,
image,
):
instance_type = get_instance_type()
envs = [
# a static link, don't use S3_URL or S3_DOWNLOAD
"-e S3_URL='https://s3.amazonaws.com/clickhouse-datasets'",
"-e CLICKHOUSE_CI_LOGS_HOST",
"-e CLICKHOUSE_CI_LOGS_PASSWORD",
f"-e PULL_REQUEST_NUMBER='{pr_info.number}'",
f"-e COMMIT_SHA='{pr_info.sha}'",
f"-e CHECK_START_TIME='{check_start_time}'",
f"-e CHECK_NAME='{check_name}'",
f"-e INSTANCE_TYPE='{instance_type}'",
]
env_str = " ".join(envs)
build_path: str,
result_path: str,
repo_tests_path: str,
server_log_path: str,
ci_logs_args: str,
image: DockerImage,
) -> str:
cmd = (
"docker run --cap-add=SYS_PTRACE "
f"{env_str} "
# For dmesg and sysctl
"--privileged "
# a static link, don't use S3_URL or S3_DOWNLOAD
"-e S3_URL='https://s3.amazonaws.com/clickhouse-datasets' "
f"{ci_logs_args}"
f"--volume={build_path}:/package_folder "
f"--volume={result_folder}:/test_output "
f"--volume={result_path}:/test_output "
f"--volume={repo_tests_path}:/usr/share/clickhouse-test "
f"--volume={server_log_folder}:/var/log/clickhouse-server {image} "
f"--volume={server_log_path}:/var/log/clickhouse-server {image} "
)
return cmd
@ -170,15 +154,17 @@ def run_stress_test(docker_image_name):
os.makedirs(result_path)
run_log_path = os.path.join(temp_path, "run.log")
ci_logs_credentials = CiLogsCredentials(Path(temp_path) / "export-logs-config.sh")
ci_logs_args = ci_logs_credentials.get_docker_arguments(
pr_info, stopwatch.start_time_str, check_name
)
run_command = get_run_command(
pr_info,
stopwatch.start_time_str,
check_name,
packages_path,
result_path,
repo_tests_path,
server_log_path,
ci_logs_args,
docker_image,
)
logging.info("Going to run stress test: %s", run_command)
@ -191,6 +177,7 @@ def run_stress_test(docker_image_name):
logging.info("Run failed")
subprocess.check_call(f"sudo chown -R ubuntu:ubuntu {temp_path}", shell=True)
ci_logs_credentials.clean_ci_logs_from_credentials(Path(run_log_path))
s3_helper = S3Helper()
state, description, test_results, additional_logs = process_results(
@ -198,23 +185,6 @@ def run_stress_test(docker_image_name):
)
ch_helper = ClickHouseHelper()
# Cleanup run log from the credentials of CI logs database.
# Note: a malicious user can still print them by splitting the value into parts.
# But we will be warned when a malicious user modifies CI script.
# Although they can also print them from inside tests.
# Nevertheless, the credentials of the CI logs have limited scope
# and does not provide access to sensitive info.
ci_logs_host = os.getenv("CLICKHOUSE_CI_LOGS_HOST", "CLICKHOUSE_CI_LOGS_HOST")
ci_logs_password = os.getenv(
"CLICKHOUSE_CI_LOGS_PASSWORD", "CLICKHOUSE_CI_LOGS_PASSWORD"
)
if ci_logs_host not in ("CLICKHOUSE_CI_LOGS_HOST", ""):
subprocess.check_call(
f"sed -i -r -e 's!{ci_logs_host}!CLICKHOUSE_CI_LOGS_HOST!g; s!{ci_logs_password}!CLICKHOUSE_CI_LOGS_PASSWORD!g;' '{run_log_path}'",
shell=True,
)
report_url = upload_results(
s3_helper,
pr_info.number,

View File

@ -275,3 +275,5 @@ function collect_core_dumps()
mv $core.zst /test_output/
done
}
# vi: ft=bash

View File

@ -34,3 +34,5 @@ function run_with_retry()
function fn_exists() {
declare -F "$1" > /dev/null;
}
# vi: ft=bash

View File

@ -0,0 +1,3 @@
<clickhouse>
<validate_tcp_client_information>true</validate_tcp_client_information>
</clickhouse>

View File

@ -61,6 +61,7 @@ ln -sf $SRC_PATH/config.d/disable_s3_env_credentials.xml $DEST_SERVER_PATH/confi
ln -sf $SRC_PATH/config.d/enable_wait_for_shutdown_replicated_tables.xml $DEST_SERVER_PATH/config.d/
ln -sf $SRC_PATH/config.d/backups.xml $DEST_SERVER_PATH/config.d/
ln -sf $SRC_PATH/config.d/filesystem_caches_path.xml $DEST_SERVER_PATH/config.d/
ln -sf $SRC_PATH/config.d/validate_tcp_client_information.xml $DEST_SERVER_PATH/config.d/
# Not supported with fasttest.
if [ "${DEST_SERVER_PATH}" = "/etc/clickhouse-server" ]

View File

@ -61,4 +61,6 @@
</protocols>
<!--tcp_port>9010</tcp_port-->
<validate_tcp_client_information>true</validate_tcp_client_information>
</clickhouse>

View File

@ -84,7 +84,7 @@ def test_connections():
assert execute_query_https(server.ip_address, 8444, "SELECT 1") == "1\n"
data = "PROXY TCP4 255.255.255.255 255.255.255.255 65535 65535\r\n\0\021ClickHouse client\024\r\253\251\003\0\007default\0\004\001\0\001\0\0\t0.0.0.0:0\001\tmilovidov\021milovidov-desktop\vClickHouse \024\r\253\251\003\0\001\0\0\0\002\001\025SELECT 'Hello, world'\002\0\247\203\254l\325\\z|\265\254F\275\333\206\342\024\202\024\0\0\0\n\0\0\0\240\01\0\02\377\377\377\377\0\0\0"
data = "PROXY TCP4 255.255.255.255 255.255.255.255 65535 65535\r\n\0\021ClickHouse client\024\r\253\251\003\0\007default\0\004\001\0\001\0\0\t0.0.0.0:0\001\tmilovidov\021milovidov-desktop\21ClickHouse client\024\r\253\251\003\0\001\0\0\0\002\001\025SELECT 'Hello, world'\002\0\247\203\254l\325\\z|\265\254F\275\333\206\342\024\202\024\0\0\0\n\0\0\0\240\01\0\02\377\377\377\377\0\0\0"
assert (
netcat(server.ip_address, 9100, bytearray(data, "latin-1")).find(
bytearray("Hello, world", "latin-1")
@ -92,7 +92,7 @@ def test_connections():
>= 0
)
data_user_allowed = "PROXY TCP4 123.123.123.123 255.255.255.255 65535 65535\r\n\0\021ClickHouse client\024\r\253\251\003\0\007user123\0\004\001\0\001\0\0\t0.0.0.0:0\001\tmilovidov\021milovidov-desktop\vClickHouse \024\r\253\251\003\0\001\0\0\0\002\001\025SELECT 'Hello, world'\002\0\247\203\254l\325\\z|\265\254F\275\333\206\342\024\202\024\0\0\0\n\0\0\0\240\01\0\02\377\377\377\377\0\0\0"
data_user_allowed = "PROXY TCP4 123.123.123.123 255.255.255.255 65535 65535\r\n\0\021ClickHouse client\024\r\253\251\003\0\007user123\0\004\001\0\001\0\0\t0.0.0.0:0\001\tmilovidov\021milovidov-desktop\21ClickHouse client\024\r\253\251\003\0\001\0\0\0\002\001\025SELECT 'Hello, world'\002\0\247\203\254l\325\\z|\265\254F\275\333\206\342\024\202\024\0\0\0\n\0\0\0\240\01\0\02\377\377\377\377\0\0\0"
assert (
netcat(server.ip_address, 9100, bytearray(data_user_allowed, "latin-1")).find(
bytearray("Hello, world", "latin-1")
@ -100,7 +100,7 @@ def test_connections():
>= 0
)
data_user_restricted = "PROXY TCP4 127.0.0.1 255.255.255.255 65535 65535\r\n\0\021ClickHouse client\024\r\253\251\003\0\007user123\0\004\001\0\001\0\0\t0.0.0.0:0\001\tmilovidov\021milovidov-desktop\vClickHouse \024\r\253\251\003\0\001\0\0\0\002\001\025SELECT 'Hello, world'\002\0\247\203\254l\325\\z|\265\254F\275\333\206\342\024\202\024\0\0\0\n\0\0\0\240\01\0\02\377\377\377\377\0\0\0"
data_user_restricted = "PROXY TCP4 127.0.0.1 255.255.255.255 65535 65535\r\n\0\021ClickHouse client\024\r\253\251\003\0\007user123\0\004\001\0\001\0\0\t0.0.0.0:0\001\tmilovidov\021milovidov-desktop\21ClickHouse client\024\r\253\251\003\0\001\0\0\0\002\001\025SELECT 'Hello, world'\002\0\247\203\254l\325\\z|\265\254F\275\333\206\342\024\202\024\0\0\0\n\0\0\0\240\01\0\02\377\377\377\377\0\0\0"
assert (
netcat(
server.ip_address, 9100, bytearray(data_user_restricted, "latin-1")

View File

@ -0,0 +1,12 @@
<clickhouse>
<max_thread_pool_free_size hide_in_preprocessed="1">2000</max_thread_pool_free_size>
<max_table_size_to_drop hide_in_preprocessed="true">60000000000</max_table_size_to_drop>
<max_partition_size_to_drop hide_in_preprocessed="false">40000000000</max_partition_size_to_drop>
<named_collections hide_in_preprocessed="true">
<name>
<key_1>value</key_1>
<key_2>value_2</key_2>
<url>https://connection.url/</url>
</name>
</named_collections>
</clickhouse>

View File

@ -0,0 +1,7 @@
<clickhouse>
<users>
<default>
<named_collection_control>1</named_collection_control>
</default>
</users>
</clickhouse>

View File

@ -0,0 +1,57 @@
import pytest
import os
from helpers.cluster import ClickHouseCluster
cluster = ClickHouseCluster(__file__)
node = cluster.add_instance(
"node", main_configs=["configs/config.xml"], user_configs=["configs/users.xml"]
)
@pytest.fixture(scope="module")
def started_cluster():
try:
cluster.start()
yield cluster
finally:
cluster.shutdown()
def test_hide_in_preprocessed(started_cluster):
assert (
node.query(
"select value from system.server_settings where name ='max_thread_pool_free_size'"
)
== "2000\n"
)
assert (
node.query(
"select value from system.server_settings where name ='max_table_size_to_drop'"
)
== "60000000000\n"
)
assert (
node.query(
"select value from system.server_settings where name ='max_partition_size_to_drop'"
)
== "40000000000\n"
)
assert "key_1" in node.query("select collection from system.named_collections")
out = node.exec_in_container(
["cat", "/var/lib/clickhouse/preprocessed_configs/config.xml"]
)
assert (
'<max_thread_pool_free_size hide_in_preprocessed="1">2000</max_thread_pool_free_size>'
not in out
)
assert (
'<max_table_size_to_drop hide_in_preprocessed="true">60000000000</max_table_size_to_drop>'
not in out
)
assert (
'<max_partition_size_to_drop hide_in_preprocessed="false">40000000000</max_partition_size_to_drop>'
in out
)
assert '<named_collections hide_in_preprocessed="true">' not in out

View File

@ -96,7 +96,10 @@ def threaded_run_test(sessions):
thread.start()
if len(sessions) > MAX_SESSIONS_FOR_USER:
assert_logs_contain_with_retry(instance, "overflown session count")
# High retry amount to avoid flakiness in ASAN (+Analyzer) tests
assert_logs_contain_with_retry(
instance, "overflown session count", retry_count=60
)
instance.query(f"KILL QUERY WHERE user='{TEST_USER}' SYNC")

View File

@ -3,7 +3,7 @@
<default>
<password></password>
<profile>default</profile>
<named_collection_control>1</named_collection_control>
<named_collection_admin>1</named_collection_admin>
</default>
</users>
</clickhouse>

View File

@ -1,4 +1,6 @@
<test>
<query>SELECT COUNT(DISTINCT Title) FROM test.hits SETTINGS max_threads = 24</query>
<query>SELECT COUNT(DISTINCT Title) FROM test.hits SETTINGS max_threads = 56</query>
<query>SELECT COUNT(DISTINCT Title) FROM test.hits SETTINGS max_threads = 64</query>
<query>SELECT COUNT(DISTINCT Referer) FROM test.hits SETTINGS max_threads = 22</query>
</test>

View File

@ -4,3 +4,5 @@ function test_variant {
perl -E "say \$_ for map {chomp; (qq{$1})} qx{$CLICKHOUSE_CLIENT -q 'SELECT name FROM system.functions ORDER BY name;'}" | $CLICKHOUSE_CLIENT --calculate_text_stack_trace=0 -n --ignore-error >/dev/null 2>&1
$CLICKHOUSE_CLIENT -q "SELECT 'Still alive'"
}
# vi: ft=bash

View File

@ -1 +1,3 @@
Hello, world
Hello, world
Hello, world

View File

@ -6,4 +6,8 @@ CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
# shellcheck source=../shell_config.sh
. "$CURDIR"/../shell_config.sh
# Old clickhouse-client (with version 23.8-) sends "ClickHouse client" and then "ClickHouse" or "ClickHouse ".
# For backward compatibility purposes, the server accepts both variants.
printf "PROXY TCP4 255.255.255.255 255.255.255.255 65535 65535\r\n\0\21ClickHouse client\24\r\253\251\3\0\7default\0\4\1\0\1\0\0\t0.0.0.0:0\1\tmilovidov\21milovidov-desktop\nClickHouse\24\r\253\251\3\0\1\0\0\0\2\1\25SELECT 'Hello, world'\2\0\247\203\254l\325\\z|\265\254F\275\333\206\342\24\202\24\0\0\0\n\0\0\0\240\1\0\2\377\377\377\377\0\0\0" | nc "${CLICKHOUSE_HOST}" "${CLICKHOUSE_PORT_TCP_WITH_PROXY}" | head -c150 | grep --text -o -F 'Hello, world'
printf "PROXY TCP4 255.255.255.255 255.255.255.255 65535 65535\r\n\0\21ClickHouse client\24\r\253\251\3\0\7default\0\4\1\0\1\0\0\t0.0.0.0:0\1\tmilovidov\21milovidov-desktop\vClickHouse \24\r\253\251\3\0\1\0\0\0\2\1\25SELECT 'Hello, world'\2\0\247\203\254l\325\\z|\265\254F\275\333\206\342\24\202\24\0\0\0\n\0\0\0\240\1\0\2\377\377\377\377\0\0\0" | nc "${CLICKHOUSE_HOST}" "${CLICKHOUSE_PORT_TCP_WITH_PROXY}" | head -c150 | grep --text -o -F 'Hello, world'
printf "PROXY TCP4 255.255.255.255 255.255.255.255 65535 65535\r\n\0\21ClickHouse client\24\r\253\251\3\0\7default\0\4\1\0\1\0\0\t0.0.0.0:0\1\tmilovidov\21milovidov-desktop\21ClickHouse client\24\r\253\251\3\0\1\0\0\0\2\1\25SELECT 'Hello, world'\2\0\247\203\254l\325\\z|\265\254F\275\333\206\342\24\202\24\0\0\0\n\0\0\0\240\1\0\2\377\377\377\377\0\0\0" | nc "${CLICKHOUSE_HOST}" "${CLICKHOUSE_PORT_TCP_WITH_PROXY}" | head -c150 | grep --text -o -F 'Hello, world'

View File

@ -23,3 +23,5 @@ $CLICKHOUSE_CLIENT -q "select number, number + 1, concat('string: ', toString(nu
diff $non_parallel_file $parallel_file
rm $non_parallel_file $parallel_file
# vi: ft=bash

View File

@ -12,18 +12,21 @@ sys.path.insert(0, os.path.join(CURDIR, "helpers"))
from pure_http_client import ClickHouseClient
table_engine = sys.argv[1]
client = ClickHouseClient()
# test table without partition
client.query("DROP TABLE IF EXISTS t_async_insert_dedup_no_part SYNC")
client.query(
"""
create_query = f"""
CREATE TABLE t_async_insert_dedup_no_part (
KeyID UInt32
) Engine = ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/t_async_insert_dedup', '{replica}')
) Engine = {table_engine}('/clickhouse/tables/{{shard}}/{{database}}/t_async_insert_dedup', '{{replica}}')
ORDER BY (KeyID)
"""
)
client.query(create_query)
client.query(
"insert into t_async_insert_dedup_no_part values (1), (2), (3), (4), (5)",
@ -101,22 +104,22 @@ def fetch_and_insert_data(q, client):
# main process
client.query("DROP TABLE IF EXISTS t_async_insert_dedup SYNC")
client.query(
"""
create_query = f"""
CREATE TABLE t_async_insert_dedup (
EventDate DateTime,
KeyID UInt32
) Engine = ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/t_async_insert_dedup', '{replica}')
) Engine = {table_engine}('/clickhouse/tables/{{shard}}/{{database}}/t_async_insert_dedup', '{{replica}}')
PARTITION BY toYYYYMM(EventDate)
ORDER BY (KeyID, EventDate) SETTINGS use_async_block_ids_cache = 1
"""
)
client.query(create_query)
q = queue.Queue(100)
total_number = 10000
use_token = False
if sys.argv[-1] == "token":
if len(sys.argv) > 3 and sys.argv[2] == "token":
use_token = True
gen = Thread(target=generate_data, args=[q, total_number, use_token])
@ -158,13 +161,14 @@ while True:
break
result = client.query(
"SELECT value FROM system.metrics where metric = 'AsyncInsertCacheSize'"
"SELECT value FROM system.metrics where metric = 'AsyncInsertCacheSize'"
)
result = int(result.split()[0])
if result <= 0:
raise Exception(f"AsyncInsertCacheSize should > 0, but got {result}")
result = client.query(
"SELECT value FROM system.events where event = 'AsyncInsertCacheHits'"
"SELECT value FROM system.events where event = 'AsyncInsertCacheHits'"
)
result = int(result.split()[0])
if result <= 0:

View File

@ -6,4 +6,4 @@ CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
. "$CURDIR"/../shell_config.sh
# We should have correct env vars from shell_config.sh to run this test
python3 "$CURDIR"/02481_async_insert_dedup.python
python3 "$CURDIR"/02481_async_insert_dedup.python ReplicatedMergeTree

View File

@ -6,4 +6,4 @@ CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
. "$CURDIR"/../shell_config.sh
# We should have correct env vars from shell_config.sh to run this test
python3 "$CURDIR"/02481_async_insert_dedup.python token
python3 "$CURDIR"/02481_async_insert_dedup.python ReplicatedMergeTree token

View File

@ -8,6 +8,7 @@ CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
function read_archive_file() {
$CLICKHOUSE_LOCAL --query "SELECT * FROM file('${user_files_path}/$1') ORDER BY 1, 2"
$CLICKHOUSE_CLIENT --query "SELECT * FROM file('${user_files_path}/$1') ORDER BY 1, 2"
$CLICKHOUSE_CLIENT --query "DESC file('${user_files_path}/$1')"
$CLICKHOUSE_CLIENT --query "CREATE TABLE 02661_archive_table Engine=File('CSV', '${user_files_path}/$1')"
$CLICKHOUSE_CLIENT --query "SELECT * FROM 02661_archive_table ORDER BY 1, 2"
$CLICKHOUSE_CLIENT --query "DROP TABLE 02661_archive_table"
@ -16,15 +17,17 @@ function read_archive_file() {
function run_archive_test() {
$CLICKHOUSE_CLIENT --query "DROP TABLE IF EXISTS 02661_archive_table"
FILE_PREFIX="${CLICKHOUSE_TEST_UNIQUE_NAME}_$1_"
extension_without_dot=$(echo $1 | sed -e 's/\.//g')
FILE_PREFIX="02661_read_from_archive_${CLICKHOUSE_DATABASE}_$extension_without_dot"
user_files_path=$(clickhouse-client --query "select _path,_file from file('nonexist.txt', 'CSV', 'val1 char')" 2>&1 | grep Exception | awk '{gsub("/nonexist.txt","",$9); print $9}')
user_files_path=$(clickhouse-client --query "select _path, _file from file('nonexist.txt', 'CSV', 'val1 char')" 2>&1 | grep -o "/[^[:space:]]*nonexist.txt" | awk '{gsub("/nonexist.txt","",$1); print $1}')
touch ${FILE_PREFIX}_data0.csv
echo -e "1,2\n3,4" > ${FILE_PREFIX}_data1.csv
echo -e "5,6\n7,8" > ${FILE_PREFIX}_data2.csv
echo -e "9,10\n11,12" > ${FILE_PREFIX}_data3.csv
eval "$2 ${user_files_path}/${FILE_PREFIX}_archive1.$1 ${FILE_PREFIX}_data1.csv ${FILE_PREFIX}_data2.csv > /dev/null"
eval "$2 ${user_files_path}/${FILE_PREFIX}_archive1.$1 ${FILE_PREFIX}_data0.csv ${FILE_PREFIX}_data1.csv ${FILE_PREFIX}_data2.csv > /dev/null"
eval "$2 ${user_files_path}/${FILE_PREFIX}_archive2.$1 ${FILE_PREFIX}_data1.csv ${FILE_PREFIX}_data3.csv > /dev/null"
eval "$2 ${user_files_path}/${FILE_PREFIX}_archive3.$1 ${FILE_PREFIX}_data2.csv ${FILE_PREFIX}_data3.csv > /dev/null"
@ -41,10 +44,12 @@ function run_archive_test() {
echo "archive* {2..3}.csv"
read_archive_file "${FILE_PREFIX}_archive*.$1 :: ${FILE_PREFIX}_data{2..3}.csv"
$CLICKHOUSE_LOCAL --query "SELECT * FROM file('${user_files_path}/${FILE_PREFIX}_archive1.$1::nonexistent.csv')" 2>&1 | grep -q "CANNOT_UNPACK_ARCHIVE" && echo "OK" || echo "FAIL"
$CLICKHOUSE_LOCAL --query "SELECT * FROM file('${user_files_path}/${FILE_PREFIX}_archive3.$1::{2..3}.csv')" 2>&1 | grep -q "CANNOT_UNPACK_ARCHIVE" && echo "OK" || echo "FAIL"
$CLICKHOUSE_LOCAL --query "SELECT * FROM file('${user_files_path}/${FILE_PREFIX}_archive1.$1::nonexistent.csv')" 2>&1 | grep -q "CANNOT_EXTRACT_TABLE_STRUCTURE" && echo "OK" || echo "FAIL"
$CLICKHOUSE_LOCAL --query "SELECT * FROM file('${user_files_path}/${FILE_PREFIX}_archive3.$1::{2..3}.csv')" 2>&1 | grep -q "CANNOT_EXTRACT_TABLE_STRUCTURE" && echo "OK" || echo "FAIL"
rm ${user_files_path}/${FILE_PREFIX}_archive{1..3}.$1
rm ${FILE_PREFIX}_data{1..3}.csv
}
rm ${FILE_PREFIX}_data{0..3}.csv
}
# vi: ft=bash

View File

@ -3,6 +3,8 @@ archive1 data1.csv
3 4
1 2
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
3 4
archive{1..2} data1.csv
@ -14,6 +16,8 @@ archive{1..2} data1.csv
1 2
3 4
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -31,6 +35,8 @@ archive{1,2} data{1,3}.csv
3 4
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -46,6 +52,8 @@ archive3 data*.csv
7 8
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
7 8
9 10
@ -75,6 +83,8 @@ archive* *.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -104,6 +114,8 @@ archive* {2..3}.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
5 6
7 8

View File

@ -3,6 +3,8 @@ archive1 data1.csv
3 4
1 2
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
3 4
archive{1..2} data1.csv
@ -14,6 +16,8 @@ archive{1..2} data1.csv
1 2
3 4
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -31,6 +35,8 @@ archive{1,2} data{1,3}.csv
3 4
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -46,6 +52,8 @@ archive3 data*.csv
7 8
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
7 8
9 10
@ -75,6 +83,8 @@ archive* *.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -104,6 +114,8 @@ archive* {2..3}.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
5 6
7 8

View File

@ -3,6 +3,8 @@ archive1 data1.csv
3 4
1 2
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
3 4
archive{1..2} data1.csv
@ -14,6 +16,8 @@ archive{1..2} data1.csv
1 2
3 4
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -31,6 +35,8 @@ archive{1,2} data{1,3}.csv
3 4
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -46,6 +52,8 @@ archive3 data*.csv
7 8
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
7 8
9 10
@ -75,6 +83,8 @@ archive* *.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -104,6 +114,8 @@ archive* {2..3}.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
5 6
7 8

View File

@ -3,6 +3,8 @@ archive1 data1.csv
3 4
1 2
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
3 4
archive{1..2} data1.csv
@ -14,6 +16,8 @@ archive{1..2} data1.csv
1 2
3 4
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -31,6 +35,8 @@ archive{1,2} data{1,3}.csv
3 4
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -46,6 +52,8 @@ archive3 data*.csv
7 8
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
7 8
9 10
@ -75,6 +83,8 @@ archive* *.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -104,6 +114,8 @@ archive* {2..3}.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
5 6
7 8

View File

@ -3,6 +3,8 @@ archive1 data1.csv
3 4
1 2
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
3 4
archive{1..2} data1.csv
@ -14,6 +16,8 @@ archive{1..2} data1.csv
1 2
3 4
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -31,6 +35,8 @@ archive{1,2} data{1,3}.csv
3 4
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -46,6 +52,8 @@ archive3 data*.csv
7 8
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
7 8
9 10
@ -75,6 +83,8 @@ archive* *.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -104,6 +114,8 @@ archive* {2..3}.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
5 6
7 8

View File

@ -3,6 +3,8 @@ archive1 data1.csv
3 4
1 2
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
3 4
archive{1..2} data1.csv
@ -14,6 +16,8 @@ archive{1..2} data1.csv
1 2
3 4
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -31,6 +35,8 @@ archive{1,2} data{1,3}.csv
3 4
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -46,6 +52,8 @@ archive3 data*.csv
7 8
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
7 8
9 10
@ -75,6 +83,8 @@ archive* *.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -104,6 +114,8 @@ archive* {2..3}.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
5 6
7 8

View File

@ -3,6 +3,8 @@ archive1 data1.csv
3 4
1 2
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
3 4
archive{1..2} data1.csv
@ -14,6 +16,8 @@ archive{1..2} data1.csv
1 2
3 4
3 4
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -31,6 +35,8 @@ archive{1,2} data{1,3}.csv
3 4
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -46,6 +52,8 @@ archive3 data*.csv
7 8
9 10
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
7 8
9 10
@ -75,6 +83,8 @@ archive* *.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
1 2
1 2
3 4
@ -104,6 +114,8 @@ archive* {2..3}.csv
9 10
11 12
11 12
c1 Nullable(Int64)
c2 Nullable(Int64)
5 6
5 6
7 8

View File

@ -5,5 +5,11 @@ implicit:
4
Test 2: check Filesystem database
4
30
10
4
3
2
1
Test 3: check show database with Filesystem
test02707

View File

@ -15,6 +15,23 @@ echo '2,"def",456,"bacabaa"' >> $dir/tmp.csv
echo '3,"story",78912,"acabaab"' >> $dir/tmp.csv
echo '4,"history",21321321,"cabaaba"' >> $dir/tmp.csv
$CLICKHOUSE_LOCAL -q "insert into function file('$dir/tmp_numbers_1.csv') select * from numbers(1, 10)"
$CLICKHOUSE_LOCAL -q "insert into function file('$dir/tmp_numbers_2.csv') select * from numbers(11, 10)"
$CLICKHOUSE_LOCAL -q "insert into function file('$dir/tmp_numbers_30.csv') select * from numbers(21, 10)"
readonly nested_dir=$dir/nested
[[ -d $nested_dir ]] && rm -rd $nested_dir
mkdir $nested_dir
mkdir $nested_dir/subnested
cp ${dir}/tmp_numbers_1.csv ${nested_dir}/nested_tmp_numbers_1.csv
cp ${dir}/tmp_numbers_1.csv ${nested_dir}/subnested/subnested_tmp_numbers_1.csv
readonly other_nested_dir=$dir/other_nested
[[ -d $other_nested_dir ]] && rm -rd $other_nested_dir
mkdir $other_nested_dir
cp ${dir}/tmp_numbers_1.csv ${other_nested_dir}/tmp_numbers_1.csv
#################
echo "Test 1: check explicit and implicit call of the file table function"
@ -29,6 +46,12 @@ $CLICKHOUSE_LOCAL --multiline --multiquery -q """
DROP DATABASE IF EXISTS test;
CREATE DATABASE test ENGINE = Filesystem('${dir}');
SELECT COUNT(*) FROM test.\`tmp.csv\`;
SELECT COUNT(*) FROM test.\`tmp_numbers_*.csv\`;
SELECT COUNT(*) FROM test.\`nested/nested_tmp_numbers_1*.csv\`;
SELECT count(DISTINCT _path) FROM test.\`*.csv\`;
SELECT count(DISTINCT _path) FROM test.\`**/*.csv\`;
SELECT count(DISTINCT _path) FROM test.\`**/*.csv\` WHERE position(_path, '${nested_dir}') > 0;
SELECT count(DISTINCT _path) FROM test.\`**/*.csv\` WHERE position(_path, '${nested_dir}') = 0;
DROP DATABASE test;
"""

View File

@ -3,6 +3,14 @@ Test 1: create filesystem database and check implicit calls
test1
4
4
30
10
10
4
0
2
0
OK
4
Test 2: check DatabaseFilesystem access rights and errors handling on server
OK
@ -13,3 +21,6 @@ OK
OK
OK
OK
OK
OK
OK

View File

@ -19,11 +19,17 @@ echo '3,"story",78912,"acabaab"' >> ${user_files_tmp_dir}/tmp.csv
echo '4,"history",21321321,"cabaaba"' >> ${user_files_tmp_dir}/tmp.csv
tmp_dir=${CLICKHOUSE_TEST_UNIQUE_NAME}
$CLICKHOUSE_LOCAL -q "insert into function file('$user_files_tmp_dir/tmp_numbers_1.csv') select * from numbers(1, 10)"
$CLICKHOUSE_LOCAL -q "insert into function file('$user_files_tmp_dir/tmp_numbers_2.csv') select * from numbers(11, 10)"
$CLICKHOUSE_LOCAL -q "insert into function file('$user_files_tmp_dir/tmp_numbers_30.csv') select * from numbers(21, 10)"
[[ -d $tmp_dir ]] && rm -rd $tmp_dir
mkdir $tmp_dir
cp ${user_files_tmp_dir}/tmp.csv ${tmp_dir}/tmp.csv
cp ${user_files_tmp_dir}/tmp.csv ${user_files_tmp_dir}/tmp/tmp.csv
cp ${user_files_tmp_dir}/tmp.csv ${user_files_tmp_dir}/tmp.myext
cp ${user_files_tmp_dir}/tmp_numbers_1.csv ${user_files_tmp_dir}/tmp/tmp_numbers_1.csv
#################
echo "Test 1: create filesystem database and check implicit calls"
@ -35,6 +41,15 @@ echo $?
${CLICKHOUSE_CLIENT} --query "SHOW DATABASES" | grep "test1"
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`${unique_name}/tmp.csv\`;"
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`${unique_name}/tmp/tmp.csv\`;"
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`${unique_name}/tmp_numbers_*.csv\`;"
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`${unique_name}/tmp/*tmp_numbers_*.csv\`;"
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`${unique_name}/*/*tmp_numbers_*.csv\`;"
${CLICKHOUSE_CLIENT} --query "SELECT count(DISTINCT _path) FROM test1.\`${unique_name}/*.csv\` WHERE startsWith(_path, '${user_files_tmp_dir}')";
${CLICKHOUSE_CLIENT} --query "SELECT count(DISTINCT _path) FROM test1.\`${unique_name}/*.csv\` WHERE not startsWith(_path, '${user_files_tmp_dir}')";
# **/* does not search in the current directory but searches recursively in nested directories.
${CLICKHOUSE_CLIENT} --query "SELECT count(DISTINCT _path) FROM test1.\`${unique_name}/**/*.csv\` WHERE startsWith(_path, '${user_files_tmp_dir}')";
${CLICKHOUSE_CLIENT} --query "SELECT count(DISTINCT _path) FROM test1.\`${unique_name}/**/*.csv\` WHERE not startsWith(_path, '${user_files_tmp_dir}')";
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`tmp_numbers_*.csv\`;" 2>&1 | tr '\n' ' ' | grep -oF "CANNOT_EXTRACT_TABLE_STRUCTURE" > /dev/null && echo "OK" || echo 'FAIL' ||:
${CLICKHOUSE_LOCAL} -q "SELECT COUNT(*) FROM \"${tmp_dir}/tmp.csv\""
#################
@ -42,6 +57,9 @@ echo "Test 2: check DatabaseFilesystem access rights and errors handling on serv
# DATABASE_ACCESS_DENIED: Allows list files only inside user_files
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`../tmp.csv\`;" 2>&1 | tr '\n' ' ' | grep -oF "PATH_ACCESS_DENIED" > /dev/null && echo "OK" || echo 'FAIL' ||:
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`/tmp/tmp.csv\`;" 2>&1 | tr '\n' ' ' | grep -oF "PATH_ACCESS_DENIED" > /dev/null && echo "OK" || echo 'FAIL' ||:
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`../*/tmp_numbers_*.csv\`;" 2>&1 | tr '\n' ' ' | grep -oF "PATH_ACCESS_DENIED" > /dev/null && echo "OK" || echo 'FAIL' ||:
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`../tmp_numbers_*.csv\`;" 2>&1 | tr '\n' ' ' | grep -oF "PATH_ACCESS_DENIED" > /dev/null && echo "OK" || echo 'FAIL' ||:
${CLICKHOUSE_CLIENT} --query "SELECT COUNT(*) FROM test1.\`../*.csv\`;" 2>&1 | tr '\n' ' ' | grep -oF "PATH_ACCESS_DENIED" > /dev/null && echo "OK" || echo 'FAIL' ||:
${CLICKHOUSE_CLIENT} --multiline --multiquery --query """
USE test1;
SELECT COUNT(*) FROM \"../${tmp_dir}/tmp.csv\";

View File

@ -25,3 +25,37 @@ Test 3a: check literal no parsing overflow
1
Test 3b: check literal empty
1
Test 4: select using * wildcard
30
30
30
30
30
10
30
10
Test 4b: select using ? wildcard
20
10
20
10
20
Test 4c: select using '{' + '}' wildcards
20
20
1
Test 4d: select using ? and * wildcards
30
30
30
1
30
30
Test 4e: select using ?, * and '{' + '}' wildcards
10
20
20
20
Test 4f: recursive search
2
1

Some files were not shown because too many files have changed in this diff Show More