diff --git a/.gitignore b/.gitignore index 4bc162c1b0f..8a745655cbf 100644 --- a/.gitignore +++ b/.gitignore @@ -159,6 +159,7 @@ website/package-lock.json /programs/server/store /programs/server/uuid /programs/server/coordination +/programs/server/workload # temporary test files tests/queries/0_stateless/test_* diff --git a/.gitmodules b/.gitmodules index bbc8fc7d06c..a3b6450032a 100644 --- a/.gitmodules +++ b/.gitmodules @@ -332,7 +332,7 @@ url = https://github.com/ClickHouse/usearch.git [submodule "contrib/SimSIMD"] path = contrib/SimSIMD - url = https://github.com/ashvardanian/SimSIMD.git + url = https://github.com/ClickHouse/SimSIMD.git [submodule "contrib/FP16"] path = contrib/FP16 url = https://github.com/Maratyszcza/FP16.git diff --git a/CHANGELOG.md b/CHANGELOG.md index 6c0d21a4698..90285582b4e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,4 +1,5 @@ ### Table of Contents +**[ClickHouse release v24.10, 2024-10-31](#2410)**
**[ClickHouse release v24.9, 2024-09-26](#249)**
**[ClickHouse release v24.8 LTS, 2024-08-20](#248)**
**[ClickHouse release v24.7, 2024-07-30](#247)**
@@ -12,6 +13,165 @@ # 2024 Changelog +### ClickHouse release 24.10, 2024-10-31 + +#### Backward Incompatible Change +* Allow to write `SETTINGS` before `FORMAT` in a chain of queries with `UNION` when subqueries are inside parentheses. This closes [#39712](https://github.com/ClickHouse/ClickHouse/issues/39712). Change the behavior when a query has the SETTINGS clause specified twice in a sequence. The closest SETTINGS clause will have a preference for the corresponding subquery. In the previous versions, the outermost SETTINGS clause could take a preference over the inner one. [#68614](https://github.com/ClickHouse/ClickHouse/pull/68614) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Reordering of filter conditions from `[PRE]WHERE` clause is now allowed by default. It could be disabled by setting `allow_reorder_prewhere_conditions` to `false`. [#70657](https://github.com/ClickHouse/ClickHouse/pull/70657) ([Nikita Taranov](https://github.com/nickitat)). +* Remove the `idxd-config` library, which has an incompatible license. This also removes the experimental Intel DeflateQPL codec. [#70987](https://github.com/ClickHouse/ClickHouse/pull/70987) ([Alexey Milovidov](https://github.com/alexey-milovidov)). + +#### New Feature +* Allow to grant access to the wildcard prefixes. `GRANT SELECT ON db.table_pefix_* TO user`. [#65311](https://github.com/ClickHouse/ClickHouse/pull/65311) ([pufit](https://github.com/pufit)). +* If you press space bar during query runtime, the client will display a real-time table with detailed metrics. You can enable it globally with the new `--progress-table` option in clickhouse-client; a new `--enable-progress-table-toggle` is associated with the `--progress-table` option, and toggles the rendering of the progress table by pressing the control key (Space). [#63689](https://github.com/ClickHouse/ClickHouse/pull/63689) ([Maria Khristenko](https://github.com/mariaKhr)), [#70423](https://github.com/ClickHouse/ClickHouse/pull/70423) ([Julia Kartseva](https://github.com/jkartseva)). +* Allow to cache read files for object storage table engines and data lakes using hash from ETag + file path as cache key. [#70135](https://github.com/ClickHouse/ClickHouse/pull/70135) ([Kseniia Sumarokova](https://github.com/kssenii)). +* Support creating a table with a query: `CREATE TABLE ... CLONE AS ...`. It clones the source table's schema and then attaches all partitions to the newly created table. This feature is only supported with tables of the `MergeTree` family Closes [#65015](https://github.com/ClickHouse/ClickHouse/issues/65015). [#69091](https://github.com/ClickHouse/ClickHouse/pull/69091) ([tuanpach](https://github.com/tuanpach)). +* Add a new system table, `system.query_metric_log` which contains history of memory and metric values from table system.events for individual queries, periodically flushed to disk. [#66532](https://github.com/ClickHouse/ClickHouse/pull/66532) ([Pablo Marcos](https://github.com/pamarcos)). +* A simple SELECT query can be written with implicit SELECT to enable calculator-style expressions, e.g., `ch "1 + 2"`. This is controlled by a new setting, `implicit_select`. [#68502](https://github.com/ClickHouse/ClickHouse/pull/68502) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Support the `--copy` mode for clickhouse local as a shortcut for format conversion [#68503](https://github.com/ClickHouse/ClickHouse/issues/68503). [#68583](https://github.com/ClickHouse/ClickHouse/pull/68583) ([Denis Hananein](https://github.com/denis-hananein)). +* Add a builtin HTML page for visualizing merges which is available at the `/merges` path. [#70821](https://github.com/ClickHouse/ClickHouse/pull/70821) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Add support for `arrayUnion` function. [#68989](https://github.com/ClickHouse/ClickHouse/pull/68989) ([Peter Nguyen](https://github.com/petern48)). +* Allow parametrised SQL aliases. [#50665](https://github.com/ClickHouse/ClickHouse/pull/50665) ([Anton Kozlov](https://github.com/tonickkozlov)). +* A new aggregate function `quantileExactWeightedInterpolated`, which is a interpolated version based on quantileExactWeighted. Some people may wonder why we need a new `quantileExactWeightedInterpolated` since we already have `quantileExactInterpolatedWeighted`. The reason is the new one is more accurate than the old one. This is for spark compatibility. [#69619](https://github.com/ClickHouse/ClickHouse/pull/69619) ([李扬](https://github.com/taiyang-li)). +* A new function `arrayElementOrNull`. It returns `NULL` if the array index is out of range or a Map key not found. [#69646](https://github.com/ClickHouse/ClickHouse/pull/69646) ([李扬](https://github.com/taiyang-li)). +* Allows users to specify regular expressions through new `message_regexp` and `message_regexp_negative` fields in the `config.xml` file to filter out logging. The logging is applied to the formatted un-colored text for the most intuitive developer experience. [#69657](https://github.com/ClickHouse/ClickHouse/pull/69657) ([Peter Nguyen](https://github.com/petern48)). +* Added `RIPEMD160` function, which computes the RIPEMD-160 cryptographic hash of a string. Example: `SELECT HEX(RIPEMD160('The quick brown fox jumps over the lazy dog'))` returns `37F332F68DB77BD9D7EDD4969571AD671CF9DD3B`. [#70087](https://github.com/ClickHouse/ClickHouse/pull/70087) ([Dergousov Maxim](https://github.com/m7kss1)). +* Support reading `Iceberg` tables on `HDFS`. [#70268](https://github.com/ClickHouse/ClickHouse/pull/70268) ([flynn](https://github.com/ucasfl)). +* Support for CTE in the form of `WITH ... INSERT`, as previously we only supported `INSERT ... WITH ...`. [#70593](https://github.com/ClickHouse/ClickHouse/pull/70593) ([Shichao Jin](https://github.com/jsc0218)). +* MongoDB integration: support for all MongoDB types, support for WHERE and ORDER BY statements on MongoDB side, restriction for expressions unsupported by MongoDB. Note that the new inegration is disabled by default, to use it, please set `` to `false` in server config. [#63279](https://github.com/ClickHouse/ClickHouse/pull/63279) ([Kirill Nikiforov](https://github.com/allmazz)). +* A new function `getSettingOrDefault` added to return the default value and avoid exception if a custom setting is not found in the current profile. [#69917](https://github.com/ClickHouse/ClickHouse/pull/69917) ([Shankar](https://github.com/shiyer7474)). + +#### Experimental feature +* Refreshable materialized views are production ready. [#70550](https://github.com/ClickHouse/ClickHouse/pull/70550) ([Michael Kolupaev](https://github.com/al13n321)). Refreshable materialized views are now supported in Replicated databases. [#60669](https://github.com/ClickHouse/ClickHouse/pull/60669) ([Michael Kolupaev](https://github.com/al13n321)). +* Parallel replicas are moved from experimental to beta. Reworked settings that control the behavior of parallel replicas algorithms. A quick recap: ClickHouse has four different algorithms for parallel reading involving multiple replicas, which is reflected in the setting `parallel_replicas_mode`, the default value for it is `read_tasks` Additionally, the toggle-switch setting `enable_parallel_replicas` has been added. [#63151](https://github.com/ClickHouse/ClickHouse/pull/63151) ([Alexey Milovidov](https://github.com/alexey-milovidov)), ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)). +* Support for the `Dynamic` type in most functions by executing them on internal types inside `Dynamic`. [#69691](https://github.com/ClickHouse/ClickHouse/pull/69691) ([Pavel Kruglov](https://github.com/Avogar)). +* Allow to read/write the `JSON` type as a binary string in `RowBinary` format under settings `input_format_binary_read_json_as_string/output_format_binary_write_json_as_string`. [#70288](https://github.com/ClickHouse/ClickHouse/pull/70288) ([Pavel Kruglov](https://github.com/Avogar)). +* Allow to serialize/deserialize `JSON` column as single String column in the Native format. For output use setting `output_format_native_write_json_as_string`. For input, use serialization version `1` before the column data. [#70312](https://github.com/ClickHouse/ClickHouse/pull/70312) ([Pavel Kruglov](https://github.com/Avogar)). +* Introduced a special (experimental) mode of a merge selector for MergeTree tables which makes it more aggressive for the partitions that are close to the limit by the number of parts. It is controlled by the `merge_selector_use_blurry_base` MergeTree-level setting. [#70645](https://github.com/ClickHouse/ClickHouse/pull/70645) ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)). +* Implement generic ser/de between Avro's `Union` and ClickHouse's `Variant` types. Resolves [#69713](https://github.com/ClickHouse/ClickHouse/issues/69713). [#69712](https://github.com/ClickHouse/ClickHouse/pull/69712) ([Jiří Kozlovský](https://github.com/jirislav)). + +#### Performance Improvement +* Refactor `IDisk` and `IObjectStorage` for better performance. Tables from `plain` and `plain_rewritable` object storages will initialize faster. [#68146](https://github.com/ClickHouse/ClickHouse/pull/68146) ([Alexey Milovidov](https://github.com/alexey-milovidov), [Julia Kartseva](https://github.com/jkartseva)). Do not call the LIST object storage API when determining if a file or directory exists on the plain rewritable disk, as it can be cost-inefficient. [#70852](https://github.com/ClickHouse/ClickHouse/pull/70852) ([Julia Kartseva](https://github.com/jkartseva)). Reduce the number of object storage HEAD API requests in the plain_rewritable disk. [#70915](https://github.com/ClickHouse/ClickHouse/pull/70915) ([Julia Kartseva](https://github.com/jkartseva)). +* Added an ability to parse data directly into sparse columns. [#69828](https://github.com/ClickHouse/ClickHouse/pull/69828) ([Anton Popov](https://github.com/CurtizJ)). +* Improved performance of parsing formats with high number of missed values (e.g. `JSONEachRow`). [#69875](https://github.com/ClickHouse/ClickHouse/pull/69875) ([Anton Popov](https://github.com/CurtizJ)). +* Supports parallel reading of parquet row groups and prefetching of row groups in single-threaded mode. [#69862](https://github.com/ClickHouse/ClickHouse/pull/69862) ([LiuNeng](https://github.com/liuneng1994)). +* Support minmax index for `pointInPolygon`. [#62085](https://github.com/ClickHouse/ClickHouse/pull/62085) ([JackyWoo](https://github.com/JackyWoo)). +* Use bloom filters when reading Parquet files. [#62966](https://github.com/ClickHouse/ClickHouse/pull/62966) ([Arthur Passos](https://github.com/arthurpassos)). +* Lock-free parts rename to avoid INSERT affect SELECT (due to parts lock) (under normal circumstances with `fsync_part_directory`, QPS of SELECT with INSERT in parallel, increased 2x, under heavy load the effect is even bigger). Note, this only includes `ReplicatedMergeTree` for now. [#64955](https://github.com/ClickHouse/ClickHouse/pull/64955) ([Azat Khuzhin](https://github.com/azat)). +* Respect `ttl_only_drop_parts` on `materialize ttl`; only read necessary columns to recalculate TTL and drop parts by replacing them with an empty one. [#65488](https://github.com/ClickHouse/ClickHouse/pull/65488) ([Andrey Zvonov](https://github.com/zvonand)). +* Optimized thread creation in the ThreadPool to minimize lock contention. Thread creation is now performed outside of the critical section to avoid delays in job scheduling and thread management under high load conditions. This leads to a much more responsive ClickHouse under heavy concurrent load. [#68694](https://github.com/ClickHouse/ClickHouse/pull/68694) ([filimonov](https://github.com/filimonov)). +* Enable reading `LowCardinality` string columns from `ORC`. [#69481](https://github.com/ClickHouse/ClickHouse/pull/69481) ([李扬](https://github.com/taiyang-li)). +* Use `LowCardinality` for `ProfileEvents` in system logs such as `part_log`, `query_views_log`, `filesystem_cache_log`. [#70152](https://github.com/ClickHouse/ClickHouse/pull/70152) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Improve performance of `fromUnixTimestamp`/`toUnixTimestamp` functions. [#71042](https://github.com/ClickHouse/ClickHouse/pull/71042) ([kevinyhzou](https://github.com/KevinyhZou)). +* Don't disable nonblocking read from page cache for the entire server when reading from a blocking I/O. This was leading to a poorer performance when a single filesystem (e.g., tmpfs) didn't support the `preadv2` syscall while others do. [#70299](https://github.com/ClickHouse/ClickHouse/pull/70299) ([Antonio Andelic](https://github.com/antonio2368)). +* `ALTER TABLE .. REPLACE PARTITION` doesn't wait anymore for mutations/merges that happen in other partitions. [#59138](https://github.com/ClickHouse/ClickHouse/pull/59138) ([Vasily Nemkov](https://github.com/Enmk)). +* Don't do validation when synchronizing ACL from Keeper. It's validating during creation. It shouldn't matter that much, but there are installations with tens of thousands or even more user created, and the unnecessary hash validation can take a long time to finish during server startup (it synchronizes everything from keeper). [#70644](https://github.com/ClickHouse/ClickHouse/pull/70644) ([Raúl Marín](https://github.com/Algunenano)). + +#### Improvement +* `CREATE TABLE AS` will copy `PRIMARY KEY`, `ORDER BY`, and similar clauses (of `MergeTree` tables). [#69739](https://github.com/ClickHouse/ClickHouse/pull/69739) ([sakulali](https://github.com/sakulali)). +* Support 64-bit XID in Keeper. It can be enabled with the `use_xid_64` configuration value. [#69908](https://github.com/ClickHouse/ClickHouse/pull/69908) ([Antonio Andelic](https://github.com/antonio2368)). +* Command-line arguments for Bool settings are set to true when no value is provided for the argument (e.g. `clickhouse-client --optimize_aggregation_in_order --query "SELECT 1"`). [#70459](https://github.com/ClickHouse/ClickHouse/pull/70459) ([davidtsuk](https://github.com/davidtsuk)). +* Added user-level settings `min_free_disk_bytes_to_throw_insert` and `min_free_disk_ratio_to_throw_insert` to prevent insertions on disks that are almost full. [#69755](https://github.com/ClickHouse/ClickHouse/pull/69755) ([Marco Vilas Boas](https://github.com/marco-vb)). +* Embedded documentation for settings will be strictly more detailed and complete than the documentation on the website. This is the first step before making the website documentation always auto-generated from the source code. This has long-standing implications: - it will be guaranteed to have every setting; - there is no chance of having default values obsolete; - we can generate this documentation for each ClickHouse version; - the documentation can be displayed by the server itself even without Internet access. Generate the docs on the website from the source code. [#70289](https://github.com/ClickHouse/ClickHouse/pull/70289) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Allow empty needle in the function `replace`, the same behavior with PostgreSQL. [#69918](https://github.com/ClickHouse/ClickHouse/pull/69918) ([zhanglistar](https://github.com/zhanglistar)). +* Allow empty needle in functions `replaceRegexp*`. [#70053](https://github.com/ClickHouse/ClickHouse/pull/70053) ([zhanglistar](https://github.com/zhanglistar)). +* Symbolic links for tables in the `data/database_name/` directory are created for the actual paths to the table's data, depending on the storage policy, instead of the `store/...` directory on the default disk. [#61777](https://github.com/ClickHouse/ClickHouse/pull/61777) ([Kirill](https://github.com/kirillgarbar)). +* While parsing an `Enum` field from `JSON`, a string containing an integer will be interpreted as the corresponding `Enum` element. This closes [#65119](https://github.com/ClickHouse/ClickHouse/issues/65119). [#66801](https://github.com/ClickHouse/ClickHouse/pull/66801) ([scanhex12](https://github.com/scanhex12)). +* Allow `TRIM` -ing `LEADING` or `TRAILING` empty string as a no-op. Closes [#67792](https://github.com/ClickHouse/ClickHouse/issues/67792). [#68455](https://github.com/ClickHouse/ClickHouse/pull/68455) ([Peter Nguyen](https://github.com/petern48)). +* Improve compatibility of `cast(timestamp as String)` with Spark. [#69179](https://github.com/ClickHouse/ClickHouse/pull/69179) ([Wenzheng Liu](https://github.com/lwz9103)). +* Always use the new analyzer to calculate constant expressions when `enable_analyzer` is set to `true`. Support calculation of `executable` table function arguments without using `SELECT` query for constant expressions. [#69292](https://github.com/ClickHouse/ClickHouse/pull/69292) ([Dmitry Novik](https://github.com/novikd)). +* Add a setting `enable_secure_identifiers` to disallow identifiers with special characters. [#69411](https://github.com/ClickHouse/ClickHouse/pull/69411) ([tuanpach](https://github.com/tuanpach)). +* Add `show_create_query_identifier_quoting_rule` to define identifier quoting behavior in the `SHOW CREATE TABLE` query result. Possible values: - `user_display`: When the identifiers is a keyword. - `when_necessary`: When the identifiers is one of `{"distinct", "all", "table"}` and when it could lead to ambiguity: column names, dictionary attribute names. - `always`: Always quote identifiers. [#69448](https://github.com/ClickHouse/ClickHouse/pull/69448) ([tuanpach](https://github.com/tuanpach)). +* Improve restoring of access entities' dependencies [#69563](https://github.com/ClickHouse/ClickHouse/pull/69563) ([Vitaly Baranov](https://github.com/vitlibar)). +* If you run `clickhouse-client` or other CLI application, and it starts up slowly due to an overloaded server, and you start typing your query, such as `SELECT`, the previous versions will display the remaining of the terminal echo contents before printing the greetings message, such as `SELECTClickHouse local version 24.10.1.1.` instead of `ClickHouse local version 24.10.1.1.`. Now it is fixed. This closes [#31696](https://github.com/ClickHouse/ClickHouse/issues/31696). [#69856](https://github.com/ClickHouse/ClickHouse/pull/69856) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Add new column `readonly_duration` to the `system.replicas` table. Needed to be able to distinguish actual readonly replicas from sentinel ones in alerts. [#69871](https://github.com/ClickHouse/ClickHouse/pull/69871) ([Miсhael Stetsyuk](https://github.com/mstetsyuk)). +* Change the type of `join_output_by_rowlist_perkey_rows_threshold` setting type to unsigned integer. [#69886](https://github.com/ClickHouse/ClickHouse/pull/69886) ([kevinyhzou](https://github.com/KevinyhZou)). +* Enhance OpenTelemetry span logging to include query settings. [#70011](https://github.com/ClickHouse/ClickHouse/pull/70011) ([sharathks118](https://github.com/sharathks118)). +* Add diagnostic info about higher-order array functions if lambda result type is unexpected. [#70093](https://github.com/ClickHouse/ClickHouse/pull/70093) ([ttanay](https://github.com/ttanay)). +* Keeper improvement: less locking during cluster changes. [#70275](https://github.com/ClickHouse/ClickHouse/pull/70275) ([Antonio Andelic](https://github.com/antonio2368)). +* Add `WITH IMPLICIT` and `FINAL` keywords to the `SHOW GRANTS` command. Fix a minor bug with implicit grants: [#70094](https://github.com/ClickHouse/ClickHouse/issues/70094). [#70293](https://github.com/ClickHouse/ClickHouse/pull/70293) ([pufit](https://github.com/pufit)). +* Respect `compatibility` for MergeTree settings. The `compatibility` value is taken from the `default` profile on server startup, and default MergeTree settings are changed accordingly. Further changes of the `compatibility` setting do not affect MergeTree settings. [#70322](https://github.com/ClickHouse/ClickHouse/pull/70322) ([Nikolai Kochetov](https://github.com/KochetovNicolai)). +* Avoid spamming the logs with large HTTP response bodies in case of errors during inter-server communication. [#70487](https://github.com/ClickHouse/ClickHouse/pull/70487) ([Vladimir Cherkasov](https://github.com/vdimir)). +* Added a new setting `max_parts_to_move` to control the maximum number of parts that can be moved at once. [#70520](https://github.com/ClickHouse/ClickHouse/pull/70520) ([Vladimir Cherkasov](https://github.com/vdimir)). +* Limit the frequency of certain log messages. [#70601](https://github.com/ClickHouse/ClickHouse/pull/70601) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* `CHECK TABLE` with `PART` qualifier was incorrectly formatted in the client. [#70660](https://github.com/ClickHouse/ClickHouse/pull/70660) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Support writing the column index and the offset index using parquet native writer. [#70669](https://github.com/ClickHouse/ClickHouse/pull/70669) ([LiuNeng](https://github.com/liuneng1994)). +* Support parsing `DateTime64` for microsecond and timezone in joda syntax ("joda" is a popular Java library for date and time, and the "joda syntax" is that library's style). [#70737](https://github.com/ClickHouse/ClickHouse/pull/70737) ([kevinyhzou](https://github.com/KevinyhZou)). +* Changed an approach to figure out if a cloud storage supports [batch delete](https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html) or not. [#70786](https://github.com/ClickHouse/ClickHouse/pull/70786) ([Vitaly Baranov](https://github.com/vitlibar)). +* Support for Parquet page v2 in the native reader. [#70807](https://github.com/ClickHouse/ClickHouse/pull/70807) ([Arthur Passos](https://github.com/arthurpassos)). +* A check if table has both `storage_policy` and `disk` set. A check if a new storage policy is compatible with an old one when using `disk` setting is added. [#70839](https://github.com/ClickHouse/ClickHouse/pull/70839) ([Kirill](https://github.com/kirillgarbar)). +* Add `system.s3_queue_settings` and `system.azure_queue_settings`. [#70841](https://github.com/ClickHouse/ClickHouse/pull/70841) ([Kseniia Sumarokova](https://github.com/kssenii)). +* Functions `base58Encode` and `base58Decode` now accept arguments of type `FixedString`. Example: `SELECT base58Encode(toFixedString('plaintext', 9));`. [#70846](https://github.com/ClickHouse/ClickHouse/pull/70846) ([Faizan Patel](https://github.com/faizan2786)). +* Add the `partition` column to every entry type of the part log. Previously, it was set only for some entries. This closes [#70819](https://github.com/ClickHouse/ClickHouse/issues/70819). [#70848](https://github.com/ClickHouse/ClickHouse/pull/70848) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Add `MergeStart` and `MutateStart` events into `system.part_log` which helps with merges analysis and visualization. [#70850](https://github.com/ClickHouse/ClickHouse/pull/70850) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Add a profile event about the number of merged source parts. It allows the monitoring of the fanout of the merge tree in production. [#70908](https://github.com/ClickHouse/ClickHouse/pull/70908) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Background downloads to the filesystem cache were enabled back. [#70929](https://github.com/ClickHouse/ClickHouse/pull/70929) ([Nikita Taranov](https://github.com/nickitat)). +* Add a new merge selector algorithm, named `Trivial`, for professional usage only. It is worse than the `Simple` merge selector. [#70969](https://github.com/ClickHouse/ClickHouse/pull/70969) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Support for atomic `CREATE OR REPLACE VIEW`. [#70536](https://github.com/ClickHouse/ClickHouse/pull/70536) ([tuanpach](https://github.com/tuanpach)) +* Added `strict_once` mode to aggregate function `windowFunnel` to avoid counting one event several times in case it matches multiple conditions, close [#21835](https://github.com/ClickHouse/ClickHouse/issues/21835). [#69738](https://github.com/ClickHouse/ClickHouse/pull/69738) ([Vladimir Cherkasov](https://github.com/vdimir)). + +#### Bug Fix (user-visible misbehavior in an official stable release) +* Apply configuration updates in global context object. It fixes issues like [#62308](https://github.com/ClickHouse/ClickHouse/issues/62308). [#62944](https://github.com/ClickHouse/ClickHouse/pull/62944) ([Amos Bird](https://github.com/amosbird)). +* Fix `ReadSettings` not using user set values, because defaults were only used. [#65625](https://github.com/ClickHouse/ClickHouse/pull/65625) ([Kseniia Sumarokova](https://github.com/kssenii)). +* Fix type mismatch issue in `sumMapFiltered` when using signed arguments. [#58408](https://github.com/ClickHouse/ClickHouse/pull/58408) ([Chen768959](https://github.com/Chen768959)). +* Fix toHour-like conversion functions' monotonicity when optional time zone argument is passed. [#60264](https://github.com/ClickHouse/ClickHouse/pull/60264) ([Amos Bird](https://github.com/amosbird)). +* Relax `supportsPrewhere` check for `Merge` tables. This fixes [#61064](https://github.com/ClickHouse/ClickHouse/issues/61064). It was hardened unnecessarily in [#60082](https://github.com/ClickHouse/ClickHouse/issues/60082). [#61091](https://github.com/ClickHouse/ClickHouse/pull/61091) ([Amos Bird](https://github.com/amosbird)). +* Fix `use_concurrency_control` setting handling for proper `concurrent_threads_soft_limit_num` limit enforcing. This enables concurrency control by default because previously it was broken. [#61473](https://github.com/ClickHouse/ClickHouse/pull/61473) ([Sergei Trifonov](https://github.com/serxa)). +* Fix incorrect `JOIN ON` section optimization in case of `IS NULL` check under any other function (like `NOT`) that may lead to wrong results. Closes [#67915](https://github.com/ClickHouse/ClickHouse/issues/67915). [#68049](https://github.com/ClickHouse/ClickHouse/pull/68049) ([Vladimir Cherkasov](https://github.com/vdimir)). +* Prevent `ALTER` queries that would make the `CREATE` query of tables invalid. [#68574](https://github.com/ClickHouse/ClickHouse/pull/68574) ([János Benjamin Antal](https://github.com/antaljanosbenjamin)). +* Fix inconsistent AST formatting for `negate` (`-`) and `NOT` functions with tuples and arrays. [#68600](https://github.com/ClickHouse/ClickHouse/pull/68600) ([Vladimir Cherkasov](https://github.com/vdimir)). +* Fix insertion of incomplete type into `Dynamic` during deserialization. It could lead to `Parameter out of bound` errors. [#69291](https://github.com/ClickHouse/ClickHouse/pull/69291) ([Pavel Kruglov](https://github.com/Avogar)). +* Zero-copy replication, which is experimental and should not be used in production: fix inf loop after `restore replica` in the replicated merge tree with zero copy. [#69293](https://github.com/CljmnickHouse/ClickHouse/pull/69293) ([MikhailBurdukov](https://github.com/MikhailBurdukov)). +* Return back default value of `processing_threads_num` as number of cpu cores in storage `S3Queue`. [#69384](https://github.com/ClickHouse/ClickHouse/pull/69384) ([Kseniia Sumarokova](https://github.com/kssenii)). +* Bypass try/catch flow when de/serializing nested repeated protobuf to nested columns (fixes [#41971](https://github.com/ClickHouse/ClickHouse/issues/41971)). [#69556](https://github.com/ClickHouse/ClickHouse/pull/69556) ([Eliot Hautefeuille](https://github.com/hileef)). +* Fix crash during insertion into FixedString column in PostgreSQL engine. [#69584](https://github.com/ClickHouse/ClickHouse/pull/69584) ([Pavel Kruglov](https://github.com/Avogar)). +* Fix crash when executing `create view t as (with recursive 42 as ttt select ttt);`. [#69676](https://github.com/ClickHouse/ClickHouse/pull/69676) ([Han Fei](https://github.com/hanfei1991)). +* Fixed `maxMapState` throwing 'Bad get' if value type is DateTime64. [#69787](https://github.com/ClickHouse/ClickHouse/pull/69787) ([Michael Kolupaev](https://github.com/al13n321)). +* Fix `getSubcolumn` with `LowCardinality` columns by overriding `useDefaultImplementationForLowCardinalityColumns` to return `true`. [#69831](https://github.com/ClickHouse/ClickHouse/pull/69831) ([Miсhael Stetsyuk](https://github.com/mstetsyuk)). +* Fix permanent blocked distributed sends if a DROP of distributed table failed. [#69843](https://github.com/ClickHouse/ClickHouse/pull/69843) ([Azat Khuzhin](https://github.com/azat)). +* Fix non-cancellable queries containing WITH FILL with NaN keys. This closes [#69261](https://github.com/ClickHouse/ClickHouse/issues/69261). [#69845](https://github.com/ClickHouse/ClickHouse/pull/69845) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Fix analyzer default with old compatibility value. [#69895](https://github.com/ClickHouse/ClickHouse/pull/69895) ([Raúl Marín](https://github.com/Algunenano)). +* Don't check dependencies during CREATE OR REPLACE VIEW during DROP of old table. Previously CREATE OR REPLACE query failed when there are dependent tables of the recreated view. [#69907](https://github.com/ClickHouse/ClickHouse/pull/69907) ([Pavel Kruglov](https://github.com/Avogar)). +* Something for Decimal. Fixes [#69730](https://github.com/ClickHouse/ClickHouse/issues/69730). [#69978](https://github.com/ClickHouse/ClickHouse/pull/69978) ([Arthur Passos](https://github.com/arthurpassos)). +* Now DEFINER/INVOKER will work with parameterized views. [#69984](https://github.com/ClickHouse/ClickHouse/pull/69984) ([pufit](https://github.com/pufit)). +* Fix parsing for view's definers. [#69985](https://github.com/ClickHouse/ClickHouse/pull/69985) ([pufit](https://github.com/pufit)). +* Fixed a bug when the timezone could change the result of the query with a `Date` or `Date32` arguments. [#70036](https://github.com/ClickHouse/ClickHouse/pull/70036) ([Yarik Briukhovetskyi](https://github.com/yariks5s)). +* Fixes `Block structure mismatch` for queries with nested views and `WHERE` condition. Fixes [#66209](https://github.com/ClickHouse/ClickHouse/issues/66209). [#70054](https://github.com/ClickHouse/ClickHouse/pull/70054) ([Nikolai Kochetov](https://github.com/KochetovNicolai)). +* Avoid reusing columns among different named tuples when evaluating `tuple` functions. This fixes [#70022](https://github.com/ClickHouse/ClickHouse/issues/70022). [#70103](https://github.com/ClickHouse/ClickHouse/pull/70103) ([Amos Bird](https://github.com/amosbird)). +* Fix wrong LOGICAL_ERROR when replacing literals in ranges. [#70122](https://github.com/ClickHouse/ClickHouse/pull/70122) ([Pablo Marcos](https://github.com/pamarcos)). +* Check for Nullable(Nothing) type during ALTER TABLE MODIFY COLUMN/QUERY to prevent tables with such data type. [#70123](https://github.com/ClickHouse/ClickHouse/pull/70123) ([Pavel Kruglov](https://github.com/Avogar)). +* Proper error message for illegal query `JOIN ... ON *` , close [#68650](https://github.com/ClickHouse/ClickHouse/issues/68650). [#70124](https://github.com/ClickHouse/ClickHouse/pull/70124) ([Vladimir Cherkasov](https://github.com/vdimir)). +* Fix wrong result with skipping index. [#70127](https://github.com/ClickHouse/ClickHouse/pull/70127) ([Raúl Marín](https://github.com/Algunenano)). +* Fix data race in ColumnObject/ColumnTuple decompress method that could lead to heap use after free. [#70137](https://github.com/ClickHouse/ClickHouse/pull/70137) ([Pavel Kruglov](https://github.com/Avogar)). +* Fix possible hung in ALTER COLUMN with Dynamic type. [#70144](https://github.com/ClickHouse/ClickHouse/pull/70144) ([Pavel Kruglov](https://github.com/Avogar)). +* Now ClickHouse will consider more errors as retriable and will not mark data parts as broken in case of such errors. [#70145](https://github.com/ClickHouse/ClickHouse/pull/70145) ([alesapin](https://github.com/alesapin)). +* Use correct `max_types` parameter during Dynamic type creation for JSON subcolumn. [#70147](https://github.com/ClickHouse/ClickHouse/pull/70147) ([Pavel Kruglov](https://github.com/Avogar)). +* Fix the password being displayed in `system.query_log` for users with bcrypt password authentication method. [#70148](https://github.com/ClickHouse/ClickHouse/pull/70148) ([Nikolay Degterinsky](https://github.com/evillique)). +* Fix event counter for the native interface (InterfaceNativeSendBytes). [#70153](https://github.com/ClickHouse/ClickHouse/pull/70153) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)). +* Fix possible crash related to JSON columns. [#70172](https://github.com/ClickHouse/ClickHouse/pull/70172) ([Pavel Kruglov](https://github.com/Avogar)). +* Fix multiple issues with arrayMin and arrayMax. [#70207](https://github.com/ClickHouse/ClickHouse/pull/70207) ([Raúl Marín](https://github.com/Algunenano)). +* Respect setting allow_simdjson in the JSON type parser. [#70218](https://github.com/ClickHouse/ClickHouse/pull/70218) ([Pavel Kruglov](https://github.com/Avogar)). +* Fix a null pointer dereference on creating a materialized view with two selects and an `INTERSECT`, e.g. `CREATE MATERIALIZED VIEW v0 AS (SELECT 1) INTERSECT (SELECT 1);`. [#70264](https://github.com/ClickHouse/ClickHouse/pull/70264) ([Konstantin Bogdanov](https://github.com/thevar1able)). +* Don't modify global settings with startup scripts. Previously, changing a setting in a startup script would change it globally. [#70310](https://github.com/ClickHouse/ClickHouse/pull/70310) ([Antonio Andelic](https://github.com/antonio2368)). +* Fix ALTER of `Dynamic` type with reducing max_types parameter that could lead to server crash. [#70328](https://github.com/ClickHouse/ClickHouse/pull/70328) ([Pavel Kruglov](https://github.com/Avogar)). +* Fix crash when using WITH FILL incorrectly. [#70338](https://github.com/ClickHouse/ClickHouse/pull/70338) ([Raúl Marín](https://github.com/Algunenano)). +* Fix possible use-after-free in `SYSTEM DROP FORMAT SCHEMA CACHE FOR Protobuf`. [#70358](https://github.com/ClickHouse/ClickHouse/pull/70358) ([Azat Khuzhin](https://github.com/azat)). +* Fix crash during GROUP BY JSON sub-object subcolumn. [#70374](https://github.com/ClickHouse/ClickHouse/pull/70374) ([Pavel Kruglov](https://github.com/Avogar)). +* Don't prefetch parts for vertical merges if part has no rows. [#70452](https://github.com/ClickHouse/ClickHouse/pull/70452) ([Antonio Andelic](https://github.com/antonio2368)). +* Fix crash in WHERE with lambda functions. [#70464](https://github.com/ClickHouse/ClickHouse/pull/70464) ([Raúl Marín](https://github.com/Algunenano)). +* Fix table creation with `CREATE ... AS table_function(...)` with database `Replicated` and unavailable table function source on secondary replica. [#70511](https://github.com/ClickHouse/ClickHouse/pull/70511) ([Kseniia Sumarokova](https://github.com/kssenii)). +* Ignore all output on async insert with `wait_for_async_insert=1`. Closes [#62644](https://github.com/ClickHouse/ClickHouse/issues/62644). [#70530](https://github.com/ClickHouse/ClickHouse/pull/70530) ([Konstantin Bogdanov](https://github.com/thevar1able)). +* Ignore frozen_metadata.txt while traversing shadow directory from system.remote_data_paths. [#70590](https://github.com/ClickHouse/ClickHouse/pull/70590) ([Aleksei Filatov](https://github.com/aalexfvk)). +* Fix creation of stateful window functions on misaligned memory. [#70631](https://github.com/ClickHouse/ClickHouse/pull/70631) ([Raúl Marín](https://github.com/Algunenano)). +* Fixed rare crashes in `SELECT`-s and merges after adding a column of `Array` type with non-empty default expression. [#70695](https://github.com/ClickHouse/ClickHouse/pull/70695) ([Anton Popov](https://github.com/CurtizJ)). +* Insert into table function s3 will respect query settings. [#70696](https://github.com/ClickHouse/ClickHouse/pull/70696) ([Vladimir Cherkasov](https://github.com/vdimir)). +* Fix infinite recursion when inferring a protobuf schema when skipping unsupported fields is enabled. [#70697](https://github.com/ClickHouse/ClickHouse/pull/70697) ([Raúl Marín](https://github.com/Algunenano)). +* Disable enable_named_columns_in_function_tuple by default. [#70833](https://github.com/ClickHouse/ClickHouse/pull/70833) ([Raúl Marín](https://github.com/Algunenano)). +* Fix S3Queue table engine setting processing_threads_num not being effective in case it was deduced from the number of cpu cores on the server. [#70837](https://github.com/ClickHouse/ClickHouse/pull/70837) ([Kseniia Sumarokova](https://github.com/kssenii)). +* Normalize named tuple arguments in aggregation states. This fixes [#69732](https://github.com/ClickHouse/ClickHouse/issues/69732) . [#70853](https://github.com/ClickHouse/ClickHouse/pull/70853) ([Amos Bird](https://github.com/amosbird)). +* Fix a logical error due to negative zeros in the two-level hash table. This closes [#70973](https://github.com/ClickHouse/ClickHouse/issues/70973). [#70979](https://github.com/ClickHouse/ClickHouse/pull/70979) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Fix `limit by`, `limit with ties` for distributed and parallel replicas. [#70880](https://github.com/ClickHouse/ClickHouse/pull/70880) ([Nikita Taranov](https://github.com/nickitat)). + + ### ClickHouse release 24.9, 2024-09-26 #### Backward Incompatible Change diff --git a/README.md b/README.md index 3b5209dcbe9..dcaeda13acd 100644 --- a/README.md +++ b/README.md @@ -42,31 +42,19 @@ Keep an eye out for upcoming meetups and events around the world. Somewhere else Upcoming meetups -* [Jakarta Meetup](https://www.meetup.com/clickhouse-indonesia-user-group/events/303191359/) - October 1 -* [Singapore Meetup](https://www.meetup.com/clickhouse-singapore-meetup-group/events/303212064/) - October 3 -* [Madrid Meetup](https://www.meetup.com/clickhouse-spain-user-group/events/303096564/) - October 22 -* [Oslo Meetup](https://www.meetup.com/open-source-real-time-data-warehouse-real-time-analytics/events/302938622) - October 31 * [Barcelona Meetup](https://www.meetup.com/clickhouse-spain-user-group/events/303096876/) - November 12 * [Ghent Meetup](https://www.meetup.com/clickhouse-belgium-user-group/events/303049405/) - November 19 * [Dubai Meetup](https://www.meetup.com/clickhouse-dubai-meetup-group/events/303096989/) - November 21 * [Paris Meetup](https://www.meetup.com/clickhouse-france-user-group/events/303096434) - November 26 +* [Amsterdam Meetup](https://www.meetup.com/clickhouse-netherlands-user-group/events/303638814) - December 3 +* [New York Meetup](https://www.meetup.com/clickhouse-new-york-user-group/events/304268174) - December 9 +* [San Francisco Meetup](https://www.meetup.com/clickhouse-silicon-valley-meetup-group/events/304286951/) - December 12 Recently completed meetups -* [ClickHouse Guangzhou User Group Meetup](https://mp.weixin.qq.com/s/GSvo-7xUoVzCsuUvlLTpCw) - August 25 -* [Seattle Meetup (Statsig)](https://www.meetup.com/clickhouse-seattle-user-group/events/302518075/) - August 27 -* [Melbourne Meetup](https://www.meetup.com/clickhouse-australia-user-group/events/302732666/) - August 27 -* [Sydney Meetup](https://www.meetup.com/clickhouse-australia-user-group/events/302862966/) - September 5 -* [Zurich Meetup](https://www.meetup.com/clickhouse-switzerland-meetup-group/events/302267429/) - September 5 -* [San Francisco Meetup (Cloudflare)](https://www.meetup.com/clickhouse-silicon-valley-meetup-group/events/302540575) - September 5 -* [Raleigh Meetup (Deutsche Bank)](https://www.meetup.com/triangletechtalks/events/302723486/) - September 9 -* [New York Meetup (Rokt)](https://www.meetup.com/clickhouse-new-york-user-group/events/302575342) - September 10 -* [Toronto Meetup (Shopify)](https://www.meetup.com/clickhouse-toronto-user-group/events/301490855/) - September 10 -* [Chicago Meetup (Jump Capital)](https://lu.ma/43tvmrfw) - September 12 -* [London Meetup](https://www.meetup.com/clickhouse-london-user-group/events/302977267) - September 17 -* [Austin Meetup](https://www.meetup.com/clickhouse-austin-user-group/events/302558689/) - September 17 -* [Bangalore Meetup](https://www.meetup.com/clickhouse-bangalore-user-group/events/303208274/) - September 18 -* [Tel Aviv Meetup](https://www.meetup.com/clickhouse-meetup-israel/events/303095121) - September 22 +* [Madrid Meetup](https://www.meetup.com/clickhouse-spain-user-group/events/303096564/) - October 22 +* [Singapore Meetup](https://www.meetup.com/clickhouse-singapore-meetup-group/events/303212064/) - October 3 +* [Jakarta Meetup](https://www.meetup.com/clickhouse-indonesia-user-group/events/303191359/) - October 1 ## Recent Recordings * **Recent Meetup Videos**: [Meetup Playlist](https://www.youtube.com/playlist?list=PL0Z2YDlm0b3iNDUzpY1S3L_iV4nARda_U) Whenever possible recordings of the ClickHouse Community Meetups are edited and presented as individual talks. Current featuring "Modern SQL in 2023", "Fast, Concurrent, and Consistent Asynchronous INSERTS in ClickHouse", and "Full-Text Indices: Design and Experiments" diff --git a/base/base/chrono_io.h b/base/base/chrono_io.h index 4ee8dec6634..d55aa11bc1d 100644 --- a/base/base/chrono_io.h +++ b/base/base/chrono_io.h @@ -4,6 +4,7 @@ #include #include #include +#include inline std::string to_string(const std::time_t & time) @@ -11,18 +12,6 @@ inline std::string to_string(const std::time_t & time) return cctz::format("%Y-%m-%d %H:%M:%S", std::chrono::system_clock::from_time_t(time), cctz::local_time_zone()); } -template -std::string to_string(const std::chrono::time_point & tp) -{ - // Don't use DateLUT because it shows weird characters for - // TimePoint::max(). I wish we could use C++20 format, but it's not - // there yet. - // return DateLUT::instance().timeToString(std::chrono::system_clock::to_time_t(tp)); - - auto in_time_t = std::chrono::system_clock::to_time_t(tp); - return to_string(in_time_t); -} - template > std::string to_string(const std::chrono::duration & duration) { @@ -33,6 +22,20 @@ std::string to_string(const std::chrono::duration & duration) return std::to_string(seconds_as_double.count()) + "s"; } +template +std::string to_string(const std::chrono::time_point & tp) +{ + // Don't use DateLUT because it shows weird characters for + // TimePoint::max(). I wish we could use C++20 format, but it's not + // there yet. + // return DateLUT::instance().timeToString(std::chrono::system_clock::to_time_t(tp)); + + if constexpr (std::is_same_v) + return to_string(std::chrono::system_clock::to_time_t(tp)); + else + return to_string(tp.time_since_epoch()); +} + template std::ostream & operator<<(std::ostream & o, const std::chrono::time_point & tp) { @@ -44,3 +47,23 @@ std::ostream & operator<<(std::ostream & o, const std::chrono::duration +struct fmt::formatter> : fmt::formatter +{ + template + auto format(const std::chrono::time_point & tp, FormatCtx & ctx) const + { + return fmt::formatter::format(::to_string(tp), ctx); + } +}; + +template +struct fmt::formatter> : fmt::formatter +{ + template + auto format(const std::chrono::duration & duration, FormatCtx & ctx) const + { + return fmt::formatter::format(::to_string(duration), ctx); + } +}; diff --git a/base/glibc-compatibility/musl/getauxval.c b/base/glibc-compatibility/musl/getauxval.c index ec2cce1e4aa..cc0cdf25b03 100644 --- a/base/glibc-compatibility/musl/getauxval.c +++ b/base/glibc-compatibility/musl/getauxval.c @@ -25,9 +25,10 @@ // We don't have libc struct available here. // Compute aux vector manually (from /proc/self/auxv). // -// Right now there is only 51 AT_* constants, -// so 64 should be enough until this implementation will be replaced with musl. -static unsigned long __auxv_procfs[64]; +// Right now there are 51 AT_* constants. Custom kernels have been encountered +// making use of up to 71. 128 should be enough until this implementation is +// replaced with musl. +static unsigned long __auxv_procfs[128]; static unsigned long __auxv_secure = 0; // Common static unsigned long * __auxv_environ = NULL; diff --git a/contrib/SimSIMD b/contrib/SimSIMD index ff51434d90c..935fef2964b 160000 --- a/contrib/SimSIMD +++ b/contrib/SimSIMD @@ -1 +1 @@ -Subproject commit ff51434d90c66f916e94ff05b24530b127aa4cff +Subproject commit 935fef2964bc38e995c5f465b42259a35b8cf0d3 diff --git a/contrib/arrow b/contrib/arrow index 5cfccd8ea65..6e2574f5013 160000 --- a/contrib/arrow +++ b/contrib/arrow @@ -1 +1 @@ -Subproject commit 5cfccd8ea65f33d4517e7409815d761c7650b45d +Subproject commit 6e2574f5013a005c050c9a7787d341aef09d0063 diff --git a/contrib/arrow-cmake/CMakeLists.txt b/contrib/arrow-cmake/CMakeLists.txt index 96d1f4adda7..208d48df178 100644 --- a/contrib/arrow-cmake/CMakeLists.txt +++ b/contrib/arrow-cmake/CMakeLists.txt @@ -213,13 +213,19 @@ target_include_directories(_orc SYSTEM PRIVATE set(LIBRARY_DIR "${ClickHouse_SOURCE_DIR}/contrib/arrow/cpp/src/arrow") # arrow/cpp/src/arrow/CMakeLists.txt (ARROW_SRCS + ARROW_COMPUTE + ARROW_IPC) +# find . \( -iname \*.cc -o -iname \*.cpp -o -iname \*.c \) | sort | awk '{print "\"${LIBRARY_DIR}" substr($1,2) "\"" }' | grep -v 'test.cc' | grep -v 'json' | grep -v 'flight' \| +# grep -v 'csv' | grep -v 'acero' | grep -v 'dataset' | grep -v 'testing' | grep -v 'gpu' | grep -v 'engine' | grep -v 'filesystem' | grep -v 'benchmark.cc' set(ARROW_SRCS + "${LIBRARY_DIR}/adapters/orc/adapter.cc" + "${LIBRARY_DIR}/adapters/orc/options.cc" + "${LIBRARY_DIR}/adapters/orc/util.cc" "${LIBRARY_DIR}/array/array_base.cc" "${LIBRARY_DIR}/array/array_binary.cc" "${LIBRARY_DIR}/array/array_decimal.cc" "${LIBRARY_DIR}/array/array_dict.cc" "${LIBRARY_DIR}/array/array_nested.cc" "${LIBRARY_DIR}/array/array_primitive.cc" + "${LIBRARY_DIR}/array/array_run_end.cc" "${LIBRARY_DIR}/array/builder_adaptive.cc" "${LIBRARY_DIR}/array/builder_base.cc" "${LIBRARY_DIR}/array/builder_binary.cc" @@ -227,124 +233,26 @@ set(ARROW_SRCS "${LIBRARY_DIR}/array/builder_dict.cc" "${LIBRARY_DIR}/array/builder_nested.cc" "${LIBRARY_DIR}/array/builder_primitive.cc" - "${LIBRARY_DIR}/array/builder_union.cc" "${LIBRARY_DIR}/array/builder_run_end.cc" - "${LIBRARY_DIR}/array/array_run_end.cc" + "${LIBRARY_DIR}/array/builder_union.cc" "${LIBRARY_DIR}/array/concatenate.cc" "${LIBRARY_DIR}/array/data.cc" "${LIBRARY_DIR}/array/diff.cc" "${LIBRARY_DIR}/array/util.cc" "${LIBRARY_DIR}/array/validate.cc" - "${LIBRARY_DIR}/builder.cc" "${LIBRARY_DIR}/buffer.cc" - "${LIBRARY_DIR}/chunked_array.cc" - "${LIBRARY_DIR}/chunk_resolver.cc" - "${LIBRARY_DIR}/compare.cc" - "${LIBRARY_DIR}/config.cc" - "${LIBRARY_DIR}/datum.cc" - "${LIBRARY_DIR}/device.cc" - "${LIBRARY_DIR}/extension_type.cc" - "${LIBRARY_DIR}/memory_pool.cc" - "${LIBRARY_DIR}/pretty_print.cc" - "${LIBRARY_DIR}/record_batch.cc" - "${LIBRARY_DIR}/result.cc" - "${LIBRARY_DIR}/scalar.cc" - "${LIBRARY_DIR}/sparse_tensor.cc" - "${LIBRARY_DIR}/status.cc" - "${LIBRARY_DIR}/table.cc" - "${LIBRARY_DIR}/table_builder.cc" - "${LIBRARY_DIR}/tensor.cc" - "${LIBRARY_DIR}/tensor/coo_converter.cc" - "${LIBRARY_DIR}/tensor/csf_converter.cc" - "${LIBRARY_DIR}/tensor/csx_converter.cc" - "${LIBRARY_DIR}/type.cc" - "${LIBRARY_DIR}/visitor.cc" + "${LIBRARY_DIR}/builder.cc" "${LIBRARY_DIR}/c/bridge.cc" - "${LIBRARY_DIR}/io/buffered.cc" - "${LIBRARY_DIR}/io/caching.cc" - "${LIBRARY_DIR}/io/compressed.cc" - "${LIBRARY_DIR}/io/file.cc" - "${LIBRARY_DIR}/io/hdfs.cc" - "${LIBRARY_DIR}/io/hdfs_internal.cc" - "${LIBRARY_DIR}/io/interfaces.cc" - "${LIBRARY_DIR}/io/memory.cc" - "${LIBRARY_DIR}/io/slow.cc" - "${LIBRARY_DIR}/io/stdio.cc" - "${LIBRARY_DIR}/io/transform.cc" - "${LIBRARY_DIR}/util/async_util.cc" - "${LIBRARY_DIR}/util/basic_decimal.cc" - "${LIBRARY_DIR}/util/bit_block_counter.cc" - "${LIBRARY_DIR}/util/bit_run_reader.cc" - "${LIBRARY_DIR}/util/bit_util.cc" - "${LIBRARY_DIR}/util/bitmap.cc" - "${LIBRARY_DIR}/util/bitmap_builders.cc" - "${LIBRARY_DIR}/util/bitmap_ops.cc" - "${LIBRARY_DIR}/util/bpacking.cc" - "${LIBRARY_DIR}/util/cancel.cc" - "${LIBRARY_DIR}/util/compression.cc" - "${LIBRARY_DIR}/util/counting_semaphore.cc" - "${LIBRARY_DIR}/util/cpu_info.cc" - "${LIBRARY_DIR}/util/decimal.cc" - "${LIBRARY_DIR}/util/delimiting.cc" - "${LIBRARY_DIR}/util/formatting.cc" - "${LIBRARY_DIR}/util/future.cc" - "${LIBRARY_DIR}/util/int_util.cc" - "${LIBRARY_DIR}/util/io_util.cc" - "${LIBRARY_DIR}/util/logging.cc" - "${LIBRARY_DIR}/util/key_value_metadata.cc" - "${LIBRARY_DIR}/util/memory.cc" - "${LIBRARY_DIR}/util/mutex.cc" - "${LIBRARY_DIR}/util/string.cc" - "${LIBRARY_DIR}/util/string_builder.cc" - "${LIBRARY_DIR}/util/task_group.cc" - "${LIBRARY_DIR}/util/tdigest.cc" - "${LIBRARY_DIR}/util/thread_pool.cc" - "${LIBRARY_DIR}/util/time.cc" - "${LIBRARY_DIR}/util/trie.cc" - "${LIBRARY_DIR}/util/unreachable.cc" - "${LIBRARY_DIR}/util/uri.cc" - "${LIBRARY_DIR}/util/utf8.cc" - "${LIBRARY_DIR}/util/value_parsing.cc" - "${LIBRARY_DIR}/util/byte_size.cc" - "${LIBRARY_DIR}/util/debug.cc" - "${LIBRARY_DIR}/util/tracing.cc" - "${LIBRARY_DIR}/util/atfork_internal.cc" - "${LIBRARY_DIR}/util/crc32.cc" - "${LIBRARY_DIR}/util/hashing.cc" - "${LIBRARY_DIR}/util/ree_util.cc" - "${LIBRARY_DIR}/util/union_util.cc" - "${LIBRARY_DIR}/vendored/base64.cpp" - "${LIBRARY_DIR}/vendored/datetime/tz.cpp" - "${LIBRARY_DIR}/vendored/musl/strptime.c" - "${LIBRARY_DIR}/vendored/uriparser/UriCommon.c" - "${LIBRARY_DIR}/vendored/uriparser/UriCompare.c" - "${LIBRARY_DIR}/vendored/uriparser/UriEscape.c" - "${LIBRARY_DIR}/vendored/uriparser/UriFile.c" - "${LIBRARY_DIR}/vendored/uriparser/UriIp4Base.c" - "${LIBRARY_DIR}/vendored/uriparser/UriIp4.c" - "${LIBRARY_DIR}/vendored/uriparser/UriMemory.c" - "${LIBRARY_DIR}/vendored/uriparser/UriNormalizeBase.c" - "${LIBRARY_DIR}/vendored/uriparser/UriNormalize.c" - "${LIBRARY_DIR}/vendored/uriparser/UriParseBase.c" - "${LIBRARY_DIR}/vendored/uriparser/UriParse.c" - "${LIBRARY_DIR}/vendored/uriparser/UriQuery.c" - "${LIBRARY_DIR}/vendored/uriparser/UriRecompose.c" - "${LIBRARY_DIR}/vendored/uriparser/UriResolve.c" - "${LIBRARY_DIR}/vendored/uriparser/UriShorten.c" - "${LIBRARY_DIR}/vendored/double-conversion/bignum.cc" - "${LIBRARY_DIR}/vendored/double-conversion/bignum-dtoa.cc" - "${LIBRARY_DIR}/vendored/double-conversion/cached-powers.cc" - "${LIBRARY_DIR}/vendored/double-conversion/double-to-string.cc" - "${LIBRARY_DIR}/vendored/double-conversion/fast-dtoa.cc" - "${LIBRARY_DIR}/vendored/double-conversion/fixed-dtoa.cc" - "${LIBRARY_DIR}/vendored/double-conversion/string-to-double.cc" - "${LIBRARY_DIR}/vendored/double-conversion/strtod.cc" - + "${LIBRARY_DIR}/c/dlpack.cc" + "${LIBRARY_DIR}/chunk_resolver.cc" + "${LIBRARY_DIR}/chunked_array.cc" + "${LIBRARY_DIR}/compare.cc" "${LIBRARY_DIR}/compute/api_aggregate.cc" "${LIBRARY_DIR}/compute/api_scalar.cc" "${LIBRARY_DIR}/compute/api_vector.cc" "${LIBRARY_DIR}/compute/cast.cc" "${LIBRARY_DIR}/compute/exec.cc" + "${LIBRARY_DIR}/compute/expression.cc" "${LIBRARY_DIR}/compute/function.cc" "${LIBRARY_DIR}/compute/function_internal.cc" "${LIBRARY_DIR}/compute/kernel.cc" @@ -355,6 +263,7 @@ set(ARROW_SRCS "${LIBRARY_DIR}/compute/kernels/aggregate_var_std.cc" "${LIBRARY_DIR}/compute/kernels/codegen_internal.cc" "${LIBRARY_DIR}/compute/kernels/hash_aggregate.cc" + "${LIBRARY_DIR}/compute/kernels/ree_util_internal.cc" "${LIBRARY_DIR}/compute/kernels/row_encoder.cc" "${LIBRARY_DIR}/compute/kernels/scalar_arithmetic.cc" "${LIBRARY_DIR}/compute/kernels/scalar_boolean.cc" @@ -382,30 +291,139 @@ set(ARROW_SRCS "${LIBRARY_DIR}/compute/kernels/vector_cumulative_ops.cc" "${LIBRARY_DIR}/compute/kernels/vector_hash.cc" "${LIBRARY_DIR}/compute/kernels/vector_nested.cc" + "${LIBRARY_DIR}/compute/kernels/vector_pairwise.cc" "${LIBRARY_DIR}/compute/kernels/vector_rank.cc" "${LIBRARY_DIR}/compute/kernels/vector_replace.cc" + "${LIBRARY_DIR}/compute/kernels/vector_run_end_encode.cc" "${LIBRARY_DIR}/compute/kernels/vector_select_k.cc" "${LIBRARY_DIR}/compute/kernels/vector_selection.cc" - "${LIBRARY_DIR}/compute/kernels/vector_sort.cc" - "${LIBRARY_DIR}/compute/kernels/vector_selection_internal.cc" "${LIBRARY_DIR}/compute/kernels/vector_selection_filter_internal.cc" + "${LIBRARY_DIR}/compute/kernels/vector_selection_internal.cc" "${LIBRARY_DIR}/compute/kernels/vector_selection_take_internal.cc" - "${LIBRARY_DIR}/compute/light_array.cc" - "${LIBRARY_DIR}/compute/registry.cc" - "${LIBRARY_DIR}/compute/expression.cc" + "${LIBRARY_DIR}/compute/kernels/vector_sort.cc" + "${LIBRARY_DIR}/compute/key_hash_internal.cc" + "${LIBRARY_DIR}/compute/key_map_internal.cc" + "${LIBRARY_DIR}/compute/light_array_internal.cc" "${LIBRARY_DIR}/compute/ordering.cc" + "${LIBRARY_DIR}/compute/registry.cc" "${LIBRARY_DIR}/compute/row/compare_internal.cc" "${LIBRARY_DIR}/compute/row/encode_internal.cc" "${LIBRARY_DIR}/compute/row/grouper.cc" "${LIBRARY_DIR}/compute/row/row_internal.cc" - + "${LIBRARY_DIR}/compute/util.cc" + "${LIBRARY_DIR}/config.cc" + "${LIBRARY_DIR}/datum.cc" + "${LIBRARY_DIR}/device.cc" + "${LIBRARY_DIR}/extension_type.cc" + "${LIBRARY_DIR}/integration/c_data_integration_internal.cc" + "${LIBRARY_DIR}/io/buffered.cc" + "${LIBRARY_DIR}/io/caching.cc" + "${LIBRARY_DIR}/io/compressed.cc" + "${LIBRARY_DIR}/io/file.cc" + "${LIBRARY_DIR}/io/hdfs.cc" + "${LIBRARY_DIR}/io/hdfs_internal.cc" + "${LIBRARY_DIR}/io/interfaces.cc" + "${LIBRARY_DIR}/io/memory.cc" + "${LIBRARY_DIR}/io/slow.cc" + "${LIBRARY_DIR}/io/stdio.cc" + "${LIBRARY_DIR}/io/transform.cc" "${LIBRARY_DIR}/ipc/dictionary.cc" "${LIBRARY_DIR}/ipc/feather.cc" + "${LIBRARY_DIR}/ipc/file_to_stream.cc" "${LIBRARY_DIR}/ipc/message.cc" "${LIBRARY_DIR}/ipc/metadata_internal.cc" "${LIBRARY_DIR}/ipc/options.cc" "${LIBRARY_DIR}/ipc/reader.cc" + "${LIBRARY_DIR}/ipc/stream_to_file.cc" "${LIBRARY_DIR}/ipc/writer.cc" + "${LIBRARY_DIR}/memory_pool.cc" + "${LIBRARY_DIR}/pretty_print.cc" + "${LIBRARY_DIR}/record_batch.cc" + "${LIBRARY_DIR}/result.cc" + "${LIBRARY_DIR}/scalar.cc" + "${LIBRARY_DIR}/sparse_tensor.cc" + "${LIBRARY_DIR}/status.cc" + "${LIBRARY_DIR}/table.cc" + "${LIBRARY_DIR}/table_builder.cc" + "${LIBRARY_DIR}/tensor.cc" + "${LIBRARY_DIR}/tensor/coo_converter.cc" + "${LIBRARY_DIR}/tensor/csf_converter.cc" + "${LIBRARY_DIR}/tensor/csx_converter.cc" + "${LIBRARY_DIR}/type.cc" + "${LIBRARY_DIR}/type_traits.cc" + "${LIBRARY_DIR}/util/align_util.cc" + "${LIBRARY_DIR}/util/async_util.cc" + "${LIBRARY_DIR}/util/atfork_internal.cc" + "${LIBRARY_DIR}/util/basic_decimal.cc" + "${LIBRARY_DIR}/util/bit_block_counter.cc" + "${LIBRARY_DIR}/util/bit_run_reader.cc" + "${LIBRARY_DIR}/util/bit_util.cc" + "${LIBRARY_DIR}/util/bitmap.cc" + "${LIBRARY_DIR}/util/bitmap_builders.cc" + "${LIBRARY_DIR}/util/bitmap_ops.cc" + "${LIBRARY_DIR}/util/bpacking.cc" + "${LIBRARY_DIR}/util/byte_size.cc" + "${LIBRARY_DIR}/util/cancel.cc" + "${LIBRARY_DIR}/util/compression.cc" + "${LIBRARY_DIR}/util/counting_semaphore.cc" + "${LIBRARY_DIR}/util/cpu_info.cc" + "${LIBRARY_DIR}/util/crc32.cc" + "${LIBRARY_DIR}/util/debug.cc" + "${LIBRARY_DIR}/util/decimal.cc" + "${LIBRARY_DIR}/util/delimiting.cc" + "${LIBRARY_DIR}/util/dict_util.cc" + "${LIBRARY_DIR}/util/float16.cc" + "${LIBRARY_DIR}/util/formatting.cc" + "${LIBRARY_DIR}/util/future.cc" + "${LIBRARY_DIR}/util/hashing.cc" + "${LIBRARY_DIR}/util/int_util.cc" + "${LIBRARY_DIR}/util/io_util.cc" + "${LIBRARY_DIR}/util/key_value_metadata.cc" + "${LIBRARY_DIR}/util/list_util.cc" + "${LIBRARY_DIR}/util/logging.cc" + "${LIBRARY_DIR}/util/memory.cc" + "${LIBRARY_DIR}/util/mutex.cc" + "${LIBRARY_DIR}/util/ree_util.cc" + "${LIBRARY_DIR}/util/string.cc" + "${LIBRARY_DIR}/util/string_builder.cc" + "${LIBRARY_DIR}/util/task_group.cc" + "${LIBRARY_DIR}/util/tdigest.cc" + "${LIBRARY_DIR}/util/thread_pool.cc" + "${LIBRARY_DIR}/util/time.cc" + "${LIBRARY_DIR}/util/tracing.cc" + "${LIBRARY_DIR}/util/trie.cc" + "${LIBRARY_DIR}/util/union_util.cc" + "${LIBRARY_DIR}/util/unreachable.cc" + "${LIBRARY_DIR}/util/uri.cc" + "${LIBRARY_DIR}/util/utf8.cc" + "${LIBRARY_DIR}/util/value_parsing.cc" + "${LIBRARY_DIR}/vendored/base64.cpp" + "${LIBRARY_DIR}/vendored/datetime/tz.cpp" + "${LIBRARY_DIR}/vendored/double-conversion/bignum-dtoa.cc" + "${LIBRARY_DIR}/vendored/double-conversion/bignum.cc" + "${LIBRARY_DIR}/vendored/double-conversion/cached-powers.cc" + "${LIBRARY_DIR}/vendored/double-conversion/double-to-string.cc" + "${LIBRARY_DIR}/vendored/double-conversion/fast-dtoa.cc" + "${LIBRARY_DIR}/vendored/double-conversion/fixed-dtoa.cc" + "${LIBRARY_DIR}/vendored/double-conversion/string-to-double.cc" + "${LIBRARY_DIR}/vendored/double-conversion/strtod.cc" + "${LIBRARY_DIR}/vendored/musl/strptime.c" + "${LIBRARY_DIR}/vendored/uriparser/UriCommon.c" + "${LIBRARY_DIR}/vendored/uriparser/UriCompare.c" + "${LIBRARY_DIR}/vendored/uriparser/UriEscape.c" + "${LIBRARY_DIR}/vendored/uriparser/UriFile.c" + "${LIBRARY_DIR}/vendored/uriparser/UriIp4.c" + "${LIBRARY_DIR}/vendored/uriparser/UriIp4Base.c" + "${LIBRARY_DIR}/vendored/uriparser/UriMemory.c" + "${LIBRARY_DIR}/vendored/uriparser/UriNormalize.c" + "${LIBRARY_DIR}/vendored/uriparser/UriNormalizeBase.c" + "${LIBRARY_DIR}/vendored/uriparser/UriParse.c" + "${LIBRARY_DIR}/vendored/uriparser/UriParseBase.c" + "${LIBRARY_DIR}/vendored/uriparser/UriQuery.c" + "${LIBRARY_DIR}/vendored/uriparser/UriRecompose.c" + "${LIBRARY_DIR}/vendored/uriparser/UriResolve.c" + "${LIBRARY_DIR}/vendored/uriparser/UriShorten.c" + "${LIBRARY_DIR}/visitor.cc" "${ARROW_SRC_DIR}/arrow/adapters/orc/adapter.cc" "${ARROW_SRC_DIR}/arrow/adapters/orc/util.cc" @@ -465,22 +483,38 @@ set(PARQUET_SRCS "${LIBRARY_DIR}/arrow/schema.cc" "${LIBRARY_DIR}/arrow/schema_internal.cc" "${LIBRARY_DIR}/arrow/writer.cc" + "${LIBRARY_DIR}/benchmark_util.cc" "${LIBRARY_DIR}/bloom_filter.cc" + "${LIBRARY_DIR}/bloom_filter_reader.cc" "${LIBRARY_DIR}/column_reader.cc" "${LIBRARY_DIR}/column_scanner.cc" "${LIBRARY_DIR}/column_writer.cc" "${LIBRARY_DIR}/encoding.cc" + "${LIBRARY_DIR}/encryption/crypto_factory.cc" "${LIBRARY_DIR}/encryption/encryption.cc" "${LIBRARY_DIR}/encryption/encryption_internal.cc" + "${LIBRARY_DIR}/encryption/encryption_internal_nossl.cc" + "${LIBRARY_DIR}/encryption/file_key_unwrapper.cc" + "${LIBRARY_DIR}/encryption/file_key_wrapper.cc" + "${LIBRARY_DIR}/encryption/file_system_key_material_store.cc" "${LIBRARY_DIR}/encryption/internal_file_decryptor.cc" "${LIBRARY_DIR}/encryption/internal_file_encryptor.cc" + "${LIBRARY_DIR}/encryption/key_material.cc" + "${LIBRARY_DIR}/encryption/key_metadata.cc" + "${LIBRARY_DIR}/encryption/key_toolkit.cc" + "${LIBRARY_DIR}/encryption/key_toolkit_internal.cc" + "${LIBRARY_DIR}/encryption/kms_client.cc" + "${LIBRARY_DIR}/encryption/local_wrap_kms_client.cc" + "${LIBRARY_DIR}/encryption/openssl_internal.cc" "${LIBRARY_DIR}/exception.cc" "${LIBRARY_DIR}/file_reader.cc" "${LIBRARY_DIR}/file_writer.cc" - "${LIBRARY_DIR}/page_index.cc" - "${LIBRARY_DIR}/level_conversion.cc" "${LIBRARY_DIR}/level_comparison.cc" + "${LIBRARY_DIR}/level_comparison_avx2.cc" + "${LIBRARY_DIR}/level_conversion.cc" + "${LIBRARY_DIR}/level_conversion_bmi2.cc" "${LIBRARY_DIR}/metadata.cc" + "${LIBRARY_DIR}/page_index.cc" "${LIBRARY_DIR}/platform.cc" "${LIBRARY_DIR}/printer.cc" "${LIBRARY_DIR}/properties.cc" @@ -489,7 +523,6 @@ set(PARQUET_SRCS "${LIBRARY_DIR}/stream_reader.cc" "${LIBRARY_DIR}/stream_writer.cc" "${LIBRARY_DIR}/types.cc" - "${LIBRARY_DIR}/bloom_filter_reader.cc" "${LIBRARY_DIR}/xxhasher.cc" "${GEN_LIBRARY_DIR}/parquet_constants.cpp" @@ -520,6 +553,9 @@ endif () add_definitions(-DPARQUET_THRIFT_VERSION_MAJOR=0) add_definitions(-DPARQUET_THRIFT_VERSION_MINOR=16) +# As per https://github.com/apache/arrow/pull/35672 you need to enable it explicitly. +add_definitions(-DARROW_ENABLE_THREADING) + # === tools set(TOOLS_DIR "${ClickHouse_SOURCE_DIR}/contrib/arrow/cpp/tools/parquet") diff --git a/contrib/flatbuffers b/contrib/flatbuffers index eb3f8279482..0100f6a5779 160000 --- a/contrib/flatbuffers +++ b/contrib/flatbuffers @@ -1 +1 @@ -Subproject commit eb3f827948241ce0e701516f16cd67324802bce9 +Subproject commit 0100f6a5779831fa7a651e4b67ef389a8752bd9b diff --git a/contrib/numactl b/contrib/numactl index 8d13d63a05f..ff32c618d63 160000 --- a/contrib/numactl +++ b/contrib/numactl @@ -1 +1 @@ -Subproject commit 8d13d63a05f0c3cd88bf777cbb61541202b7da08 +Subproject commit ff32c618d63ca7ac48cce366c5a04bb3563683a0 diff --git a/contrib/usearch b/contrib/usearch index 1706420acaf..53799b84ca9 160000 --- a/contrib/usearch +++ b/contrib/usearch @@ -1 +1 @@ -Subproject commit 1706420acafbd83d852c512dcf343af0a4059e48 +Subproject commit 53799b84ca9ad708b060d0b1cfa5f039371721cd diff --git a/docker/test/base/setup_export_logs.sh b/docker/test/base/setup_export_logs.sh index a39f96867be..12f1cc4d357 100755 --- a/docker/test/base/setup_export_logs.sh +++ b/docker/test/base/setup_export_logs.sh @@ -25,7 +25,7 @@ EXTRA_COLUMNS_EXPRESSION_TRACE_LOG="${EXTRA_COLUMNS_EXPRESSION}, arrayMap(x -> d # coverage_log needs more columns for symbolization, but only symbol names (the line numbers are too heavy to calculate) EXTRA_COLUMNS_COVERAGE_LOG="${EXTRA_COLUMNS} symbols Array(LowCardinality(String)), " -EXTRA_COLUMNS_EXPRESSION_COVERAGE_LOG="${EXTRA_COLUMNS_EXPRESSION}, arrayMap(x -> demangle(addressToSymbol(x)), coverage)::Array(LowCardinality(String)) AS symbols" +EXTRA_COLUMNS_EXPRESSION_COVERAGE_LOG="${EXTRA_COLUMNS_EXPRESSION}, arrayDistinct(arrayMap(x -> demangle(addressToSymbol(x)), coverage))::Array(LowCardinality(String)) AS symbols" function __set_connection_args diff --git a/docker/test/style/Dockerfile b/docker/test/style/Dockerfile index fa6b087eb7d..564301f447c 100644 --- a/docker/test/style/Dockerfile +++ b/docker/test/style/Dockerfile @@ -28,7 +28,7 @@ COPY requirements.txt / RUN pip3 install --no-cache-dir -r requirements.txt RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && locale-gen en_US.UTF-8 -ENV LC_ALL en_US.UTF-8 +ENV LC_ALL=en_US.UTF-8 # Architecture of the image when BuildKit/buildx is used ARG TARGETARCH diff --git a/docker/test/style/requirements.txt b/docker/test/style/requirements.txt index cc87f6e548d..aab20b5bee0 100644 --- a/docker/test/style/requirements.txt +++ b/docker/test/style/requirements.txt @@ -12,6 +12,7 @@ charset-normalizer==3.3.2 click==8.1.7 codespell==2.2.1 cryptography==43.0.1 +datacompy==0.7.3 Deprecated==1.2.14 dill==0.3.8 flake8==4.0.1 @@ -23,6 +24,7 @@ mccabe==0.6.1 multidict==6.0.5 mypy==1.8.0 mypy-extensions==1.0.0 +pandas==2.2.3 packaging==24.1 pathspec==0.9.0 pip==24.1.1 diff --git a/docs/en/engines/table-engines/integrations/s3.md b/docs/en/engines/table-engines/integrations/s3.md index 2675c193519..fd27d4b6ed9 100644 --- a/docs/en/engines/table-engines/integrations/s3.md +++ b/docs/en/engines/table-engines/integrations/s3.md @@ -290,6 +290,7 @@ The following settings can be specified in configuration file for given endpoint - `expiration_window_seconds` — Grace period for checking if expiration-based credentials have expired. Optional, default value is `120`. - `no_sign_request` - Ignore all the credentials so requests are not signed. Useful for accessing public buckets. - `header` — Adds specified HTTP header to a request to given endpoint. Optional, can be specified multiple times. +- `access_header` - Adds specified HTTP header to a request to given endpoint, in cases where there are no other credentials from another source. - `server_side_encryption_customer_key_base64` — If specified, required headers for accessing S3 objects with SSE-C encryption will be set. Optional. - `server_side_encryption_kms_key_id` - If specified, required headers for accessing S3 objects with [SSE-KMS encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html) will be set. If an empty string is specified, the AWS managed S3 key will be used. Optional. - `server_side_encryption_kms_encryption_context` - If specified alongside `server_side_encryption_kms_key_id`, the given encryption context header for SSE-KMS will be set. Optional. @@ -320,6 +321,32 @@ The following settings can be specified in configuration file for given endpoint ``` +## Working with archives + +Suppose that we have several archive files with following URIs on S3: + +- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-10.csv.zip' +- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-11.csv.zip' +- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-12.csv.zip' + +Extracting data from these archives is possible using ::. Globs can be used both in the url part as well as in the part after :: (responsible for the name of a file inside the archive). + +``` sql +SELECT * +FROM s3( + 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-1{0..2}.csv.zip :: *.csv' +); +``` + +:::note +ClickHouse supports three archive formats: +ZIP +TAR +7Z +While ZIP and TAR archives can be accessed from any supported storage location, 7Z archives can only be read from the local filesystem where ClickHouse is installed. +::: + + ## Accessing public buckets ClickHouse tries to fetch credentials from many different types of sources. diff --git a/docs/en/getting-started/index.md b/docs/en/getting-started/index.md index b520220984c..7898ca01129 100644 --- a/docs/en/getting-started/index.md +++ b/docs/en/getting-started/index.md @@ -23,6 +23,7 @@ functions in ClickHouse. The sample datasets include: - The [NYPD Complaint Data](../getting-started/example-datasets/nypd_complaint_data.md) demonstrates how to use data inference to simplify creating tables - The ["What's on the Menu?" dataset](../getting-started/example-datasets/menus.md) has an example of denormalizing data - The [Laion dataset](../getting-started/example-datasets/laion.md) has an example of [Approximate nearest neighbor search indexes](../engines/table-engines/mergetree-family/annindexes.md) usage +- The [TPC-H](../getting-started/example-datasets/tpch.md), [TPC-DS](../getting-started/example-datasets/tpcds.md), and [Star Schema (SSB)](../getting-started/example-datasets/star-schema.md) industry benchmarks for analytics databases - [Getting Data Into ClickHouse - Part 1](https://clickhouse.com/blog/getting-data-into-clickhouse-part-1) provides examples of defining a schema and loading a small Hacker News dataset - [Getting Data Into ClickHouse - Part 3 - Using S3](https://clickhouse.com/blog/getting-data-into-clickhouse-part-3-s3) has examples of loading data from s3 - [Generating random data in ClickHouse](https://clickhouse.com/blog/generating-random-test-distribution-data-for-clickhouse) shows how to generate random data if none of the above fit your needs. diff --git a/docs/en/interfaces/cli.md b/docs/en/interfaces/cli.md index 66291014ed7..504f6eec6de 100644 --- a/docs/en/interfaces/cli.md +++ b/docs/en/interfaces/cli.md @@ -190,6 +190,7 @@ You can pass parameters to `clickhouse-client` (all parameters have a default va - `--config-file` – The name of the configuration file. - `--secure` – If specified, will connect to server over secure connection (TLS). You might need to configure your CA certificates in the [configuration file](#configuration_files). The available configuration settings are the same as for [server-side TLS configuration](../operations/server-configuration-parameters/settings.md#openssl). - `--history_file` — Path to a file containing command history. +- `--history_max_entries` — Maximum number of entries in the history file. Default value: 1 000 000. - `--param_` — Value for a [query with parameters](#cli-queries-with-parameters). - `--hardware-utilization` — Print hardware utilization information in progress bar. - `--print-profile-events` – Print `ProfileEvents` packets. diff --git a/docs/en/operations/server-configuration-parameters/settings.md b/docs/en/operations/server-configuration-parameters/settings.md index bb40e55133a..02fa5a8ca58 100644 --- a/docs/en/operations/server-configuration-parameters/settings.md +++ b/docs/en/operations/server-configuration-parameters/settings.md @@ -1975,6 +1975,22 @@ The default is `false`. true ``` +## async_load_system_database {#async_load_system_database} + +Asynchronous loading of system tables. Helpful if there is a high amount of log tables and parts in the `system` database. Independent of the `async_load_databases` setting. + +If set to `true`, all system databases with `Ordinary`, `Atomic`, and `Replicated` engines will be loaded asynchronously after the ClickHouse server starts. See `system.asynchronous_loader` table, `tables_loader_background_pool_size` and `tables_loader_foreground_pool_size` server settings. Any query that tries to access a system table, that is not yet loaded, will wait for exactly this table to be started up. The table that is waited for by at least one query will be loaded with higher priority. Also consider setting the `max_waiting_queries` setting to limit the total number of waiting queries. + +If `false`, system database loads before server start. + +The default is `false`. + +**Example** + +``` xml +true +``` + ## tables_loader_foreground_pool_size {#tables_loader_foreground_pool_size} Sets the number of threads performing load jobs in foreground pool. The foreground pool is used for loading table synchronously before server start listening on a port and for loading tables that are waited for. Foreground pool has higher priority than background pool. It means that no job starts in background pool while there are jobs running in foreground pool. @@ -3208,6 +3224,34 @@ Default value: "default" **See Also** - [Workload Scheduling](/docs/en/operations/workload-scheduling.md) +## workload_path {#workload_path} + +The directory used as a storage for all `CREATE WORKLOAD` and `CREATE RESOURCE` queries. By default `/workload/` folder under server working directory is used. + +**Example** + +``` xml +/var/lib/clickhouse/workload/ +``` + +**See Also** +- [Workload Hierarchy](/docs/en/operations/workload-scheduling.md#workloads) +- [workload_zookeeper_path](#workload_zookeeper_path) + +## workload_zookeeper_path {#workload_zookeeper_path} + +The path to a ZooKeeper node, which is used as a storage for all `CREATE WORKLOAD` and `CREATE RESOURCE` queries. For consistency all SQL definitions are stored as a value of this single znode. By default ZooKeeper is not used and definitions are stored on [disk](#workload_path). + +**Example** + +``` xml +/clickhouse/workload/definitions.sql +``` + +**See Also** +- [Workload Hierarchy](/docs/en/operations/workload-scheduling.md#workloads) +- [workload_path](#workload_path) + ## max_authentication_methods_per_user {#max_authentication_methods_per_user} The maximum number of authentication methods a user can be created with or altered to. diff --git a/docs/en/operations/system-tables/merge_tree_settings.md b/docs/en/operations/system-tables/merge_tree_settings.md index 48217d63f9d..473315d3941 100644 --- a/docs/en/operations/system-tables/merge_tree_settings.md +++ b/docs/en/operations/system-tables/merge_tree_settings.md @@ -18,6 +18,11 @@ Columns: - `1` — Current user can’t change the setting. - `type` ([String](../../sql-reference/data-types/string.md)) — Setting type (implementation specific string value). - `is_obsolete` ([UInt8](../../sql-reference/data-types/int-uint.md#uint-ranges)) - Shows whether a setting is obsolete. +- `tier` ([Enum8](../../sql-reference/data-types/enum.md)) — Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their development and the expectations one might have when using them. Values: + - `'Production'` — The feature is stable, safe to use and does not have issues interacting with other **production** features. . + - `'Beta'` — The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome. + - `'Experimental'` — The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time. + - `'Obsolete'` — No longer supported. Either it is already removed or it will be removed in future releases. **Example** ```sql diff --git a/docs/en/operations/system-tables/resources.md b/docs/en/operations/system-tables/resources.md new file mode 100644 index 00000000000..6329f05f610 --- /dev/null +++ b/docs/en/operations/system-tables/resources.md @@ -0,0 +1,37 @@ +--- +slug: /en/operations/system-tables/resources +--- +# resources + +Contains information for [resources](/docs/en/operations/workload-scheduling.md#workload_entity_storage) residing on the local server. The table contains a row for every resource. + +Example: + +``` sql +SELECT * +FROM system.resources +FORMAT Vertical +``` + +``` text +Row 1: +────── +name: io_read +read_disks: ['s3'] +write_disks: [] +create_query: CREATE RESOURCE io_read (READ DISK s3) + +Row 2: +────── +name: io_write +read_disks: [] +write_disks: ['s3'] +create_query: CREATE RESOURCE io_write (WRITE DISK s3) +``` + +Columns: + +- `name` (`String`) - Resource name. +- `read_disks` (`Array(String)`) - The array of disk names that uses this resource for read operations. +- `write_disks` (`Array(String)`) - The array of disk names that uses this resource for write operations. +- `create_query` (`String`) - The definition of the resource. diff --git a/docs/en/operations/system-tables/settings.md b/docs/en/operations/system-tables/settings.md index a04e095e990..1cfee0ba5f4 100644 --- a/docs/en/operations/system-tables/settings.md +++ b/docs/en/operations/system-tables/settings.md @@ -18,6 +18,11 @@ Columns: - `1` — Current user can’t change the setting. - `default` ([String](../../sql-reference/data-types/string.md)) — Setting default value. - `is_obsolete` ([UInt8](../../sql-reference/data-types/int-uint.md#uint-ranges)) - Shows whether a setting is obsolete. +- `tier` ([Enum8](../../sql-reference/data-types/enum.md)) — Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their development and the expectations one might have when using them. Values: + - `'Production'` — The feature is stable, safe to use and does not have issues interacting with other **production** features. . + - `'Beta'` — The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome. + - `'Experimental'` — The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time. + - `'Obsolete'` — No longer supported. Either it is already removed or it will be removed in future releases. **Example** @@ -26,19 +31,99 @@ The following example shows how to get information about settings which name con ``` sql SELECT * FROM system.settings -WHERE name LIKE '%min_i%' +WHERE name LIKE '%min_insert_block_size_%' +FORMAT Vertical ``` ``` text -┌─name───────────────────────────────────────────────_─value─────_─changed─_─description───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────_─min──_─max──_─readonly─_─type─────────_─default───_─alias_for─_─is_obsolete─┐ -│ min_insert_block_size_rows │ 1048449 │ 0 │ Squash blocks passed to INSERT query to specified size in rows, if blocks are not big enough. │ ____ │ ____ │ 0 │ UInt64 │ 1048449 │ │ 0 │ -│ min_insert_block_size_bytes │ 268402944 │ 0 │ Squash blocks passed to INSERT query to specified size in bytes, if blocks are not big enough. │ ____ │ ____ │ 0 │ UInt64 │ 268402944 │ │ 0 │ -│ min_insert_block_size_rows_for_materialized_views │ 0 │ 0 │ Like min_insert_block_size_rows, but applied only during pushing to MATERIALIZED VIEW (default: min_insert_block_size_rows) │ ____ │ ____ │ 0 │ UInt64 │ 0 │ │ 0 │ -│ min_insert_block_size_bytes_for_materialized_views │ 0 │ 0 │ Like min_insert_block_size_bytes, but applied only during pushing to MATERIALIZED VIEW (default: min_insert_block_size_bytes) │ ____ │ ____ │ 0 │ UInt64 │ 0 │ │ 0 │ -│ read_backoff_min_interval_between_events_ms │ 1000 │ 0 │ Settings to reduce the number of threads in case of slow reads. Do not pay attention to the event, if the previous one has passed less than a certain amount of time. │ ____ │ ____ │ 0 │ Milliseconds │ 1000 │ │ 0 │ -└────────────────────────────────────────────────────┴───────────┴─────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────── -──────────────────────────────────────────────────────┴──────┴──────┴──────────┴──────────────┴───────────┴───────────┴─────────────┘ -``` +Row 1: +────── +name: min_insert_block_size_rows +value: 1048449 +changed: 0 +description: Sets the minimum number of rows in the block that can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. + +Possible values: + +- Positive integer. +- 0 — Squashing disabled. +min: ᴺᵁᴸᴸ +max: ᴺᵁᴸᴸ +readonly: 0 +type: UInt64 +default: 1048449 +alias_for: +is_obsolete: 0 +tier: Production + +Row 2: +────── +name: min_insert_block_size_bytes +value: 268402944 +changed: 0 +description: Sets the minimum number of bytes in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. + +Possible values: + +- Positive integer. +- 0 — Squashing disabled. +min: ᴺᵁᴸᴸ +max: ᴺᵁᴸᴸ +readonly: 0 +type: UInt64 +default: 268402944 +alias_for: +is_obsolete: 0 +tier: Production + +Row 3: +────── +name: min_insert_block_size_rows_for_materialized_views +value: 0 +changed: 0 +description: Sets the minimum number of rows in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. This setting is applied only for blocks inserted into [materialized view](../../sql-reference/statements/create/view.md). By adjusting this setting, you control blocks squashing while pushing to materialized view and avoid excessive memory usage. + +Possible values: + +- Any positive integer. +- 0 — Squashing disabled. + +**See Also** + +- [min_insert_block_size_rows](#min-insert-block-size-rows) +min: ᴺᵁᴸᴸ +max: ᴺᵁᴸᴸ +readonly: 0 +type: UInt64 +default: 0 +alias_for: +is_obsolete: 0 +tier: Production + +Row 4: +────── +name: min_insert_block_size_bytes_for_materialized_views +value: 0 +changed: 0 +description: Sets the minimum number of bytes in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. This setting is applied only for blocks inserted into [materialized view](../../sql-reference/statements/create/view.md). By adjusting this setting, you control blocks squashing while pushing to materialized view and avoid excessive memory usage. + +Possible values: + +- Any positive integer. +- 0 — Squashing disabled. + +**See also** + +- [min_insert_block_size_bytes](#min-insert-block-size-bytes) +min: ᴺᵁᴸᴸ +max: ᴺᵁᴸᴸ +readonly: 0 +type: UInt64 +default: 0 +alias_for: +is_obsolete: 0 +tier: Production + ``` Using of `WHERE changed` can be useful, for example, when you want to check: diff --git a/docs/en/operations/system-tables/workloads.md b/docs/en/operations/system-tables/workloads.md new file mode 100644 index 00000000000..d9c62372044 --- /dev/null +++ b/docs/en/operations/system-tables/workloads.md @@ -0,0 +1,40 @@ +--- +slug: /en/operations/system-tables/workloads +--- +# workloads + +Contains information for [workloads](/docs/en/operations/workload-scheduling.md#workload_entity_storage) residing on the local server. The table contains a row for every workload. + +Example: + +``` sql +SELECT * +FROM system.workloads +FORMAT Vertical +``` + +``` text +Row 1: +────── +name: production +parent: all +create_query: CREATE WORKLOAD production IN `all` SETTINGS weight = 9 + +Row 2: +────── +name: development +parent: all +create_query: CREATE WORKLOAD development IN `all` + +Row 3: +────── +name: all +parent: +create_query: CREATE WORKLOAD `all` +``` + +Columns: + +- `name` (`String`) - Workload name. +- `parent` (`String`) - Parent workload name. +- `create_query` (`String`) - The definition of the workload. diff --git a/docs/en/operations/workload-scheduling.md b/docs/en/operations/workload-scheduling.md index 08629492ec6..a43bea7a5b1 100644 --- a/docs/en/operations/workload-scheduling.md +++ b/docs/en/operations/workload-scheduling.md @@ -43,6 +43,20 @@ Example: ``` +An alternative way to express which disks are used by a resource is SQL syntax: + +```sql +CREATE RESOURCE resource_name (WRITE DISK disk1, READ DISK disk2) +``` + +Resource could be used for any number of disk for READ or WRITE or both for READ and WRITE. There a syntax allowing to use a resource for all the disks: + +```sql +CREATE RESOURCE all_io (READ ANY DISK, WRITE ANY DISK); +``` + +Note that server configuration options have priority over SQL way to define resources. + ## Workload markup {#workload_markup} Queries can be marked with setting `workload` to distinguish different workloads. If `workload` is not set, than value "default" is used. Note that you are able to specify the other value using settings profiles. Setting constraints can be used to make `workload` constant if you want all queries from the user to be marked with fixed value of `workload` setting. @@ -153,9 +167,48 @@ Example: ``` +## Workload hierarchy (SQL only) {#workloads} + +Defining resources and classifiers in XML could be challenging. ClickHouse provides SQL syntax that is much more convenient. All resources that were created with `CREATE RESOURCE` share the same structure of the hierarchy, but could differ in some aspects. Every workload created with `CREATE WORKLOAD` maintains a few automatically created scheduling nodes for every resource. A child workload can be created inside another parent workload. Here is the example that defines exactly the same hierarchy as XML configuration above: + +```sql +CREATE RESOURCE network_write (WRITE DISK s3) +CREATE RESOURCE network_read (READ DISK s3) +CREATE WORKLOAD all SETTINGS max_requests = 100 +CREATE WORKLOAD development IN all +CREATE WORKLOAD production IN all SETTINGS weight = 3 +``` + +The name of a leaf workload without children could be used in query settings `SETTINGS workload = 'name'`. Note that workload classifiers are also created automatically when using SQL syntax. + +To customize workload the following settings could be used: +* `priority` - sibling workloads are served according to static priority values (lower value means higher priority). +* `weight` - sibling workloads having the same static priority share resources according to weights. +* `max_requests` - the limit on the number of concurrent resource requests in this workload. +* `max_cost` - the limit on the total inflight bytes count of concurrent resource requests in this workload. +* `max_speed` - the limit on byte processing rate of this workload (the limit is independent for every resource). +* `max_burst` - maximum number of bytes that could be processed by the workload without being throttled (for every resource independently). + +Note that workload settings are translated into a proper set of scheduling nodes. For more details, see the description of the scheduling node [types and options](#hierarchy). + +There is no way to specify different hierarchies of workloads for different resources. But there is a way to specify different workload setting value for a specific resource: + +```sql +CREATE OR REPLACE WORKLOAD all SETTINGS max_requests = 100, max_speed = 1000000 FOR network_read, max_speed = 2000000 FOR network_write +``` + +Also note that workload or resource could not be dropped if it is referenced from another workload. To update a definition of a workload use `CREATE OR REPLACE WORKLOAD` query. + +## Workloads and resources storage {#workload_entity_storage} +Definitions of all workloads and resources in the form of `CREATE WORKLOAD` and `CREATE RESOURCE` queries are stored persistently either on disk at `workload_path` or in ZooKeeper at `workload_zookeeper_path`. ZooKeeper storage is recommended to achieve consistency between nodes. Alternatively `ON CLUSTER` clause could be used along with disk storage. + ## See also - [system.scheduler](/docs/en/operations/system-tables/scheduler.md) + - [system.workloads](/docs/en/operations/system-tables/workloads.md) + - [system.resources](/docs/en/operations/system-tables/resources.md) - [merge_workload](/docs/en/operations/settings/merge-tree-settings.md#merge_workload) merge tree setting - [merge_workload](/docs/en/operations/server-configuration-parameters/settings.md#merge_workload) global server setting - [mutation_workload](/docs/en/operations/settings/merge-tree-settings.md#mutation_workload) merge tree setting - [mutation_workload](/docs/en/operations/server-configuration-parameters/settings.md#mutation_workload) global server setting + - [workload_path](/docs/en/operations/server-configuration-parameters/settings.md#workload_path) global server setting + - [workload_zookeeper_path](/docs/en/operations/server-configuration-parameters/settings.md#workload_zookeeper_path) global server setting diff --git a/docs/en/sql-reference/aggregate-functions/reference/anylast.md b/docs/en/sql-reference/aggregate-functions/reference/anylast.md index 202d2e9fb10..4fe21531c76 100644 --- a/docs/en/sql-reference/aggregate-functions/reference/anylast.md +++ b/docs/en/sql-reference/aggregate-functions/reference/anylast.md @@ -17,7 +17,7 @@ anyLast(column) [RESPECT NULLS] - `column`: The column name. :::note -Supports the `RESPECT NULLS` modifier after the function name. Using this modifier will ensure the function selects the first value passed, regardless of whether it is `NULL` or not. +Supports the `RESPECT NULLS` modifier after the function name. Using this modifier will ensure the function selects the last value passed, regardless of whether it is `NULL` or not. ::: **Returned value** @@ -40,4 +40,4 @@ SELECT anyLast(city) FROM any_last_nulls; ┌─anyLast(city)─┐ │ Valencia │ └───────────────┘ -``` \ No newline at end of file +``` diff --git a/docs/en/sql-reference/data-types/newjson.md b/docs/en/sql-reference/data-types/newjson.md index 68952590eb9..7e6d4dd934f 100644 --- a/docs/en/sql-reference/data-types/newjson.md +++ b/docs/en/sql-reference/data-types/newjson.md @@ -5,7 +5,7 @@ sidebar_label: JSON keywords: [json, data type] --- -# JSON +# JSON Data Type Stores JavaScript Object Notation (JSON) documents in a single column. diff --git a/docs/en/sql-reference/statements/alter/user.md b/docs/en/sql-reference/statements/alter/user.md index a56532e2ab0..1514b16a657 100644 --- a/docs/en/sql-reference/statements/alter/user.md +++ b/docs/en/sql-reference/statements/alter/user.md @@ -12,7 +12,7 @@ Syntax: ``` sql ALTER USER [IF EXISTS] name1 [RENAME TO new_name |, name2 [,...]] [ON CLUSTER cluster_name] - [NOT IDENTIFIED | RESET AUTHENTICATION METHODS TO NEW | {IDENTIFIED | ADD IDENTIFIED} {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} + [NOT IDENTIFIED | RESET AUTHENTICATION METHODS TO NEW | {IDENTIFIED | ADD IDENTIFIED} {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} [VALID UNTIL datetime] [, {[{plaintext_password | sha256_password | sha256_hash | ...}] BY {'password' | 'hash'}} | {ldap SERVER 'server_name'} | {...} | ... [,...]]] [[ADD | DROP] HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE] [VALID UNTIL datetime] @@ -91,3 +91,15 @@ Reset authentication methods and keep the most recent added one: ``` sql ALTER USER user1 RESET AUTHENTICATION METHODS TO NEW ``` + +## VALID UNTIL Clause + +Allows you to specify the expiration date and, optionally, the time for an authentication method. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`. +The `VALID UNTIL` clause can only be specified along with an authentication method, except for the case where no authentication method has been specified in the query. In this scenario, the `VALID UNTIL` clause will be applied to all existing authentication methods. + +Examples: + +- `ALTER USER name1 VALID UNTIL '2025-01-01'` +- `ALTER USER name1 VALID UNTIL '2025-01-01 12:00:00 UTC'` +- `ALTER USER name1 VALID UNTIL 'infinity'` +- `ALTER USER name1 IDENTIFIED WITH plaintext_password BY 'no_expiration', bcrypt_password BY 'expiration_set' VALID UNTIL'2025-01-01''` diff --git a/docs/en/sql-reference/statements/create/user.md b/docs/en/sql-reference/statements/create/user.md index a018e28306c..03d93fc3365 100644 --- a/docs/en/sql-reference/statements/create/user.md +++ b/docs/en/sql-reference/statements/create/user.md @@ -11,7 +11,7 @@ Syntax: ``` sql CREATE USER [IF NOT EXISTS | OR REPLACE] name1 [, name2 [,...]] [ON CLUSTER cluster_name] - [NOT IDENTIFIED | IDENTIFIED {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} + [NOT IDENTIFIED | IDENTIFIED {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} [VALID UNTIL datetime] [, {[{plaintext_password | sha256_password | sha256_hash | ...}] BY {'password' | 'hash'}} | {ldap SERVER 'server_name'} | {...} | ... [,...]]] [HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE] [VALID UNTIL datetime] @@ -178,13 +178,16 @@ ClickHouse treats `user_name@'address'` as a username as a whole. Thus, technica ## VALID UNTIL Clause -Allows you to specify the expiration date and, optionally, the time for user credentials. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`. +Allows you to specify the expiration date and, optionally, the time for an authentication method. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`. +The `VALID UNTIL` clause can only be specified along with an authentication method, except for the case where no authentication method has been specified in the query. In this scenario, the `VALID UNTIL` clause will be applied to all existing authentication methods. Examples: - `CREATE USER name1 VALID UNTIL '2025-01-01'` - `CREATE USER name1 VALID UNTIL '2025-01-01 12:00:00 UTC'` - `CREATE USER name1 VALID UNTIL 'infinity'` +- ```CREATE USER name1 VALID UNTIL '2025-01-01 12:00:00 `Asia/Tokyo`'``` +- `CREATE USER name1 IDENTIFIED WITH plaintext_password BY 'no_expiration', bcrypt_password BY 'expiration_set' VALID UNTIL '2025-01-01''` ## GRANTEES Clause diff --git a/docs/en/sql-reference/statements/create/view.md b/docs/en/sql-reference/statements/create/view.md index 0e5d5250e0f..c770348bce0 100644 --- a/docs/en/sql-reference/statements/create/view.md +++ b/docs/en/sql-reference/statements/create/view.md @@ -55,7 +55,7 @@ SELECT * FROM view(column1=value1, column2=value2 ...) ## Materialized View ``` sql -CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER] [TO[db.]name] [ENGINE = engine] [POPULATE] +CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster_name] [TO[db.]name] [ENGINE = engine] [POPULATE] [DEFINER = { user | CURRENT_USER }] [SQL SECURITY { DEFINER | INVOKER | NONE }] AS SELECT ... [COMMENT 'comment'] diff --git a/docs/en/sql-reference/statements/grant.md b/docs/en/sql-reference/statements/grant.md index c11299baf38..6decaf19d5b 100644 --- a/docs/en/sql-reference/statements/grant.md +++ b/docs/en/sql-reference/statements/grant.md @@ -78,6 +78,10 @@ Specifying privileges you can use asterisk (`*`) instead of a table or a databas Also, you can omit database name. In this case privileges are granted for current database. For example, `GRANT SELECT ON * TO john` grants the privilege on all the tables in the current database, `GRANT SELECT ON mytable TO john` grants the privilege on the `mytable` table in the current database. +:::note +The feature described below is available starting with the 24.10 ClickHouse version. +::: + You can also put asterisks at the end of a table or a database name. This feature allows you to grant privileges on an abstract prefix of the table's path. Example: `GRANT SELECT ON db.my_tables* TO john`. This query allows `john` to execute the `SELECT` query over all the `db` database tables with the prefix `my_tables*`. @@ -113,6 +117,7 @@ GRANT SELECT ON db*.* TO john -- correct GRANT SELECT ON *.my_table TO john -- wrong GRANT SELECT ON foo*bar TO john -- wrong GRANT SELECT ON *suffix TO john -- wrong +GRANT SELECT(foo) ON db.table* TO john -- wrong ``` ## Privileges @@ -238,10 +243,13 @@ Hierarchy of privileges: - `HDFS` - `HIVE` - `JDBC` + - `KAFKA` - `MONGO` - `MYSQL` + - `NATS` - `ODBC` - `POSTGRES` + - `RABBITMQ` - `REDIS` - `REMOTE` - `S3` @@ -520,10 +528,13 @@ Allows using external data sources. Applies to [table engines](../../engines/tab - `HDFS`. Level: `GLOBAL` - `HIVE`. Level: `GLOBAL` - `JDBC`. Level: `GLOBAL` + - `KAFKA`. Level: `GLOBAL` - `MONGO`. Level: `GLOBAL` - `MYSQL`. Level: `GLOBAL` + - `NATS`. Level: `GLOBAL` - `ODBC`. Level: `GLOBAL` - `POSTGRES`. Level: `GLOBAL` + - `RABBITMQ`. Level: `GLOBAL` - `REDIS`. Level: `GLOBAL` - `REMOTE`. Level: `GLOBAL` - `S3`. Level: `GLOBAL` diff --git a/docs/en/sql-reference/statements/kill.md b/docs/en/sql-reference/statements/kill.md index 667a5b51f5c..ff6f64a97fe 100644 --- a/docs/en/sql-reference/statements/kill.md +++ b/docs/en/sql-reference/statements/kill.md @@ -83,7 +83,7 @@ The presence of long-running or incomplete mutations often indicates that a Clic - Or manually kill some of these mutations by sending a `KILL` command. ``` sql -KILL MUTATION [ON CLUSTER cluster] +KILL MUTATION WHERE [TEST] [FORMAT format] @@ -135,7 +135,6 @@ KILL MUTATION WHERE database = 'default' AND table = 'table' -- Cancel the specific mutation: KILL MUTATION WHERE database = 'default' AND table = 'table' AND mutation_id = 'mutation_3.txt' ``` -:::tip If you are killing a mutation in ClickHouse Cloud or in a self-managed cluster, then be sure to use the ```ON CLUSTER [cluster-name]``` option, in order to ensure the mutation is killed on all replicas::: The query is useful when a mutation is stuck and cannot finish (e.g. if some function in the mutation query throws an exception when applied to the data contained in the table). diff --git a/docs/en/sql-reference/table-functions/s3.md b/docs/en/sql-reference/table-functions/s3.md index 181c92b92d4..b14eb84392f 100644 --- a/docs/en/sql-reference/table-functions/s3.md +++ b/docs/en/sql-reference/table-functions/s3.md @@ -93,7 +93,6 @@ LIMIT 5; ClickHouse also can determine the compression method of the file. For example, if the file was zipped up with a `.csv.gz` extension, ClickHouse would decompress the file automatically. ::: - ## Usage Suppose that we have several files with following URIs on S3: @@ -248,6 +247,25 @@ FROM s3( LIMIT 5; ``` +## Using S3 credentials (ClickHouse Cloud) + +For non-public buckets, users can pass an `aws_access_key_id` and `aws_secret_access_key` to the function. For example: + +```sql +SELECT count() FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/mta/*.tsv', '', '','TSVWithNames') +``` + +This is appropriate for one-off accesses or in cases where credentials can easily be rotated. However, this is not recommended as a long-term solution for repeated access or where credentials are sensitive. In this case, we recommend users rely on role-based access. + +Role-based access for S3 in ClickHouse Cloud is documented [here](/docs/en/cloud/security/secure-s3#access-your-s3-bucket-with-the-clickhouseaccess-role). + +Once configured, a `roleARN` can be passed to the s3 function via an `extra_credentials` parameter. For example: + +```sql +SELECT count() FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/mta/*.tsv','CSVWithNames',extra_credentials(role_arn = 'arn:aws:iam::111111111111:role/ClickHouseAccessRole-001')) +``` + +Further examples can be found [here](/docs/en/cloud/security/secure-s3#access-your-s3-bucket-with-the-clickhouseaccess-role) ## Working with archives @@ -266,6 +284,14 @@ FROM s3( ); ``` +:::note +ClickHouse supports three archive formats: +ZIP +TAR +7Z +While ZIP and TAR archives can be accessed from any supported storage location, 7Z archives can only be read from the local filesystem where ClickHouse is installed. +::: + ## Virtual Columns {#virtual-columns} diff --git a/docs/en/sql-reference/table-functions/s3Cluster.md b/docs/en/sql-reference/table-functions/s3Cluster.md index 0eb17751c27..0aa4784d27a 100644 --- a/docs/en/sql-reference/table-functions/s3Cluster.md +++ b/docs/en/sql-reference/table-functions/s3Cluster.md @@ -70,10 +70,15 @@ SELECT count(*) FROM s3Cluster( ) ``` +## Accessing private and public buckets + +Users can use the same approaches as document for the s3 function [here](/docs/en/sql-reference/table-functions/s3#accessing-public-buckets). + ## Optimizing performance For details on optimizing the performance of the s3 function see [our detailed guide](/docs/en/integrations/s3/performance). + **See Also** - [S3 engine](../../engines/table-engines/integrations/s3.md) diff --git a/docs/ru/engines/table-engines/integrations/s3.md b/docs/ru/engines/table-engines/integrations/s3.md index a1c69df4d0a..2bab78c0612 100644 --- a/docs/ru/engines/table-engines/integrations/s3.md +++ b/docs/ru/engines/table-engines/integrations/s3.md @@ -138,6 +138,7 @@ CREATE TABLE table_with_asterisk (name String, value UInt32) - `use_insecure_imds_request` — признак использования менее безопасного соединения при выполнении запроса к IMDS при получении учётных данных из метаданных Amazon EC2. Значение по умолчанию — `false`. - `region` — название региона S3. - `header` — добавляет указанный HTTP-заголовок к запросу на заданную точку приема запроса. Может быть определен несколько раз. +- `access_header` - добавляет указанный HTTP-заголовок к запросу на заданную точку приема запроса, в случая если не указаны другие способы авторизации. - `server_side_encryption_customer_key_base64` — устанавливает необходимые заголовки для доступа к объектам S3 с шифрованием SSE-C. - `single_read_retries` — Максимальное количество попыток запроса при единичном чтении. Значение по умолчанию — `4`. diff --git a/docs/ru/getting-started/install.md b/docs/ru/getting-started/install.md index f8a660fbec9..083ddc8c39c 100644 --- a/docs/ru/getting-started/install.md +++ b/docs/ru/getting-started/install.md @@ -95,7 +95,7 @@ sudo yum install -y clickhouse-server clickhouse-client sudo systemctl enable clickhouse-server sudo systemctl start clickhouse-server sudo systemctl status clickhouse-server -clickhouse-client # илм "clickhouse-client --password" если установлен пароль +clickhouse-client # или "clickhouse-client --password" если установлен пароль ``` Для использования наиболее свежих версий нужно заменить `stable` на `testing` (рекомендуется для тестовых окружений). Также иногда доступен `prestable`. diff --git a/docs/ru/sql-reference/statements/create/view.md b/docs/ru/sql-reference/statements/create/view.md index 8fa30446bb3..5dbffd90205 100644 --- a/docs/ru/sql-reference/statements/create/view.md +++ b/docs/ru/sql-reference/statements/create/view.md @@ -39,7 +39,7 @@ SELECT a, b, c FROM (SELECT ...) ## Материализованные представления {#materialized} ``` sql -CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER] [TO[db.]name] [ENGINE = engine] [POPULATE] +CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster_name] [TO[db.]name] [ENGINE = engine] [POPULATE] [DEFINER = { user | CURRENT_USER }] [SQL SECURITY { DEFINER | INVOKER | NONE }] AS SELECT ... ``` diff --git a/docs/ru/sql-reference/statements/grant.md b/docs/ru/sql-reference/statements/grant.md index 2ccc2d05452..79682dc42cd 100644 --- a/docs/ru/sql-reference/statements/grant.md +++ b/docs/ru/sql-reference/statements/grant.md @@ -192,14 +192,23 @@ GRANT SELECT(x,y) ON db.table TO john WITH GRANT OPTION - `addressToSymbol` - `demangle` - [SOURCES](#grant-sources) + - `AZURE` - `FILE` - - `URL` - - `REMOTE` - - `MYSQL` - - `ODBC` - - `JDBC` - `HDFS` + - `HIVE` + - `JDBC` + - `KAFKA` + - `MONGO` + - `MYSQL` + - `NATS` + - `ODBC` + - `POSTGRES` + - `RABBITMQ` + - `REDIS` + - `REMOTE` - `S3` + - `SQLITE` + - `URL` - [dictGet](#grant-dictget) Примеры того, как трактуется данная иерархия: @@ -461,14 +470,23 @@ GRANT INSERT(x,y) ON db.table TO john Разрешает использовать внешние источники данных. Применяется к [движкам таблиц](../../engines/table-engines/index.md) и [табличным функциям](../table-functions/index.md#table-functions). - `SOURCES`. Уровень: `GROUP` + - `AZURE`. Уровень: `GLOBAL` - `FILE`. Уровень: `GLOBAL` - - `URL`. Уровень: `GLOBAL` - - `REMOTE`. Уровень: `GLOBAL` - - `MYSQL`. Уровень: `GLOBAL` - - `ODBC`. Уровень: `GLOBAL` - - `JDBC`. Уровень: `GLOBAL` - `HDFS`. Уровень: `GLOBAL` + - `HIVE`. Уровень: `GLOBAL` + - `JDBC`. Уровень: `GLOBAL` + - `KAFKA`. Уровень: `GLOBAL` + - `MONGO`. Уровень: `GLOBAL` + - `MYSQL`. Уровень: `GLOBAL` + - `NATS`. Уровень: `GLOBAL` + - `ODBC`. Уровень: `GLOBAL` + - `POSTGRES`. Уровень: `GLOBAL` + - `RABBITMQ`. Уровень: `GLOBAL` + - `REDIS`. Уровень: `GLOBAL` + - `REMOTE`. Уровень: `GLOBAL` - `S3`. Уровень: `GLOBAL` + - `SQLITE`. Уровень: `GLOBAL` + - `URL`. Уровень: `GLOBAL` Привилегия `SOURCES` разрешает использование всех источников. Также вы можете присвоить привилегию для каждого источника отдельно. Для использования источников необходимы дополнительные привилегии. diff --git a/docs/zh/sql-reference/statements/create/view.md b/docs/zh/sql-reference/statements/create/view.md index 49a1d66bdf1..6c93240644d 100644 --- a/docs/zh/sql-reference/statements/create/view.md +++ b/docs/zh/sql-reference/statements/create/view.md @@ -39,7 +39,7 @@ SELECT a, b, c FROM (SELECT ...) ## Materialized {#materialized} ``` sql -CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER] [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ... +CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster_name] [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ... ``` 物化视图存储由相应的[SELECT](../../../sql-reference/statements/select/index.md)管理. diff --git a/docs/zh/sql-reference/statements/grant.md b/docs/zh/sql-reference/statements/grant.md index fea51d590d5..3fd314c791f 100644 --- a/docs/zh/sql-reference/statements/grant.md +++ b/docs/zh/sql-reference/statements/grant.md @@ -170,14 +170,23 @@ GRANT SELECT(x,y) ON db.table TO john WITH GRANT OPTION - `addressToSymbol` - `demangle` - [SOURCES](#grant-sources) + - `AZURE` - `FILE` - - `URL` - - `REMOTE` - - `YSQL` - - `ODBC` - - `JDBC` - `HDFS` + - `HIVE` + - `JDBC` + - `KAFKA` + - `MONGO` + - `MYSQL` + - `NATS` + - `ODBC` + - `POSTGRES` + - `RABBITMQ` + - `REDIS` + - `REMOTE` - `S3` + - `SQLITE` + - `URL` - [dictGet](#grant-dictget) 如何对待该层级的示例: @@ -428,14 +437,23 @@ GRANT INSERT(x,y) ON db.table TO john 允许在 [table engines](../../engines/table-engines/index.md) 和 [table functions](../../sql-reference/table-functions/index.md#table-functions)中使用外部数据源。 - `SOURCES`. 级别: `GROUP` + - `AZURE`. 级别: `GLOBAL` - `FILE`. 级别: `GLOBAL` - - `URL`. 级别: `GLOBAL` - - `REMOTE`. 级别: `GLOBAL` - - `YSQL`. 级别: `GLOBAL` - - `ODBC`. 级别: `GLOBAL` - - `JDBC`. 级别: `GLOBAL` - `HDFS`. 级别: `GLOBAL` + - `HIVE`. 级别: `GLOBAL` + - `JDBC`. 级别: `GLOBAL` + - `KAFKA`. 级别: `GLOBAL` + - `MONGO`. 级别: `GLOBAL` + - `MYSQL`. 级别: `GLOBAL` + - `NATS`. 级别: `GLOBAL` + - `ODBC`. 级别: `GLOBAL` + - `POSTGRES`. 级别: `GLOBAL` + - `RABBITMQ`. 级别: `GLOBAL` + - `REDIS`. 级别: `GLOBAL` + - `REMOTE`. 级别: `GLOBAL` - `S3`. 级别: `GLOBAL` + - `SQLITE`. 级别: `GLOBAL` + - `URL`. 级别: `GLOBAL` `SOURCES` 权限允许使用所有数据源。当然也可以单独对每个数据源进行授权。要使用数据源时,还需要额外的权限。 diff --git a/programs/client/Client.cpp b/programs/client/Client.cpp index 4aab7fcae14..d7190444f0b 100644 --- a/programs/client/Client.cpp +++ b/programs/client/Client.cpp @@ -192,6 +192,10 @@ void Client::parseConnectionsCredentials(Poco::Util::AbstractConfiguration & con history_file = home_path + "/" + history_file.substr(1); config.setString("history_file", history_file); } + if (config.has(prefix + ".history_max_entries")) + { + config.setUInt("history_max_entries", history_max_entries); + } if (config.has(prefix + ".accept-invalid-certificate")) config.setBool("accept-invalid-certificate", config.getBool(prefix + ".accept-invalid-certificate")); } diff --git a/programs/disks/DisksApp.cpp b/programs/disks/DisksApp.cpp index 5fddfce0678..610d8eaa638 100644 --- a/programs/disks/DisksApp.cpp +++ b/programs/disks/DisksApp.cpp @@ -236,6 +236,7 @@ void DisksApp::runInteractiveReplxx() ReplxxLineReader lr( suggest, history_file, + history_max_entries, /* multiline= */ false, /* ignore_shell_suspend= */ false, query_extenders, @@ -398,6 +399,8 @@ void DisksApp::initializeHistoryFile() throw; } } + + history_max_entries = config().getUInt("history-max-entries", 1000000); } void DisksApp::init(const std::vector & common_arguments) diff --git a/programs/disks/DisksApp.h b/programs/disks/DisksApp.h index 5b240648508..4f2bd7fcad6 100644 --- a/programs/disks/DisksApp.h +++ b/programs/disks/DisksApp.h @@ -62,6 +62,8 @@ private: // Fields responsible for the REPL work String history_file; + UInt32 history_max_entries = 0; /// Maximum number of entries in the history file. Needs to be initialized to 0 since we don't have a proper constructor. Worry not, actual value is set within the initializeHistoryFile method. + LineReader::Suggest suggest; static LineReader::Patterns query_extenders; static LineReader::Patterns query_delimiters; diff --git a/programs/keeper-client/KeeperClient.cpp b/programs/keeper-client/KeeperClient.cpp index 101ed270fc5..2a426fad7ac 100644 --- a/programs/keeper-client/KeeperClient.cpp +++ b/programs/keeper-client/KeeperClient.cpp @@ -243,6 +243,8 @@ void KeeperClient::initialize(Poco::Util::Application & /* self */) } } + history_max_entries = config().getUInt("history-max-entries", 1000000); + String default_log_level; if (config().has("query")) /// We don't want to see any information log in query mode, unless it was set explicitly @@ -319,6 +321,7 @@ void KeeperClient::runInteractiveReplxx() ReplxxLineReader lr( suggest, history_file, + history_max_entries, /* multiline= */ false, /* ignore_shell_suspend= */ false, query_extenders, diff --git a/programs/keeper-client/KeeperClient.h b/programs/keeper-client/KeeperClient.h index 0d3db3c2f02..359663c6a13 100644 --- a/programs/keeper-client/KeeperClient.h +++ b/programs/keeper-client/KeeperClient.h @@ -59,6 +59,8 @@ protected: std::vector getCompletions(const String & prefix) const; String history_file; + UInt32 history_max_entries; /// Maximum number of entries in the history file. + LineReader::Suggest suggest; zkutil::ZooKeeperArgs zk_args; diff --git a/programs/keeper/Keeper.cpp b/programs/keeper/Keeper.cpp index 3007df60765..74af9950e13 100644 --- a/programs/keeper/Keeper.cpp +++ b/programs/keeper/Keeper.cpp @@ -590,6 +590,7 @@ try #if USE_SSL CertificateReloader::instance().tryLoad(*config); + CertificateReloader::instance().tryLoadClient(*config); #endif }); diff --git a/programs/local/LocalServer.cpp b/programs/local/LocalServer.cpp index b6b67724b0a..1dcef5eb25e 100644 --- a/programs/local/LocalServer.cpp +++ b/programs/local/LocalServer.cpp @@ -821,11 +821,11 @@ void LocalServer::processConfig() status.emplace(fs::path(path) / "status", StatusFile::write_full_info); LOG_DEBUG(log, "Loading metadata from {}", path); - auto startup_system_tasks = loadMetadataSystem(global_context); + auto load_system_metadata_tasks = loadMetadataSystem(global_context); attachSystemTablesServer(global_context, *createMemoryDatabaseIfNotExists(global_context, DatabaseCatalog::SYSTEM_DATABASE), false); attachInformationSchema(global_context, *createMemoryDatabaseIfNotExists(global_context, DatabaseCatalog::INFORMATION_SCHEMA)); attachInformationSchema(global_context, *createMemoryDatabaseIfNotExists(global_context, DatabaseCatalog::INFORMATION_SCHEMA_UPPERCASE)); - waitLoad(TablesLoaderForegroundPoolId, startup_system_tasks); + waitLoad(TablesLoaderForegroundPoolId, load_system_metadata_tasks); if (!getClientConfiguration().has("only-system-tables")) { diff --git a/programs/server/Server.cpp b/programs/server/Server.cpp index c106a68f360..1f481381b2b 100644 --- a/programs/server/Server.cpp +++ b/programs/server/Server.cpp @@ -86,7 +86,7 @@ #include #include #include -#include +#include #include #include #include "MetricsTransmitter.h" @@ -168,9 +168,11 @@ namespace ServerSetting { extern const ServerSettingsUInt32 asynchronous_heavy_metrics_update_period_s; extern const ServerSettingsUInt32 asynchronous_metrics_update_period_s; + extern const ServerSettingsBool asynchronous_metrics_enable_heavy_metrics; extern const ServerSettingsBool async_insert_queue_flush_on_shutdown; extern const ServerSettingsUInt64 async_insert_threads; extern const ServerSettingsBool async_load_databases; + extern const ServerSettingsBool async_load_system_database; extern const ServerSettingsUInt64 background_buffer_flush_schedule_pool_size; extern const ServerSettingsUInt64 background_common_pool_size; extern const ServerSettingsUInt64 background_distributed_schedule_pool_size; @@ -205,7 +207,6 @@ namespace ServerSetting extern const ServerSettingsBool format_alter_operations_with_parentheses; extern const ServerSettingsUInt64 global_profiler_cpu_time_period_ns; extern const ServerSettingsUInt64 global_profiler_real_time_period_ns; - extern const ServerSettingsDouble gwp_asan_force_sample_probability; extern const ServerSettingsUInt64 http_connections_soft_limit; extern const ServerSettingsUInt64 http_connections_store_limit; extern const ServerSettingsUInt64 http_connections_warn_limit; @@ -620,7 +621,7 @@ void sanityChecks(Server & server) #if defined(OS_LINUX) try { - const std::unordered_set fastClockSources = { + const std::unordered_set fast_clock_sources = { // ARM clock "arch_sys_counter", // KVM guest clock @@ -629,7 +630,7 @@ void sanityChecks(Server & server) "tsc", }; const char * filename = "/sys/devices/system/clocksource/clocksource0/current_clocksource"; - if (!fastClockSources.contains(readLine(filename))) + if (!fast_clock_sources.contains(readLine(filename))) server.context()->addWarningMessage("Linux is not using a fast clock source. Performance can be degraded. Check " + String(filename)); } catch (...) // NOLINT(bugprone-empty-catch) @@ -919,7 +920,6 @@ try registerFormats(); registerRemoteFileMetadatas(); registerSchedulerNodes(); - registerResourceManagers(); CurrentMetrics::set(CurrentMetrics::Revision, ClickHouseRevision::getVersionRevision()); CurrentMetrics::set(CurrentMetrics::VersionInteger, ClickHouseRevision::getVersionInteger()); @@ -1060,6 +1060,7 @@ try ServerAsynchronousMetrics async_metrics( global_context, server_settings[ServerSetting::asynchronous_metrics_update_period_s], + server_settings[ServerSetting::asynchronous_metrics_enable_heavy_metrics], server_settings[ServerSetting::asynchronous_heavy_metrics_update_period_s], [&]() -> std::vector { @@ -1927,10 +1928,6 @@ try if (global_context->isServerCompletelyStarted()) CannotAllocateThreadFaultInjector::setFaultProbability(new_server_settings[ServerSetting::cannot_allocate_thread_fault_injection_probability]); -#if USE_GWP_ASAN - GWPAsan::setForceSampleProbability(new_server_settings[ServerSetting::gwp_asan_force_sample_probability]); -#endif - ProfileEvents::increment(ProfileEvents::MainConfigLoads); /// Must be the last. @@ -2199,6 +2196,7 @@ try LOG_INFO(log, "Loading metadata from {}", path_str); + LoadTaskPtrs load_system_metadata_tasks; LoadTaskPtrs load_metadata_tasks; // Make sure that if exception is thrown during startup async, new async loading jobs are not going to be called. @@ -2222,12 +2220,8 @@ try auto & database_catalog = DatabaseCatalog::instance(); /// We load temporary database first, because projections need it. database_catalog.initializeAndLoadTemporaryDatabase(); - auto system_startup_tasks = loadMetadataSystem(global_context); - maybeConvertSystemDatabase(global_context, system_startup_tasks); - /// This has to be done before the initialization of system logs, - /// otherwise there is a race condition between the system database initialization - /// and creation of new tables in the database. - waitLoad(TablesLoaderForegroundPoolId, system_startup_tasks); + load_system_metadata_tasks = loadMetadataSystem(global_context, server_settings[ServerSetting::async_load_system_database]); + maybeConvertSystemDatabase(global_context, load_system_metadata_tasks); /// Startup scripts can depend on the system log tables. if (config().has("startup_scripts") && !server_settings[ServerSetting::prepare_system_log_tables_on_startup].changed) @@ -2258,6 +2252,8 @@ try database_catalog.assertDatabaseExists(default_database); /// Load user-defined SQL functions. global_context->getUserDefinedSQLObjectsStorage().loadObjects(); + /// Load WORKLOADs and RESOURCEs. + global_context->getWorkloadEntityStorage().loadEntities(); global_context->getRefreshSet().setRefreshesStopped(false); } @@ -2271,10 +2267,19 @@ try if (has_zookeeper && global_context->getMacros()->getMacroMap().contains("replica")) { - auto zookeeper = global_context->getZooKeeper(); - String stop_flag_path = "/clickhouse/stop_replicated_ddl_queries/{replica}"; - stop_flag_path = global_context->getMacros()->expand(stop_flag_path); - found_stop_flag = zookeeper->exists(stop_flag_path); + try + { + auto zookeeper = global_context->getZooKeeper(); + String stop_flag_path = "/clickhouse/stop_replicated_ddl_queries/{replica}"; + stop_flag_path = global_context->getMacros()->expand(stop_flag_path); + found_stop_flag = zookeeper->exists(stop_flag_path); + } + catch (const Coordination::Exception & e) + { + if (e.code != Coordination::Error::ZCONNECTIONLOSS) + throw; + tryLogCurrentException(log); + } } if (found_stop_flag) @@ -2336,6 +2341,7 @@ try #if USE_SSL CertificateReloader::instance().tryLoad(config()); + CertificateReloader::instance().tryLoadClient(config()); #endif /// Must be done after initialization of `servers`, because async_metrics will access `servers` variable from its thread. @@ -2384,17 +2390,28 @@ try if (has_zookeeper && config().has("distributed_ddl")) { /// DDL worker should be started after all tables were loaded - String ddl_zookeeper_path = config().getString("distributed_ddl.path", "/clickhouse/task_queue/ddl/"); + String ddl_queue_path = config().getString("distributed_ddl.path", "/clickhouse/task_queue/ddl/"); + String ddl_replicas_path = config().getString("distributed_ddl.replicas_path", "/clickhouse/task_queue/replicas/"); int pool_size = config().getInt("distributed_ddl.pool_size", 1); if (pool_size < 1) throw Exception(ErrorCodes::ARGUMENT_OUT_OF_BOUND, "distributed_ddl.pool_size should be greater then 0"); - global_context->setDDLWorker(std::make_unique(pool_size, ddl_zookeeper_path, global_context, &config(), - "distributed_ddl", "DDLWorker", - &CurrentMetrics::MaxDDLEntryID, &CurrentMetrics::MaxPushedDDLEntryID), - load_metadata_tasks); + global_context->setDDLWorker( + std::make_unique( + pool_size, + ddl_queue_path, + ddl_replicas_path, + global_context, + &config(), + "distributed_ddl", + "DDLWorker", + &CurrentMetrics::MaxDDLEntryID, + &CurrentMetrics::MaxPushedDDLEntryID), + joinTasks(load_system_metadata_tasks, load_metadata_tasks)); } /// Do not keep tasks in server, they should be kept inside databases. Used here to make dependent tasks only. + load_system_metadata_tasks.clear(); + load_system_metadata_tasks.shrink_to_fit(); load_metadata_tasks.clear(); load_metadata_tasks.shrink_to_fit(); @@ -2420,7 +2437,6 @@ try #if USE_GWP_ASAN GWPAsan::initFinished(); - GWPAsan::setForceSampleProbability(server_settings[ServerSetting::gwp_asan_force_sample_probability]); #endif try @@ -3014,7 +3030,7 @@ void Server::updateServers( for (auto * server : all_servers) { - if (!server->isStopping()) + if (server->supportsRuntimeReconfiguration() && !server->isStopping()) { std::string port_name = server->getPortName(); bool has_host = false; diff --git a/programs/server/config.xml b/programs/server/config.xml index 28f1f465c71..9807f8c0d5a 100644 --- a/programs/server/config.xml +++ b/programs/server/config.xml @@ -1399,6 +1399,10 @@ If not specified they will be stored locally. --> + + + @@ -1450,6 +1454,8 @@ /clickhouse/task_queue/ddl + + /clickhouse/task_queue/replicas diff --git a/src/Access/AuthenticationData.cpp b/src/Access/AuthenticationData.cpp index 57a1cd756ff..37a4e356af8 100644 --- a/src/Access/AuthenticationData.cpp +++ b/src/Access/AuthenticationData.cpp @@ -1,12 +1,16 @@ #include #include #include +#include #include #include #include #include #include #include +#include +#include +#include #include #include @@ -113,7 +117,8 @@ bool operator ==(const AuthenticationData & lhs, const AuthenticationData & rhs) && (lhs.ssh_keys == rhs.ssh_keys) #endif && (lhs.http_auth_scheme == rhs.http_auth_scheme) - && (lhs.http_auth_server_name == rhs.http_auth_server_name); + && (lhs.http_auth_server_name == rhs.http_auth_server_name) + && (lhs.valid_until == rhs.valid_until); } @@ -384,14 +389,34 @@ std::shared_ptr AuthenticationData::toAST() const throw Exception(ErrorCodes::LOGICAL_ERROR, "AST: Unexpected authentication type {}", toString(auth_type)); } + + if (valid_until) + { + WriteBufferFromOwnString out; + writeDateTimeText(valid_until, out); + + node->valid_until = std::make_shared(out.str()); + } + return node; } AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & query, ContextPtr context, bool validate) { + time_t valid_until = 0; + + if (query.valid_until) + { + valid_until = getValidUntilFromAST(query.valid_until, context); + } + if (query.type && query.type == AuthenticationType::NO_PASSWORD) - return AuthenticationData(); + { + AuthenticationData auth_data; + auth_data.setValidUntil(valid_until); + return auth_data; + } /// For this type of authentication we have ASTPublicSSHKey as children for ASTAuthenticationData if (query.type && query.type == AuthenticationType::SSH_KEY) @@ -418,6 +443,7 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que } auth_data.setSSHKeys(std::move(keys)); + auth_data.setValidUntil(valid_until); return auth_data; #else throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "SSH is disabled, because ClickHouse is built without libssh"); @@ -451,6 +477,8 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que AuthenticationData auth_data(current_type); + auth_data.setValidUntil(valid_until); + if (validate) context->getAccessControl().checkPasswordComplexityRules(value); @@ -494,6 +522,7 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que } AuthenticationData auth_data(*query.type); + auth_data.setValidUntil(valid_until); if (query.contains_hash) { diff --git a/src/Access/AuthenticationData.h b/src/Access/AuthenticationData.h index a0c100264f8..2d8d008c925 100644 --- a/src/Access/AuthenticationData.h +++ b/src/Access/AuthenticationData.h @@ -74,6 +74,9 @@ public: const String & getHTTPAuthenticationServerName() const { return http_auth_server_name; } void setHTTPAuthenticationServerName(const String & name) { http_auth_server_name = name; } + time_t getValidUntil() const { return valid_until; } + void setValidUntil(time_t valid_until_) { valid_until = valid_until_; } + friend bool operator ==(const AuthenticationData & lhs, const AuthenticationData & rhs); friend bool operator !=(const AuthenticationData & lhs, const AuthenticationData & rhs) { return !(lhs == rhs); } @@ -106,6 +109,7 @@ private: /// HTTP authentication properties String http_auth_server_name; HTTPAuthenticationScheme http_auth_scheme = HTTPAuthenticationScheme::BASIC; + time_t valid_until = 0; }; } diff --git a/src/Access/Common/AccessType.h b/src/Access/Common/AccessType.h index 383e7f70420..ec543104167 100644 --- a/src/Access/Common/AccessType.h +++ b/src/Access/Common/AccessType.h @@ -99,6 +99,8 @@ enum class AccessType : uint8_t M(CREATE_ARBITRARY_TEMPORARY_TABLE, "", GLOBAL, CREATE) /* allows to create and manipulate temporary tables with arbitrary table engine */\ M(CREATE_FUNCTION, "", GLOBAL, CREATE) /* allows to execute CREATE FUNCTION */ \ + M(CREATE_WORKLOAD, "", GLOBAL, CREATE) /* allows to execute CREATE WORKLOAD */ \ + M(CREATE_RESOURCE, "", GLOBAL, CREATE) /* allows to execute CREATE RESOURCE */ \ M(CREATE_NAMED_COLLECTION, "", NAMED_COLLECTION, NAMED_COLLECTION_ADMIN) /* allows to execute CREATE NAMED COLLECTION */ \ M(CREATE, "", GROUP, ALL) /* allows to execute {CREATE|ATTACH} */ \ \ @@ -108,6 +110,8 @@ enum class AccessType : uint8_t implicitly enabled by the grant DROP_TABLE */\ M(DROP_DICTIONARY, "", DICTIONARY, DROP) /* allows to execute {DROP|DETACH} DICTIONARY */\ M(DROP_FUNCTION, "", GLOBAL, DROP) /* allows to execute DROP FUNCTION */\ + M(DROP_WORKLOAD, "", GLOBAL, DROP) /* allows to execute DROP WORKLOAD */\ + M(DROP_RESOURCE, "", GLOBAL, DROP) /* allows to execute DROP RESOURCE */\ M(DROP_NAMED_COLLECTION, "", NAMED_COLLECTION, NAMED_COLLECTION_ADMIN) /* allows to execute DROP NAMED COLLECTION */\ M(DROP, "", GROUP, ALL) /* allows to execute {DROP|DETACH} */\ \ @@ -159,6 +163,7 @@ enum class AccessType : uint8_t M(SYSTEM_SHUTDOWN, "SYSTEM KILL, SHUTDOWN", GLOBAL, SYSTEM) \ M(SYSTEM_DROP_DNS_CACHE, "SYSTEM DROP DNS, DROP DNS CACHE, DROP DNS", GLOBAL, SYSTEM_DROP_CACHE) \ M(SYSTEM_DROP_CONNECTIONS_CACHE, "SYSTEM DROP CONNECTIONS CACHE, DROP CONNECTIONS CACHE", GLOBAL, SYSTEM_DROP_CACHE) \ + M(SYSTEM_PREWARM_MARK_CACHE, "SYSTEM PREWARM MARK, PREWARM MARK CACHE, PREWARM MARKS", GLOBAL, SYSTEM_DROP_CACHE) \ M(SYSTEM_DROP_MARK_CACHE, "SYSTEM DROP MARK, DROP MARK CACHE, DROP MARKS", GLOBAL, SYSTEM_DROP_CACHE) \ M(SYSTEM_DROP_UNCOMPRESSED_CACHE, "SYSTEM DROP UNCOMPRESSED, DROP UNCOMPRESSED CACHE, DROP UNCOMPRESSED", GLOBAL, SYSTEM_DROP_CACHE) \ M(SYSTEM_DROP_MMAP_CACHE, "SYSTEM DROP MMAP, DROP MMAP CACHE, DROP MMAP", GLOBAL, SYSTEM_DROP_CACHE) \ @@ -238,6 +243,9 @@ enum class AccessType : uint8_t M(S3, "", GLOBAL, SOURCES) \ M(HIVE, "", GLOBAL, SOURCES) \ M(AZURE, "", GLOBAL, SOURCES) \ + M(KAFKA, "", GLOBAL, SOURCES) \ + M(NATS, "", GLOBAL, SOURCES) \ + M(RABBITMQ, "", GLOBAL, SOURCES) \ M(SOURCES, "", GROUP, ALL) \ \ M(CLUSTER, "", GLOBAL, ALL) /* ON CLUSTER queries */ \ diff --git a/src/Access/ContextAccess.cpp b/src/Access/ContextAccess.cpp index 949fd37e403..06e89d78339 100644 --- a/src/Access/ContextAccess.cpp +++ b/src/Access/ContextAccess.cpp @@ -52,7 +52,10 @@ namespace {AccessType::HDFS, "HDFS"}, {AccessType::S3, "S3"}, {AccessType::HIVE, "Hive"}, - {AccessType::AZURE, "AzureBlobStorage"} + {AccessType::AZURE, "AzureBlobStorage"}, + {AccessType::KAFKA, "Kafka"}, + {AccessType::NATS, "NATS"}, + {AccessType::RABBITMQ, "RabbitMQ"} }; @@ -701,15 +704,17 @@ bool ContextAccess::checkAccessImplHelper(const ContextPtr & context, AccessFlag const AccessFlags dictionary_ddl = AccessType::CREATE_DICTIONARY | AccessType::DROP_DICTIONARY; const AccessFlags function_ddl = AccessType::CREATE_FUNCTION | AccessType::DROP_FUNCTION; + const AccessFlags workload_ddl = AccessType::CREATE_WORKLOAD | AccessType::DROP_WORKLOAD; + const AccessFlags resource_ddl = AccessType::CREATE_RESOURCE | AccessType::DROP_RESOURCE; const AccessFlags table_and_dictionary_ddl = table_ddl | dictionary_ddl; const AccessFlags table_and_dictionary_and_function_ddl = table_ddl | dictionary_ddl | function_ddl; const AccessFlags write_table_access = AccessType::INSERT | AccessType::OPTIMIZE; const AccessFlags write_dcl_access = AccessType::ACCESS_MANAGEMENT - AccessType::SHOW_ACCESS; - const AccessFlags not_readonly_flags = write_table_access | table_and_dictionary_and_function_ddl | write_dcl_access | AccessType::SYSTEM | AccessType::KILL_QUERY; + const AccessFlags not_readonly_flags = write_table_access | table_and_dictionary_and_function_ddl | workload_ddl | resource_ddl | write_dcl_access | AccessType::SYSTEM | AccessType::KILL_QUERY; const AccessFlags not_readonly_1_flags = AccessType::CREATE_TEMPORARY_TABLE; - const AccessFlags ddl_flags = table_ddl | dictionary_ddl | function_ddl; + const AccessFlags ddl_flags = table_ddl | dictionary_ddl | function_ddl | workload_ddl | resource_ddl; const AccessFlags introspection_flags = AccessType::INTROSPECTION; }; static const PrecalculatedFlags precalc; diff --git a/src/Access/Credentials.h b/src/Access/Credentials.h index f220b8d2c48..b21b7e6921f 100644 --- a/src/Access/Credentials.h +++ b/src/Access/Credentials.h @@ -15,6 +15,9 @@ public: explicit Credentials() = default; explicit Credentials(const String & user_name_); + Credentials(const Credentials &) = default; + Credentials(Credentials &&) = default; + virtual ~Credentials() = default; const String & getUserName() const; diff --git a/src/Access/IAccessStorage.cpp b/src/Access/IAccessStorage.cpp index 3249d89ba87..72e0933e214 100644 --- a/src/Access/IAccessStorage.cpp +++ b/src/Access/IAccessStorage.cpp @@ -554,7 +554,7 @@ std::optional IAccessStorage::authenticateImpl( continue; } - if (areCredentialsValid(user->getName(), user->valid_until, auth_method, credentials, external_authenticators, auth_result.settings)) + if (areCredentialsValid(user->getName(), auth_method, credentials, external_authenticators, auth_result.settings)) { auth_result.authentication_data = auth_method; return auth_result; @@ -579,7 +579,6 @@ std::optional IAccessStorage::authenticateImpl( bool IAccessStorage::areCredentialsValid( const std::string & user_name, - time_t valid_until, const AuthenticationData & authentication_method, const Credentials & credentials, const ExternalAuthenticators & external_authenticators, @@ -591,6 +590,7 @@ bool IAccessStorage::areCredentialsValid( if (credentials.getUserName() != user_name) return false; + auto valid_until = authentication_method.getValidUntil(); if (valid_until) { const time_t now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now()); diff --git a/src/Access/IAccessStorage.h b/src/Access/IAccessStorage.h index 84cbdd0a751..4e2b27a1864 100644 --- a/src/Access/IAccessStorage.h +++ b/src/Access/IAccessStorage.h @@ -236,7 +236,6 @@ protected: bool allow_plaintext_password) const; virtual bool areCredentialsValid( const std::string & user_name, - time_t valid_until, const AuthenticationData & authentication_method, const Credentials & credentials, const ExternalAuthenticators & external_authenticators, diff --git a/src/Access/User.cpp b/src/Access/User.cpp index 887abc213f9..1c92f467003 100644 --- a/src/Access/User.cpp +++ b/src/Access/User.cpp @@ -19,8 +19,7 @@ bool User::equal(const IAccessEntity & other) const return (authentication_methods == other_user.authentication_methods) && (allowed_client_hosts == other_user.allowed_client_hosts) && (access == other_user.access) && (granted_roles == other_user.granted_roles) && (default_roles == other_user.default_roles) - && (settings == other_user.settings) && (grantees == other_user.grantees) && (default_database == other_user.default_database) - && (valid_until == other_user.valid_until); + && (settings == other_user.settings) && (grantees == other_user.grantees) && (default_database == other_user.default_database); } void User::setName(const String & name_) @@ -88,7 +87,6 @@ void User::clearAllExceptDependencies() access = {}; settings.removeSettingsKeepProfiles(); default_database = {}; - valid_until = 0; } } diff --git a/src/Access/User.h b/src/Access/User.h index 03d62bf2277..f54e74a305d 100644 --- a/src/Access/User.h +++ b/src/Access/User.h @@ -23,7 +23,6 @@ struct User : public IAccessEntity SettingsProfileElements settings; RolesOrUsersSet grantees = RolesOrUsersSet::AllTag{}; String default_database; - time_t valid_until = 0; bool equal(const IAccessEntity & other) const override; std::shared_ptr clone() const override { return cloneImpl(); } diff --git a/src/AggregateFunctions/AggregateFunctionGroupArraySorted.cpp b/src/AggregateFunctions/AggregateFunctionGroupArraySorted.cpp index 86f7661e53f..061a1e519e1 100644 --- a/src/AggregateFunctions/AggregateFunctionGroupArraySorted.cpp +++ b/src/AggregateFunctions/AggregateFunctionGroupArraySorted.cpp @@ -59,13 +59,13 @@ constexpr size_t group_array_sorted_sort_strategy_max_elements_threshold = 10000 template struct GroupArraySortedData { + static constexpr bool is_value_generic_field = std::is_same_v; + using Allocator = MixedAlignedArenaAllocator; - using Array = PODArray; + using Array = typename std::conditional_t, PODArray>; static constexpr size_t partial_sort_max_elements_factor = 2; - static constexpr bool is_value_generic_field = std::is_same_v; - Array values; static bool compare(const T & lhs, const T & rhs) @@ -144,7 +144,7 @@ struct GroupArraySortedData } if (values.size() > max_elements) - values.resize(max_elements, arena); + resize(max_elements, arena); } ALWAYS_INLINE void partialSortAndLimitIfNeeded(size_t max_elements, Arena * arena) @@ -153,7 +153,23 @@ struct GroupArraySortedData return; ::nth_element(values.begin(), values.begin() + max_elements, values.end(), Comparator()); - values.resize(max_elements, arena); + resize(max_elements, arena); + } + + ALWAYS_INLINE void resize(size_t n, Arena * arena) + { + if constexpr (is_value_generic_field) + values.resize(n); + else + values.resize(n, arena); + } + + ALWAYS_INLINE void push_back(T && element, Arena * arena) + { + if constexpr (is_value_generic_field) + values.push_back(element); + else + values.push_back(element, arena); } ALWAYS_INLINE void addElement(T && element, size_t max_elements, Arena * arena) @@ -171,12 +187,12 @@ struct GroupArraySortedData return; } - values.push_back(std::move(element), arena); + push_back(std::move(element), arena); std::push_heap(values.begin(), values.end(), Comparator()); } else { - values.push_back(std::move(element), arena); + push_back(std::move(element), arena); partialSortAndLimitIfNeeded(max_elements, arena); } } @@ -210,14 +226,6 @@ struct GroupArraySortedData result_array_data[result_array_data_insert_begin + i] = values[i]; } } - - ~GroupArraySortedData() - { - for (auto & value : values) - { - value.~T(); - } - } }; template @@ -313,14 +321,12 @@ public: throw Exception(ErrorCodes::TOO_LARGE_ARRAY_SIZE, "Too large array size, it should not exceed {}", max_elements); auto & values = this->data(place).values; - values.resize_exact(size, arena); - if constexpr (std::is_same_v) + if constexpr (Data::is_value_generic_field) { + values.resize(size); for (Field & element : values) { - /// We must initialize the Field type since some internal functions (like operator=) use them - new (&element) Field; bool has_value = false; readBinary(has_value, buf); if (has_value) @@ -329,6 +335,7 @@ public: } else { + values.resize_exact(size, arena); if constexpr (std::endian::native == std::endian::little) { buf.readStrict(reinterpret_cast(values.data()), size * sizeof(values[0])); diff --git a/src/Analyzer/Resolve/QueryAnalyzer.cpp b/src/Analyzer/Resolve/QueryAnalyzer.cpp index 381edee607d..cb3087af707 100644 --- a/src/Analyzer/Resolve/QueryAnalyzer.cpp +++ b/src/Analyzer/Resolve/QueryAnalyzer.cpp @@ -227,8 +227,13 @@ void QueryAnalyzer::resolveConstantExpression(QueryTreeNodePtr & node, const Que scope.context = context; auto node_type = node->getNodeType(); + if (node_type == QueryTreeNodeType::QUERY || node_type == QueryTreeNodeType::UNION) + { + evaluateScalarSubqueryIfNeeded(node, scope); + return; + } - if (table_expression && node_type != QueryTreeNodeType::QUERY && node_type != QueryTreeNodeType::UNION) + if (table_expression) { scope.expression_join_tree_node = table_expression; validateTableExpressionModifiers(scope.expression_join_tree_node, scope); diff --git a/src/Backups/BackupConcurrencyCheck.cpp b/src/Backups/BackupConcurrencyCheck.cpp new file mode 100644 index 00000000000..8b29ae41b53 --- /dev/null +++ b/src/Backups/BackupConcurrencyCheck.cpp @@ -0,0 +1,135 @@ +#include + +#include +#include + + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int CONCURRENT_ACCESS_NOT_SUPPORTED; +} + + +BackupConcurrencyCheck::BackupConcurrencyCheck( + const UUID & backup_or_restore_uuid_, + bool is_restore_, + bool on_cluster_, + bool allow_concurrency_, + BackupConcurrencyCounters & counters_) + : is_restore(is_restore_), backup_or_restore_uuid(backup_or_restore_uuid_), on_cluster(on_cluster_), counters(counters_) +{ + std::lock_guard lock{counters.mutex}; + + if (!allow_concurrency_) + { + bool found_concurrent_operation = false; + if (is_restore) + { + size_t num_local_restores = counters.local_restores; + size_t num_on_cluster_restores = counters.on_cluster_restores.size(); + if (on_cluster) + { + if (!counters.on_cluster_restores.contains(backup_or_restore_uuid)) + ++num_on_cluster_restores; + } + else + { + ++num_local_restores; + } + found_concurrent_operation = (num_local_restores + num_on_cluster_restores > 1); + } + else + { + size_t num_local_backups = counters.local_backups; + size_t num_on_cluster_backups = counters.on_cluster_backups.size(); + if (on_cluster) + { + if (!counters.on_cluster_backups.contains(backup_or_restore_uuid)) + ++num_on_cluster_backups; + } + else + { + ++num_local_backups; + } + found_concurrent_operation = (num_local_backups + num_on_cluster_backups > 1); + } + + if (found_concurrent_operation) + throwConcurrentOperationNotAllowed(is_restore); + } + + if (on_cluster) + { + if (is_restore) + ++counters.on_cluster_restores[backup_or_restore_uuid]; + else + ++counters.on_cluster_backups[backup_or_restore_uuid]; + } + else + { + if (is_restore) + ++counters.local_restores; + else + ++counters.local_backups; + } +} + + +BackupConcurrencyCheck::~BackupConcurrencyCheck() +{ + std::lock_guard lock{counters.mutex}; + + if (on_cluster) + { + if (is_restore) + { + auto it = counters.on_cluster_restores.find(backup_or_restore_uuid); + if (it != counters.on_cluster_restores.end()) + { + if (!--it->second) + counters.on_cluster_restores.erase(it); + } + } + else + { + auto it = counters.on_cluster_backups.find(backup_or_restore_uuid); + if (it != counters.on_cluster_backups.end()) + { + if (!--it->second) + counters.on_cluster_backups.erase(it); + } + } + } + else + { + if (is_restore) + --counters.local_restores; + else + --counters.local_backups; + } +} + + +void BackupConcurrencyCheck::throwConcurrentOperationNotAllowed(bool is_restore) +{ + throw Exception( + ErrorCodes::CONCURRENT_ACCESS_NOT_SUPPORTED, + "Concurrent {} are not allowed, turn on setting '{}'", + is_restore ? "restores" : "backups", + is_restore ? "allow_concurrent_restores" : "allow_concurrent_backups"); +} + + +BackupConcurrencyCounters::BackupConcurrencyCounters() = default; + + +BackupConcurrencyCounters::~BackupConcurrencyCounters() +{ + if (local_backups > 0 || local_restores > 0 || !on_cluster_backups.empty() || !on_cluster_restores.empty()) + LOG_ERROR(getLogger(__PRETTY_FUNCTION__), "Some backups or restores are processing"); +} + +} diff --git a/src/Backups/BackupConcurrencyCheck.h b/src/Backups/BackupConcurrencyCheck.h new file mode 100644 index 00000000000..048a23a716a --- /dev/null +++ b/src/Backups/BackupConcurrencyCheck.h @@ -0,0 +1,55 @@ +#pragma once + +#include +#include +#include +#include + + +namespace DB +{ +class BackupConcurrencyCounters; + +/// Local checker for concurrent BACKUP or RESTORE operations. +/// This class is used by implementations of IBackupCoordination and IRestoreCoordination +/// to throw an exception if concurrent backups or restores are not allowed. +class BackupConcurrencyCheck +{ +public: + /// Checks concurrency of a BACKUP operation or a RESTORE operation. + /// Keep a constructed instance of BackupConcurrencyCheck until the operation is done. + BackupConcurrencyCheck( + const UUID & backup_or_restore_uuid_, + bool is_restore_, + bool on_cluster_, + bool allow_concurrency_, + BackupConcurrencyCounters & counters_); + + ~BackupConcurrencyCheck(); + + [[noreturn]] static void throwConcurrentOperationNotAllowed(bool is_restore); + +private: + const bool is_restore; + const UUID backup_or_restore_uuid; + const bool on_cluster; + BackupConcurrencyCounters & counters; +}; + + +class BackupConcurrencyCounters +{ +public: + BackupConcurrencyCounters(); + ~BackupConcurrencyCounters(); + +private: + friend class BackupConcurrencyCheck; + size_t local_backups TSA_GUARDED_BY(mutex) = 0; + size_t local_restores TSA_GUARDED_BY(mutex) = 0; + std::unordered_map on_cluster_backups TSA_GUARDED_BY(mutex); + std::unordered_map on_cluster_restores TSA_GUARDED_BY(mutex); + std::mutex mutex; +}; + +} diff --git a/src/Backups/BackupCoordinationCleaner.cpp b/src/Backups/BackupCoordinationCleaner.cpp new file mode 100644 index 00000000000..1f5068a94de --- /dev/null +++ b/src/Backups/BackupCoordinationCleaner.cpp @@ -0,0 +1,64 @@ +#include + + +namespace DB +{ + +BackupCoordinationCleaner::BackupCoordinationCleaner(const String & zookeeper_path_, const WithRetries & with_retries_, LoggerPtr log_) + : zookeeper_path(zookeeper_path_), with_retries(with_retries_), log(log_) +{ +} + +void BackupCoordinationCleaner::cleanup() +{ + tryRemoveAllNodes(/* throw_if_error = */ true, /* retries_kind = */ WithRetries::kNormal); +} + +bool BackupCoordinationCleaner::tryCleanupAfterError() noexcept +{ + return tryRemoveAllNodes(/* throw_if_error = */ false, /* retries_kind = */ WithRetries::kNormal); +} + +bool BackupCoordinationCleaner::tryRemoveAllNodes(bool throw_if_error, WithRetries::Kind retries_kind) +{ + { + std::lock_guard lock{mutex}; + if (cleanup_result.succeeded) + return true; + if (cleanup_result.exception) + { + if (throw_if_error) + std::rethrow_exception(cleanup_result.exception); + return false; + } + } + + try + { + LOG_TRACE(log, "Removing nodes from ZooKeeper"); + auto holder = with_retries.createRetriesControlHolder("removeAllNodes", retries_kind); + holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zookeeper); + zookeeper->removeRecursive(zookeeper_path); + }); + + std::lock_guard lock{mutex}; + cleanup_result.succeeded = true; + return true; + } + catch (...) + { + LOG_TRACE(log, "Caught exception while removing nodes from ZooKeeper for this restore: {}", + getCurrentExceptionMessage(/* with_stacktrace= */ false, /* check_embedded_stacktrace= */ true)); + + std::lock_guard lock{mutex}; + cleanup_result.exception = std::current_exception(); + + if (throw_if_error) + throw; + return false; + } +} + +} diff --git a/src/Backups/BackupCoordinationCleaner.h b/src/Backups/BackupCoordinationCleaner.h new file mode 100644 index 00000000000..43e095d9f33 --- /dev/null +++ b/src/Backups/BackupCoordinationCleaner.h @@ -0,0 +1,40 @@ +#pragma once + +#include + + +namespace DB +{ + +/// Removes all the nodes from ZooKeeper used to coordinate a BACKUP ON CLUSTER operation or +/// a RESTORE ON CLUSTER operation (successful or not). +/// This class is used by BackupCoordinationOnCluster and RestoreCoordinationOnCluster to cleanup. +class BackupCoordinationCleaner +{ +public: + BackupCoordinationCleaner(const String & zookeeper_path_, const WithRetries & with_retries_, LoggerPtr log_); + + void cleanup(); + bool tryCleanupAfterError() noexcept; + +private: + bool tryRemoveAllNodes(bool throw_if_error, WithRetries::Kind retries_kind); + + const String zookeeper_path; + + /// A reference to a field of the parent object which is either BackupCoordinationOnCluster or RestoreCoordinationOnCluster. + const WithRetries & with_retries; + + const LoggerPtr log; + + struct CleanupResult + { + bool succeeded = false; + std::exception_ptr exception; + }; + CleanupResult cleanup_result TSA_GUARDED_BY(mutex); + + std::mutex mutex; +}; + +} diff --git a/src/Backups/BackupCoordinationLocal.cpp b/src/Backups/BackupCoordinationLocal.cpp index efdc18cc29c..8bd6b4d327d 100644 --- a/src/Backups/BackupCoordinationLocal.cpp +++ b/src/Backups/BackupCoordinationLocal.cpp @@ -1,5 +1,7 @@ #include + #include +#include #include #include #include @@ -8,27 +10,20 @@ namespace DB { -BackupCoordinationLocal::BackupCoordinationLocal(bool plain_backup_) - : log(getLogger("BackupCoordinationLocal")), file_infos(plain_backup_) +BackupCoordinationLocal::BackupCoordinationLocal( + const UUID & backup_uuid_, + bool is_plain_backup_, + bool allow_concurrent_backup_, + BackupConcurrencyCounters & concurrency_counters_) + : log(getLogger("BackupCoordinationLocal")) + , concurrency_check(backup_uuid_, /* is_restore = */ false, /* on_cluster = */ false, allow_concurrent_backup_, concurrency_counters_) + , file_infos(is_plain_backup_) { } BackupCoordinationLocal::~BackupCoordinationLocal() = default; -void BackupCoordinationLocal::setStage(const String &, const String &) -{ -} - -void BackupCoordinationLocal::setError(const Exception &) -{ -} - -Strings BackupCoordinationLocal::waitForStage(const String &) -{ - return {}; -} - -Strings BackupCoordinationLocal::waitForStage(const String &, std::chrono::milliseconds) +ZooKeeperRetriesInfo BackupCoordinationLocal::getOnClusterInitializationKeeperRetriesInfo() const { return {}; } @@ -135,15 +130,4 @@ bool BackupCoordinationLocal::startWritingFile(size_t data_file_index) return writing_files.emplace(data_file_index).second; } - -bool BackupCoordinationLocal::hasConcurrentBackups(const std::atomic & num_active_backups) const -{ - if (num_active_backups > 1) - { - LOG_WARNING(log, "Found concurrent backups: num_active_backups={}", num_active_backups); - return true; - } - return false; -} - } diff --git a/src/Backups/BackupCoordinationLocal.h b/src/Backups/BackupCoordinationLocal.h index a7f15c79649..09991c0d301 100644 --- a/src/Backups/BackupCoordinationLocal.h +++ b/src/Backups/BackupCoordinationLocal.h @@ -1,6 +1,7 @@ #pragma once #include +#include #include #include #include @@ -21,13 +22,21 @@ namespace DB class BackupCoordinationLocal : public IBackupCoordination { public: - explicit BackupCoordinationLocal(bool plain_backup_); + explicit BackupCoordinationLocal( + const UUID & backup_uuid_, + bool is_plain_backup_, + bool allow_concurrent_backup_, + BackupConcurrencyCounters & concurrency_counters_); + ~BackupCoordinationLocal() override; - void setStage(const String & new_stage, const String & message) override; - void setError(const Exception & exception) override; - Strings waitForStage(const String & stage_to_wait) override; - Strings waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) override; + Strings setStage(const String &, const String &, bool) override { return {}; } + void setBackupQueryWasSentToOtherHosts() override {} + bool trySetError(std::exception_ptr) override { return true; } + void finish() override {} + bool tryFinishAfterError() noexcept override { return true; } + void waitForOtherHostsToFinish() override {} + bool tryWaitForOtherHostsToFinishAfterError() noexcept override { return true; } void addReplicatedPartNames(const String & table_zk_path, const String & table_name_for_logs, const String & replica_name, const std::vector & part_names_and_checksums) override; @@ -54,17 +63,18 @@ public: BackupFileInfos getFileInfosForAllHosts() const override; bool startWritingFile(size_t data_file_index) override; - bool hasConcurrentBackups(const std::atomic & num_active_backups) const override; + ZooKeeperRetriesInfo getOnClusterInitializationKeeperRetriesInfo() const override; private: LoggerPtr const log; + BackupConcurrencyCheck concurrency_check; - BackupCoordinationReplicatedTables TSA_GUARDED_BY(replicated_tables_mutex) replicated_tables; - BackupCoordinationReplicatedAccess TSA_GUARDED_BY(replicated_access_mutex) replicated_access; - BackupCoordinationReplicatedSQLObjects TSA_GUARDED_BY(replicated_sql_objects_mutex) replicated_sql_objects; - BackupCoordinationFileInfos TSA_GUARDED_BY(file_infos_mutex) file_infos; + BackupCoordinationReplicatedTables replicated_tables TSA_GUARDED_BY(replicated_tables_mutex); + BackupCoordinationReplicatedAccess replicated_access TSA_GUARDED_BY(replicated_access_mutex); + BackupCoordinationReplicatedSQLObjects replicated_sql_objects TSA_GUARDED_BY(replicated_sql_objects_mutex); + BackupCoordinationFileInfos file_infos TSA_GUARDED_BY(file_infos_mutex); BackupCoordinationKeeperMapTables keeper_map_tables TSA_GUARDED_BY(keeper_map_tables_mutex); - std::unordered_set TSA_GUARDED_BY(writing_files_mutex) writing_files; + std::unordered_set writing_files TSA_GUARDED_BY(writing_files_mutex); mutable std::mutex replicated_tables_mutex; mutable std::mutex replicated_access_mutex; diff --git a/src/Backups/BackupCoordinationRemote.cpp b/src/Backups/BackupCoordinationOnCluster.cpp similarity index 73% rename from src/Backups/BackupCoordinationRemote.cpp rename to src/Backups/BackupCoordinationOnCluster.cpp index a60ac0c636f..dc34939f805 100644 --- a/src/Backups/BackupCoordinationRemote.cpp +++ b/src/Backups/BackupCoordinationOnCluster.cpp @@ -1,7 +1,4 @@ -#include - -#include -#include +#include #include #include @@ -26,8 +23,6 @@ namespace ErrorCodes extern const int LOGICAL_ERROR; } -namespace Stage = BackupCoordinationStage; - namespace { using PartNameAndChecksum = IBackupCoordination::PartNameAndChecksum; @@ -149,144 +144,152 @@ namespace }; } -size_t BackupCoordinationRemote::findCurrentHostIndex(const Strings & all_hosts, const String & current_host) +Strings BackupCoordinationOnCluster::excludeInitiator(const Strings & all_hosts) +{ + Strings all_hosts_without_initiator = all_hosts; + bool has_initiator = (std::erase(all_hosts_without_initiator, kInitiator) > 0); + chassert(has_initiator); + return all_hosts_without_initiator; +} + +size_t BackupCoordinationOnCluster::findCurrentHostIndex(const String & current_host, const Strings & all_hosts) { auto it = std::find(all_hosts.begin(), all_hosts.end(), current_host); if (it == all_hosts.end()) - return 0; + return all_hosts.size(); return it - all_hosts.begin(); } -BackupCoordinationRemote::BackupCoordinationRemote( - zkutil::GetZooKeeper get_zookeeper_, + +BackupCoordinationOnCluster::BackupCoordinationOnCluster( + const UUID & backup_uuid_, + bool is_plain_backup_, const String & root_zookeeper_path_, + zkutil::GetZooKeeper get_zookeeper_, const BackupKeeperSettings & keeper_settings_, - const String & backup_uuid_, - const Strings & all_hosts_, const String & current_host_, - bool plain_backup_, - bool is_internal_, + const Strings & all_hosts_, + bool allow_concurrent_backup_, + BackupConcurrencyCounters & concurrency_counters_, + ThreadPoolCallbackRunnerUnsafe schedule_, QueryStatusPtr process_list_element_) : root_zookeeper_path(root_zookeeper_path_) - , zookeeper_path(root_zookeeper_path_ + "/backup-" + backup_uuid_) + , zookeeper_path(root_zookeeper_path_ + "/backup-" + toString(backup_uuid_)) , keeper_settings(keeper_settings_) , backup_uuid(backup_uuid_) , all_hosts(all_hosts_) + , all_hosts_without_initiator(excludeInitiator(all_hosts)) , current_host(current_host_) - , current_host_index(findCurrentHostIndex(all_hosts, current_host)) - , plain_backup(plain_backup_) - , is_internal(is_internal_) - , log(getLogger("BackupCoordinationRemote")) - , with_retries( - log, - get_zookeeper_, - keeper_settings, - process_list_element_, - [my_zookeeper_path = zookeeper_path, my_current_host = current_host, my_is_internal = is_internal] - (WithRetries::FaultyKeeper & zk) - { - /// Recreate this ephemeral node to signal that we are alive. - if (my_is_internal) - { - String alive_node_path = my_zookeeper_path + "/stage/alive|" + my_current_host; - - /// Delete the ephemeral node from the previous connection so we don't have to wait for keeper to do it automatically. - zk->tryRemove(alive_node_path); - - zk->createAncestors(alive_node_path); - zk->create(alive_node_path, "", zkutil::CreateMode::Ephemeral); - } - }) + , current_host_index(findCurrentHostIndex(current_host, all_hosts)) + , plain_backup(is_plain_backup_) + , log(getLogger("BackupCoordinationOnCluster")) + , with_retries(log, get_zookeeper_, keeper_settings, process_list_element_, [root_zookeeper_path_](Coordination::ZooKeeperWithFaultInjection::Ptr zk) { zk->sync(root_zookeeper_path_); }) + , concurrency_check(backup_uuid_, /* is_restore = */ false, /* on_cluster = */ true, allow_concurrent_backup_, concurrency_counters_) + , stage_sync(/* is_restore = */ false, fs::path{zookeeper_path} / "stage", current_host, all_hosts, allow_concurrent_backup_, with_retries, schedule_, process_list_element_, log) + , cleaner(zookeeper_path, with_retries, log) { createRootNodes(); - - stage_sync.emplace( - zookeeper_path, - with_retries, - log); } -BackupCoordinationRemote::~BackupCoordinationRemote() +BackupCoordinationOnCluster::~BackupCoordinationOnCluster() { - try - { - if (!is_internal) - removeAllNodes(); - } - catch (...) - { - tryLogCurrentException(__PRETTY_FUNCTION__); - } + tryFinishImpl(); } -void BackupCoordinationRemote::createRootNodes() +void BackupCoordinationOnCluster::createRootNodes() { - auto holder = with_retries.createRetriesControlHolder("createRootNodes"); + auto holder = with_retries.createRetriesControlHolder("createRootNodes", WithRetries::kInitialization); holder.retries_ctl.retryLoop( [&, &zk = holder.faulty_zookeeper]() { with_retries.renewZooKeeper(zk); zk->createAncestors(zookeeper_path); - - Coordination::Requests ops; - Coordination::Responses responses; - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path, "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_part_names", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_mutations", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_data_paths", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_access", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_sql_objects", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/keeper_map_tables", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/file_infos", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/writing_files", "", zkutil::CreateMode::Persistent)); - zk->tryMulti(ops, responses); + zk->createIfNotExists(zookeeper_path, ""); + zk->createIfNotExists(zookeeper_path + "/repl_part_names", ""); + zk->createIfNotExists(zookeeper_path + "/repl_mutations", ""); + zk->createIfNotExists(zookeeper_path + "/repl_data_paths", ""); + zk->createIfNotExists(zookeeper_path + "/repl_access", ""); + zk->createIfNotExists(zookeeper_path + "/repl_sql_objects", ""); + zk->createIfNotExists(zookeeper_path + "/keeper_map_tables", ""); + zk->createIfNotExists(zookeeper_path + "/file_infos", ""); + zk->createIfNotExists(zookeeper_path + "/writing_files", ""); }); } -void BackupCoordinationRemote::removeAllNodes() +Strings BackupCoordinationOnCluster::setStage(const String & new_stage, const String & message, bool sync) { - auto holder = with_retries.createRetriesControlHolder("removeAllNodes"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() + stage_sync.setStage(new_stage, message); + + if (!sync) + return {}; + + return stage_sync.waitForHostsToReachStage(new_stage, all_hosts_without_initiator); +} + +void BackupCoordinationOnCluster::setBackupQueryWasSentToOtherHosts() +{ + backup_query_was_sent_to_other_hosts = true; +} + +bool BackupCoordinationOnCluster::trySetError(std::exception_ptr exception) +{ + return stage_sync.trySetError(exception); +} + +void BackupCoordinationOnCluster::finish() +{ + bool other_hosts_also_finished = false; + stage_sync.finish(other_hosts_also_finished); + + if ((current_host == kInitiator) && (other_hosts_also_finished || !backup_query_was_sent_to_other_hosts)) + cleaner.cleanup(); +} + +bool BackupCoordinationOnCluster::tryFinishAfterError() noexcept +{ + return tryFinishImpl(); +} + +bool BackupCoordinationOnCluster::tryFinishImpl() noexcept +{ + bool other_hosts_also_finished = false; + if (!stage_sync.tryFinishAfterError(other_hosts_also_finished)) + return false; + + if ((current_host == kInitiator) && (other_hosts_also_finished || !backup_query_was_sent_to_other_hosts)) { - /// Usually this function is called by the initiator when a backup is complete so we don't need the coordination anymore. - /// - /// However there can be a rare situation when this function is called after an error occurs on the initiator of a query - /// while some hosts are still making the backup. Removing all the nodes will remove the parent node of the backup coordination - /// at `zookeeper_path` which might cause such hosts to stop with exception "ZNONODE". Or such hosts might still do some useless part - /// of their backup work before that. Anyway in this case backup won't be finalized (because only an initiator can do that). - with_retries.renewZooKeeper(zk); - zk->removeRecursive(zookeeper_path); - }); + if (!cleaner.tryCleanupAfterError()) + return false; + } + + return true; } - -void BackupCoordinationRemote::setStage(const String & new_stage, const String & message) +void BackupCoordinationOnCluster::waitForOtherHostsToFinish() { - if (is_internal) - stage_sync->set(current_host, new_stage, message); - else - stage_sync->set(current_host, new_stage, /* message */ "", /* all_hosts */ true); + if ((current_host != kInitiator) || !backup_query_was_sent_to_other_hosts) + return; + stage_sync.waitForOtherHostsToFinish(); } -void BackupCoordinationRemote::setError(const Exception & exception) +bool BackupCoordinationOnCluster::tryWaitForOtherHostsToFinishAfterError() noexcept { - stage_sync->setError(current_host, exception); + if (current_host != kInitiator) + return false; + if (!backup_query_was_sent_to_other_hosts) + return true; + return stage_sync.tryWaitForOtherHostsToFinishAfterError(); } -Strings BackupCoordinationRemote::waitForStage(const String & stage_to_wait) +ZooKeeperRetriesInfo BackupCoordinationOnCluster::getOnClusterInitializationKeeperRetriesInfo() const { - return stage_sync->wait(all_hosts, stage_to_wait); + return ZooKeeperRetriesInfo{keeper_settings.max_retries_while_initializing, + static_cast(keeper_settings.retry_initial_backoff_ms.count()), + static_cast(keeper_settings.retry_max_backoff_ms.count())}; } -Strings BackupCoordinationRemote::waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) -{ - return stage_sync->waitFor(all_hosts, stage_to_wait, timeout); -} - - -void BackupCoordinationRemote::serializeToMultipleZooKeeperNodes(const String & path, const String & value, const String & logging_name) +void BackupCoordinationOnCluster::serializeToMultipleZooKeeperNodes(const String & path, const String & value, const String & logging_name) { { auto holder = with_retries.createRetriesControlHolder(logging_name + "::create"); @@ -301,7 +304,7 @@ void BackupCoordinationRemote::serializeToMultipleZooKeeperNodes(const String & if (value.empty()) return; - size_t max_part_size = keeper_settings.keeper_value_max_size; + size_t max_part_size = keeper_settings.value_max_size; if (!max_part_size) max_part_size = value.size(); @@ -324,7 +327,7 @@ void BackupCoordinationRemote::serializeToMultipleZooKeeperNodes(const String & } } -String BackupCoordinationRemote::deserializeFromMultipleZooKeeperNodes(const String & path, const String & logging_name) const +String BackupCoordinationOnCluster::deserializeFromMultipleZooKeeperNodes(const String & path, const String & logging_name) const { Strings part_names; @@ -357,7 +360,7 @@ String BackupCoordinationRemote::deserializeFromMultipleZooKeeperNodes(const Str } -void BackupCoordinationRemote::addReplicatedPartNames( +void BackupCoordinationOnCluster::addReplicatedPartNames( const String & table_zk_path, const String & table_name_for_logs, const String & replica_name, @@ -381,14 +384,14 @@ void BackupCoordinationRemote::addReplicatedPartNames( }); } -Strings BackupCoordinationRemote::getReplicatedPartNames(const String & table_zk_path, const String & replica_name) const +Strings BackupCoordinationOnCluster::getReplicatedPartNames(const String & table_zk_path, const String & replica_name) const { std::lock_guard lock{replicated_tables_mutex}; prepareReplicatedTables(); return replicated_tables->getPartNames(table_zk_path, replica_name); } -void BackupCoordinationRemote::addReplicatedMutations( +void BackupCoordinationOnCluster::addReplicatedMutations( const String & table_zk_path, const String & table_name_for_logs, const String & replica_name, @@ -412,7 +415,7 @@ void BackupCoordinationRemote::addReplicatedMutations( }); } -std::vector BackupCoordinationRemote::getReplicatedMutations(const String & table_zk_path, const String & replica_name) const +std::vector BackupCoordinationOnCluster::getReplicatedMutations(const String & table_zk_path, const String & replica_name) const { std::lock_guard lock{replicated_tables_mutex}; prepareReplicatedTables(); @@ -420,7 +423,7 @@ std::vector BackupCoordinationRemote::getRepl } -void BackupCoordinationRemote::addReplicatedDataPath( +void BackupCoordinationOnCluster::addReplicatedDataPath( const String & table_zk_path, const String & data_path) { { @@ -441,7 +444,7 @@ void BackupCoordinationRemote::addReplicatedDataPath( }); } -Strings BackupCoordinationRemote::getReplicatedDataPaths(const String & table_zk_path) const +Strings BackupCoordinationOnCluster::getReplicatedDataPaths(const String & table_zk_path) const { std::lock_guard lock{replicated_tables_mutex}; prepareReplicatedTables(); @@ -449,7 +452,7 @@ Strings BackupCoordinationRemote::getReplicatedDataPaths(const String & table_zk } -void BackupCoordinationRemote::prepareReplicatedTables() const +void BackupCoordinationOnCluster::prepareReplicatedTables() const { if (replicated_tables) return; @@ -536,7 +539,7 @@ void BackupCoordinationRemote::prepareReplicatedTables() const replicated_tables->addDataPath(std::move(data_paths)); } -void BackupCoordinationRemote::addReplicatedAccessFilePath(const String & access_zk_path, AccessEntityType access_entity_type, const String & file_path) +void BackupCoordinationOnCluster::addReplicatedAccessFilePath(const String & access_zk_path, AccessEntityType access_entity_type, const String & file_path) { { std::lock_guard lock{replicated_access_mutex}; @@ -558,14 +561,14 @@ void BackupCoordinationRemote::addReplicatedAccessFilePath(const String & access }); } -Strings BackupCoordinationRemote::getReplicatedAccessFilePaths(const String & access_zk_path, AccessEntityType access_entity_type) const +Strings BackupCoordinationOnCluster::getReplicatedAccessFilePaths(const String & access_zk_path, AccessEntityType access_entity_type) const { std::lock_guard lock{replicated_access_mutex}; prepareReplicatedAccess(); return replicated_access->getFilePaths(access_zk_path, access_entity_type, current_host); } -void BackupCoordinationRemote::prepareReplicatedAccess() const +void BackupCoordinationOnCluster::prepareReplicatedAccess() const { if (replicated_access) return; @@ -601,7 +604,7 @@ void BackupCoordinationRemote::prepareReplicatedAccess() const replicated_access->addFilePath(std::move(file_path)); } -void BackupCoordinationRemote::addReplicatedSQLObjectsDir(const String & loader_zk_path, UserDefinedSQLObjectType object_type, const String & dir_path) +void BackupCoordinationOnCluster::addReplicatedSQLObjectsDir(const String & loader_zk_path, UserDefinedSQLObjectType object_type, const String & dir_path) { { std::lock_guard lock{replicated_sql_objects_mutex}; @@ -631,14 +634,14 @@ void BackupCoordinationRemote::addReplicatedSQLObjectsDir(const String & loader_ }); } -Strings BackupCoordinationRemote::getReplicatedSQLObjectsDirs(const String & loader_zk_path, UserDefinedSQLObjectType object_type) const +Strings BackupCoordinationOnCluster::getReplicatedSQLObjectsDirs(const String & loader_zk_path, UserDefinedSQLObjectType object_type) const { std::lock_guard lock{replicated_sql_objects_mutex}; prepareReplicatedSQLObjects(); return replicated_sql_objects->getDirectories(loader_zk_path, object_type, current_host); } -void BackupCoordinationRemote::prepareReplicatedSQLObjects() const +void BackupCoordinationOnCluster::prepareReplicatedSQLObjects() const { if (replicated_sql_objects) return; @@ -674,7 +677,7 @@ void BackupCoordinationRemote::prepareReplicatedSQLObjects() const replicated_sql_objects->addDirectory(std::move(directory)); } -void BackupCoordinationRemote::addKeeperMapTable(const String & table_zookeeper_root_path, const String & table_id, const String & data_path_in_backup) +void BackupCoordinationOnCluster::addKeeperMapTable(const String & table_zookeeper_root_path, const String & table_id, const String & data_path_in_backup) { { std::lock_guard lock{keeper_map_tables_mutex}; @@ -695,7 +698,7 @@ void BackupCoordinationRemote::addKeeperMapTable(const String & table_zookeeper_ }); } -void BackupCoordinationRemote::prepareKeeperMapTables() const +void BackupCoordinationOnCluster::prepareKeeperMapTables() const { if (keeper_map_tables) return; @@ -740,7 +743,7 @@ void BackupCoordinationRemote::prepareKeeperMapTables() const } -String BackupCoordinationRemote::getKeeperMapDataPath(const String & table_zookeeper_root_path) const +String BackupCoordinationOnCluster::getKeeperMapDataPath(const String & table_zookeeper_root_path) const { std::lock_guard lock(keeper_map_tables_mutex); prepareKeeperMapTables(); @@ -748,7 +751,7 @@ String BackupCoordinationRemote::getKeeperMapDataPath(const String & table_zooke } -void BackupCoordinationRemote::addFileInfos(BackupFileInfos && file_infos_) +void BackupCoordinationOnCluster::addFileInfos(BackupFileInfos && file_infos_) { { std::lock_guard lock{file_infos_mutex}; @@ -761,21 +764,21 @@ void BackupCoordinationRemote::addFileInfos(BackupFileInfos && file_infos_) serializeToMultipleZooKeeperNodes(zookeeper_path + "/file_infos/" + current_host, file_infos_str, "addFileInfos"); } -BackupFileInfos BackupCoordinationRemote::getFileInfos() const +BackupFileInfos BackupCoordinationOnCluster::getFileInfos() const { std::lock_guard lock{file_infos_mutex}; prepareFileInfos(); return file_infos->getFileInfos(current_host); } -BackupFileInfos BackupCoordinationRemote::getFileInfosForAllHosts() const +BackupFileInfos BackupCoordinationOnCluster::getFileInfosForAllHosts() const { std::lock_guard lock{file_infos_mutex}; prepareFileInfos(); return file_infos->getFileInfosForAllHosts(); } -void BackupCoordinationRemote::prepareFileInfos() const +void BackupCoordinationOnCluster::prepareFileInfos() const { if (file_infos) return; @@ -801,7 +804,7 @@ void BackupCoordinationRemote::prepareFileInfos() const } } -bool BackupCoordinationRemote::startWritingFile(size_t data_file_index) +bool BackupCoordinationOnCluster::startWritingFile(size_t data_file_index) { { /// Check if this host is already writing this file. @@ -842,66 +845,4 @@ bool BackupCoordinationRemote::startWritingFile(size_t data_file_index) } } -bool BackupCoordinationRemote::hasConcurrentBackups(const std::atomic &) const -{ - /// If its internal concurrency will be checked for the base backup - if (is_internal) - return false; - - std::string backup_stage_path = zookeeper_path + "/stage"; - - bool result = false; - - auto holder = with_retries.createRetriesControlHolder("getAllArchiveSuffixes"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - - if (!zk->exists(root_zookeeper_path)) - zk->createAncestors(root_zookeeper_path); - - for (size_t attempt = 0; attempt < MAX_ZOOKEEPER_ATTEMPTS; ++attempt) - { - Coordination::Stat stat; - zk->get(root_zookeeper_path, &stat); - Strings existing_backup_paths = zk->getChildren(root_zookeeper_path); - - for (const auto & existing_backup_path : existing_backup_paths) - { - if (startsWith(existing_backup_path, "restore-")) - continue; - - String existing_backup_uuid = existing_backup_path; - existing_backup_uuid.erase(0, String("backup-").size()); - - if (existing_backup_uuid == toString(backup_uuid)) - continue; - - String status; - if (zk->tryGet(root_zookeeper_path + "/" + existing_backup_path + "/stage", status)) - { - /// Check if some other backup is in progress - if (status == Stage::SCHEDULED_TO_START) - { - LOG_WARNING(log, "Found a concurrent backup: {}, current backup: {}", existing_backup_uuid, toString(backup_uuid)); - result = true; - return; - } - } - } - - zk->createIfNotExists(backup_stage_path, ""); - auto code = zk->trySet(backup_stage_path, Stage::SCHEDULED_TO_START, stat.version); - if (code == Coordination::Error::ZOK) - break; - bool is_last_attempt = (attempt == MAX_ZOOKEEPER_ATTEMPTS - 1); - if ((code != Coordination::Error::ZBADVERSION) || is_last_attempt) - throw zkutil::KeeperException::fromPath(code, backup_stage_path); - } - }); - - return result; -} - } diff --git a/src/Backups/BackupCoordinationRemote.h b/src/Backups/BackupCoordinationOnCluster.h similarity index 67% rename from src/Backups/BackupCoordinationRemote.h rename to src/Backups/BackupCoordinationOnCluster.h index 7a56b1a4eb8..7369c2cc746 100644 --- a/src/Backups/BackupCoordinationRemote.h +++ b/src/Backups/BackupCoordinationOnCluster.h @@ -1,6 +1,8 @@ #pragma once #include +#include +#include #include #include #include @@ -13,32 +15,35 @@ namespace DB { -/// We try to store data to zookeeper several times due to possible version conflicts. -constexpr size_t MAX_ZOOKEEPER_ATTEMPTS = 10; - /// Implementation of the IBackupCoordination interface performing coordination via ZooKeeper. It's necessary for "BACKUP ON CLUSTER". -class BackupCoordinationRemote : public IBackupCoordination +class BackupCoordinationOnCluster : public IBackupCoordination { public: - using BackupKeeperSettings = WithRetries::KeeperSettings; + /// Empty string as the current host is used to mark the initiator of a BACKUP ON CLUSTER query. + static const constexpr std::string_view kInitiator; - BackupCoordinationRemote( - zkutil::GetZooKeeper get_zookeeper_, + BackupCoordinationOnCluster( + const UUID & backup_uuid_, + bool is_plain_backup_, const String & root_zookeeper_path_, + zkutil::GetZooKeeper get_zookeeper_, const BackupKeeperSettings & keeper_settings_, - const String & backup_uuid_, - const Strings & all_hosts_, const String & current_host_, - bool plain_backup_, - bool is_internal_, + const Strings & all_hosts_, + bool allow_concurrent_backup_, + BackupConcurrencyCounters & concurrency_counters_, + ThreadPoolCallbackRunnerUnsafe schedule_, QueryStatusPtr process_list_element_); - ~BackupCoordinationRemote() override; + ~BackupCoordinationOnCluster() override; - void setStage(const String & new_stage, const String & message) override; - void setError(const Exception & exception) override; - Strings waitForStage(const String & stage_to_wait) override; - Strings waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) override; + Strings setStage(const String & new_stage, const String & message, bool sync) override; + void setBackupQueryWasSentToOtherHosts() override; + bool trySetError(std::exception_ptr exception) override; + void finish() override; + bool tryFinishAfterError() noexcept override; + void waitForOtherHostsToFinish() override; + bool tryWaitForOtherHostsToFinishAfterError() noexcept override; void addReplicatedPartNames( const String & table_zk_path, @@ -73,13 +78,14 @@ public: BackupFileInfos getFileInfosForAllHosts() const override; bool startWritingFile(size_t data_file_index) override; - bool hasConcurrentBackups(const std::atomic & num_active_backups) const override; + ZooKeeperRetriesInfo getOnClusterInitializationKeeperRetriesInfo() const override; - static size_t findCurrentHostIndex(const Strings & all_hosts, const String & current_host); + static Strings excludeInitiator(const Strings & all_hosts); + static size_t findCurrentHostIndex(const String & current_host, const Strings & all_hosts); private: void createRootNodes(); - void removeAllNodes(); + bool tryFinishImpl() noexcept; void serializeToMultipleZooKeeperNodes(const String & path, const String & value, const String & logging_name); String deserializeFromMultipleZooKeeperNodes(const String & path, const String & logging_name) const; @@ -96,26 +102,27 @@ private: const String root_zookeeper_path; const String zookeeper_path; const BackupKeeperSettings keeper_settings; - const String backup_uuid; + const UUID backup_uuid; const Strings all_hosts; + const Strings all_hosts_without_initiator; const String current_host; const size_t current_host_index; const bool plain_backup; - const bool is_internal; LoggerPtr const log; - /// The order of these two fields matters, because stage_sync holds a reference to with_retries object - mutable WithRetries with_retries; - std::optional stage_sync; + const WithRetries with_retries; + BackupConcurrencyCheck concurrency_check; + BackupCoordinationStageSync stage_sync; + BackupCoordinationCleaner cleaner; + std::atomic backup_query_was_sent_to_other_hosts = false; - mutable std::optional TSA_GUARDED_BY(replicated_tables_mutex) replicated_tables; - mutable std::optional TSA_GUARDED_BY(replicated_access_mutex) replicated_access; - mutable std::optional TSA_GUARDED_BY(replicated_sql_objects_mutex) replicated_sql_objects; - mutable std::optional TSA_GUARDED_BY(file_infos_mutex) file_infos; + mutable std::optional replicated_tables TSA_GUARDED_BY(replicated_tables_mutex); + mutable std::optional replicated_access TSA_GUARDED_BY(replicated_access_mutex); + mutable std::optional replicated_sql_objects TSA_GUARDED_BY(replicated_sql_objects_mutex); + mutable std::optional file_infos TSA_GUARDED_BY(file_infos_mutex); mutable std::optional keeper_map_tables TSA_GUARDED_BY(keeper_map_tables_mutex); - std::unordered_set TSA_GUARDED_BY(writing_files_mutex) writing_files; + std::unordered_set writing_files TSA_GUARDED_BY(writing_files_mutex); - mutable std::mutex zookeeper_mutex; mutable std::mutex replicated_tables_mutex; mutable std::mutex replicated_access_mutex; mutable std::mutex replicated_sql_objects_mutex; diff --git a/src/Backups/BackupCoordinationStage.h b/src/Backups/BackupCoordinationStage.h index 9abdc019784..2cd1efb5404 100644 --- a/src/Backups/BackupCoordinationStage.h +++ b/src/Backups/BackupCoordinationStage.h @@ -8,10 +8,6 @@ namespace DB namespace BackupCoordinationStage { - /// This stage is set after concurrency check so ensure we dont start other backup/restores - /// when concurrent backup/restores are not allowed - constexpr const char * SCHEDULED_TO_START = "scheduled to start"; - /// Finding all tables and databases which we're going to put to the backup and collecting their metadata. constexpr const char * GATHERING_METADATA = "gathering metadata"; @@ -46,10 +42,6 @@ namespace BackupCoordinationStage /// Coordination stage meaning that a host finished its work. constexpr const char * COMPLETED = "completed"; - - /// Coordination stage meaning that backup/restore has failed due to an error - /// Check '/error' for the error message - constexpr const char * ERROR = "error"; } } diff --git a/src/Backups/BackupCoordinationStageSync.cpp b/src/Backups/BackupCoordinationStageSync.cpp index 17ef163ce35..9a05f9490c2 100644 --- a/src/Backups/BackupCoordinationStageSync.cpp +++ b/src/Backups/BackupCoordinationStageSync.cpp @@ -9,267 +9,1117 @@ #include #include #include +#include +#include +#include + namespace DB { -namespace Stage = BackupCoordinationStage; - namespace ErrorCodes { extern const int FAILED_TO_SYNC_BACKUP_OR_RESTORE; + extern const int LOGICAL_ERROR; } +namespace +{ + /// The coordination version is stored in the 'start' node for each host + /// by each host when it starts working on this backup or restore. + enum Version + { + kInitialVersion = 1, + + /// This old version didn't create the 'finish' node, it uses stage "completed" to tell other hosts that the work is done. + /// If an error happened this old version didn't change any nodes to tell other hosts that the error handling is done. + /// So while using this old version hosts couldn't know when other hosts are done with the error handling, + /// and that situation caused weird errors in the logs somehow. + /// Also this old version didn't create the 'start' node for the initiator. + kVersionWithoutFinishNode = 1, + + /// Now we create the 'finish' node both if the work is done or if the error handling is done. + + kCurrentVersion = 2, + }; + + /// Empty string as the current host is used to mark the initiator of a BACKUP ON CLUSTER or RESTORE ON CLUSTER query. + const constexpr std::string_view kInitiator; +} + +bool BackupCoordinationStageSync::HostInfo::operator ==(const HostInfo & other) const +{ + /// We don't compare `last_connection_time` here. + return (host == other.host) && (started == other.started) && (connected == other.connected) && (finished == other.finished) + && (stages == other.stages) && (!!exception == !!other.exception); +} + +bool BackupCoordinationStageSync::HostInfo::operator !=(const HostInfo & other) const +{ + return !(*this == other); +} + +bool BackupCoordinationStageSync::State::operator ==(const State & other) const = default; +bool BackupCoordinationStageSync::State::operator !=(const State & other) const = default; + BackupCoordinationStageSync::BackupCoordinationStageSync( - const String & root_zookeeper_path_, - WithRetries & with_retries_, - LoggerPtr log_) - : zookeeper_path(root_zookeeper_path_ + "/stage") + bool is_restore_, + const String & zookeeper_path_, + const String & current_host_, + const Strings & all_hosts_, + bool allow_concurrency_, + const WithRetries & with_retries_, + ThreadPoolCallbackRunnerUnsafe schedule_, + QueryStatusPtr process_list_element_, + LoggerPtr log_) + : is_restore(is_restore_) + , operation_name(is_restore ? "restore" : "backup") + , current_host(current_host_) + , current_host_desc(getHostDesc(current_host)) + , all_hosts(all_hosts_) + , allow_concurrency(allow_concurrency_) , with_retries(with_retries_) + , schedule(schedule_) + , process_list_element(process_list_element_) , log(log_) + , failure_after_host_disconnected_for_seconds(with_retries.getKeeperSettings().failure_after_host_disconnected_for_seconds) + , finish_timeout_after_error(with_retries.getKeeperSettings().finish_timeout_after_error) + , sync_period_ms(with_retries.getKeeperSettings().sync_period_ms) + , max_attempts_after_bad_version(with_retries.getKeeperSettings().max_attempts_after_bad_version) + , zookeeper_path(zookeeper_path_) + , root_zookeeper_path(zookeeper_path.parent_path().parent_path()) + , operation_node_path(zookeeper_path.parent_path()) + , operation_node_name(zookeeper_path.parent_path().filename()) + , stage_node_path(zookeeper_path) + , start_node_path(zookeeper_path / ("started|" + current_host)) + , finish_node_path(zookeeper_path / ("finished|" + current_host)) + , num_hosts_node_path(zookeeper_path / "num_hosts") + , alive_node_path(zookeeper_path / ("alive|" + current_host)) + , alive_tracker_node_path(fs::path{root_zookeeper_path} / "alive_tracker") + , error_node_path(zookeeper_path / "error") + , zk_nodes_changed(std::make_shared()) { + if ((zookeeper_path.filename() != "stage") || !operation_node_name.starts_with(is_restore ? "restore-" : "backup-") + || (root_zookeeper_path == operation_node_path)) + { + throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected path in ZooKeeper specified: {}", zookeeper_path); + } + + initializeState(); createRootNodes(); + + try + { + createStartAndAliveNodes(); + startWatchingThread(); + } + catch (...) + { + trySetError(std::current_exception()); + tryFinishImpl(); + throw; + } } + +BackupCoordinationStageSync::~BackupCoordinationStageSync() +{ + tryFinishImpl(); +} + + +void BackupCoordinationStageSync::initializeState() +{ + std::lock_guard lock{mutex}; + auto now = std::chrono::system_clock::now(); + auto monotonic_now = std::chrono::steady_clock::now(); + + for (const String & host : all_hosts) + state.hosts.emplace(host, HostInfo{.host = host, .last_connection_time = now, .last_connection_time_monotonic = monotonic_now}); +} + + +String BackupCoordinationStageSync::getHostDesc(const String & host) +{ + String res; + if (host.empty()) + { + res = "the initiator"; + } + else + { + try + { + res = "host "; + Poco::URI::decode(host, res); /// Append the decoded host name to `res`. + } + catch (const Poco::URISyntaxException &) + { + res = "host " + host; + } + } + return res; +} + + +String BackupCoordinationStageSync::getHostsDesc(const Strings & hosts) +{ + String res = "["; + for (const String & host : hosts) + { + if (res != "[") + res += ", "; + res += getHostDesc(host); + } + res += "]"; + return res; +} + + void BackupCoordinationStageSync::createRootNodes() { - auto holder = with_retries.createRetriesControlHolder("createRootNodes"); + auto holder = with_retries.createRetriesControlHolder("BackupStageSync::createRootNodes", WithRetries::kInitialization); holder.retries_ctl.retryLoop( [&, &zookeeper = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zookeeper); + zookeeper->createAncestors(root_zookeeper_path); + zookeeper->createIfNotExists(root_zookeeper_path, ""); + }); +} + + +void BackupCoordinationStageSync::createStartAndAliveNodes() +{ + auto holder = with_retries.createRetriesControlHolder("BackupStageSync::createStartAndAliveNodes", WithRetries::kInitialization); + holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]() { with_retries.renewZooKeeper(zookeeper); - zookeeper->createAncestors(zookeeper_path); - zookeeper->createIfNotExists(zookeeper_path, ""); + createStartAndAliveNodes(zookeeper); }); } -void BackupCoordinationStageSync::set(const String & current_host, const String & new_stage, const String & message, const bool & all_hosts) -{ - auto holder = with_retries.createRetriesControlHolder("set"); - holder.retries_ctl.retryLoop( - [&, &zookeeper = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zookeeper); - if (all_hosts) +void BackupCoordinationStageSync::createStartAndAliveNodes(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper) +{ + /// The "num_hosts" node keeps the number of hosts which started (created the "started" node) + /// but not yet finished (not created the "finished" node). + /// The number of alive hosts can be less than that. + + /// The "alive_tracker" node always keeps an empty string, we track its version only. + /// The "alive_tracker" node increases its version each time when any "alive" nodes are created + /// so we use it to check concurrent backups/restores. + zookeeper->createIfNotExists(alive_tracker_node_path, ""); + + std::optional num_hosts; + int num_hosts_version = -1; + + bool check_concurrency = !allow_concurrency; + int alive_tracker_version = -1; + + for (size_t attempt_no = 1; attempt_no <= max_attempts_after_bad_version; ++attempt_no) + { + if (!num_hosts) { - auto code = zookeeper->trySet(zookeeper_path, new_stage); - if (code != Coordination::Error::ZOK) - throw zkutil::KeeperException::fromPath(code, zookeeper_path); + String num_hosts_str; + Coordination::Stat stat; + if (zookeeper->tryGet(num_hosts_node_path, num_hosts_str, &stat)) + { + num_hosts = parseFromString(num_hosts_str); + num_hosts_version = stat.version; + } + } + + String serialized_error; + if (zookeeper->tryGet(error_node_path, serialized_error)) + { + auto [exception, host] = parseErrorNode(serialized_error); + if (exception) + std::rethrow_exception(exception); + } + + if (check_concurrency) + { + Coordination::Stat stat; + zookeeper->exists(alive_tracker_node_path, &stat); + alive_tracker_version = stat.version; + + checkConcurrency(zookeeper); + check_concurrency = false; + } + + Coordination::Requests requests; + requests.reserve(6); + + size_t operation_node_path_pos = static_cast(-1); + if (!zookeeper->exists(operation_node_path)) + { + operation_node_path_pos = requests.size(); + requests.emplace_back(zkutil::makeCreateRequest(operation_node_path, "", zkutil::CreateMode::Persistent)); + } + + size_t stage_node_path_pos = static_cast(-1); + if (!zookeeper->exists(stage_node_path)) + { + stage_node_path_pos = requests.size(); + requests.emplace_back(zkutil::makeCreateRequest(stage_node_path, "", zkutil::CreateMode::Persistent)); + } + + size_t num_hosts_node_path_pos = requests.size(); + if (num_hosts) + requests.emplace_back(zkutil::makeSetRequest(num_hosts_node_path, toString(*num_hosts + 1), num_hosts_version)); + else + requests.emplace_back(zkutil::makeCreateRequest(num_hosts_node_path, "1", zkutil::CreateMode::Persistent)); + + size_t alive_tracker_node_path_pos = requests.size(); + requests.emplace_back(zkutil::makeSetRequest(alive_tracker_node_path, "", alive_tracker_version)); + + requests.emplace_back(zkutil::makeCreateRequest(start_node_path, std::to_string(kCurrentVersion), zkutil::CreateMode::Persistent)); + requests.emplace_back(zkutil::makeCreateRequest(alive_node_path, "", zkutil::CreateMode::Ephemeral)); + + Coordination::Responses responses; + auto code = zookeeper->tryMulti(requests, responses); + + if (code == Coordination::Error::ZOK) + { + LOG_INFO(log, "Created start node #{} in ZooKeeper for {} (coordination version: {})", + num_hosts.value_or(0) + 1, current_host_desc, kCurrentVersion); + return; + } + + auto show_error_before_next_attempt = [&](const String & message) + { + bool will_try_again = (attempt_no < max_attempts_after_bad_version); + LOG_TRACE(log, "{} (attempt #{}){}", message, attempt_no, will_try_again ? ", will try again" : ""); + }; + + if ((responses.size() > operation_node_path_pos) && + (responses[operation_node_path_pos]->error == Coordination::Error::ZNODEEXISTS)) + { + show_error_before_next_attempt(fmt::format("Node {} in ZooKeeper already exists", operation_node_path)); + /// needs another attempt + } + else if ((responses.size() > stage_node_path_pos) && + (responses[stage_node_path_pos]->error == Coordination::Error::ZNODEEXISTS)) + { + show_error_before_next_attempt(fmt::format("Node {} in ZooKeeper already exists", stage_node_path)); + /// needs another attempt + } + else if ((responses.size() > num_hosts_node_path_pos) && num_hosts && + (responses[num_hosts_node_path_pos]->error == Coordination::Error::ZBADVERSION)) + { + show_error_before_next_attempt("Other host changed the 'num_hosts' node in ZooKeeper"); + num_hosts.reset(); /// needs to reread 'num_hosts' again + } + else if ((responses.size() > num_hosts_node_path_pos) && num_hosts && + (responses[num_hosts_node_path_pos]->error == Coordination::Error::ZNONODE)) + { + show_error_before_next_attempt("Other host removed the 'num_hosts' node in ZooKeeper"); + num_hosts.reset(); /// needs to reread 'num_hosts' again + } + else if ((responses.size() > num_hosts_node_path_pos) && !num_hosts && + (responses[num_hosts_node_path_pos]->error == Coordination::Error::ZNODEEXISTS)) + { + show_error_before_next_attempt("Other host created the 'num_hosts' node in ZooKeeper"); + /// needs another attempt + } + else if ((responses.size() > alive_tracker_node_path_pos) && + (responses[alive_tracker_node_path_pos]->error == Coordination::Error::ZBADVERSION)) + { + show_error_before_next_attempt("Concurrent backup or restore changed some 'alive' nodes in ZooKeeper"); + check_concurrency = true; /// needs to recheck for concurrency again } else { - zookeeper->createIfNotExists(zookeeper_path + "/started|" + current_host, ""); - zookeeper->createIfNotExists(zookeeper_path + "/current|" + current_host + "|" + new_stage, message); + zkutil::KeeperMultiException::check(code, requests, responses); } + } + + throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, + "Couldn't create the 'start' node in ZooKeeper for {} after {} attempts", + current_host_desc, max_attempts_after_bad_version); +} + + +void BackupCoordinationStageSync::checkConcurrency(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper) +{ + if (allow_concurrency) + return; + + Strings found_operations; + auto code = zookeeper->tryGetChildren(root_zookeeper_path, found_operations); + + if (!((code == Coordination::Error::ZOK) || (code == Coordination::Error::ZNONODE))) + throw zkutil::KeeperException::fromPath(code, root_zookeeper_path); + + if (code == Coordination::Error::ZNONODE) + return; + + for (const String & found_operation : found_operations) + { + if (found_operation.starts_with(is_restore ? "restore-" : "backup-") && (found_operation != operation_node_name)) + { + Strings stages; + code = zookeeper->tryGetChildren(fs::path{root_zookeeper_path} / found_operation / "stage", stages); + + if (!((code == Coordination::Error::ZOK) || (code == Coordination::Error::ZNONODE))) + throw zkutil::KeeperException::fromPath(code, fs::path{root_zookeeper_path} / found_operation / "stage"); + + if (code == Coordination::Error::ZOK) + { + for (const String & stage : stages) + { + if (stage.starts_with("alive")) + BackupConcurrencyCheck::throwConcurrentOperationNotAllowed(is_restore); + } + } + } + } +} + + +void BackupCoordinationStageSync::startWatchingThread() +{ + watching_thread_future = schedule([this]() { watchingThread(); }, Priority{}); +} + + +void BackupCoordinationStageSync::stopWatchingThread() +{ + should_stop_watching_thread = true; + + /// Wake up waiting threads. + if (zk_nodes_changed) + zk_nodes_changed->set(); + state_changed.notify_all(); + + if (watching_thread_future.valid()) + watching_thread_future.wait(); +} + + +void BackupCoordinationStageSync::watchingThread() +{ + while (!should_stop_watching_thread) + { + try + { + /// Check if the current BACKUP or RESTORE command is already cancelled. + checkIfQueryCancelled(); + + /// Reset the `connected` flag for each host, we'll set them to true again after we find the 'alive' nodes. + resetConnectedFlag(); + + /// Recreate the 'alive' node if necessary and read a new state from ZooKeeper. + auto holder = with_retries.createRetriesControlHolder("BackupStageSync::watchingThread"); + auto & zookeeper = holder.faulty_zookeeper; + with_retries.renewZooKeeper(zookeeper); + + if (should_stop_watching_thread) + return; + + /// Recreate the 'alive' node if it was removed. + createAliveNode(zookeeper); + + /// Reads the current state from nodes in ZooKeeper. + readCurrentState(zookeeper); + } + catch (...) + { + tryLogCurrentException(log, "Caugth exception while watching"); + } + + try + { + /// Cancel the query if there is an error on another host or if some host was disconnected too long. + cancelQueryIfError(); + cancelQueryIfDisconnectedTooLong(); + } + catch (...) + { + tryLogCurrentException(log, "Caugth exception while checking if the query should be cancelled"); + } + + zk_nodes_changed->tryWait(sync_period_ms.count()); + } +} + + +void BackupCoordinationStageSync::createAliveNode(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper) +{ + if (zookeeper->exists(alive_node_path)) + return; + + Coordination::Requests requests; + requests.emplace_back(zkutil::makeCreateRequest(alive_node_path, "", zkutil::CreateMode::Ephemeral)); + requests.emplace_back(zkutil::makeSetRequest(alive_tracker_node_path, "", -1)); + zookeeper->multi(requests); + + LOG_INFO(log, "The alive node was recreated for {}", current_host_desc); +} + + +void BackupCoordinationStageSync::resetConnectedFlag() +{ + std::lock_guard lock{mutex}; + for (auto & [_, host_info] : state.hosts) + host_info.connected = false; +} + + +void BackupCoordinationStageSync::readCurrentState(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper) +{ + zk_nodes_changed->reset(); + + /// Get zk nodes and subscribe on their changes. + Strings new_zk_nodes = zookeeper->getChildren(stage_node_path, nullptr, zk_nodes_changed); + std::sort(new_zk_nodes.begin(), new_zk_nodes.end()); /// Sorting is necessary because we compare the list of zk nodes with its previous versions. + + State new_state; + + { + std::lock_guard lock{mutex}; + + /// Log all changes in zookeeper nodes in the "stage" folder to make debugging easier. + Strings added_zk_nodes, removed_zk_nodes; + std::set_difference(new_zk_nodes.begin(), new_zk_nodes.end(), zk_nodes.begin(), zk_nodes.end(), back_inserter(added_zk_nodes)); + std::set_difference(zk_nodes.begin(), zk_nodes.end(), new_zk_nodes.begin(), new_zk_nodes.end(), back_inserter(removed_zk_nodes)); + if (!added_zk_nodes.empty()) + LOG_TRACE(log, "Detected new zookeeper nodes appeared in the stage folder: {}", boost::algorithm::join(added_zk_nodes, ", ")); + if (!removed_zk_nodes.empty()) + LOG_TRACE(log, "Detected that some zookeeper nodes disappeared from the stage folder: {}", boost::algorithm::join(removed_zk_nodes, ", ")); + + zk_nodes = new_zk_nodes; + new_state = state; + } + + auto get_host_info = [&](const String & host) -> HostInfo * + { + auto it = new_state.hosts.find(host); + if (it == new_state.hosts.end()) + return nullptr; + return &it->second; + }; + + auto now = std::chrono::system_clock::now(); + auto monotonic_now = std::chrono::steady_clock::now(); + + /// Read the current state from zookeeper nodes. + for (const auto & zk_node : new_zk_nodes) + { + if (zk_node == "error") + { + if (!new_state.host_with_error) + { + String serialized_error = zookeeper->get(error_node_path); + auto [exception, host] = parseErrorNode(serialized_error); + if (auto * host_info = get_host_info(host)) + { + host_info->exception = exception; + new_state.host_with_error = host; + } + } + } + else if (zk_node.starts_with("started|")) + { + String host = zk_node.substr(strlen("started|")); + if (auto * host_info = get_host_info(host)) + { + if (!host_info->started) + { + host_info->version = parseStartNode(zookeeper->get(zookeeper_path / zk_node), host); + host_info->started = true; + } + } + } + else if (zk_node.starts_with("finished|")) + { + String host = zk_node.substr(strlen("finished|")); + if (auto * host_info = get_host_info(host)) + host_info->finished = true; + } + else if (zk_node.starts_with("alive|")) + { + String host = zk_node.substr(strlen("alive|")); + if (auto * host_info = get_host_info(host)) + { + host_info->connected = true; + host_info->last_connection_time = now; + host_info->last_connection_time_monotonic = monotonic_now; + } + } + else if (zk_node.starts_with("current|")) + { + String host_and_stage = zk_node.substr(strlen("current|")); + size_t separator_pos = host_and_stage.find('|'); + if (separator_pos != String::npos) + { + String host = host_and_stage.substr(0, separator_pos); + String stage = host_and_stage.substr(separator_pos + 1); + if (auto * host_info = get_host_info(host)) + { + String result = zookeeper->get(fs::path{zookeeper_path} / zk_node); + host_info->stages[stage] = std::move(result); + + /// That old version didn't create the 'finish' node so we consider that a host finished its work + /// if it reached the "completed" stage. + if ((host_info->version == kVersionWithoutFinishNode) && (stage == BackupCoordinationStage::COMPLETED)) + host_info->finished = true; + } + } + } + } + + /// Check if the state has been just changed, and if so then wake up waiting threads (see waitHostsReachStage()). + bool was_state_changed = false; + + { + std::lock_guard lock{mutex}; + was_state_changed = (new_state != state); + state = std::move(new_state); + } + + if (was_state_changed) + state_changed.notify_all(); +} + + +int BackupCoordinationStageSync::parseStartNode(const String & start_node_contents, const String & host) const +{ + int version; + if (start_node_contents.empty()) + { + version = kInitialVersion; + } + else if (!tryParse(version, start_node_contents) || (version < kInitialVersion)) + { + throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, + "Coordination version {} used by {} is not supported", start_node_contents, getHostDesc(host)); + } + + if (version < kCurrentVersion) + LOG_WARNING(log, "Coordination version {} used by {} is outdated", version, getHostDesc(host)); + return version; +} + + +std::pair BackupCoordinationStageSync::parseErrorNode(const String & error_node_contents) +{ + ReadBufferFromOwnString buf{error_node_contents}; + String host; + readStringBinary(host, buf); + auto exception = std::make_exception_ptr(readException(buf, fmt::format("Got error from {}", getHostDesc(host)))); + return {exception, host}; +} + + +void BackupCoordinationStageSync::checkIfQueryCancelled() +{ + if (process_list_element->checkTimeLimitSoft()) + return; /// Not cancelled. + + std::lock_guard lock{mutex}; + if (state.cancelled) + return; /// Already marked as cancelled. + + state.cancelled = true; + state_changed.notify_all(); +} + + +void BackupCoordinationStageSync::cancelQueryIfError() +{ + std::exception_ptr exception; + + { + std::lock_guard lock{mutex}; + if (state.cancelled || !state.host_with_error) + return; + + state.cancelled = true; + exception = state.hosts.at(*state.host_with_error).exception; + } + + process_list_element->cancelQuery(false, exception); + state_changed.notify_all(); +} + + +void BackupCoordinationStageSync::cancelQueryIfDisconnectedTooLong() +{ + std::exception_ptr exception; + + { + std::lock_guard lock{mutex}; + if (state.cancelled || state.host_with_error || ((failure_after_host_disconnected_for_seconds.count() == 0))) + return; + + auto monotonic_now = std::chrono::steady_clock::now(); + bool info_shown = false; + + for (auto & [host, host_info] : state.hosts) + { + if (!host_info.connected && !host_info.finished && (host != current_host)) + { + auto disconnected_duration = std::chrono::duration_cast(monotonic_now - host_info.last_connection_time_monotonic); + if (disconnected_duration > failure_after_host_disconnected_for_seconds) + { + /// Host `host` was disconnected too long. + /// We can't just throw an exception here because readCurrentState() is called from a background thread. + /// So here we're writingh the error to the `process_list_element` and let it to be thrown later + /// from `process_list_element->checkTimeLimit()`. + String message = fmt::format("The 'alive' node hasn't been updated in ZooKeeper for {} for {} " + "which is more than the specified timeout {}. Last time the 'alive' node was detected at {}", + getHostDesc(host), disconnected_duration, failure_after_host_disconnected_for_seconds, + host_info.last_connection_time); + LOG_WARNING(log, "Lost connection to {}: {}", getHostDesc(host), message); + exception = std::make_exception_ptr(Exception{ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, "Lost connection to {}: {}", getHostDesc(host), message}); + break; + } + + if ((disconnected_duration >= std::chrono::seconds{1}) && !info_shown) + { + LOG_TRACE(log, "The 'alive' node hasn't been updated in ZooKeeper for {} for {}", getHostDesc(host), disconnected_duration); + info_shown = true; + } + } + } + + if (!exception) + return; + + state.cancelled = true; + } + + process_list_element->cancelQuery(false, exception); + state_changed.notify_all(); +} + + +void BackupCoordinationStageSync::setStage(const String & stage, const String & stage_result) +{ + LOG_INFO(log, "{} reached stage {}", current_host_desc, stage); + auto holder = with_retries.createRetriesControlHolder("BackupStageSync::setStage"); + holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zookeeper); + zookeeper->createIfNotExists(getStageNodePath(stage), stage_result); }); } -void BackupCoordinationStageSync::setError(const String & current_host, const Exception & exception) + +String BackupCoordinationStageSync::getStageNodePath(const String & stage) const { - auto holder = with_retries.createRetriesControlHolder("setError"); - holder.retries_ctl.retryLoop( - [&, &zookeeper = holder.faulty_zookeeper]() + return fs::path{zookeeper_path} / ("current|" + current_host + "|" + stage); +} + + +bool BackupCoordinationStageSync::trySetError(std::exception_ptr exception) noexcept +{ + try + { + std::rethrow_exception(exception); + } + catch (const Exception & e) + { + return trySetError(e); + } + catch (...) + { + return trySetError(Exception(getCurrentExceptionMessageAndPattern(true, true), getCurrentExceptionCode())); + } +} + + +bool BackupCoordinationStageSync::trySetError(const Exception & exception) +{ + try + { + setError(exception); + return true; + } + catch (...) + { + return false; + } +} + + +void BackupCoordinationStageSync::setError(const Exception & exception) +{ + /// Most likely this exception has been already logged so here we're logging it without stacktrace. + String exception_message = getExceptionMessage(exception, /* with_stacktrace= */ false, /* check_embedded_stacktrace= */ true); + LOG_INFO(log, "Sending exception from {} to other hosts: {}", current_host_desc, exception_message); + + auto holder = with_retries.createRetriesControlHolder("BackupStageSync::setError", WithRetries::kErrorHandling); + + holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]() { with_retries.renewZooKeeper(zookeeper); WriteBufferFromOwnString buf; writeStringBinary(current_host, buf); writeException(exception, buf, true); - zookeeper->createIfNotExists(zookeeper_path + "/error", buf.str()); + auto code = zookeeper->tryCreate(error_node_path, buf.str(), zkutil::CreateMode::Persistent); - /// When backup/restore fails, it removes the nodes from Zookeeper. - /// Sometimes it fails to remove all nodes. It's possible that it removes /error node, but fails to remove /stage node, - /// so the following line tries to preserve the error status. - auto code = zookeeper->trySet(zookeeper_path, Stage::ERROR); - if (code != Coordination::Error::ZOK) - throw zkutil::KeeperException::fromPath(code, zookeeper_path); + if (code == Coordination::Error::ZOK) + { + LOG_TRACE(log, "Sent exception from {} to other hosts", current_host_desc); + } + else if (code == Coordination::Error::ZNODEEXISTS) + { + LOG_INFO(log, "An error has been already assigned for this {}", operation_name); + } + else + { + throw zkutil::KeeperException::fromPath(code, error_node_path); + } }); } -Strings BackupCoordinationStageSync::wait(const Strings & all_hosts, const String & stage_to_wait) + +Strings BackupCoordinationStageSync::waitForHostsToReachStage(const String & stage_to_wait, const Strings & hosts, std::optional timeout) const { - return waitImpl(all_hosts, stage_to_wait, {}); -} - -Strings BackupCoordinationStageSync::waitFor(const Strings & all_hosts, const String & stage_to_wait, std::chrono::milliseconds timeout) -{ - return waitImpl(all_hosts, stage_to_wait, timeout); -} - -namespace -{ - struct UnreadyHost - { - String host; - bool started = false; - }; -} - -struct BackupCoordinationStageSync::State -{ - std::optional results; - std::optional> error; - std::optional disconnected_host; - std::optional unready_host; -}; - -BackupCoordinationStageSync::State BackupCoordinationStageSync::readCurrentState( - WithRetries::RetriesControlHolder & retries_control_holder, - const Strings & zk_nodes, - const Strings & all_hosts, - const String & stage_to_wait) const -{ - auto zookeeper = retries_control_holder.faulty_zookeeper; - auto & retries_ctl = retries_control_holder.retries_ctl; - - std::unordered_set zk_nodes_set{zk_nodes.begin(), zk_nodes.end()}; - - State state; - if (zk_nodes_set.contains("error")) - { - String errors = zookeeper->get(zookeeper_path + "/error"); - ReadBufferFromOwnString buf{errors}; - String host; - readStringBinary(host, buf); - state.error = std::make_pair(host, readException(buf, fmt::format("Got error from {}", host))); - return state; - } - - std::optional unready_host; - - for (const auto & host : all_hosts) - { - if (!zk_nodes_set.contains("current|" + host + "|" + stage_to_wait)) - { - const String started_node_name = "started|" + host; - const String alive_node_name = "alive|" + host; - - bool started = zk_nodes_set.contains(started_node_name); - bool alive = zk_nodes_set.contains(alive_node_name); - - if (!alive) - { - /// If the "alive" node doesn't exist then we don't have connection to the corresponding host. - /// This node is ephemeral so probably it will be recreated soon. We use zookeeper retries to wait. - /// In worst case when we won't manage to see the alive node for a long time we will just abort the backup. - const auto * const suffix = retries_ctl.isLastRetry() ? "" : ", will retry"; - if (started) - retries_ctl.setUserError(Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, - "Lost connection to host {}{}", host, suffix)); - else - retries_ctl.setUserError(Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, - "No connection to host {} yet{}", host, suffix)); - - state.disconnected_host = host; - return state; - } - - if (!unready_host) - unready_host.emplace(UnreadyHost{.host = host, .started = started}); - } - } - - if (unready_host) - { - state.unready_host = std::move(unready_host); - return state; - } - Strings results; - for (const auto & host : all_hosts) - results.emplace_back(zookeeper->get(zookeeper_path + "/current|" + host + "|" + stage_to_wait)); - state.results = std::move(results); + results.resize(hosts.size()); - return state; + std::unique_lock lock{mutex}; + + /// TSA_NO_THREAD_SAFETY_ANALYSIS is here because Clang Thread Safety Analysis doesn't understand std::unique_lock. + auto check_if_hosts_ready = [&](bool time_is_out) TSA_NO_THREAD_SAFETY_ANALYSIS + { + return checkIfHostsReachStage(hosts, stage_to_wait, time_is_out, timeout, results); + }; + + if (timeout) + { + if (!state_changed.wait_for(lock, *timeout, [&] { return check_if_hosts_ready(/* time_is_out = */ false); })) + check_if_hosts_ready(/* time_is_out = */ true); + } + else + { + state_changed.wait(lock, [&] { return check_if_hosts_ready(/* time_is_out = */ false); }); + } + + return results; } -Strings BackupCoordinationStageSync::waitImpl( - const Strings & all_hosts, const String & stage_to_wait, std::optional timeout) const + +bool BackupCoordinationStageSync::checkIfHostsReachStage( + const Strings & hosts, + const String & stage_to_wait, + bool time_is_out, + std::optional timeout, + Strings & results) const { - if (all_hosts.empty()) - return {}; + if (should_stop_watching_thread) + throw Exception(ErrorCodes::LOGICAL_ERROR, "finish() was called while waiting for a stage"); - /// Wait until all hosts are ready or an error happens or time is out. + process_list_element->checkTimeLimit(); - bool use_timeout = timeout.has_value(); - std::chrono::steady_clock::time_point end_of_timeout; - if (use_timeout) - end_of_timeout = std::chrono::steady_clock::now() + std::chrono::duration_cast(*timeout); - - State state; - for (;;) + for (size_t i = 0; i != hosts.size(); ++i) { - LOG_INFO(log, "Waiting for the stage {}", stage_to_wait); - /// Set by ZooKepper when list of zk nodes have changed. - auto watch = std::make_shared(); - Strings zk_nodes; - { - auto holder = with_retries.createRetriesControlHolder("waitImpl"); - holder.retries_ctl.retryLoop( - [&, &zookeeper = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zookeeper); - watch->reset(); - /// Get zk nodes and subscribe on their changes. - zk_nodes = zookeeper->getChildren(zookeeper_path, nullptr, watch); + const String & host = hosts[i]; + auto it = state.hosts.find(host); - /// Read the current state of zk nodes. - state = readCurrentState(holder, zk_nodes, all_hosts, stage_to_wait); - }); + if (it == state.hosts.end()) + throw Exception(ErrorCodes::LOGICAL_ERROR, "waitForHostsToReachStage() was called for unexpected {}, all hosts are {}", getHostDesc(host), getHostsDesc(all_hosts)); + + const HostInfo & host_info = it->second; + auto stage_it = host_info.stages.find(stage_to_wait); + if (stage_it != host_info.stages.end()) + { + results[i] = stage_it->second; + continue; } - /// Analyze the current state of zk nodes. - chassert(state.results || state.error || state.disconnected_host || state.unready_host); - - if (state.results || state.error || state.disconnected_host) - break; /// Everything is ready or error happened. - - /// Log what we will wait. - const auto & unready_host = *state.unready_host; - LOG_INFO(log, "Waiting on ZooKeeper watch for any node to be changed (currently waiting for host {}{})", - unready_host.host, - (!unready_host.started ? " which didn't start the operation yet" : "")); - - /// Wait until `watch_callback` is called by ZooKeeper meaning that zk nodes have changed. + if (host_info.finished) { - if (use_timeout) + throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, + "{} finished without coming to stage {}", getHostDesc(host), stage_to_wait); + } + + String host_status; + if (!host_info.started) + host_status = fmt::format(": the host hasn't started working on this {} yet", operation_name); + else if (!host_info.connected) + host_status = fmt::format(": the host is currently disconnected, last connection was at {}", host_info.last_connection_time); + + if (!time_is_out) + { + LOG_TRACE(log, "Waiting for {} to reach stage {}{}", getHostDesc(host), stage_to_wait, host_status); + return false; + } + else + { + throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, + "Waited longer than timeout {} for {} to reach stage {}{}", + *timeout, getHostDesc(host), stage_to_wait, host_status); + } + } + + LOG_INFO(log, "Hosts {} reached stage {}", getHostsDesc(hosts), stage_to_wait); + return true; +} + + +void BackupCoordinationStageSync::finish(bool & other_hosts_also_finished) +{ + tryFinishImpl(other_hosts_also_finished, /* throw_if_error = */ true, /* retries_kind = */ WithRetries::kNormal); +} + + +bool BackupCoordinationStageSync::tryFinishAfterError(bool & other_hosts_also_finished) noexcept +{ + return tryFinishImpl(other_hosts_also_finished, /* throw_if_error = */ false, /* retries_kind = */ WithRetries::kErrorHandling); +} + + +bool BackupCoordinationStageSync::tryFinishImpl() +{ + bool other_hosts_also_finished; + return tryFinishAfterError(other_hosts_also_finished); +} + + +bool BackupCoordinationStageSync::tryFinishImpl(bool & other_hosts_also_finished, bool throw_if_error, WithRetries::Kind retries_kind) +{ + auto get_value_other_hosts_also_finished = [&] TSA_REQUIRES(mutex) + { + other_hosts_also_finished = true; + for (const auto & [host, host_info] : state.hosts) + { + if ((host != current_host) && !host_info.finished) + other_hosts_also_finished = false; + } + }; + + { + std::lock_guard lock{mutex}; + if (finish_result.succeeded) + { + get_value_other_hosts_also_finished(); + return true; + } + if (finish_result.exception) + { + if (throw_if_error) + std::rethrow_exception(finish_result.exception); + return false; + } + } + + try + { + stopWatchingThread(); + + auto holder = with_retries.createRetriesControlHolder("BackupStageSync::finish", retries_kind); + holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zookeeper); + createFinishNodeAndRemoveAliveNode(zookeeper); + }); + + std::lock_guard lock{mutex}; + finish_result.succeeded = true; + get_value_other_hosts_also_finished(); + return true; + } + catch (...) + { + LOG_TRACE(log, "Caught exception while creating the 'finish' node for {}: {}", + current_host_desc, + getCurrentExceptionMessage(/* with_stacktrace= */ false, /* check_embedded_stacktrace= */ true)); + + std::lock_guard lock{mutex}; + finish_result.exception = std::current_exception(); + if (throw_if_error) + throw; + return false; + } +} + + +void BackupCoordinationStageSync::createFinishNodeAndRemoveAliveNode(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper) +{ + if (zookeeper->exists(finish_node_path)) + return; + + /// If the initiator of the query has that old version then it doesn't expect us to create the 'finish' node and moreover + /// the initiator can start removing all the nodes immediately after all hosts report about reaching the "completed" status. + /// So to avoid weird errors in the logs we won't create the 'finish' node if the initiator of the query has that old version. + if ((getInitiatorVersion() == kVersionWithoutFinishNode) && (current_host != kInitiator)) + { + LOG_INFO(log, "Skipped creating the 'finish' node because the initiator uses outdated version {}", getInitiatorVersion()); + return; + } + + std::optional num_hosts; + int num_hosts_version = -1; + + for (size_t attempt_no = 1; attempt_no <= max_attempts_after_bad_version; ++attempt_no) + { + if (!num_hosts) + { + Coordination::Stat stat; + num_hosts = parseFromString(zookeeper->get(num_hosts_node_path, &stat)); + num_hosts_version = stat.version; + } + + Coordination::Requests requests; + requests.reserve(3); + + requests.emplace_back(zkutil::makeCreateRequest(finish_node_path, "", zkutil::CreateMode::Persistent)); + + size_t num_hosts_node_path_pos = requests.size(); + requests.emplace_back(zkutil::makeSetRequest(num_hosts_node_path, toString(*num_hosts - 1), num_hosts_version)); + + size_t alive_node_path_pos = static_cast(-1); + if (zookeeper->exists(alive_node_path)) + { + alive_node_path_pos = requests.size(); + requests.emplace_back(zkutil::makeRemoveRequest(alive_node_path, -1)); + } + + Coordination::Responses responses; + auto code = zookeeper->tryMulti(requests, responses); + + if (code == Coordination::Error::ZOK) + { + --*num_hosts; + String hosts_left_desc = ((*num_hosts == 0) ? "no hosts left" : fmt::format("{} hosts left", *num_hosts)); + LOG_INFO(log, "Created the 'finish' node in ZooKeeper for {}, {}", current_host_desc, hosts_left_desc); + return; + } + + auto show_error_before_next_attempt = [&](const String & message) + { + bool will_try_again = (attempt_no < max_attempts_after_bad_version); + LOG_TRACE(log, "{} (attempt #{}){}", message, attempt_no, will_try_again ? ", will try again" : ""); + }; + + if ((responses.size() > num_hosts_node_path_pos) && + (responses[num_hosts_node_path_pos]->error == Coordination::Error::ZBADVERSION)) + { + show_error_before_next_attempt("Other host changed the 'num_hosts' node in ZooKeeper"); + num_hosts.reset(); /// needs to reread 'num_hosts' again + } + else if ((responses.size() > alive_node_path_pos) && + (responses[alive_node_path_pos]->error == Coordination::Error::ZNONODE)) + { + show_error_before_next_attempt(fmt::format("Node {} in ZooKeeper doesn't exist", alive_node_path_pos)); + /// needs another attempt + } + else + { + zkutil::KeeperMultiException::check(code, requests, responses); + } + } + + throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, + "Couldn't create the 'finish' node for {} after {} attempts", + current_host_desc, max_attempts_after_bad_version); +} + + +int BackupCoordinationStageSync::getInitiatorVersion() const +{ + std::lock_guard lock{mutex}; + auto it = state.hosts.find(String{kInitiator}); + if (it == state.hosts.end()) + throw Exception(ErrorCodes::LOGICAL_ERROR, "There is no initiator of this {} query, it's a bug", operation_name); + const HostInfo & host_info = it->second; + return host_info.version; +} + + +void BackupCoordinationStageSync::waitForOtherHostsToFinish() const +{ + tryWaitForOtherHostsToFinishImpl(/* reason = */ "", /* throw_if_error = */ true, /* timeout = */ {}); +} + + +bool BackupCoordinationStageSync::tryWaitForOtherHostsToFinishAfterError() const noexcept +{ + std::optional timeout; + if (finish_timeout_after_error.count() != 0) + timeout = finish_timeout_after_error; + + String reason = fmt::format("{} needs other hosts to finish before cleanup", current_host_desc); + return tryWaitForOtherHostsToFinishImpl(reason, /* throw_if_error = */ false, timeout); +} + + +bool BackupCoordinationStageSync::tryWaitForOtherHostsToFinishImpl(const String & reason, bool throw_if_error, std::optional timeout) const +{ + std::unique_lock lock{mutex}; + + /// TSA_NO_THREAD_SAFETY_ANALYSIS is here because Clang Thread Safety Analysis doesn't understand std::unique_lock. + auto check_if_other_hosts_finish = [&](bool time_is_out) TSA_NO_THREAD_SAFETY_ANALYSIS + { + return checkIfOtherHostsFinish(reason, throw_if_error, time_is_out, timeout); + }; + + if (timeout) + { + if (state_changed.wait_for(lock, *timeout, [&] { return check_if_other_hosts_finish(/* time_is_out = */ false); })) + return true; + return check_if_other_hosts_finish(/* time_is_out = */ true); + } + else + { + state_changed.wait(lock, [&] { return check_if_other_hosts_finish(/* time_is_out = */ false); }); + return true; + } +} + + +bool BackupCoordinationStageSync::checkIfOtherHostsFinish(const String & reason, bool throw_if_error, bool time_is_out, std::optional timeout) const +{ + if (should_stop_watching_thread) + throw Exception(ErrorCodes::LOGICAL_ERROR, "finish() was called while waiting for other hosts to finish"); + + if (throw_if_error) + process_list_element->checkTimeLimit(); + + for (const auto & [host, host_info] : state.hosts) + { + if ((host == current_host) || host_info.finished) + continue; + + String host_status; + if (!host_info.started) + host_status = fmt::format(": the host hasn't started working on this {} yet", operation_name); + else if (!host_info.connected) + host_status = fmt::format(": the host is currently disconnected, last connection was at {}", host_info.last_connection_time); + + if (!time_is_out) + { + String reason_text = reason.empty() ? "" : (" because " + reason); + LOG_TRACE(log, "Waiting for {} to finish{}{}", getHostDesc(host), reason_text, host_status); + return false; + } + else + { + String reason_text = reason.empty() ? "" : fmt::format(" (reason of waiting: {})", reason); + if (!throw_if_error) { - auto current_time = std::chrono::steady_clock::now(); - if ((current_time > end_of_timeout) - || !watch->tryWait(std::chrono::duration_cast(end_of_timeout - current_time).count())) - break; + LOG_INFO(log, "Waited longer than timeout {} for {} to finish{}{}", + *timeout, getHostDesc(host), host_status, reason_text); + return false; } else { - watch->wait(); + throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, + "Waited longer than timeout {} for {} to finish{}{}", + *timeout, getHostDesc(host), host_status, reason_text); } } } - /// Rethrow an error raised originally on another host. - if (state.error) - state.error->second.rethrow(); - - /// Another host terminated without errors. - if (state.disconnected_host) - throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, "No connection to host {}", *state.disconnected_host); - - /// Something's unready, timeout is probably not enough. - if (state.unready_host) - { - const auto & unready_host = *state.unready_host; - throw Exception( - ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, - "Waited for host {} too long (> {}){}", - unready_host.host, - to_string(*timeout), - unready_host.started ? "" : ": Operation didn't start"); - } - - LOG_TRACE(log, "Everything is Ok. All hosts achieved stage {}", stage_to_wait); - return std::move(*state.results); + LOG_TRACE(log, "Other hosts finished working on this {}", operation_name); + return true; } } diff --git a/src/Backups/BackupCoordinationStageSync.h b/src/Backups/BackupCoordinationStageSync.h index a06c5c61041..dc0d3c3c83d 100644 --- a/src/Backups/BackupCoordinationStageSync.h +++ b/src/Backups/BackupCoordinationStageSync.h @@ -10,33 +10,193 @@ class BackupCoordinationStageSync { public: BackupCoordinationStageSync( - const String & root_zookeeper_path_, - WithRetries & with_retries_, + bool is_restore_, /// true if this is a RESTORE ON CLUSTER command, false if this is a BACKUP ON CLUSTER command + const String & zookeeper_path_, /// path to the "stage" folder in ZooKeeper + const String & current_host_, /// the current host, or an empty string if it's the initiator of the BACKUP/RESTORE ON CLUSTER command + const Strings & all_hosts_, /// all the hosts (including the initiator and the current host) performing the BACKUP/RESTORE ON CLUSTER command + bool allow_concurrency_, /// whether it's allowed to have concurrent backups or restores. + const WithRetries & with_retries_, + ThreadPoolCallbackRunnerUnsafe schedule_, + QueryStatusPtr process_list_element_, LoggerPtr log_); + ~BackupCoordinationStageSync(); + /// Sets the stage of the current host and signal other hosts if there were other hosts waiting for that. - void set(const String & current_host, const String & new_stage, const String & message, const bool & all_hosts = false); - void setError(const String & current_host, const Exception & exception); + void setStage(const String & stage, const String & stage_result = {}); - /// Sets the stage of the current host and waits until all hosts come to the same stage. - /// The function returns the messages all hosts set when they come to the required stage. - Strings wait(const Strings & all_hosts, const String & stage_to_wait); + /// Waits until all the specified hosts come to the specified stage. + /// The function returns the results which specified hosts set when they came to the required stage. + /// If it doesn't happen before the timeout then the function will stop waiting and throw an exception. + Strings waitForHostsToReachStage(const String & stage_to_wait, const Strings & hosts, std::optional timeout = {}) const; - /// Almost the same as setAndWait() but this one stops waiting and throws an exception after a specific amount of time. - Strings waitFor(const Strings & all_hosts, const String & stage_to_wait, std::chrono::milliseconds timeout); + /// Waits until all the other hosts finish their work. + /// Stops waiting and throws an exception if another host encounters an error or if some host gets cancelled. + void waitForOtherHostsToFinish() const; + + /// Lets other host know that the current host has finished its work. + void finish(bool & other_hosts_also_finished); + + /// Lets other hosts know that the current host has encountered an error. + bool trySetError(std::exception_ptr exception) noexcept; + + /// Waits until all the other hosts finish their work (as a part of error-handling process). + /// Doesn't stops waiting if some host encounters an error or gets cancelled. + bool tryWaitForOtherHostsToFinishAfterError() const noexcept; + + /// Lets other host know that the current host has finished its work (as a part of error-handling process). + bool tryFinishAfterError(bool & other_hosts_also_finished) noexcept; + + /// Returns a printable name of a specific host. For empty host the function returns "initiator". + static String getHostDesc(const String & host); + static String getHostsDesc(const Strings & hosts); private: + /// Initializes the original state. It will be updated then with readCurrentState(). + void initializeState(); + + /// Creates the root node in ZooKeeper. void createRootNodes(); - struct State; - State readCurrentState(WithRetries::RetriesControlHolder & retries_control_holder, const Strings & zk_nodes, const Strings & all_hosts, const String & stage_to_wait) const; + /// Atomically creates both 'start' and 'alive' nodes and also checks that there is no concurrent backup or restore if `allow_concurrency` is false. + void createStartAndAliveNodes(); + void createStartAndAliveNodes(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper); - Strings waitImpl(const Strings & all_hosts, const String & stage_to_wait, std::optional timeout) const; + /// Deserialize the version of a node stored in the 'start' node. + int parseStartNode(const String & start_node_contents, const String & host) const; - String zookeeper_path; - /// A reference to the field of parent object - BackupCoordinationRemote or RestoreCoordinationRemote - WithRetries & with_retries; - LoggerPtr log; + /// Recreates the 'alive' node if it doesn't exist. It's an ephemeral node so it's removed automatically after disconnections. + void createAliveNode(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper); + + /// Checks that there is no concurrent backup or restore if `allow_concurrency` is false. + void checkConcurrency(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper); + + /// Watching thread periodically reads the current state from ZooKeeper and recreates the 'alive' node. + void startWatchingThread(); + void stopWatchingThread(); + void watchingThread(); + + /// Reads the current state from ZooKeeper without throwing exceptions. + void readCurrentState(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper); + String getStageNodePath(const String & stage) const; + + /// Lets other hosts know that the current host has encountered an error. + bool trySetError(const Exception & exception); + void setError(const Exception & exception); + + /// Deserializes an error stored in the error node. + static std::pair parseErrorNode(const String & error_node_contents); + + /// Reset the `connected` flag for each host. + void resetConnectedFlag(); + + /// Checks if the current query is cancelled, and if so then the function sets the `cancelled` flag in the current state. + void checkIfQueryCancelled(); + + /// Checks if the current state contains an error, and if so then the function passes this error to the query status + /// to cancel the current BACKUP or RESTORE command. + void cancelQueryIfError(); + + /// Checks if some host was disconnected for too long, and if so then the function generates an error and pass it to the query status + /// to cancel the current BACKUP or RESTORE command. + void cancelQueryIfDisconnectedTooLong(); + + /// Used by waitForHostsToReachStage() to check if everything is ready to return. + bool checkIfHostsReachStage(const Strings & hosts, const String & stage_to_wait, bool time_is_out, std::optional timeout, Strings & results) const TSA_REQUIRES(mutex); + + /// Creates the 'finish' node. + bool tryFinishImpl(); + bool tryFinishImpl(bool & other_hosts_also_finished, bool throw_if_error, WithRetries::Kind retries_kind); + void createFinishNodeAndRemoveAliveNode(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper); + + /// Returns the version used by the initiator. + int getInitiatorVersion() const; + + /// Waits until all the other hosts finish their work. + bool tryWaitForOtherHostsToFinishImpl(const String & reason, bool throw_if_error, std::optional timeout) const; + bool checkIfOtherHostsFinish(const String & reason, bool throw_if_error, bool time_is_out, std::optional timeout) const TSA_REQUIRES(mutex); + + const bool is_restore; + const String operation_name; + const String current_host; + const String current_host_desc; + const Strings all_hosts; + const bool allow_concurrency; + + /// A reference to a field of the parent object which is either BackupCoordinationOnCluster or RestoreCoordinationOnCluster. + const WithRetries & with_retries; + + const ThreadPoolCallbackRunnerUnsafe schedule; + const QueryStatusPtr process_list_element; + const LoggerPtr log; + + const std::chrono::seconds failure_after_host_disconnected_for_seconds; + const std::chrono::seconds finish_timeout_after_error; + const std::chrono::milliseconds sync_period_ms; + const size_t max_attempts_after_bad_version; + + /// Paths in ZooKeeper. + const std::filesystem::path zookeeper_path; + const String root_zookeeper_path; + const String operation_node_path; + const String operation_node_name; + const String stage_node_path; + const String start_node_path; + const String finish_node_path; + const String num_hosts_node_path; + const String alive_node_path; + const String alive_tracker_node_path; + const String error_node_path; + + std::shared_ptr zk_nodes_changed; + + /// We store list of previously found ZooKeeper nodes to show better logging messages. + Strings zk_nodes; + + /// Information about one host read from ZooKeeper. + struct HostInfo + { + String host; + bool started = false; + bool connected = false; + bool finished = false; + int version = 1; + std::map stages = {}; /// std::map because we need to compare states + std::exception_ptr exception = nullptr; + + std::chrono::time_point last_connection_time = {}; + std::chrono::time_point last_connection_time_monotonic = {}; + + bool operator ==(const HostInfo & other) const; + bool operator !=(const HostInfo & other) const; + }; + + /// Information about all the host participating in the current BACKUP or RESTORE operation. + struct State + { + std::map hosts; /// std::map because we need to compare states + std::optional host_with_error; + bool cancelled = false; + + bool operator ==(const State & other) const; + bool operator !=(const State & other) const; + }; + + State state TSA_GUARDED_BY(mutex); + mutable std::condition_variable state_changed; + + std::future watching_thread_future; + std::atomic should_stop_watching_thread = false; + + struct FinishResult + { + bool succeeded = false; + std::exception_ptr exception; + bool other_hosts_also_finished = false; + }; + FinishResult finish_result TSA_GUARDED_BY(mutex); + + mutable std::mutex mutex; }; } diff --git a/src/Backups/BackupEntriesCollector.cpp b/src/Backups/BackupEntriesCollector.cpp index ae73630d41c..00a4471d994 100644 --- a/src/Backups/BackupEntriesCollector.cpp +++ b/src/Backups/BackupEntriesCollector.cpp @@ -102,7 +102,6 @@ BackupEntriesCollector::BackupEntriesCollector( , read_settings(read_settings_) , context(context_) , process_list_element(context->getProcessListElement()) - , on_cluster_first_sync_timeout(context->getConfigRef().getUInt64("backups.on_cluster_first_sync_timeout", 180000)) , collect_metadata_timeout(context->getConfigRef().getUInt64( "backups.collect_metadata_timeout", context->getConfigRef().getUInt64("backups.consistent_metadata_snapshot_timeout", 600000))) , attempts_to_collect_metadata_before_sleep(context->getConfigRef().getUInt("backups.attempts_to_collect_metadata_before_sleep", 2)) @@ -176,21 +175,7 @@ Strings BackupEntriesCollector::setStage(const String & new_stage, const String checkIsQueryCancelled(); current_stage = new_stage; - backup_coordination->setStage(new_stage, message); - - if (new_stage == Stage::formatGatheringMetadata(0)) - { - return backup_coordination->waitForStage(new_stage, on_cluster_first_sync_timeout); - } - if (new_stage.starts_with(Stage::GATHERING_METADATA)) - { - auto current_time = std::chrono::steady_clock::now(); - auto end_of_timeout = std::max(current_time, collect_metadata_end_time); - return backup_coordination->waitForStage( - new_stage, std::chrono::duration_cast(end_of_timeout - current_time)); - } - - return backup_coordination->waitForStage(new_stage); + return backup_coordination->setStage(new_stage, message, /* sync = */ true); } void BackupEntriesCollector::checkIsQueryCancelled() const diff --git a/src/Backups/BackupEntriesCollector.h b/src/Backups/BackupEntriesCollector.h index ae076a84c8b..504489cce6b 100644 --- a/src/Backups/BackupEntriesCollector.h +++ b/src/Backups/BackupEntriesCollector.h @@ -111,10 +111,6 @@ private: ContextPtr context; QueryStatusPtr process_list_element; - /// The time a BACKUP ON CLUSTER or RESTORE ON CLUSTER command will wait until all the nodes receive the BACKUP (or RESTORE) query and start working. - /// This setting is similar to `distributed_ddl_task_timeout`. - const std::chrono::milliseconds on_cluster_first_sync_timeout; - /// The time a BACKUP command will try to collect the metadata of tables & databases. const std::chrono::milliseconds collect_metadata_timeout; diff --git a/src/Backups/BackupIO.h b/src/Backups/BackupIO.h index ee2f38c785b..c9e0f25f9a0 100644 --- a/src/Backups/BackupIO.h +++ b/src/Backups/BackupIO.h @@ -5,6 +5,7 @@ namespace DB { + class IDisk; using DiskPtr = std::shared_ptr; class SeekableReadBuffer; @@ -63,9 +64,13 @@ public: virtual void copyFile(const String & destination, const String & source, size_t size) = 0; + /// Removes a file written to the backup, if it still exists. virtual void removeFile(const String & file_name) = 0; virtual void removeFiles(const Strings & file_names) = 0; + /// Removes the backup folder if it's empty or contains empty subfolders. + virtual void removeEmptyDirectories() = 0; + virtual const ReadSettings & getReadSettings() const = 0; virtual const WriteSettings & getWriteSettings() const = 0; virtual size_t getWriteBufferSize() const = 0; diff --git a/src/Backups/BackupIO_AzureBlobStorage.h b/src/Backups/BackupIO_AzureBlobStorage.h index c3b88f245ab..c90a030a1e7 100644 --- a/src/Backups/BackupIO_AzureBlobStorage.h +++ b/src/Backups/BackupIO_AzureBlobStorage.h @@ -81,6 +81,7 @@ public: void removeFile(const String & file_name) override; void removeFiles(const Strings & file_names) override; + void removeEmptyDirectories() override {} private: std::unique_ptr readFile(const String & file_name, size_t expected_file_size) override; diff --git a/src/Backups/BackupIO_Disk.cpp b/src/Backups/BackupIO_Disk.cpp index aeb07b154f5..794fb5be936 100644 --- a/src/Backups/BackupIO_Disk.cpp +++ b/src/Backups/BackupIO_Disk.cpp @@ -91,16 +91,36 @@ std::unique_ptr BackupWriterDisk::writeFile(const String & file_nam void BackupWriterDisk::removeFile(const String & file_name) { disk->removeFileIfExists(root_path / file_name); - if (disk->existsDirectory(root_path) && disk->isDirectoryEmpty(root_path)) - disk->removeDirectory(root_path); } void BackupWriterDisk::removeFiles(const Strings & file_names) { for (const auto & file_name : file_names) disk->removeFileIfExists(root_path / file_name); - if (disk->existsDirectory(root_path) && disk->isDirectoryEmpty(root_path)) - disk->removeDirectory(root_path); +} + +void BackupWriterDisk::removeEmptyDirectories() +{ + removeEmptyDirectoriesImpl(root_path); +} + +void BackupWriterDisk::removeEmptyDirectoriesImpl(const fs::path & current_dir) +{ + if (!disk->existsDirectory(current_dir)) + return; + + if (disk->isDirectoryEmpty(current_dir)) + { + disk->removeDirectory(current_dir); + return; + } + + /// Backups are not too deep, so recursion is good enough here. + for (auto it = disk->iterateDirectory(current_dir); it->isValid(); it->next()) + removeEmptyDirectoriesImpl(current_dir / it->name()); + + if (disk->isDirectoryEmpty(current_dir)) + disk->removeDirectory(current_dir); } void BackupWriterDisk::copyFileFromDisk(const String & path_in_backup, DiskPtr src_disk, const String & src_path, diff --git a/src/Backups/BackupIO_Disk.h b/src/Backups/BackupIO_Disk.h index 3d3253877bd..c77513935a9 100644 --- a/src/Backups/BackupIO_Disk.h +++ b/src/Backups/BackupIO_Disk.h @@ -50,9 +50,11 @@ public: void removeFile(const String & file_name) override; void removeFiles(const Strings & file_names) override; + void removeEmptyDirectories() override; private: std::unique_ptr readFile(const String & file_name, size_t expected_file_size) override; + void removeEmptyDirectoriesImpl(const std::filesystem::path & current_dir); const DiskPtr disk; const std::filesystem::path root_path; diff --git a/src/Backups/BackupIO_File.cpp b/src/Backups/BackupIO_File.cpp index 681513bf7ce..80f084d241c 100644 --- a/src/Backups/BackupIO_File.cpp +++ b/src/Backups/BackupIO_File.cpp @@ -106,16 +106,36 @@ std::unique_ptr BackupWriterFile::writeFile(const String & file_nam void BackupWriterFile::removeFile(const String & file_name) { (void)fs::remove(root_path / file_name); - if (fs::is_directory(root_path) && fs::is_empty(root_path)) - (void)fs::remove(root_path); } void BackupWriterFile::removeFiles(const Strings & file_names) { for (const auto & file_name : file_names) (void)fs::remove(root_path / file_name); - if (fs::is_directory(root_path) && fs::is_empty(root_path)) - (void)fs::remove(root_path); +} + +void BackupWriterFile::removeEmptyDirectories() +{ + removeEmptyDirectoriesImpl(root_path); +} + +void BackupWriterFile::removeEmptyDirectoriesImpl(const fs::path & current_dir) +{ + if (!fs::is_directory(current_dir)) + return; + + if (fs::is_empty(current_dir)) + { + (void)fs::remove(current_dir); + return; + } + + /// Backups are not too deep, so recursion is good enough here. + for (const auto & it : std::filesystem::directory_iterator{current_dir}) + removeEmptyDirectoriesImpl(it.path()); + + if (fs::is_empty(current_dir)) + (void)fs::remove(current_dir); } void BackupWriterFile::copyFileFromDisk(const String & path_in_backup, DiskPtr src_disk, const String & src_path, diff --git a/src/Backups/BackupIO_File.h b/src/Backups/BackupIO_File.h index ebe9a0f02cb..a2169ac7b4b 100644 --- a/src/Backups/BackupIO_File.h +++ b/src/Backups/BackupIO_File.h @@ -42,9 +42,11 @@ public: void removeFile(const String & file_name) override; void removeFiles(const Strings & file_names) override; + void removeEmptyDirectories() override; private: std::unique_ptr readFile(const String & file_name, size_t expected_file_size) override; + void removeEmptyDirectoriesImpl(const std::filesystem::path & current_dir); const std::filesystem::path root_path; const DataSourceDescription data_source_description; diff --git a/src/Backups/BackupIO_S3.h b/src/Backups/BackupIO_S3.h index a04f1c915b9..4ccf477b369 100644 --- a/src/Backups/BackupIO_S3.h +++ b/src/Backups/BackupIO_S3.h @@ -74,6 +74,7 @@ public: void removeFile(const String & file_name) override; void removeFiles(const Strings & file_names) override; + void removeEmptyDirectories() override {} private: std::unique_ptr readFile(const String & file_name, size_t expected_file_size) override; diff --git a/src/Backups/BackupImpl.cpp b/src/Backups/BackupImpl.cpp index b95a2e10b4d..af3fa5531b8 100644 --- a/src/Backups/BackupImpl.cpp +++ b/src/Backups/BackupImpl.cpp @@ -147,11 +147,11 @@ BackupImpl::BackupImpl( BackupImpl::~BackupImpl() { - if ((open_mode == OpenMode::WRITE) && !is_internal_backup && !writing_finalized && !std::uncaught_exceptions() && !std::current_exception()) + if ((open_mode == OpenMode::WRITE) && !writing_finalized && !corrupted) { /// It is suspicious to destroy BackupImpl without finalization while writing a backup when there is no exception. - LOG_ERROR(log, "BackupImpl is not finalized when destructor is called. Stack trace: {}", StackTrace().toString()); - chassert(false && "BackupImpl is not finalized when destructor is called."); + LOG_ERROR(log, "BackupImpl is not finalized or marked as corrupted when destructor is called. Stack trace: {}", StackTrace().toString()); + chassert(false, "BackupImpl is not finalized or marked as corrupted when destructor is called."); } try @@ -196,9 +196,6 @@ void BackupImpl::open() if (open_mode == OpenMode::READ) readBackupMetadata(); - - if ((open_mode == OpenMode::WRITE) && base_backup_info) - base_backup_uuid = getBaseBackupUnlocked()->getUUID(); } void BackupImpl::close() @@ -280,6 +277,8 @@ std::shared_ptr BackupImpl::getBaseBackupUnlocked() const toString(base_backup->getUUID()), (base_backup_uuid ? toString(*base_backup_uuid) : "")); } + + base_backup_uuid = base_backup->getUUID(); } return base_backup; } @@ -369,7 +368,7 @@ void BackupImpl::writeBackupMetadata() if (base_backup_in_use) { *out << "" << xml << base_backup_info->toString() << ""; - *out << "" << toString(*base_backup_uuid) << ""; + *out << "" << getBaseBackupUnlocked()->getUUID() << ""; } } @@ -594,9 +593,6 @@ bool BackupImpl::checkLockFile(bool throw_if_failed) const void BackupImpl::removeLockFile() { - if (is_internal_backup) - return; /// Internal backup must not remove the lock file (it's still used by the initiator). - if (checkLockFile(false)) writer->removeFile(lock_file_name); } @@ -989,8 +985,11 @@ void BackupImpl::finalizeWriting() if (open_mode != OpenMode::WRITE) throw Exception(ErrorCodes::LOGICAL_ERROR, "Backup is not opened for writing"); + if (corrupted) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Backup can't be finalized after an error happened"); + if (writing_finalized) - throw Exception(ErrorCodes::LOGICAL_ERROR, "Backup is already finalized"); + return; if (!is_internal_backup) { @@ -1015,20 +1014,58 @@ void BackupImpl::setCompressedSize() } -void BackupImpl::tryRemoveAllFiles() +bool BackupImpl::setIsCorrupted() noexcept { - if (open_mode != OpenMode::WRITE) - throw Exception(ErrorCodes::LOGICAL_ERROR, "Backup is not opened for writing"); - - if (is_internal_backup) - return; - try { - LOG_INFO(log, "Removing all files of backup {}", backup_name_for_logging); + std::lock_guard lock{mutex}; + if (open_mode != OpenMode::WRITE) + { + LOG_ERROR(log, "Backup is not opened for writing. Stack trace: {}", StackTrace().toString()); + chassert(false, "Backup is not opened for writing when setIsCorrupted() is called"); + return false; + } + + if (writing_finalized) + { + LOG_WARNING(log, "An error happened after the backup was completed successfully, the backup must be correct!"); + return false; + } + + if (corrupted) + return true; + + LOG_WARNING(log, "An error happened, the backup won't be completed"); + closeArchive(/* finalize= */ false); + corrupted = true; + return true; + } + catch (...) + { + DB::tryLogCurrentException(log, "Caught exception while setting that the backup was corrupted"); + return false; + } +} + + +bool BackupImpl::tryRemoveAllFiles() noexcept +{ + try + { + std::lock_guard lock{mutex}; + if (!corrupted) + { + LOG_ERROR(log, "Backup is not set as corrupted. Stack trace: {}", StackTrace().toString()); + chassert(false, "Backup is not set as corrupted when tryRemoveAllFiles() is called"); + return false; + } + + LOG_INFO(log, "Removing all files of backup {}", backup_name_for_logging); + Strings files_to_remove; + if (use_archive) { files_to_remove.push_back(archive_params.archive_name); @@ -1041,14 +1078,17 @@ void BackupImpl::tryRemoveAllFiles() } if (!checkLockFile(false)) - return; + return false; writer->removeFiles(files_to_remove); removeLockFile(); + writer->removeEmptyDirectories(); + return true; } catch (...) { - DB::tryLogCurrentException(__PRETTY_FUNCTION__); + DB::tryLogCurrentException(log, "Caught exception while removing files of a corrupted backup"); + return false; } } diff --git a/src/Backups/BackupImpl.h b/src/Backups/BackupImpl.h index d7846104c4c..4b0f9f879ec 100644 --- a/src/Backups/BackupImpl.h +++ b/src/Backups/BackupImpl.h @@ -86,7 +86,8 @@ public: void writeFile(const BackupFileInfo & info, BackupEntryPtr entry) override; bool supportsWritingInMultipleThreads() const override { return !use_archive; } void finalizeWriting() override; - void tryRemoveAllFiles() override; + bool setIsCorrupted() noexcept override; + bool tryRemoveAllFiles() noexcept override; private: void open(); @@ -146,13 +147,14 @@ private: int version; mutable std::optional base_backup_info; mutable std::shared_ptr base_backup; - std::optional base_backup_uuid; + mutable std::optional base_backup_uuid; std::shared_ptr archive_reader; std::shared_ptr archive_writer; String lock_file_name; std::atomic lock_file_before_first_file_checked = false; bool writing_finalized = false; + bool corrupted = false; bool deduplicate_files = true; bool use_same_s3_credentials_for_base_backup = false; bool use_same_password_for_base_backup = false; diff --git a/src/Backups/BackupKeeperSettings.cpp b/src/Backups/BackupKeeperSettings.cpp new file mode 100644 index 00000000000..180633cea1f --- /dev/null +++ b/src/Backups/BackupKeeperSettings.cpp @@ -0,0 +1,58 @@ +#include + +#include +#include +#include + + +namespace DB +{ + +namespace Setting +{ + extern const SettingsUInt64 backup_restore_keeper_max_retries; + extern const SettingsUInt64 backup_restore_keeper_retry_initial_backoff_ms; + extern const SettingsUInt64 backup_restore_keeper_retry_max_backoff_ms; + extern const SettingsUInt64 backup_restore_failure_after_host_disconnected_for_seconds; + extern const SettingsUInt64 backup_restore_keeper_max_retries_while_initializing; + extern const SettingsUInt64 backup_restore_keeper_max_retries_while_handling_error; + extern const SettingsUInt64 backup_restore_finish_timeout_after_error_sec; + extern const SettingsUInt64 backup_restore_keeper_value_max_size; + extern const SettingsUInt64 backup_restore_batch_size_for_keeper_multi; + extern const SettingsUInt64 backup_restore_batch_size_for_keeper_multiread; + extern const SettingsFloat backup_restore_keeper_fault_injection_probability; + extern const SettingsUInt64 backup_restore_keeper_fault_injection_seed; +} + +BackupKeeperSettings BackupKeeperSettings::fromContext(const ContextPtr & context) +{ + BackupKeeperSettings keeper_settings; + + const auto & settings = context->getSettingsRef(); + const auto & config = context->getConfigRef(); + + keeper_settings.max_retries = settings[Setting::backup_restore_keeper_max_retries]; + keeper_settings.retry_initial_backoff_ms = std::chrono::milliseconds{settings[Setting::backup_restore_keeper_retry_initial_backoff_ms]}; + keeper_settings.retry_max_backoff_ms = std::chrono::milliseconds{settings[Setting::backup_restore_keeper_retry_max_backoff_ms]}; + + keeper_settings.failure_after_host_disconnected_for_seconds = std::chrono::seconds{settings[Setting::backup_restore_failure_after_host_disconnected_for_seconds]}; + keeper_settings.max_retries_while_initializing = settings[Setting::backup_restore_keeper_max_retries_while_initializing]; + keeper_settings.max_retries_while_handling_error = settings[Setting::backup_restore_keeper_max_retries_while_handling_error]; + keeper_settings.finish_timeout_after_error = std::chrono::seconds(settings[Setting::backup_restore_finish_timeout_after_error_sec]); + + if (config.has("backups.sync_period_ms")) + keeper_settings.sync_period_ms = std::chrono::milliseconds{config.getUInt64("backups.sync_period_ms")}; + + if (config.has("backups.max_attempts_after_bad_version")) + keeper_settings.max_attempts_after_bad_version = config.getUInt64("backups.max_attempts_after_bad_version"); + + keeper_settings.value_max_size = settings[Setting::backup_restore_keeper_value_max_size]; + keeper_settings.batch_size_for_multi = settings[Setting::backup_restore_batch_size_for_keeper_multi]; + keeper_settings.batch_size_for_multiread = settings[Setting::backup_restore_batch_size_for_keeper_multiread]; + keeper_settings.fault_injection_probability = settings[Setting::backup_restore_keeper_fault_injection_probability]; + keeper_settings.fault_injection_seed = settings[Setting::backup_restore_keeper_fault_injection_seed]; + + return keeper_settings; +} + +} diff --git a/src/Backups/BackupKeeperSettings.h b/src/Backups/BackupKeeperSettings.h new file mode 100644 index 00000000000..6c4b2187094 --- /dev/null +++ b/src/Backups/BackupKeeperSettings.h @@ -0,0 +1,64 @@ +#pragma once + +#include + + +namespace DB +{ + +/// Settings for [Zoo]Keeper-related works during BACKUP or RESTORE. +struct BackupKeeperSettings +{ + /// Maximum number of retries in the middle of a BACKUP ON CLUSTER or RESTORE ON CLUSTER operation. + /// Should be big enough so the whole operation won't be cancelled in the middle of it because of a temporary ZooKeeper failure. + UInt64 max_retries{1000}; + + /// Initial backoff timeout for ZooKeeper operations during backup or restore. + std::chrono::milliseconds retry_initial_backoff_ms{100}; + + /// Max backoff timeout for ZooKeeper operations during backup or restore. + std::chrono::milliseconds retry_max_backoff_ms{5000}; + + /// If a host during BACKUP ON CLUSTER or RESTORE ON CLUSTER doesn't recreate its 'alive' node in ZooKeeper + /// for this amount of time then the whole backup or restore is considered as failed. + /// Should be bigger than any reasonable time for a host to reconnect to ZooKeeper after a failure. + /// Set to zero to disable (if it's zero and some host crashed then BACKUP ON CLUSTER or RESTORE ON CLUSTER will be waiting + /// for the crashed host forever until the operation is explicitly cancelled with KILL QUERY). + std::chrono::seconds failure_after_host_disconnected_for_seconds{3600}; + + /// Maximum number of retries during the initialization of a BACKUP ON CLUSTER or RESTORE ON CLUSTER operation. + /// Shouldn't be too big because if the operation is going to fail then it's better if it fails faster. + UInt64 max_retries_while_initializing{20}; + + /// Maximum number of retries while handling an error of a BACKUP ON CLUSTER or RESTORE ON CLUSTER operation. + /// Shouldn't be too big because those retries are just for cleanup after the operation has failed already. + UInt64 max_retries_while_handling_error{20}; + + /// How long the initiator should wait for other host to handle the 'error' node and finish their work. + std::chrono::seconds finish_timeout_after_error{180}; + + /// How often the "stage" folder in ZooKeeper must be scanned in a background thread to track changes done by other hosts. + std::chrono::milliseconds sync_period_ms{5000}; + + /// Number of attempts after getting error ZBADVERSION from ZooKeeper. + size_t max_attempts_after_bad_version{10}; + + /// Maximum size of data of a ZooKeeper's node during backup. + UInt64 value_max_size{1048576}; + + /// Maximum size of a batch for a multi request. + UInt64 batch_size_for_multi{1000}; + + /// Maximum size of a batch for a multiread request. + UInt64 batch_size_for_multiread{10000}; + + /// Approximate probability of failure for a keeper request during backup or restore. Valid value is in interval [0.0f, 1.0f]. + Float64 fault_injection_probability{0}; + + /// Seed for `fault_injection_probability`: 0 - random seed, otherwise the setting value. + UInt64 fault_injection_seed{0}; + + static BackupKeeperSettings fromContext(const ContextPtr & context); +}; + +} diff --git a/src/Backups/BackupSettings.cpp b/src/Backups/BackupSettings.cpp index 9b8117c6587..915989735c3 100644 --- a/src/Backups/BackupSettings.cpp +++ b/src/Backups/BackupSettings.cpp @@ -74,6 +74,17 @@ BackupSettings BackupSettings::fromBackupQuery(const ASTBackupQuery & query) return res; } +bool BackupSettings::isAsync(const ASTBackupQuery & query) +{ + if (query.settings) + { + const auto * field = query.settings->as().changes.tryGet("async"); + if (field) + return field->safeGet(); + } + return false; /// `async` is false by default. +} + void BackupSettings::copySettingsToQuery(ASTBackupQuery & query) const { auto query_settings = std::make_shared(); diff --git a/src/Backups/BackupSettings.h b/src/Backups/BackupSettings.h index 8c2ea21df01..fa1e5025935 100644 --- a/src/Backups/BackupSettings.h +++ b/src/Backups/BackupSettings.h @@ -101,6 +101,8 @@ struct BackupSettings static BackupSettings fromBackupQuery(const ASTBackupQuery & query); void copySettingsToQuery(ASTBackupQuery & query) const; + static bool isAsync(const ASTBackupQuery & query); + struct Util { static std::vector clusterHostIDsFromAST(const IAST & ast); diff --git a/src/Backups/BackupsWorker.cpp b/src/Backups/BackupsWorker.cpp index d3889295598..8480dc5d64d 100644 --- a/src/Backups/BackupsWorker.cpp +++ b/src/Backups/BackupsWorker.cpp @@ -1,4 +1,6 @@ #include + +#include #include #include #include @@ -6,9 +8,9 @@ #include #include #include -#include +#include #include -#include +#include #include #include #include @@ -43,21 +45,11 @@ namespace CurrentMetrics namespace DB { -namespace Setting -{ - extern const SettingsUInt64 backup_restore_batch_size_for_keeper_multiread; - extern const SettingsUInt64 backup_restore_keeper_max_retries; - extern const SettingsUInt64 backup_restore_keeper_retry_initial_backoff_ms; - extern const SettingsUInt64 backup_restore_keeper_retry_max_backoff_ms; - extern const SettingsUInt64 backup_restore_keeper_fault_injection_seed; - extern const SettingsFloat backup_restore_keeper_fault_injection_probability; -} namespace ErrorCodes { extern const int BAD_ARGUMENTS; extern const int LOGICAL_ERROR; - extern const int CONCURRENT_ACCESS_NOT_SUPPORTED; extern const int QUERY_WAS_CANCELLED; } @@ -66,102 +58,6 @@ namespace Stage = BackupCoordinationStage; namespace { - std::shared_ptr makeBackupCoordination(const ContextPtr & context, const BackupSettings & backup_settings, bool remote) - { - if (remote) - { - String root_zk_path = context->getConfigRef().getString("backups.zookeeper_path", "/clickhouse/backups"); - - auto get_zookeeper = [global_context = context->getGlobalContext()] { return global_context->getZooKeeper(); }; - - BackupCoordinationRemote::BackupKeeperSettings keeper_settings = WithRetries::KeeperSettings::fromContext(context); - - auto all_hosts = BackupSettings::Util::filterHostIDs( - backup_settings.cluster_host_ids, backup_settings.shard_num, backup_settings.replica_num); - - return std::make_shared( - get_zookeeper, - root_zk_path, - keeper_settings, - toString(*backup_settings.backup_uuid), - all_hosts, - backup_settings.host_id, - !backup_settings.deduplicate_files, - backup_settings.internal, - context->getProcessListElement()); - } - - return std::make_shared(!backup_settings.deduplicate_files); - } - - std::shared_ptr - makeRestoreCoordination(const ContextPtr & context, const RestoreSettings & restore_settings, bool remote) - { - if (remote) - { - String root_zk_path = context->getConfigRef().getString("backups.zookeeper_path", "/clickhouse/backups"); - - auto get_zookeeper = [global_context = context->getGlobalContext()] { return global_context->getZooKeeper(); }; - - RestoreCoordinationRemote::RestoreKeeperSettings keeper_settings - { - .keeper_max_retries = context->getSettingsRef()[Setting::backup_restore_keeper_max_retries], - .keeper_retry_initial_backoff_ms = context->getSettingsRef()[Setting::backup_restore_keeper_retry_initial_backoff_ms], - .keeper_retry_max_backoff_ms = context->getSettingsRef()[Setting::backup_restore_keeper_retry_max_backoff_ms], - .batch_size_for_keeper_multiread = context->getSettingsRef()[Setting::backup_restore_batch_size_for_keeper_multiread], - .keeper_fault_injection_probability = context->getSettingsRef()[Setting::backup_restore_keeper_fault_injection_probability], - .keeper_fault_injection_seed = context->getSettingsRef()[Setting::backup_restore_keeper_fault_injection_seed] - }; - - auto all_hosts = BackupSettings::Util::filterHostIDs( - restore_settings.cluster_host_ids, restore_settings.shard_num, restore_settings.replica_num); - - return std::make_shared( - get_zookeeper, - root_zk_path, - keeper_settings, - toString(*restore_settings.restore_uuid), - all_hosts, - restore_settings.host_id, - restore_settings.internal, - context->getProcessListElement()); - } - - return std::make_shared(); - } - - /// Sends information about an exception to IBackupCoordination or IRestoreCoordination. - template - void sendExceptionToCoordination(std::shared_ptr coordination, const Exception & exception) - { - try - { - if (coordination) - coordination->setError(exception); - } - catch (...) // NOLINT(bugprone-empty-catch) - { - } - } - - /// Sends information about the current exception to IBackupCoordination or IRestoreCoordination. - template - void sendCurrentExceptionToCoordination(std::shared_ptr coordination) - { - try - { - throw; - } - catch (const Exception & e) - { - sendExceptionToCoordination(coordination, e); - } - catch (...) - { - sendExceptionToCoordination(coordination, Exception(getCurrentExceptionMessageAndPattern(true, true), getCurrentExceptionCode())); - } - } - bool isFinishedSuccessfully(BackupStatus status) { return (status == BackupStatus::BACKUP_CREATED) || (status == BackupStatus::RESTORED); @@ -262,24 +158,27 @@ namespace /// while the thread pool is still occupied with the waiting task then a scheduled task can be never executed). enum class BackupsWorker::ThreadPoolId : uint8_t { - /// "BACKUP ON CLUSTER ASYNC" waits in background while "BACKUP ASYNC" is finished on the nodes of the cluster, then finalizes the backup. - BACKUP_ASYNC_ON_CLUSTER = 0, + /// Making a list of files to copy or copying those files. + BACKUP, - /// "BACKUP ASYNC" waits in background while all file infos are built and then it copies the backup's files. - BACKUP_ASYNC = 1, + /// Creating of tables and databases during RESTORE and filling them with data. + RESTORE, - /// Making a list of files to copy and copying of those files is always sequential, so those operations can share one thread pool. - BACKUP_MAKE_FILES_LIST = 2, - BACKUP_COPY_FILES = BACKUP_MAKE_FILES_LIST, + /// We need background threads for ASYNC backups and restores. + ASYNC_BACKGROUND_BACKUP, + ASYNC_BACKGROUND_RESTORE, - /// "RESTORE ON CLUSTER ASYNC" waits in background while "BACKUP ASYNC" is finished on the nodes of the cluster, then finalizes the backup. - RESTORE_ASYNC_ON_CLUSTER = 3, + /// We need background threads for coordination workers (see BackgroundCoordinationStageSync). + ON_CLUSTER_COORDINATION_BACKUP, + ON_CLUSTER_COORDINATION_RESTORE, - /// "RESTORE ASYNC" waits in background while the data of all tables are restored. - RESTORE_ASYNC = 4, - - /// Restores from backups. - RESTORE = 5, + /// We need separate threads for internal backups and restores. + /// An internal backup is a helper backup invoked on some shard and replica by a BACKUP ON CLUSTER command, + /// (see BackupSettings.internal); and the same for restores. + ASYNC_BACKGROUND_INTERNAL_BACKUP, + ASYNC_BACKGROUND_INTERNAL_RESTORE, + ON_CLUSTER_COORDINATION_INTERNAL_BACKUP, + ON_CLUSTER_COORDINATION_INTERNAL_RESTORE, }; @@ -312,22 +211,26 @@ public: switch (thread_pool_id) { - case ThreadPoolId::BACKUP_ASYNC: - case ThreadPoolId::BACKUP_ASYNC_ON_CLUSTER: - case ThreadPoolId::BACKUP_COPY_FILES: + case ThreadPoolId::BACKUP: + case ThreadPoolId::ASYNC_BACKGROUND_BACKUP: + case ThreadPoolId::ON_CLUSTER_COORDINATION_BACKUP: + case ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_BACKUP: + case ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_BACKUP: { metric_threads = CurrentMetrics::BackupsThreads; metric_active_threads = CurrentMetrics::BackupsThreadsActive; metric_active_threads = CurrentMetrics::BackupsThreadsScheduled; max_threads = num_backup_threads; /// We don't use thread pool queues for thread pools with a lot of tasks otherwise that queue could be memory-wasting. - use_queue = (thread_pool_id != ThreadPoolId::BACKUP_COPY_FILES); + use_queue = (thread_pool_id != ThreadPoolId::BACKUP); break; } - case ThreadPoolId::RESTORE_ASYNC: - case ThreadPoolId::RESTORE_ASYNC_ON_CLUSTER: case ThreadPoolId::RESTORE: + case ThreadPoolId::ASYNC_BACKGROUND_RESTORE: + case ThreadPoolId::ON_CLUSTER_COORDINATION_RESTORE: + case ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_RESTORE: + case ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_RESTORE: { metric_threads = CurrentMetrics::RestoreThreads; metric_active_threads = CurrentMetrics::RestoreThreadsActive; @@ -352,12 +255,20 @@ public: void wait() { auto wait_sequence = { - ThreadPoolId::RESTORE_ASYNC_ON_CLUSTER, - ThreadPoolId::RESTORE_ASYNC, + /// ASYNC_BACKGROUND_BACKUP must be before ASYNC_BACKGROUND_INTERNAL_BACKUP, + /// ASYNC_BACKGROUND_RESTORE must be before ASYNC_BACKGROUND_INTERNAL_RESTORE, + /// and everything else is after those ones. + ThreadPoolId::ASYNC_BACKGROUND_BACKUP, + ThreadPoolId::ASYNC_BACKGROUND_RESTORE, + ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_BACKUP, + ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_RESTORE, + /// Others: + ThreadPoolId::BACKUP, ThreadPoolId::RESTORE, - ThreadPoolId::BACKUP_ASYNC_ON_CLUSTER, - ThreadPoolId::BACKUP_ASYNC, - ThreadPoolId::BACKUP_COPY_FILES, + ThreadPoolId::ON_CLUSTER_COORDINATION_BACKUP, + ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_BACKUP, + ThreadPoolId::ON_CLUSTER_COORDINATION_RESTORE, + ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_RESTORE, }; for (auto thread_pool_id : wait_sequence) @@ -392,6 +303,7 @@ BackupsWorker::BackupsWorker(ContextMutablePtr global_context, size_t num_backup , log(getLogger("BackupsWorker")) , backup_log(global_context->getBackupLog()) , process_list(global_context->getProcessList()) + , concurrency_counters(std::make_unique()) { } @@ -405,7 +317,7 @@ ThreadPool & BackupsWorker::getThreadPool(ThreadPoolId thread_pool_id) } -OperationID BackupsWorker::start(const ASTPtr & backup_or_restore_query, ContextMutablePtr context) +std::pair BackupsWorker::start(const ASTPtr & backup_or_restore_query, ContextMutablePtr context) { const ASTBackupQuery & backup_query = typeid_cast(*backup_or_restore_query); if (backup_query.kind == ASTBackupQuery::Kind::BACKUP) @@ -414,180 +326,147 @@ OperationID BackupsWorker::start(const ASTPtr & backup_or_restore_query, Context } -OperationID BackupsWorker::startMakingBackup(const ASTPtr & query, const ContextPtr & context) +struct BackupsWorker::BackupStarter { - auto backup_query = std::static_pointer_cast(query->clone()); - auto backup_settings = BackupSettings::fromBackupQuery(*backup_query); - - auto backup_info = BackupInfo::fromAST(*backup_query->backup_name); - String backup_name_for_logging = backup_info.toStringForLogging(); - - if (!backup_settings.backup_uuid) - backup_settings.backup_uuid = UUIDHelpers::generateV4(); - - /// `backup_id` will be used as a key to the `infos` map, so it should be unique. - OperationID backup_id; - if (backup_settings.internal) - backup_id = "internal-" + toString(UUIDHelpers::generateV4()); /// Always generate `backup_id` for internal backup to avoid collision if both internal and non-internal backups are on the same host - else if (!backup_settings.id.empty()) - backup_id = backup_settings.id; - else - backup_id = toString(*backup_settings.backup_uuid); - + BackupsWorker & backups_worker; + std::shared_ptr backup_query; + ContextPtr query_context; /// We have to keep `query_context` until the end of the operation because a pointer to it is stored inside the ThreadGroup we're using. + ContextMutablePtr backup_context; + BackupSettings backup_settings; + BackupInfo backup_info; + String backup_id; + String backup_name_for_logging; + bool on_cluster; + bool is_internal_backup; std::shared_ptr backup_coordination; + ClusterPtr cluster; BackupMutablePtr backup; + std::shared_ptr process_list_element_holder; - /// Called in exception handlers below. This lambda function can be called on a separate thread, so it can't capture local variables by reference. - auto on_exception = [this](BackupMutablePtr & backup_, const OperationID & backup_id_, const String & backup_name_for_logging_, - const BackupSettings & backup_settings_, const std::shared_ptr & backup_coordination_) + BackupStarter(BackupsWorker & backups_worker_, const ASTPtr & query_, const ContextPtr & context_) + : backups_worker(backups_worker_) + , backup_query(std::static_pointer_cast(query_->clone())) + , query_context(context_) + , backup_context(Context::createCopy(query_context)) { - /// Something bad happened, the backup has not built. - tryLogCurrentException(log, fmt::format("Failed to make {} {}", (backup_settings_.internal ? "internal backup" : "backup"), backup_name_for_logging_)); - setStatusSafe(backup_id_, getBackupStatusFromCurrentException()); - sendCurrentExceptionToCoordination(backup_coordination_); + backup_context->makeQueryContext(); + backup_settings = BackupSettings::fromBackupQuery(*backup_query); + backup_info = BackupInfo::fromAST(*backup_query->backup_name); + backup_name_for_logging = backup_info.toStringForLogging(); + is_internal_backup = backup_settings.internal; + on_cluster = !backup_query->cluster.empty() || is_internal_backup; - if (backup_ && remove_backup_files_after_failure) - backup_->tryRemoveAllFiles(); - backup_.reset(); - }; + if (!backup_settings.backup_uuid) + backup_settings.backup_uuid = UUIDHelpers::generateV4(); + + /// `backup_id` will be used as a key to the `infos` map, so it should be unique. + if (is_internal_backup) + backup_id = "internal-" + toString(UUIDHelpers::generateV4()); /// Always generate `backup_id` for internal backup to avoid collision if both internal and non-internal backups are on the same host + else if (!backup_settings.id.empty()) + backup_id = backup_settings.id; + else + backup_id = toString(*backup_settings.backup_uuid); - try - { String base_backup_name; if (backup_settings.base_backup_info) base_backup_name = backup_settings.base_backup_info->toStringForLogging(); - addInfo(backup_id, + /// process_list_element_holder is used to make an element in ProcessList live while BACKUP is working asynchronously. + auto process_list_element = backup_context->getProcessListElement(); + if (process_list_element) + process_list_element_holder = process_list_element->getProcessListEntry(); + + backups_worker.addInfo(backup_id, backup_name_for_logging, base_backup_name, - context->getCurrentQueryId(), - backup_settings.internal, - context->getProcessListElement(), + backup_context->getCurrentQueryId(), + is_internal_backup, + process_list_element, BackupStatus::CREATING_BACKUP); + } - if (backup_settings.internal) + void doBackup() + { + chassert(!backup_coordination); + if (on_cluster && !is_internal_backup) { - /// The following call of makeBackupCoordination() is not essential because doBackup() will later create a backup coordination - /// if it's not created here. However to handle errors better it's better to make a coordination here because this way - /// if an exception will be thrown in startMakingBackup() other hosts will know about that. - backup_coordination = makeBackupCoordination(context, backup_settings, /* remote= */ true); + backup_query->cluster = backup_context->getMacros()->expand(backup_query->cluster); + cluster = backup_context->getCluster(backup_query->cluster); + backup_settings.cluster_host_ids = cluster->getHostIDs(); + } + backup_coordination = backups_worker.makeBackupCoordination(on_cluster, backup_settings, backup_context); + + chassert(!backup); + backup = backups_worker.openBackupForWriting(backup_info, backup_settings, backup_coordination, backup_context); + + backups_worker.doBackup( + backup, backup_query, backup_id, backup_name_for_logging, backup_settings, backup_coordination, backup_context, + on_cluster, cluster); + } + + void onException() + { + /// Something bad happened, the backup has not built. + tryLogCurrentException(backups_worker.log, fmt::format("Failed to make {} {}", + (is_internal_backup ? "internal backup" : "backup"), + backup_name_for_logging)); + + bool should_remove_files_in_backup = backup && !is_internal_backup && backups_worker.remove_backup_files_after_failure; + + if (backup && !backup->setIsCorrupted()) + should_remove_files_in_backup = false; + + if (backup_coordination && backup_coordination->trySetError(std::current_exception())) + { + bool other_hosts_finished = backup_coordination->tryWaitForOtherHostsToFinishAfterError(); + + if (should_remove_files_in_backup && other_hosts_finished) + backup->tryRemoveAllFiles(); + + backup_coordination->tryFinishAfterError(); } - /// Prepare context to use. - ContextPtr context_in_use = context; - ContextMutablePtr mutable_context; - bool on_cluster = !backup_query->cluster.empty(); - if (on_cluster || backup_settings.async) - { - /// We have to clone the query context here because: - /// if this is an "ON CLUSTER" query we need to change some settings, and - /// if this is an "ASYNC" query it's going to be executed in another thread. - context_in_use = mutable_context = Context::createCopy(context); - mutable_context->makeQueryContext(); - } + backups_worker.setStatusSafe(backup_id, getBackupStatusFromCurrentException()); + } +}; - if (backup_settings.async) - { - auto & thread_pool = getThreadPool(on_cluster ? ThreadPoolId::BACKUP_ASYNC_ON_CLUSTER : ThreadPoolId::BACKUP_ASYNC); - /// process_list_element_holder is used to make an element in ProcessList live while BACKUP is working asynchronously. - auto process_list_element = context_in_use->getProcessListElement(); +std::pair BackupsWorker::startMakingBackup(const ASTPtr & query, const ContextPtr & context) +{ + auto starter = std::make_shared(*this, query, context); - thread_pool.scheduleOrThrowOnError( - [this, - backup_query, - backup_id, - backup_name_for_logging, - backup_info, - backup_settings, - backup_coordination, - context_in_use, - mutable_context, - on_exception, - process_list_element_holder = process_list_element ? process_list_element->getProcessListEntry() : nullptr] + try + { + auto thread_pool_id = starter->is_internal_backup ? ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_BACKUP: ThreadPoolId::ASYNC_BACKGROUND_BACKUP; + String thread_name = starter->is_internal_backup ? "BackupAsyncInt" : "BackupAsync"; + auto schedule = threadPoolCallbackRunnerUnsafe(thread_pools->getThreadPool(thread_pool_id), thread_name); + + schedule([starter] + { + try { - BackupMutablePtr backup_async; - try - { - setThreadName("BackupWorker"); - CurrentThread::QueryScope query_scope(context_in_use); - doBackup( - backup_async, - backup_query, - backup_id, - backup_name_for_logging, - backup_info, - backup_settings, - backup_coordination, - context_in_use, - mutable_context); - } - catch (...) - { - on_exception(backup_async, backup_id, backup_name_for_logging, backup_settings, backup_coordination); - } - }); - } - else - { - doBackup( - backup, - backup_query, - backup_id, - backup_name_for_logging, - backup_info, - backup_settings, - backup_coordination, - context_in_use, - mutable_context); - } + starter->doBackup(); + } + catch (...) + { + starter->onException(); + } + }, + Priority{}); - return backup_id; + return {starter->backup_id, BackupStatus::CREATING_BACKUP}; } catch (...) { - on_exception(backup, backup_id, backup_name_for_logging, backup_settings, backup_coordination); + starter->onException(); throw; } } -void BackupsWorker::doBackup( - BackupMutablePtr & backup, - const std::shared_ptr & backup_query, - const OperationID & backup_id, - const String & backup_name_for_logging, - const BackupInfo & backup_info, - BackupSettings backup_settings, - std::shared_ptr backup_coordination, - const ContextPtr & context, - ContextMutablePtr mutable_context) +BackupMutablePtr BackupsWorker::openBackupForWriting(const BackupInfo & backup_info, const BackupSettings & backup_settings, std::shared_ptr backup_coordination, const ContextPtr & context) const { - bool on_cluster = !backup_query->cluster.empty(); - assert(!on_cluster || mutable_context); - - /// Checks access rights if this is not ON CLUSTER query. - /// (If this is ON CLUSTER query executeDDLQueryOnCluster() will check access rights later.) - auto required_access = BackupUtils::getRequiredAccessToBackup(backup_query->elements); - if (!on_cluster) - context->checkAccess(required_access); - - ClusterPtr cluster; - if (on_cluster) - { - backup_query->cluster = context->getMacros()->expand(backup_query->cluster); - cluster = context->getCluster(backup_query->cluster); - backup_settings.cluster_host_ids = cluster->getHostIDs(); - } - - /// Make a backup coordination. - if (!backup_coordination) - backup_coordination = makeBackupCoordination(context, backup_settings, /* remote= */ on_cluster); - - if (!allow_concurrent_backups && backup_coordination->hasConcurrentBackups(std::ref(num_active_backups))) - throw Exception(ErrorCodes::CONCURRENT_ACCESS_NOT_SUPPORTED, "Concurrent backups not supported, turn on setting 'allow_concurrent_backups'"); - - /// Opens a backup for writing. + LOG_TRACE(log, "Opening backup for writing"); BackupFactory::CreateParams backup_create_params; backup_create_params.open_mode = IBackup::OpenMode::WRITE; backup_create_params.context = context; @@ -608,37 +487,57 @@ void BackupsWorker::doBackup( backup_create_params.azure_attempt_to_create_container = backup_settings.azure_attempt_to_create_container; backup_create_params.read_settings = getReadSettingsForBackup(context, backup_settings); backup_create_params.write_settings = getWriteSettingsForBackup(context); - backup = BackupFactory::instance().createBackup(backup_create_params); + auto backup = BackupFactory::instance().createBackup(backup_create_params); + LOG_INFO(log, "Opened backup for writing"); + return backup; +} + + +void BackupsWorker::doBackup( + BackupMutablePtr backup, + const std::shared_ptr & backup_query, + const OperationID & backup_id, + const String & backup_name_for_logging, + const BackupSettings & backup_settings, + std::shared_ptr backup_coordination, + ContextMutablePtr context, + bool on_cluster, + const ClusterPtr & cluster) +{ + bool is_internal_backup = backup_settings.internal; + + /// Checks access rights if this is not ON CLUSTER query. + /// (If this is ON CLUSTER query executeDDLQueryOnCluster() will check access rights later.) + auto required_access = BackupUtils::getRequiredAccessToBackup(backup_query->elements); + if (!on_cluster) + context->checkAccess(required_access); + + maybeSleepForTesting(); /// Write the backup. - if (on_cluster) + if (on_cluster && !is_internal_backup) { - DDLQueryOnClusterParams params; - params.cluster = cluster; - params.only_shard_num = backup_settings.shard_num; - params.only_replica_num = backup_settings.replica_num; - params.access_to_check = required_access; + /// Send the BACKUP query to other hosts. backup_settings.copySettingsToQuery(*backup_query); - - // executeDDLQueryOnCluster() will return without waiting for completion - mutable_context->setSetting("distributed_ddl_task_timeout", Field{0}); - mutable_context->setSetting("distributed_ddl_output_mode", Field{"none"}); - executeDDLQueryOnCluster(backup_query, mutable_context, params); + sendQueryToOtherHosts(*backup_query, cluster, backup_settings.shard_num, backup_settings.replica_num, + context, required_access, backup_coordination->getOnClusterInitializationKeeperRetriesInfo()); + backup_coordination->setBackupQueryWasSentToOtherHosts(); /// Wait until all the hosts have written their backup entries. - backup_coordination->waitForStage(Stage::COMPLETED); - backup_coordination->setStage(Stage::COMPLETED,""); + backup_coordination->waitForOtherHostsToFinish(); } else { backup_query->setCurrentDatabase(context->getCurrentDatabase()); + auto read_settings = getReadSettingsForBackup(context, backup_settings); + /// Prepare backup entries. BackupEntries backup_entries; { BackupEntriesCollector backup_entries_collector( backup_query->elements, backup_settings, backup_coordination, - backup_create_params.read_settings, context, getThreadPool(ThreadPoolId::BACKUP_MAKE_FILES_LIST)); + read_settings, context, getThreadPool(ThreadPoolId::BACKUP)); backup_entries = backup_entries_collector.run(); } @@ -646,11 +545,11 @@ void BackupsWorker::doBackup( chassert(backup); chassert(backup_coordination); chassert(context); - buildFileInfosForBackupEntries(backup, backup_entries, backup_create_params.read_settings, backup_coordination, context->getProcessListElement()); - writeBackupEntries(backup, std::move(backup_entries), backup_id, backup_coordination, backup_settings.internal, context->getProcessListElement()); + buildFileInfosForBackupEntries(backup, backup_entries, read_settings, backup_coordination, context->getProcessListElement()); + writeBackupEntries(backup, std::move(backup_entries), backup_id, backup_coordination, is_internal_backup, context->getProcessListElement()); - /// We have written our backup entries, we need to tell other hosts (they could be waiting for it). - backup_coordination->setStage(Stage::COMPLETED,""); + /// We have written our backup entries (there is no need to sync it with other hosts because it's the last stage). + backup_coordination->setStage(Stage::COMPLETED, "", /* sync = */ false); } size_t num_files = 0; @@ -660,9 +559,9 @@ void BackupsWorker::doBackup( UInt64 compressed_size = 0; /// Finalize backup (write its metadata). - if (!backup_settings.internal) + backup->finalizeWriting(); + if (!is_internal_backup) { - backup->finalizeWriting(); num_files = backup->getNumFiles(); total_size = backup->getTotalSize(); num_entries = backup->getNumEntries(); @@ -673,19 +572,22 @@ void BackupsWorker::doBackup( /// Close the backup. backup.reset(); - LOG_INFO(log, "{} {} was created successfully", (backup_settings.internal ? "Internal backup" : "Backup"), backup_name_for_logging); + /// The backup coordination is not needed anymore. + backup_coordination->finish(); + /// NOTE: we need to update metadata again after backup->finalizeWriting(), because backup metadata is written there. setNumFilesAndSize(backup_id, num_files, total_size, num_entries, uncompressed_size, compressed_size, 0, 0); + /// NOTE: setStatus is called after setNumFilesAndSize in order to have actual information in a backup log record + LOG_INFO(log, "{} {} was created successfully", (is_internal_backup ? "Internal backup" : "Backup"), backup_name_for_logging); setStatus(backup_id, BackupStatus::BACKUP_CREATED); } void BackupsWorker::buildFileInfosForBackupEntries(const BackupPtr & backup, const BackupEntries & backup_entries, const ReadSettings & read_settings, std::shared_ptr backup_coordination, QueryStatusPtr process_list_element) { - backup_coordination->setStage(Stage::BUILDING_FILE_INFOS, ""); - backup_coordination->waitForStage(Stage::BUILDING_FILE_INFOS); - backup_coordination->addFileInfos(::DB::buildFileInfosForBackupEntries(backup_entries, backup->getBaseBackup(), read_settings, getThreadPool(ThreadPoolId::BACKUP_MAKE_FILES_LIST), process_list_element)); + backup_coordination->setStage(Stage::BUILDING_FILE_INFOS, "", /* sync = */ true); + backup_coordination->addFileInfos(::DB::buildFileInfosForBackupEntries(backup_entries, backup->getBaseBackup(), read_settings, getThreadPool(ThreadPoolId::BACKUP), process_list_element)); } @@ -694,12 +596,11 @@ void BackupsWorker::writeBackupEntries( BackupEntries && backup_entries, const OperationID & backup_id, std::shared_ptr backup_coordination, - bool internal, + bool is_internal_backup, QueryStatusPtr process_list_element) { LOG_TRACE(log, "{}, num backup entries={}", Stage::WRITING_BACKUP, backup_entries.size()); - backup_coordination->setStage(Stage::WRITING_BACKUP, ""); - backup_coordination->waitForStage(Stage::WRITING_BACKUP); + backup_coordination->setStage(Stage::WRITING_BACKUP, "", /* sync = */ true); auto file_infos = backup_coordination->getFileInfos(); if (file_infos.size() != backup_entries.size()) @@ -715,7 +616,7 @@ void BackupsWorker::writeBackupEntries( std::atomic_bool failed = false; bool always_single_threaded = !backup->supportsWritingInMultipleThreads(); - auto & thread_pool = getThreadPool(ThreadPoolId::BACKUP_COPY_FILES); + auto & thread_pool = getThreadPool(ThreadPoolId::BACKUP); std::vector writing_order; if (test_randomize_order) @@ -751,7 +652,7 @@ void BackupsWorker::writeBackupEntries( maybeSleepForTesting(); // Update metadata - if (!internal) + if (!is_internal_backup) { setNumFilesAndSize( backup_id, @@ -783,142 +684,139 @@ void BackupsWorker::writeBackupEntries( } -OperationID BackupsWorker::startRestoring(const ASTPtr & query, ContextMutablePtr context) +struct BackupsWorker::RestoreStarter { - auto restore_query = std::static_pointer_cast(query->clone()); - auto restore_settings = RestoreSettings::fromRestoreQuery(*restore_query); - - auto backup_info = BackupInfo::fromAST(*restore_query->backup_name); - String backup_name_for_logging = backup_info.toStringForLogging(); - - if (!restore_settings.restore_uuid) - restore_settings.restore_uuid = UUIDHelpers::generateV4(); - - /// `restore_id` will be used as a key to the `infos` map, so it should be unique. - OperationID restore_id; - if (restore_settings.internal) - restore_id = "internal-" + toString(UUIDHelpers::generateV4()); /// Always generate `restore_id` for internal restore to avoid collision if both internal and non-internal restores are on the same host - else if (!restore_settings.id.empty()) - restore_id = restore_settings.id; - else - restore_id = toString(*restore_settings.restore_uuid); - + BackupsWorker & backups_worker; + std::shared_ptr restore_query; + ContextPtr query_context; /// We have to keep `query_context` until the end of the operation because a pointer to it is stored inside the ThreadGroup we're using. + ContextMutablePtr restore_context; + RestoreSettings restore_settings; + BackupInfo backup_info; + String restore_id; + String backup_name_for_logging; + bool on_cluster; + bool is_internal_restore; std::shared_ptr restore_coordination; + ClusterPtr cluster; + std::shared_ptr process_list_element_holder; - /// Called in exception handlers below. This lambda function can be called on a separate thread, so it can't capture local variables by reference. - auto on_exception = [this](const OperationID & restore_id_, const String & backup_name_for_logging_, - const RestoreSettings & restore_settings_, const std::shared_ptr & restore_coordination_) + RestoreStarter(BackupsWorker & backups_worker_, const ASTPtr & query_, const ContextPtr & context_) + : backups_worker(backups_worker_) + , restore_query(std::static_pointer_cast(query_->clone())) + , query_context(context_) + , restore_context(Context::createCopy(query_context)) { - /// Something bad happened, some data were not restored. - tryLogCurrentException(log, fmt::format("Failed to restore from {} {}", (restore_settings_.internal ? "internal backup" : "backup"), backup_name_for_logging_)); - setStatusSafe(restore_id_, getRestoreStatusFromCurrentException()); - sendCurrentExceptionToCoordination(restore_coordination_); - }; + restore_context->makeQueryContext(); + restore_settings = RestoreSettings::fromRestoreQuery(*restore_query); + backup_info = BackupInfo::fromAST(*restore_query->backup_name); + backup_name_for_logging = backup_info.toStringForLogging(); + is_internal_restore = restore_settings.internal; + on_cluster = !restore_query->cluster.empty() || is_internal_restore; + + if (!restore_settings.restore_uuid) + restore_settings.restore_uuid = UUIDHelpers::generateV4(); + + /// `restore_id` will be used as a key to the `infos` map, so it should be unique. + if (is_internal_restore) + restore_id = "internal-" + toString(UUIDHelpers::generateV4()); /// Always generate `restore_id` for internal restore to avoid collision if both internal and non-internal restores are on the same host + else if (!restore_settings.id.empty()) + restore_id = restore_settings.id; + else + restore_id = toString(*restore_settings.restore_uuid); - try - { String base_backup_name; if (restore_settings.base_backup_info) base_backup_name = restore_settings.base_backup_info->toStringForLogging(); - addInfo(restore_id, + /// process_list_element_holder is used to make an element in ProcessList live while BACKUP is working asynchronously. + auto process_list_element = restore_context->getProcessListElement(); + if (process_list_element) + process_list_element_holder = process_list_element->getProcessListEntry(); + + backups_worker.addInfo(restore_id, backup_name_for_logging, base_backup_name, - context->getCurrentQueryId(), - restore_settings.internal, - context->getProcessListElement(), + restore_context->getCurrentQueryId(), + is_internal_restore, + process_list_element, BackupStatus::RESTORING); + } - if (restore_settings.internal) + void doRestore() + { + chassert(!restore_coordination); + if (on_cluster && !is_internal_restore) { - /// The following call of makeRestoreCoordination() is not essential because doRestore() will later create a restore coordination - /// if it's not created here. However to handle errors better it's better to make a coordination here because this way - /// if an exception will be thrown in startRestoring() other hosts will know about that. - restore_coordination = makeRestoreCoordination(context, restore_settings, /* remote= */ true); + restore_query->cluster = restore_context->getMacros()->expand(restore_query->cluster); + cluster = restore_context->getCluster(restore_query->cluster); + restore_settings.cluster_host_ids = cluster->getHostIDs(); + } + restore_coordination = backups_worker.makeRestoreCoordination(on_cluster, restore_settings, restore_context); + + backups_worker.doRestore( + restore_query, + restore_id, + backup_name_for_logging, + backup_info, + restore_settings, + restore_coordination, + restore_context, + on_cluster, + cluster); + } + + void onException() + { + /// Something bad happened, some data were not restored. + tryLogCurrentException(backups_worker.log, fmt::format("Failed to restore from {} {}", (is_internal_restore ? "internal backup" : "backup"), backup_name_for_logging)); + + if (restore_coordination && restore_coordination->trySetError(std::current_exception())) + { + restore_coordination->tryWaitForOtherHostsToFinishAfterError(); + restore_coordination->tryFinishAfterError(); } - /// Prepare context to use. - ContextMutablePtr context_in_use = context; - bool on_cluster = !restore_query->cluster.empty(); - if (restore_settings.async || on_cluster) - { - /// We have to clone the query context here because: - /// if this is an "ON CLUSTER" query we need to change some settings, and - /// if this is an "ASYNC" query it's going to be executed in another thread. - context_in_use = Context::createCopy(context); - context_in_use->makeQueryContext(); - } + backups_worker.setStatusSafe(restore_id, getRestoreStatusFromCurrentException()); + } +}; - if (restore_settings.async) - { - auto & thread_pool = getThreadPool(on_cluster ? ThreadPoolId::RESTORE_ASYNC_ON_CLUSTER : ThreadPoolId::RESTORE_ASYNC); - /// process_list_element_holder is used to make an element in ProcessList live while RESTORE is working asynchronously. - auto process_list_element = context_in_use->getProcessListElement(); +std::pair BackupsWorker::startRestoring(const ASTPtr & query, ContextMutablePtr context) +{ + auto starter = std::make_shared(*this, query, context); - thread_pool.scheduleOrThrowOnError( - [this, - restore_query, - restore_id, - backup_name_for_logging, - backup_info, - restore_settings, - restore_coordination, - context_in_use, - on_exception, - process_list_element_holder = process_list_element ? process_list_element->getProcessListEntry() : nullptr] + try + { + auto thread_pool_id = starter->is_internal_restore ? ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_RESTORE : ThreadPoolId::ASYNC_BACKGROUND_RESTORE; + String thread_name = starter->is_internal_restore ? "RestoreAsyncInt" : "RestoreAsync"; + auto schedule = threadPoolCallbackRunnerUnsafe(thread_pools->getThreadPool(thread_pool_id), thread_name); + + schedule([starter] + { + try { - try - { - setThreadName("RestorerWorker"); - CurrentThread::QueryScope query_scope(context_in_use); - doRestore( - restore_query, - restore_id, - backup_name_for_logging, - backup_info, - restore_settings, - restore_coordination, - context_in_use); - } - catch (...) - { - on_exception(restore_id, backup_name_for_logging, restore_settings, restore_coordination); - } - }); - } - else - { - doRestore( - restore_query, - restore_id, - backup_name_for_logging, - backup_info, - restore_settings, - restore_coordination, - context_in_use); - } + starter->doRestore(); + } + catch (...) + { + starter->onException(); + } + }, + Priority{}); - return restore_id; + return {starter->restore_id, BackupStatus::RESTORING}; } catch (...) { - on_exception(restore_id, backup_name_for_logging, restore_settings, restore_coordination); + starter->onException(); throw; } } -void BackupsWorker::doRestore( - const std::shared_ptr & restore_query, - const OperationID & restore_id, - const String & backup_name_for_logging, - const BackupInfo & backup_info, - RestoreSettings restore_settings, - std::shared_ptr restore_coordination, - ContextMutablePtr context) +BackupPtr BackupsWorker::openBackupForReading(const BackupInfo & backup_info, const RestoreSettings & restore_settings, const ContextPtr & context) const { - /// Open the backup for reading. + LOG_TRACE(log, "Opening backup for reading"); BackupFactory::CreateParams backup_open_params; backup_open_params.open_mode = IBackup::OpenMode::READ; backup_open_params.context = context; @@ -931,32 +829,35 @@ void BackupsWorker::doRestore( backup_open_params.read_settings = getReadSettingsForRestore(context); backup_open_params.write_settings = getWriteSettingsForRestore(context); backup_open_params.is_internal_backup = restore_settings.internal; - BackupPtr backup = BackupFactory::instance().createBackup(backup_open_params); + auto backup = BackupFactory::instance().createBackup(backup_open_params); + LOG_TRACE(log, "Opened backup for reading"); + return backup; +} + + +void BackupsWorker::doRestore( + const std::shared_ptr & restore_query, + const OperationID & restore_id, + const String & backup_name_for_logging, + const BackupInfo & backup_info, + RestoreSettings restore_settings, + std::shared_ptr restore_coordination, + ContextMutablePtr context, + bool on_cluster, + const ClusterPtr & cluster) +{ + bool is_internal_restore = restore_settings.internal; + + maybeSleepForTesting(); + + /// Open the backup for reading. + BackupPtr backup = openBackupForReading(backup_info, restore_settings, context); String current_database = context->getCurrentDatabase(); + /// Checks access rights if this is ON CLUSTER query. /// (If this isn't ON CLUSTER query RestorerFromBackup will check access rights later.) - ClusterPtr cluster; - bool on_cluster = !restore_query->cluster.empty(); - - if (on_cluster) - { - restore_query->cluster = context->getMacros()->expand(restore_query->cluster); - cluster = context->getCluster(restore_query->cluster); - restore_settings.cluster_host_ids = cluster->getHostIDs(); - } - - /// Make a restore coordination. - if (!restore_coordination) - restore_coordination = makeRestoreCoordination(context, restore_settings, /* remote= */ on_cluster); - - if (!allow_concurrent_restores && restore_coordination->hasConcurrentRestores(std::ref(num_active_restores))) - throw Exception( - ErrorCodes::CONCURRENT_ACCESS_NOT_SUPPORTED, - "Concurrent restores not supported, turn on setting 'allow_concurrent_restores'"); - - - if (on_cluster) + if (on_cluster && !is_internal_restore) { /// We cannot just use access checking provided by the function executeDDLQueryOnCluster(): it would be incorrect /// because different replicas can contain different set of tables and so the required access rights can differ too. @@ -975,27 +876,21 @@ void BackupsWorker::doRestore( } /// Do RESTORE. - if (on_cluster) + if (on_cluster && !is_internal_restore) { - - DDLQueryOnClusterParams params; - params.cluster = cluster; - params.only_shard_num = restore_settings.shard_num; - params.only_replica_num = restore_settings.replica_num; + /// Send the RESTORE query to other hosts. restore_settings.copySettingsToQuery(*restore_query); + sendQueryToOtherHosts(*restore_query, cluster, restore_settings.shard_num, restore_settings.replica_num, + context, {}, restore_coordination->getOnClusterInitializationKeeperRetriesInfo()); + restore_coordination->setRestoreQueryWasSentToOtherHosts(); - // executeDDLQueryOnCluster() will return without waiting for completion - context->setSetting("distributed_ddl_task_timeout", Field{0}); - context->setSetting("distributed_ddl_output_mode", Field{"none"}); - - executeDDLQueryOnCluster(restore_query, context, params); - - /// Wait until all the hosts have written their backup entries. - restore_coordination->waitForStage(Stage::COMPLETED); - restore_coordination->setStage(Stage::COMPLETED,""); + /// Wait until all the hosts have done with their restoring work. + restore_coordination->waitForOtherHostsToFinish(); } else { + maybeSleepForTesting(); + restore_query->setCurrentDatabase(current_database); auto after_task_callback = [&] @@ -1011,11 +906,115 @@ void BackupsWorker::doRestore( restorer.run(RestorerFromBackup::RESTORE); } - LOG_INFO(log, "Restored from {} {} successfully", (restore_settings.internal ? "internal backup" : "backup"), backup_name_for_logging); + /// The restore coordination is not needed anymore. + restore_coordination->finish(); + + LOG_INFO(log, "Restored from {} {} successfully", (is_internal_restore ? "internal backup" : "backup"), backup_name_for_logging); setStatus(restore_id, BackupStatus::RESTORED); } +void BackupsWorker::sendQueryToOtherHosts(const ASTBackupQuery & backup_or_restore_query, const ClusterPtr & cluster, + size_t only_shard_num, size_t only_replica_num, ContextMutablePtr context, const AccessRightsElements & access_to_check, + const ZooKeeperRetriesInfo & retries_info) const +{ + chassert(cluster); + + DDLQueryOnClusterParams params; + params.cluster = cluster; + params.only_shard_num = only_shard_num; + params.only_replica_num = only_replica_num; + params.access_to_check = access_to_check; + params.retries_info = retries_info; + + context->setSetting("distributed_ddl_task_timeout", Field{0}); + context->setSetting("distributed_ddl_output_mode", Field{"never_throw"}); + + // executeDDLQueryOnCluster() will return without waiting for completion + executeDDLQueryOnCluster(backup_or_restore_query.clone(), context, params); + + maybeSleepForTesting(); +} + + +std::shared_ptr +BackupsWorker::makeBackupCoordination(bool on_cluster, const BackupSettings & backup_settings, const ContextPtr & context) const +{ + if (!on_cluster) + { + return std::make_shared( + *backup_settings.backup_uuid, !backup_settings.deduplicate_files, allow_concurrent_backups, *concurrency_counters); + } + + bool is_internal_backup = backup_settings.internal; + + String root_zk_path = context->getConfigRef().getString("backups.zookeeper_path", "/clickhouse/backups"); + auto get_zookeeper = [global_context = context->getGlobalContext()] { return global_context->getZooKeeper(); }; + auto keeper_settings = BackupKeeperSettings::fromContext(context); + + auto all_hosts = BackupSettings::Util::filterHostIDs( + backup_settings.cluster_host_ids, backup_settings.shard_num, backup_settings.replica_num); + all_hosts.emplace_back(BackupCoordinationOnCluster::kInitiator); + + String current_host = is_internal_backup ? backup_settings.host_id : String{BackupCoordinationOnCluster::kInitiator}; + + auto thread_pool_id = is_internal_backup ? ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_BACKUP : ThreadPoolId::ON_CLUSTER_COORDINATION_BACKUP; + String thread_name = is_internal_backup ? "BackupCoordInt" : "BackupCoord"; + auto schedule = threadPoolCallbackRunnerUnsafe(thread_pools->getThreadPool(thread_pool_id), thread_name); + + return std::make_shared( + *backup_settings.backup_uuid, + !backup_settings.deduplicate_files, + root_zk_path, + get_zookeeper, + keeper_settings, + current_host, + all_hosts, + allow_concurrent_backups, + *concurrency_counters, + schedule, + context->getProcessListElement()); +} + +std::shared_ptr +BackupsWorker::makeRestoreCoordination(bool on_cluster, const RestoreSettings & restore_settings, const ContextPtr & context) const +{ + if (!on_cluster) + { + return std::make_shared( + *restore_settings.restore_uuid, allow_concurrent_restores, *concurrency_counters); + } + + bool is_internal_restore = restore_settings.internal; + + String root_zk_path = context->getConfigRef().getString("backups.zookeeper_path", "/clickhouse/backups"); + auto get_zookeeper = [global_context = context->getGlobalContext()] { return global_context->getZooKeeper(); }; + auto keeper_settings = BackupKeeperSettings::fromContext(context); + + auto all_hosts = BackupSettings::Util::filterHostIDs( + restore_settings.cluster_host_ids, restore_settings.shard_num, restore_settings.replica_num); + all_hosts.emplace_back(BackupCoordinationOnCluster::kInitiator); + + String current_host = is_internal_restore ? restore_settings.host_id : String{RestoreCoordinationOnCluster::kInitiator}; + + auto thread_pool_id = is_internal_restore ? ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_RESTORE : ThreadPoolId::ON_CLUSTER_COORDINATION_RESTORE; + String thread_name = is_internal_restore ? "RestoreCoordInt" : "RestoreCoord"; + auto schedule = threadPoolCallbackRunnerUnsafe(thread_pools->getThreadPool(thread_pool_id), thread_name); + + return std::make_shared( + *restore_settings.restore_uuid, + root_zk_path, + get_zookeeper, + keeper_settings, + current_host, + all_hosts, + allow_concurrent_restores, + *concurrency_counters, + schedule, + context->getProcessListElement()); +} + + void BackupsWorker::addInfo(const OperationID & id, const String & name, const String & base_backup_name, const String & query_id, bool internal, QueryStatusPtr process_list_element, BackupStatus status) { @@ -1135,23 +1134,25 @@ void BackupsWorker::maybeSleepForTesting() const } -void BackupsWorker::wait(const OperationID & backup_or_restore_id, bool rethrow_exception) +BackupStatus BackupsWorker::wait(const OperationID & backup_or_restore_id, bool rethrow_exception) { std::unique_lock lock{infos_mutex}; + BackupStatus current_status; status_changed.wait(lock, [&] { auto it = infos.find(backup_or_restore_id); if (it == infos.end()) throw Exception(ErrorCodes::LOGICAL_ERROR, "Unknown backup ID {}", backup_or_restore_id); const auto & info = it->second.info; - auto current_status = info.status; + current_status = info.status; if (rethrow_exception && isFailedOrCancelled(current_status)) std::rethrow_exception(info.exception); if (isFinalStatus(current_status)) return true; - LOG_INFO(log, "Waiting {} {}", isBackupStatus(info.status) ? "backup" : "restore", info.name); + LOG_INFO(log, "Waiting {} {} to complete", isBackupStatus(current_status) ? "backup" : "restore", info.name); return false; }); + return current_status; } void BackupsWorker::waitAll() @@ -1175,9 +1176,11 @@ void BackupsWorker::waitAll() LOG_INFO(log, "Backups and restores finished"); } -void BackupsWorker::cancel(const BackupOperationID & backup_or_restore_id, bool wait_) +BackupStatus BackupsWorker::cancel(const BackupOperationID & backup_or_restore_id, bool wait_) { QueryStatusPtr process_list_element; + BackupStatus current_status; + { std::unique_lock lock{infos_mutex}; auto it = infos.find(backup_or_restore_id); @@ -1186,17 +1189,20 @@ void BackupsWorker::cancel(const BackupOperationID & backup_or_restore_id, bool const auto & extended_info = it->second; const auto & info = extended_info.info; - if (isFinalStatus(info.status) || !extended_info.process_list_element) - return; + current_status = info.status; + if (isFinalStatus(current_status) || !extended_info.process_list_element) + return current_status; - LOG_INFO(log, "Cancelling {} {}", isBackupStatus(info.status) ? "backup" : "restore", info.name); + LOG_INFO(log, "Cancelling {} {}", isBackupStatus(current_status) ? "backup" : "restore", info.name); process_list_element = extended_info.process_list_element; } process_list.sendCancelToQuery(process_list_element); - if (wait_) - wait(backup_or_restore_id, /* rethrow_exception= */ false); + if (!wait_) + return current_status; + + return wait(backup_or_restore_id, /* rethrow_exception= */ false); } diff --git a/src/Backups/BackupsWorker.h b/src/Backups/BackupsWorker.h index 946562b575f..37f91e269a9 100644 --- a/src/Backups/BackupsWorker.h +++ b/src/Backups/BackupsWorker.h @@ -23,6 +23,7 @@ using BackupMutablePtr = std::shared_ptr; using BackupPtr = std::shared_ptr; class IBackupEntry; using BackupEntries = std::vector>>; +class BackupConcurrencyCounters; using DataRestoreTasks = std::vector>; struct ReadSettings; class BackupLog; @@ -31,6 +32,10 @@ using ThreadGroupPtr = std::shared_ptr; class QueryStatus; using QueryStatusPtr = std::shared_ptr; class ProcessList; +class Cluster; +using ClusterPtr = std::shared_ptr; +class AccessRightsElements; +struct ZooKeeperRetriesInfo; /// Manager of backups and restores: executes backups and restores' threads in the background. @@ -47,18 +52,18 @@ public: /// Starts executing a BACKUP or RESTORE query. Returns ID of the operation. /// For asynchronous operations the function throws no exceptions on failure usually, /// call getInfo() on a returned operation id to check for errors. - BackupOperationID start(const ASTPtr & backup_or_restore_query, ContextMutablePtr context); + std::pair start(const ASTPtr & backup_or_restore_query, ContextMutablePtr context); /// Waits until the specified backup or restore operation finishes or stops. /// The function returns immediately if the operation is already finished. - void wait(const BackupOperationID & backup_or_restore_id, bool rethrow_exception = true); + BackupStatus wait(const BackupOperationID & backup_or_restore_id, bool rethrow_exception = true); /// Waits until all running backup and restore operations finish or stop. void waitAll(); /// Cancels the specified backup or restore operation. /// The function does nothing if this operation has already finished. - void cancel(const BackupOperationID & backup_or_restore_id, bool wait_ = true); + BackupStatus cancel(const BackupOperationID & backup_or_restore_id, bool wait_ = true); /// Cancels all running backup and restore operations. void cancelAll(bool wait_ = true); @@ -67,26 +72,32 @@ public: std::vector getAllInfos() const; private: - BackupOperationID startMakingBackup(const ASTPtr & query, const ContextPtr & context); + std::pair startMakingBackup(const ASTPtr & query, const ContextPtr & context); + struct BackupStarter; + + BackupMutablePtr openBackupForWriting(const BackupInfo & backup_info, const BackupSettings & backup_settings, std::shared_ptr backup_coordination, const ContextPtr & context) const; void doBackup( - BackupMutablePtr & backup, + BackupMutablePtr backup, const std::shared_ptr & backup_query, const BackupOperationID & backup_id, const String & backup_name_for_logging, - const BackupInfo & backup_info, - BackupSettings backup_settings, + const BackupSettings & backup_settings, std::shared_ptr backup_coordination, - const ContextPtr & context, - ContextMutablePtr mutable_context); + ContextMutablePtr context, + bool on_cluster, + const ClusterPtr & cluster); /// Builds file infos for specified backup entries. void buildFileInfosForBackupEntries(const BackupPtr & backup, const BackupEntries & backup_entries, const ReadSettings & read_settings, std::shared_ptr backup_coordination, QueryStatusPtr process_list_element); /// Write backup entries to an opened backup. - void writeBackupEntries(BackupMutablePtr backup, BackupEntries && backup_entries, const BackupOperationID & backup_id, std::shared_ptr backup_coordination, bool internal, QueryStatusPtr process_list_element); + void writeBackupEntries(BackupMutablePtr backup, BackupEntries && backup_entries, const BackupOperationID & backup_id, std::shared_ptr backup_coordination, bool is_internal_backup, QueryStatusPtr process_list_element); - BackupOperationID startRestoring(const ASTPtr & query, ContextMutablePtr context); + std::pair startRestoring(const ASTPtr & query, ContextMutablePtr context); + struct RestoreStarter; + + BackupPtr openBackupForReading(const BackupInfo & backup_info, const RestoreSettings & restore_settings, const ContextPtr & context) const; void doRestore( const std::shared_ptr & restore_query, @@ -95,7 +106,17 @@ private: const BackupInfo & backup_info, RestoreSettings restore_settings, std::shared_ptr restore_coordination, - ContextMutablePtr context); + ContextMutablePtr context, + bool on_cluster, + const ClusterPtr & cluster); + + std::shared_ptr makeBackupCoordination(bool on_cluster, const BackupSettings & backup_settings, const ContextPtr & context) const; + std::shared_ptr makeRestoreCoordination(bool on_cluster, const RestoreSettings & restore_settings, const ContextPtr & context) const; + + /// Sends a BACKUP or RESTORE query to other hosts. + void sendQueryToOtherHosts(const ASTBackupQuery & backup_or_restore_query, const ClusterPtr & cluster, + size_t only_shard_num, size_t only_replica_num, ContextMutablePtr context, const AccessRightsElements & access_to_check, + const ZooKeeperRetriesInfo & retries_info) const; /// Run data restoring tasks which insert data to tables. void restoreTablesData(const BackupOperationID & restore_id, BackupPtr backup, DataRestoreTasks && tasks, ThreadPool & thread_pool, QueryStatusPtr process_list_element); @@ -139,6 +160,8 @@ private: std::shared_ptr backup_log; ProcessList & process_list; + + std::unique_ptr concurrency_counters; }; } diff --git a/src/Backups/IBackup.h b/src/Backups/IBackup.h index 0aa2d34657f..126b4d764da 100644 --- a/src/Backups/IBackup.h +++ b/src/Backups/IBackup.h @@ -121,8 +121,13 @@ public: /// Finalizes writing the backup, should be called after all entries have been successfully written. virtual void finalizeWriting() = 0; - /// Try to remove all files copied to the backup. Used after an exception or it the backup was cancelled. - virtual void tryRemoveAllFiles() = 0; + /// Sets that a non-retriable error happened while the backup was being written which means that + /// the backup is most likely corrupted and it can't be finalized. + /// This function is called while handling an exception or if the backup was cancelled. + virtual bool setIsCorrupted() noexcept = 0; + + /// Try to remove all files copied to the backup. Could be used after setIsCorrupted(). + virtual bool tryRemoveAllFiles() noexcept = 0; }; using BackupPtr = std::shared_ptr; diff --git a/src/Backups/IBackupCoordination.h b/src/Backups/IBackupCoordination.h index 166a2c5bbbc..c0eb90de89b 100644 --- a/src/Backups/IBackupCoordination.h +++ b/src/Backups/IBackupCoordination.h @@ -5,26 +5,44 @@ namespace DB { -class Exception; struct BackupFileInfo; using BackupFileInfos = std::vector; enum class AccessEntityType : uint8_t; enum class UserDefinedSQLObjectType : uint8_t; +struct ZooKeeperRetriesInfo; /// Replicas use this class to coordinate what they're writing to a backup while executing BACKUP ON CLUSTER. -/// There are two implementation of this interface: BackupCoordinationLocal and BackupCoordinationRemote. +/// There are two implementation of this interface: BackupCoordinationLocal and BackupCoordinationOnCluster. /// BackupCoordinationLocal is used while executing BACKUP without ON CLUSTER and performs coordination in memory. -/// BackupCoordinationRemote is used while executing BACKUP with ON CLUSTER and performs coordination via ZooKeeper. +/// BackupCoordinationOnCluster is used while executing BACKUP with ON CLUSTER and performs coordination via ZooKeeper. class IBackupCoordination { public: virtual ~IBackupCoordination() = default; /// Sets the current stage and waits for other hosts to come to this stage too. - virtual void setStage(const String & new_stage, const String & message) = 0; - virtual void setError(const Exception & exception) = 0; - virtual Strings waitForStage(const String & stage_to_wait) = 0; - virtual Strings waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) = 0; + virtual Strings setStage(const String & new_stage, const String & message, bool sync) = 0; + + /// Sets that the backup query was sent to other hosts. + /// Function waitForOtherHostsToFinish() will check that to find out if it should really wait or not. + virtual void setBackupQueryWasSentToOtherHosts() = 0; + + /// Lets other hosts know that the current host has encountered an error. + virtual bool trySetError(std::exception_ptr exception) = 0; + + /// Lets other hosts know that the current host has finished its work. + virtual void finish() = 0; + + /// Lets other hosts know that the current host has finished its work (as a part of error-handling process). + virtual bool tryFinishAfterError() noexcept = 0; + + /// Waits until all the other hosts finish their work. + /// Stops waiting and throws an exception if another host encounters an error or if some host gets cancelled. + virtual void waitForOtherHostsToFinish() = 0; + + /// Waits until all the other hosts finish their work (as a part of error-handling process). + /// Doesn't stops waiting if some host encounters an error or gets cancelled. + virtual bool tryWaitForOtherHostsToFinishAfterError() noexcept = 0; struct PartNameAndChecksum { @@ -87,9 +105,7 @@ public: /// Starts writing a specified file, the function returns false if that file is already being written concurrently. virtual bool startWritingFile(size_t data_file_index) = 0; - /// This function is used to check if concurrent backups are running - /// other than the backup passed to the function - virtual bool hasConcurrentBackups(const std::atomic & num_active_backups) const = 0; + virtual ZooKeeperRetriesInfo getOnClusterInitializationKeeperRetriesInfo() const = 0; }; } diff --git a/src/Backups/IRestoreCoordination.h b/src/Backups/IRestoreCoordination.h index 37229534286..daabf1745f3 100644 --- a/src/Backups/IRestoreCoordination.h +++ b/src/Backups/IRestoreCoordination.h @@ -5,26 +5,42 @@ namespace DB { -class Exception; enum class UserDefinedSQLObjectType : uint8_t; class ASTCreateQuery; +struct ZooKeeperRetriesInfo; /// Replicas use this class to coordinate what they're reading from a backup while executing RESTORE ON CLUSTER. -/// There are two implementation of this interface: RestoreCoordinationLocal and RestoreCoordinationRemote. +/// There are two implementation of this interface: RestoreCoordinationLocal and RestoreCoordinationOnCluster. /// RestoreCoordinationLocal is used while executing RESTORE without ON CLUSTER and performs coordination in memory. -/// RestoreCoordinationRemote is used while executing RESTORE with ON CLUSTER and performs coordination via ZooKeeper. +/// RestoreCoordinationOnCluster is used while executing RESTORE with ON CLUSTER and performs coordination via ZooKeeper. class IRestoreCoordination { public: virtual ~IRestoreCoordination() = default; /// Sets the current stage and waits for other hosts to come to this stage too. - virtual void setStage(const String & new_stage, const String & message) = 0; - virtual void setError(const Exception & exception) = 0; - virtual Strings waitForStage(const String & stage_to_wait) = 0; - virtual Strings waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) = 0; + virtual Strings setStage(const String & new_stage, const String & message, bool sync) = 0; - static constexpr const char * kErrorStatus = "error"; + /// Sets that the restore query was sent to other hosts. + /// Function waitForOtherHostsToFinish() will check that to find out if it should really wait or not. + virtual void setRestoreQueryWasSentToOtherHosts() = 0; + + /// Lets other hosts know that the current host has encountered an error. + virtual bool trySetError(std::exception_ptr exception) = 0; + + /// Lets other hosts know that the current host has finished its work. + virtual void finish() = 0; + + /// Lets other hosts know that the current host has finished its work (as a part of error-handling process). + virtual bool tryFinishAfterError() noexcept = 0; + + /// Waits until all the other hosts finish their work. + /// Stops waiting and throws an exception if another host encounters an error or if some host gets cancelled. + virtual void waitForOtherHostsToFinish() = 0; + + /// Waits until all the other hosts finish their work (as a part of error-handling process). + /// Doesn't stops waiting if some host encounters an error or gets cancelled. + virtual bool tryWaitForOtherHostsToFinishAfterError() noexcept = 0; /// Starts creating a table in a replicated database. Returns false if there is another host which is already creating this table. virtual bool acquireCreatingTableInReplicatedDatabase(const String & database_zk_path, const String & table_name) = 0; @@ -49,9 +65,7 @@ public: /// (because otherwise the macro "{uuid}" in the ZooKeeper path will not work correctly). virtual void generateUUIDForTable(ASTCreateQuery & create_query) = 0; - /// This function is used to check if concurrent restores are running - /// other than the restore passed to the function - virtual bool hasConcurrentRestores(const std::atomic & num_active_restores) const = 0; + virtual ZooKeeperRetriesInfo getOnClusterInitializationKeeperRetriesInfo() const = 0; }; } diff --git a/src/Backups/RestoreCoordinationLocal.cpp b/src/Backups/RestoreCoordinationLocal.cpp index 9fe22f874b4..569f58f1909 100644 --- a/src/Backups/RestoreCoordinationLocal.cpp +++ b/src/Backups/RestoreCoordinationLocal.cpp @@ -1,32 +1,24 @@ #include + #include #include +#include #include namespace DB { -RestoreCoordinationLocal::RestoreCoordinationLocal() : log(getLogger("RestoreCoordinationLocal")) +RestoreCoordinationLocal::RestoreCoordinationLocal( + const UUID & restore_uuid, bool allow_concurrent_restore_, BackupConcurrencyCounters & concurrency_counters_) + : log(getLogger("RestoreCoordinationLocal")) + , concurrency_check(restore_uuid, /* is_restore = */ true, /* on_cluster = */ false, allow_concurrent_restore_, concurrency_counters_) { } RestoreCoordinationLocal::~RestoreCoordinationLocal() = default; -void RestoreCoordinationLocal::setStage(const String &, const String &) -{ -} - -void RestoreCoordinationLocal::setError(const Exception &) -{ -} - -Strings RestoreCoordinationLocal::waitForStage(const String &) -{ - return {}; -} - -Strings RestoreCoordinationLocal::waitForStage(const String &, std::chrono::milliseconds) +ZooKeeperRetriesInfo RestoreCoordinationLocal::getOnClusterInitializationKeeperRetriesInfo() const { return {}; } @@ -63,7 +55,7 @@ void RestoreCoordinationLocal::generateUUIDForTable(ASTCreateQuery & create_quer { String query_str = serializeAST(create_query); - auto find_in_map = [&] + auto find_in_map = [&]() TSA_REQUIRES(mutex) { auto it = create_query_uuids.find(query_str); if (it != create_query_uuids.end()) @@ -91,14 +83,4 @@ void RestoreCoordinationLocal::generateUUIDForTable(ASTCreateQuery & create_quer } } -bool RestoreCoordinationLocal::hasConcurrentRestores(const std::atomic & num_active_restores) const -{ - if (num_active_restores > 1) - { - LOG_WARNING(log, "Found concurrent backups: num_active_restores={}", num_active_restores); - return true; - } - return false; -} - } diff --git a/src/Backups/RestoreCoordinationLocal.h b/src/Backups/RestoreCoordinationLocal.h index 35f93574b68..6be357c4b7e 100644 --- a/src/Backups/RestoreCoordinationLocal.h +++ b/src/Backups/RestoreCoordinationLocal.h @@ -1,6 +1,7 @@ #pragma once #include +#include #include #include #include @@ -12,19 +13,20 @@ namespace DB { class ASTCreateQuery; - /// Implementation of the IRestoreCoordination interface performing coordination in memory. class RestoreCoordinationLocal : public IRestoreCoordination { public: - RestoreCoordinationLocal(); + RestoreCoordinationLocal(const UUID & restore_uuid_, bool allow_concurrent_restore_, BackupConcurrencyCounters & concurrency_counters_); ~RestoreCoordinationLocal() override; - /// Sets the current stage and waits for other hosts to come to this stage too. - void setStage(const String & new_stage, const String & message) override; - void setError(const Exception & exception) override; - Strings waitForStage(const String & stage_to_wait) override; - Strings waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) override; + Strings setStage(const String &, const String &, bool) override { return {}; } + void setRestoreQueryWasSentToOtherHosts() override {} + bool trySetError(std::exception_ptr) override { return true; } + void finish() override {} + bool tryFinishAfterError() noexcept override { return true; } + void waitForOtherHostsToFinish() override {} + bool tryWaitForOtherHostsToFinishAfterError() noexcept override { return true; } /// Starts creating a table in a replicated database. Returns false if there is another host which is already creating this table. bool acquireCreatingTableInReplicatedDatabase(const String & database_zk_path, const String & table_name) override; @@ -49,15 +51,16 @@ public: /// (because otherwise the macro "{uuid}" in the ZooKeeper path will not work correctly). void generateUUIDForTable(ASTCreateQuery & create_query) override; - bool hasConcurrentRestores(const std::atomic & num_active_restores) const override; + ZooKeeperRetriesInfo getOnClusterInitializationKeeperRetriesInfo() const override; private: LoggerPtr const log; + BackupConcurrencyCheck concurrency_check; - std::set> acquired_tables_in_replicated_databases; - std::unordered_set acquired_data_in_replicated_tables; - std::unordered_map create_query_uuids; - std::unordered_set acquired_data_in_keeper_map_tables; + std::set> acquired_tables_in_replicated_databases TSA_GUARDED_BY(mutex); + std::unordered_set acquired_data_in_replicated_tables TSA_GUARDED_BY(mutex); + std::unordered_map create_query_uuids TSA_GUARDED_BY(mutex); + std::unordered_set acquired_data_in_keeper_map_tables TSA_GUARDED_BY(mutex); mutable std::mutex mutex; }; diff --git a/src/Backups/RestoreCoordinationOnCluster.cpp b/src/Backups/RestoreCoordinationOnCluster.cpp new file mode 100644 index 00000000000..2029ad8b072 --- /dev/null +++ b/src/Backups/RestoreCoordinationOnCluster.cpp @@ -0,0 +1,318 @@ +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +namespace DB +{ + +RestoreCoordinationOnCluster::RestoreCoordinationOnCluster( + const UUID & restore_uuid_, + const String & root_zookeeper_path_, + zkutil::GetZooKeeper get_zookeeper_, + const BackupKeeperSettings & keeper_settings_, + const String & current_host_, + const Strings & all_hosts_, + bool allow_concurrent_restore_, + BackupConcurrencyCounters & concurrency_counters_, + ThreadPoolCallbackRunnerUnsafe schedule_, + QueryStatusPtr process_list_element_) + : root_zookeeper_path(root_zookeeper_path_) + , keeper_settings(keeper_settings_) + , restore_uuid(restore_uuid_) + , zookeeper_path(root_zookeeper_path_ + "/restore-" + toString(restore_uuid_)) + , all_hosts(all_hosts_) + , all_hosts_without_initiator(BackupCoordinationOnCluster::excludeInitiator(all_hosts)) + , current_host(current_host_) + , current_host_index(BackupCoordinationOnCluster::findCurrentHostIndex(current_host, all_hosts)) + , log(getLogger("RestoreCoordinationOnCluster")) + , with_retries(log, get_zookeeper_, keeper_settings, process_list_element_, [root_zookeeper_path_](Coordination::ZooKeeperWithFaultInjection::Ptr zk) { zk->sync(root_zookeeper_path_); }) + , concurrency_check(restore_uuid_, /* is_restore = */ true, /* on_cluster = */ true, allow_concurrent_restore_, concurrency_counters_) + , stage_sync(/* is_restore = */ true, fs::path{zookeeper_path} / "stage", current_host, all_hosts, allow_concurrent_restore_, with_retries, schedule_, process_list_element_, log) + , cleaner(zookeeper_path, with_retries, log) +{ + createRootNodes(); +} + +RestoreCoordinationOnCluster::~RestoreCoordinationOnCluster() +{ + tryFinishImpl(); +} + +void RestoreCoordinationOnCluster::createRootNodes() +{ + auto holder = with_retries.createRetriesControlHolder("createRootNodes", WithRetries::kInitialization); + holder.retries_ctl.retryLoop( + [&, &zk = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zk); + + zk->createAncestors(zookeeper_path); + zk->createIfNotExists(zookeeper_path, ""); + zk->createIfNotExists(zookeeper_path + "/repl_databases_tables_acquired", ""); + zk->createIfNotExists(zookeeper_path + "/repl_tables_data_acquired", ""); + zk->createIfNotExists(zookeeper_path + "/repl_access_storages_acquired", ""); + zk->createIfNotExists(zookeeper_path + "/repl_sql_objects_acquired", ""); + zk->createIfNotExists(zookeeper_path + "/keeper_map_tables", ""); + zk->createIfNotExists(zookeeper_path + "/table_uuids", ""); + }); +} + +Strings RestoreCoordinationOnCluster::setStage(const String & new_stage, const String & message, bool sync) +{ + stage_sync.setStage(new_stage, message); + + if (!sync) + return {}; + + return stage_sync.waitForHostsToReachStage(new_stage, all_hosts_without_initiator); +} + +void RestoreCoordinationOnCluster::setRestoreQueryWasSentToOtherHosts() +{ + restore_query_was_sent_to_other_hosts = true; +} + +bool RestoreCoordinationOnCluster::trySetError(std::exception_ptr exception) +{ + return stage_sync.trySetError(exception); +} + +void RestoreCoordinationOnCluster::finish() +{ + bool other_hosts_also_finished = false; + stage_sync.finish(other_hosts_also_finished); + + if ((current_host == kInitiator) && (other_hosts_also_finished || !restore_query_was_sent_to_other_hosts)) + cleaner.cleanup(); +} + +bool RestoreCoordinationOnCluster::tryFinishAfterError() noexcept +{ + return tryFinishImpl(); +} + +bool RestoreCoordinationOnCluster::tryFinishImpl() noexcept +{ + bool other_hosts_also_finished = false; + if (!stage_sync.tryFinishAfterError(other_hosts_also_finished)) + return false; + + if ((current_host == kInitiator) && (other_hosts_also_finished || !restore_query_was_sent_to_other_hosts)) + { + if (!cleaner.tryCleanupAfterError()) + return false; + } + + return true; +} + +void RestoreCoordinationOnCluster::waitForOtherHostsToFinish() +{ + if ((current_host != kInitiator) || !restore_query_was_sent_to_other_hosts) + return; + stage_sync.waitForOtherHostsToFinish(); +} + +bool RestoreCoordinationOnCluster::tryWaitForOtherHostsToFinishAfterError() noexcept +{ + if (current_host != kInitiator) + return false; + if (!restore_query_was_sent_to_other_hosts) + return true; + return stage_sync.tryWaitForOtherHostsToFinishAfterError(); +} + +ZooKeeperRetriesInfo RestoreCoordinationOnCluster::getOnClusterInitializationKeeperRetriesInfo() const +{ + return ZooKeeperRetriesInfo{keeper_settings.max_retries_while_initializing, + static_cast(keeper_settings.retry_initial_backoff_ms.count()), + static_cast(keeper_settings.retry_max_backoff_ms.count())}; +} + +bool RestoreCoordinationOnCluster::acquireCreatingTableInReplicatedDatabase(const String & database_zk_path, const String & table_name) +{ + bool result = false; + auto holder = with_retries.createRetriesControlHolder("acquireCreatingTableInReplicatedDatabase"); + holder.retries_ctl.retryLoop( + [&, &zk = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zk); + + String path = zookeeper_path + "/repl_databases_tables_acquired/" + escapeForFileName(database_zk_path); + zk->createIfNotExists(path, ""); + + path += "/" + escapeForFileName(table_name); + auto code = zk->tryCreate(path, toString(current_host_index), zkutil::CreateMode::Persistent); + if ((code != Coordination::Error::ZOK) && (code != Coordination::Error::ZNODEEXISTS)) + throw zkutil::KeeperException::fromPath(code, path); + + if (code == Coordination::Error::ZOK) + { + result = true; + return; + } + + /// We need to check who created that node + result = zk->get(path) == toString(current_host_index); + }); + return result; +} + +bool RestoreCoordinationOnCluster::acquireInsertingDataIntoReplicatedTable(const String & table_zk_path) +{ + bool result = false; + auto holder = with_retries.createRetriesControlHolder("acquireInsertingDataIntoReplicatedTable"); + holder.retries_ctl.retryLoop( + [&, &zk = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zk); + + String path = zookeeper_path + "/repl_tables_data_acquired/" + escapeForFileName(table_zk_path); + auto code = zk->tryCreate(path, toString(current_host_index), zkutil::CreateMode::Persistent); + if ((code != Coordination::Error::ZOK) && (code != Coordination::Error::ZNODEEXISTS)) + throw zkutil::KeeperException::fromPath(code, path); + + if (code == Coordination::Error::ZOK) + { + result = true; + return; + } + + /// We need to check who created that node + result = zk->get(path) == toString(current_host_index); + }); + return result; +} + +bool RestoreCoordinationOnCluster::acquireReplicatedAccessStorage(const String & access_storage_zk_path) +{ + bool result = false; + auto holder = with_retries.createRetriesControlHolder("acquireReplicatedAccessStorage"); + holder.retries_ctl.retryLoop( + [&, &zk = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zk); + + String path = zookeeper_path + "/repl_access_storages_acquired/" + escapeForFileName(access_storage_zk_path); + auto code = zk->tryCreate(path, toString(current_host_index), zkutil::CreateMode::Persistent); + if ((code != Coordination::Error::ZOK) && (code != Coordination::Error::ZNODEEXISTS)) + throw zkutil::KeeperException::fromPath(code, path); + + if (code == Coordination::Error::ZOK) + { + result = true; + return; + } + + /// We need to check who created that node + result = zk->get(path) == toString(current_host_index); + }); + return result; +} + +bool RestoreCoordinationOnCluster::acquireReplicatedSQLObjects(const String & loader_zk_path, UserDefinedSQLObjectType object_type) +{ + bool result = false; + auto holder = with_retries.createRetriesControlHolder("acquireReplicatedSQLObjects"); + holder.retries_ctl.retryLoop( + [&, &zk = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zk); + + String path = zookeeper_path + "/repl_sql_objects_acquired/" + escapeForFileName(loader_zk_path); + zk->createIfNotExists(path, ""); + + path += "/"; + switch (object_type) + { + case UserDefinedSQLObjectType::Function: + path += "functions"; + break; + } + + auto code = zk->tryCreate(path, "", zkutil::CreateMode::Persistent); + if ((code != Coordination::Error::ZOK) && (code != Coordination::Error::ZNODEEXISTS)) + throw zkutil::KeeperException::fromPath(code, path); + + if (code == Coordination::Error::ZOK) + { + result = true; + return; + } + + /// We need to check who created that node + result = zk->get(path) == toString(current_host_index); + }); + return result; +} + +bool RestoreCoordinationOnCluster::acquireInsertingDataForKeeperMap(const String & root_zk_path, const String & table_unique_id) +{ + bool lock_acquired = false; + auto holder = with_retries.createRetriesControlHolder("acquireInsertingDataForKeeperMap"); + holder.retries_ctl.retryLoop( + [&, &zk = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zk); + + /// we need to remove leading '/' from root_zk_path + auto normalized_root_zk_path = root_zk_path.substr(1); + std::string restore_lock_path = fs::path(zookeeper_path) / "keeper_map_tables" / escapeForFileName(normalized_root_zk_path); + zk->createAncestors(restore_lock_path); + auto code = zk->tryCreate(restore_lock_path, table_unique_id, zkutil::CreateMode::Persistent); + + if (code == Coordination::Error::ZOK) + { + lock_acquired = true; + return; + } + + if (code == Coordination::Error::ZNODEEXISTS) + lock_acquired = table_unique_id == zk->get(restore_lock_path); + else + zkutil::KeeperException::fromPath(code, restore_lock_path); + }); + return lock_acquired; +} + +void RestoreCoordinationOnCluster::generateUUIDForTable(ASTCreateQuery & create_query) +{ + String query_str = serializeAST(create_query); + CreateQueryUUIDs new_uuids{create_query, /* generate_random= */ true, /* force_random= */ true}; + String new_uuids_str = new_uuids.toString(); + + auto holder = with_retries.createRetriesControlHolder("generateUUIDForTable"); + holder.retries_ctl.retryLoop( + [&, &zk = holder.faulty_zookeeper]() + { + with_retries.renewZooKeeper(zk); + + String path = zookeeper_path + "/table_uuids/" + escapeForFileName(query_str); + Coordination::Error res = zk->tryCreate(path, new_uuids_str, zkutil::CreateMode::Persistent); + + if (res == Coordination::Error::ZOK) + { + new_uuids.copyToQuery(create_query); + return; + } + + if (res == Coordination::Error::ZNODEEXISTS) + { + CreateQueryUUIDs::fromString(zk->get(path)).copyToQuery(create_query); + return; + } + + zkutil::KeeperException::fromPath(res, path); + }); +} + +} diff --git a/src/Backups/RestoreCoordinationRemote.h b/src/Backups/RestoreCoordinationOnCluster.h similarity index 62% rename from src/Backups/RestoreCoordinationRemote.h rename to src/Backups/RestoreCoordinationOnCluster.h index a3d57e9a4d0..87a8dd3ce83 100644 --- a/src/Backups/RestoreCoordinationRemote.h +++ b/src/Backups/RestoreCoordinationOnCluster.h @@ -1,6 +1,8 @@ #pragma once #include +#include +#include #include #include @@ -9,28 +11,33 @@ namespace DB { /// Implementation of the IRestoreCoordination interface performing coordination via ZooKeeper. It's necessary for "RESTORE ON CLUSTER". -class RestoreCoordinationRemote : public IRestoreCoordination +class RestoreCoordinationOnCluster : public IRestoreCoordination { public: - using RestoreKeeperSettings = WithRetries::KeeperSettings; + /// Empty string as the current host is used to mark the initiator of a RESTORE ON CLUSTER query. + static const constexpr std::string_view kInitiator; - RestoreCoordinationRemote( - zkutil::GetZooKeeper get_zookeeper_, + RestoreCoordinationOnCluster( + const UUID & restore_uuid_, const String & root_zookeeper_path_, - const RestoreKeeperSettings & keeper_settings_, - const String & restore_uuid_, - const Strings & all_hosts_, + zkutil::GetZooKeeper get_zookeeper_, + const BackupKeeperSettings & keeper_settings_, const String & current_host_, - bool is_internal_, + const Strings & all_hosts_, + bool allow_concurrent_restore_, + BackupConcurrencyCounters & concurrency_counters_, + ThreadPoolCallbackRunnerUnsafe schedule_, QueryStatusPtr process_list_element_); - ~RestoreCoordinationRemote() override; + ~RestoreCoordinationOnCluster() override; - /// Sets the current stage and waits for other hosts to come to this stage too. - void setStage(const String & new_stage, const String & message) override; - void setError(const Exception & exception) override; - Strings waitForStage(const String & stage_to_wait) override; - Strings waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) override; + Strings setStage(const String & new_stage, const String & message, bool sync) override; + void setRestoreQueryWasSentToOtherHosts() override; + bool trySetError(std::exception_ptr exception) override; + void finish() override; + bool tryFinishAfterError() noexcept override; + void waitForOtherHostsToFinish() override; + bool tryWaitForOtherHostsToFinishAfterError() noexcept override; /// Starts creating a table in a replicated database. Returns false if there is another host which is already creating this table. bool acquireCreatingTableInReplicatedDatabase(const String & database_zk_path, const String & table_name) override; @@ -55,27 +62,27 @@ public: /// (because otherwise the macro "{uuid}" in the ZooKeeper path will not work correctly). void generateUUIDForTable(ASTCreateQuery & create_query) override; - bool hasConcurrentRestores(const std::atomic & num_active_restores) const override; + ZooKeeperRetriesInfo getOnClusterInitializationKeeperRetriesInfo() const override; private: void createRootNodes(); - void removeAllNodes(); + bool tryFinishImpl() noexcept; - /// get_zookeeper will provide a zookeeper client without any fault injection - const zkutil::GetZooKeeper get_zookeeper; const String root_zookeeper_path; - const RestoreKeeperSettings keeper_settings; - const String restore_uuid; + const BackupKeeperSettings keeper_settings; + const UUID restore_uuid; const String zookeeper_path; const Strings all_hosts; + const Strings all_hosts_without_initiator; const String current_host; const size_t current_host_index; - const bool is_internal; LoggerPtr const log; - mutable WithRetries with_retries; - std::optional stage_sync; - mutable std::mutex mutex; + const WithRetries with_retries; + BackupConcurrencyCheck concurrency_check; + BackupCoordinationStageSync stage_sync; + BackupCoordinationCleaner cleaner; + std::atomic restore_query_was_sent_to_other_hosts = false; }; } diff --git a/src/Backups/RestoreCoordinationRemote.cpp b/src/Backups/RestoreCoordinationRemote.cpp deleted file mode 100644 index 0a69bc0eafb..00000000000 --- a/src/Backups/RestoreCoordinationRemote.cpp +++ /dev/null @@ -1,379 +0,0 @@ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - - -namespace DB -{ - -namespace Stage = BackupCoordinationStage; - -RestoreCoordinationRemote::RestoreCoordinationRemote( - zkutil::GetZooKeeper get_zookeeper_, - const String & root_zookeeper_path_, - const RestoreKeeperSettings & keeper_settings_, - const String & restore_uuid_, - const Strings & all_hosts_, - const String & current_host_, - bool is_internal_, - QueryStatusPtr process_list_element_) - : get_zookeeper(get_zookeeper_) - , root_zookeeper_path(root_zookeeper_path_) - , keeper_settings(keeper_settings_) - , restore_uuid(restore_uuid_) - , zookeeper_path(root_zookeeper_path_ + "/restore-" + restore_uuid_) - , all_hosts(all_hosts_) - , current_host(current_host_) - , current_host_index(BackupCoordinationRemote::findCurrentHostIndex(all_hosts, current_host)) - , is_internal(is_internal_) - , log(getLogger("RestoreCoordinationRemote")) - , with_retries( - log, - get_zookeeper_, - keeper_settings, - process_list_element_, - [my_zookeeper_path = zookeeper_path, my_current_host = current_host, my_is_internal = is_internal] - (WithRetries::FaultyKeeper & zk) - { - /// Recreate this ephemeral node to signal that we are alive. - if (my_is_internal) - { - String alive_node_path = my_zookeeper_path + "/stage/alive|" + my_current_host; - - /// Delete the ephemeral node from the previous connection so we don't have to wait for keeper to do it automatically. - zk->tryRemove(alive_node_path); - - zk->createAncestors(alive_node_path); - zk->create(alive_node_path, "", zkutil::CreateMode::Ephemeral); - } - }) -{ - createRootNodes(); - - stage_sync.emplace( - zookeeper_path, - with_retries, - log); -} - -RestoreCoordinationRemote::~RestoreCoordinationRemote() -{ - try - { - if (!is_internal) - removeAllNodes(); - } - catch (...) - { - tryLogCurrentException(__PRETTY_FUNCTION__); - } -} - -void RestoreCoordinationRemote::createRootNodes() -{ - auto holder = with_retries.createRetriesControlHolder("createRootNodes"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - zk->createAncestors(zookeeper_path); - - Coordination::Requests ops; - Coordination::Responses responses; - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path, "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_databases_tables_acquired", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_tables_data_acquired", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_access_storages_acquired", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_sql_objects_acquired", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/keeper_map_tables", "", zkutil::CreateMode::Persistent)); - ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/table_uuids", "", zkutil::CreateMode::Persistent)); - zk->tryMulti(ops, responses); - }); -} - -void RestoreCoordinationRemote::setStage(const String & new_stage, const String & message) -{ - if (is_internal) - stage_sync->set(current_host, new_stage, message); - else - stage_sync->set(current_host, new_stage, /* message */ "", /* all_hosts */ true); -} - -void RestoreCoordinationRemote::setError(const Exception & exception) -{ - stage_sync->setError(current_host, exception); -} - -Strings RestoreCoordinationRemote::waitForStage(const String & stage_to_wait) -{ - return stage_sync->wait(all_hosts, stage_to_wait); -} - -Strings RestoreCoordinationRemote::waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) -{ - return stage_sync->waitFor(all_hosts, stage_to_wait, timeout); -} - -bool RestoreCoordinationRemote::acquireCreatingTableInReplicatedDatabase(const String & database_zk_path, const String & table_name) -{ - bool result = false; - auto holder = with_retries.createRetriesControlHolder("acquireCreatingTableInReplicatedDatabase"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - - String path = zookeeper_path + "/repl_databases_tables_acquired/" + escapeForFileName(database_zk_path); - zk->createIfNotExists(path, ""); - - path += "/" + escapeForFileName(table_name); - auto code = zk->tryCreate(path, toString(current_host_index), zkutil::CreateMode::Persistent); - if ((code != Coordination::Error::ZOK) && (code != Coordination::Error::ZNODEEXISTS)) - throw zkutil::KeeperException::fromPath(code, path); - - if (code == Coordination::Error::ZOK) - { - result = true; - return; - } - - /// We need to check who created that node - result = zk->get(path) == toString(current_host_index); - }); - return result; -} - -bool RestoreCoordinationRemote::acquireInsertingDataIntoReplicatedTable(const String & table_zk_path) -{ - bool result = false; - auto holder = with_retries.createRetriesControlHolder("acquireInsertingDataIntoReplicatedTable"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - - String path = zookeeper_path + "/repl_tables_data_acquired/" + escapeForFileName(table_zk_path); - auto code = zk->tryCreate(path, toString(current_host_index), zkutil::CreateMode::Persistent); - if ((code != Coordination::Error::ZOK) && (code != Coordination::Error::ZNODEEXISTS)) - throw zkutil::KeeperException::fromPath(code, path); - - if (code == Coordination::Error::ZOK) - { - result = true; - return; - } - - /// We need to check who created that node - result = zk->get(path) == toString(current_host_index); - }); - return result; -} - -bool RestoreCoordinationRemote::acquireReplicatedAccessStorage(const String & access_storage_zk_path) -{ - bool result = false; - auto holder = with_retries.createRetriesControlHolder("acquireReplicatedAccessStorage"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - - String path = zookeeper_path + "/repl_access_storages_acquired/" + escapeForFileName(access_storage_zk_path); - auto code = zk->tryCreate(path, toString(current_host_index), zkutil::CreateMode::Persistent); - if ((code != Coordination::Error::ZOK) && (code != Coordination::Error::ZNODEEXISTS)) - throw zkutil::KeeperException::fromPath(code, path); - - if (code == Coordination::Error::ZOK) - { - result = true; - return; - } - - /// We need to check who created that node - result = zk->get(path) == toString(current_host_index); - }); - return result; -} - -bool RestoreCoordinationRemote::acquireReplicatedSQLObjects(const String & loader_zk_path, UserDefinedSQLObjectType object_type) -{ - bool result = false; - auto holder = with_retries.createRetriesControlHolder("acquireReplicatedSQLObjects"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - - String path = zookeeper_path + "/repl_sql_objects_acquired/" + escapeForFileName(loader_zk_path); - zk->createIfNotExists(path, ""); - - path += "/"; - switch (object_type) - { - case UserDefinedSQLObjectType::Function: - path += "functions"; - break; - } - - auto code = zk->tryCreate(path, "", zkutil::CreateMode::Persistent); - if ((code != Coordination::Error::ZOK) && (code != Coordination::Error::ZNODEEXISTS)) - throw zkutil::KeeperException::fromPath(code, path); - - if (code == Coordination::Error::ZOK) - { - result = true; - return; - } - - /// We need to check who created that node - result = zk->get(path) == toString(current_host_index); - }); - return result; -} - -bool RestoreCoordinationRemote::acquireInsertingDataForKeeperMap(const String & root_zk_path, const String & table_unique_id) -{ - bool lock_acquired = false; - auto holder = with_retries.createRetriesControlHolder("acquireInsertingDataForKeeperMap"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - - /// we need to remove leading '/' from root_zk_path - auto normalized_root_zk_path = root_zk_path.substr(1); - std::string restore_lock_path = fs::path(zookeeper_path) / "keeper_map_tables" / escapeForFileName(normalized_root_zk_path); - zk->createAncestors(restore_lock_path); - auto code = zk->tryCreate(restore_lock_path, table_unique_id, zkutil::CreateMode::Persistent); - - if (code == Coordination::Error::ZOK) - { - lock_acquired = true; - return; - } - - if (code == Coordination::Error::ZNODEEXISTS) - lock_acquired = table_unique_id == zk->get(restore_lock_path); - else - zkutil::KeeperException::fromPath(code, restore_lock_path); - }); - return lock_acquired; -} - -void RestoreCoordinationRemote::generateUUIDForTable(ASTCreateQuery & create_query) -{ - String query_str = serializeAST(create_query); - CreateQueryUUIDs new_uuids{create_query, /* generate_random= */ true, /* force_random= */ true}; - String new_uuids_str = new_uuids.toString(); - - auto holder = with_retries.createRetriesControlHolder("generateUUIDForTable"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - - String path = zookeeper_path + "/table_uuids/" + escapeForFileName(query_str); - Coordination::Error res = zk->tryCreate(path, new_uuids_str, zkutil::CreateMode::Persistent); - - if (res == Coordination::Error::ZOK) - { - new_uuids.copyToQuery(create_query); - return; - } - - if (res == Coordination::Error::ZNODEEXISTS) - { - CreateQueryUUIDs::fromString(zk->get(path)).copyToQuery(create_query); - return; - } - - zkutil::KeeperException::fromPath(res, path); - }); -} - -void RestoreCoordinationRemote::removeAllNodes() -{ - /// Usually this function is called by the initiator when a restore operation is complete so we don't need the coordination anymore. - /// - /// However there can be a rare situation when this function is called after an error occurs on the initiator of a query - /// while some hosts are still restoring something. Removing all the nodes will remove the parent node of the restore coordination - /// at `zookeeper_path` which might cause such hosts to stop with exception "ZNONODE". Or such hosts might still do some part - /// of their restore work before that. - - auto holder = with_retries.createRetriesControlHolder("removeAllNodes"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - zk->removeRecursive(zookeeper_path); - }); -} - -bool RestoreCoordinationRemote::hasConcurrentRestores(const std::atomic &) const -{ - /// If its internal concurrency will be checked for the base restore - if (is_internal) - return false; - - bool result = false; - std::string path = zookeeper_path + "/stage"; - - auto holder = with_retries.createRetriesControlHolder("createRootNodes"); - holder.retries_ctl.retryLoop( - [&, &zk = holder.faulty_zookeeper]() - { - with_retries.renewZooKeeper(zk); - - if (! zk->exists(root_zookeeper_path)) - zk->createAncestors(root_zookeeper_path); - - for (size_t attempt = 0; attempt < MAX_ZOOKEEPER_ATTEMPTS; ++attempt) - { - Coordination::Stat stat; - zk->get(root_zookeeper_path, &stat); - Strings existing_restore_paths = zk->getChildren(root_zookeeper_path); - for (const auto & existing_restore_path : existing_restore_paths) - { - if (startsWith(existing_restore_path, "backup-")) - continue; - - String existing_restore_uuid = existing_restore_path; - existing_restore_uuid.erase(0, String("restore-").size()); - - if (existing_restore_uuid == toString(restore_uuid)) - continue; - - String status; - if (zk->tryGet(root_zookeeper_path + "/" + existing_restore_path + "/stage", status)) - { - /// Check if some other restore is in progress - if (status == Stage::SCHEDULED_TO_START) - { - LOG_WARNING(log, "Found a concurrent restore: {}, current restore: {}", existing_restore_uuid, toString(restore_uuid)); - result = true; - return; - } - } - } - - zk->createIfNotExists(path, ""); - auto code = zk->trySet(path, Stage::SCHEDULED_TO_START, stat.version); - if (code == Coordination::Error::ZOK) - break; - bool is_last_attempt = (attempt == MAX_ZOOKEEPER_ATTEMPTS - 1); - if ((code != Coordination::Error::ZBADVERSION) || is_last_attempt) - throw zkutil::KeeperException::fromPath(code, path); - } - }); - - return result; -} - -} diff --git a/src/Backups/RestorerFromBackup.cpp b/src/Backups/RestorerFromBackup.cpp index eb4ba9424ff..29579aa7348 100644 --- a/src/Backups/RestorerFromBackup.cpp +++ b/src/Backups/RestorerFromBackup.cpp @@ -100,7 +100,6 @@ RestorerFromBackup::RestorerFromBackup( , context(context_) , process_list_element(context->getProcessListElement()) , after_task_callback(after_task_callback_) - , on_cluster_first_sync_timeout(context->getConfigRef().getUInt64("backups.on_cluster_first_sync_timeout", 180000)) , create_table_timeout(context->getConfigRef().getUInt64("backups.create_table_timeout", 300000)) , log(getLogger("RestorerFromBackup")) , tables_dependencies("RestorerFromBackup") @@ -119,12 +118,14 @@ RestorerFromBackup::~RestorerFromBackup() } } -void RestorerFromBackup::run(Mode mode) +void RestorerFromBackup::run(Mode mode_) { /// run() can be called onle once. if (!current_stage.empty()) throw Exception(ErrorCodes::LOGICAL_ERROR, "Already restoring"); + mode = mode_; + /// Find other hosts working along with us to execute this ON CLUSTER query. all_hosts = BackupSettings::Util::filterHostIDs( restore_settings.cluster_host_ids, restore_settings.shard_num, restore_settings.replica_num); @@ -139,6 +140,7 @@ void RestorerFromBackup::run(Mode mode) setStage(Stage::FINDING_TABLES_IN_BACKUP); findDatabasesAndTablesInBackup(); waitFutures(); + logNumberOfDatabasesAndTablesToRestore(); /// Check access rights. setStage(Stage::CHECKING_ACCESS_RIGHTS); @@ -228,20 +230,8 @@ void RestorerFromBackup::setStage(const String & new_stage, const String & messa if (restore_coordination) { - restore_coordination->setStage(new_stage, message); - - /// The initiator of a RESTORE ON CLUSTER query waits for other hosts to complete their work (see waitForStage(Stage::COMPLETED) in BackupsWorker::doRestore), - /// but other hosts shouldn't wait for each others' completion. (That's simply unnecessary and also - /// the initiator may start cleaning up (e.g. removing restore-coordination ZooKeeper nodes) once all other hosts are in Stage::COMPLETED.) - bool need_wait = (new_stage != Stage::COMPLETED); - - if (need_wait) - { - if (new_stage == Stage::FINDING_TABLES_IN_BACKUP) - restore_coordination->waitForStage(new_stage, on_cluster_first_sync_timeout); - else - restore_coordination->waitForStage(new_stage); - } + /// There is no need to sync Stage::COMPLETED with other hosts because it's the last stage. + restore_coordination->setStage(new_stage, message, /* sync = */ (new_stage != Stage::COMPLETED)); } } @@ -384,8 +374,12 @@ void RestorerFromBackup::findDatabasesAndTablesInBackup() } } } +} - LOG_INFO(log, "Will restore {} databases and {} tables", getNumDatabases(), getNumTables()); +void RestorerFromBackup::logNumberOfDatabasesAndTablesToRestore() const +{ + std::string_view action = (mode == CHECK_ACCESS_ONLY) ? "check access rights for restoring" : "restore"; + LOG_INFO(log, "Will {} {} databases and {} tables", action, getNumDatabases(), getNumTables()); } void RestorerFromBackup::findTableInBackup(const QualifiedTableName & table_name_in_backup, bool skip_if_inner_table, const std::optional & partitions) diff --git a/src/Backups/RestorerFromBackup.h b/src/Backups/RestorerFromBackup.h index e0130ccfcb4..87290618487 100644 --- a/src/Backups/RestorerFromBackup.h +++ b/src/Backups/RestorerFromBackup.h @@ -53,7 +53,7 @@ public: using DataRestoreTasks = std::vector; /// Restores the metadata of databases and tables and returns tasks to restore the data of tables. - void run(Mode mode); + void run(Mode mode_); BackupPtr getBackup() const { return backup; } const RestoreSettings & getRestoreSettings() const { return restore_settings; } @@ -80,10 +80,10 @@ private: ContextMutablePtr context; QueryStatusPtr process_list_element; std::function after_task_callback; - std::chrono::milliseconds on_cluster_first_sync_timeout; std::chrono::milliseconds create_table_timeout; LoggerPtr log; + Mode mode = Mode::RESTORE; Strings all_hosts; DDLRenamingMap renaming_map; std::vector root_paths_in_backup; @@ -97,6 +97,7 @@ private: void findDatabaseInBackupImpl(const String & database_name_in_backup, const std::set & except_table_names); void findEverythingInBackup(const std::set & except_database_names, const std::set & except_table_names); + void logNumberOfDatabasesAndTablesToRestore() const; size_t getNumDatabases() const; size_t getNumTables() const; diff --git a/src/Backups/WithRetries.cpp b/src/Backups/WithRetries.cpp index 772f746e40a..9c18be3ca9e 100644 --- a/src/Backups/WithRetries.cpp +++ b/src/Backups/WithRetries.cpp @@ -1,57 +1,34 @@ #include -#include #include + namespace DB { -namespace Setting -{ - extern const SettingsUInt64 backup_restore_keeper_max_retries; - extern const SettingsUInt64 backup_restore_keeper_retry_initial_backoff_ms; - extern const SettingsUInt64 backup_restore_keeper_retry_max_backoff_ms; - extern const SettingsUInt64 backup_restore_batch_size_for_keeper_multiread; - extern const SettingsFloat backup_restore_keeper_fault_injection_probability; - extern const SettingsUInt64 backup_restore_keeper_fault_injection_seed; - extern const SettingsUInt64 backup_restore_keeper_value_max_size; - extern const SettingsUInt64 backup_restore_batch_size_for_keeper_multi; -} - -WithRetries::KeeperSettings WithRetries::KeeperSettings::fromContext(ContextPtr context) -{ - return - { - .keeper_max_retries = context->getSettingsRef()[Setting::backup_restore_keeper_max_retries], - .keeper_retry_initial_backoff_ms = context->getSettingsRef()[Setting::backup_restore_keeper_retry_initial_backoff_ms], - .keeper_retry_max_backoff_ms = context->getSettingsRef()[Setting::backup_restore_keeper_retry_max_backoff_ms], - .batch_size_for_keeper_multiread = context->getSettingsRef()[Setting::backup_restore_batch_size_for_keeper_multiread], - .keeper_fault_injection_probability = context->getSettingsRef()[Setting::backup_restore_keeper_fault_injection_probability], - .keeper_fault_injection_seed = context->getSettingsRef()[Setting::backup_restore_keeper_fault_injection_seed], - .keeper_value_max_size = context->getSettingsRef()[Setting::backup_restore_keeper_value_max_size], - .batch_size_for_keeper_multi = context->getSettingsRef()[Setting::backup_restore_batch_size_for_keeper_multi], - }; -} WithRetries::WithRetries( - LoggerPtr log_, zkutil::GetZooKeeper get_zookeeper_, const KeeperSettings & settings_, QueryStatusPtr process_list_element_, RenewerCallback callback_) + LoggerPtr log_, zkutil::GetZooKeeper get_zookeeper_, const BackupKeeperSettings & settings_, QueryStatusPtr process_list_element_, RenewerCallback callback_) : log(log_) , get_zookeeper(get_zookeeper_) , settings(settings_) , process_list_element(process_list_element_) , callback(callback_) - , global_zookeeper_retries_info( - settings.keeper_max_retries, settings.keeper_retry_initial_backoff_ms, settings.keeper_retry_max_backoff_ms) {} -WithRetries::RetriesControlHolder::RetriesControlHolder(const WithRetries * parent, const String & name) - : info(parent->global_zookeeper_retries_info) - , retries_ctl(name, parent->log, info, parent->process_list_element) +WithRetries::RetriesControlHolder::RetriesControlHolder(const WithRetries * parent, const String & name, Kind kind) + : info( (kind == kInitialization) ? parent->settings.max_retries_while_initializing + : (kind == kErrorHandling) ? parent->settings.max_retries_while_handling_error + : parent->settings.max_retries, + parent->settings.retry_initial_backoff_ms.count(), + parent->settings.retry_max_backoff_ms.count()) + /// We don't use process_list_element while handling an error because the error handling can't be cancellable. + , retries_ctl(name, parent->log, info, (kind == kErrorHandling) ? nullptr : parent->process_list_element) , faulty_zookeeper(parent->getFaultyZooKeeper()) {} -WithRetries::RetriesControlHolder WithRetries::createRetriesControlHolder(const String & name) +WithRetries::RetriesControlHolder WithRetries::createRetriesControlHolder(const String & name, Kind kind) const { - return RetriesControlHolder(this, name); + return RetriesControlHolder(this, name, kind); } void WithRetries::renewZooKeeper(FaultyKeeper my_faulty_zookeeper) const @@ -62,8 +39,8 @@ void WithRetries::renewZooKeeper(FaultyKeeper my_faulty_zookeeper) const { zookeeper = get_zookeeper(); my_faulty_zookeeper->setKeeper(zookeeper); - - callback(my_faulty_zookeeper); + if (callback) + callback(my_faulty_zookeeper); } else { @@ -71,7 +48,7 @@ void WithRetries::renewZooKeeper(FaultyKeeper my_faulty_zookeeper) const } } -const WithRetries::KeeperSettings & WithRetries::getKeeperSettings() const +const BackupKeeperSettings & WithRetries::getKeeperSettings() const { return settings; } @@ -88,8 +65,8 @@ WithRetries::FaultyKeeper WithRetries::getFaultyZooKeeper() const /// The reason is that ZooKeeperWithFaultInjection may reset the underlying pointer and there could be a race condition /// when the same object is used from multiple threads. auto faulty_zookeeper = ZooKeeperWithFaultInjection::createInstance( - settings.keeper_fault_injection_probability, - settings.keeper_fault_injection_seed, + settings.fault_injection_probability, + settings.fault_injection_seed, current_zookeeper, log->name(), log); diff --git a/src/Backups/WithRetries.h b/src/Backups/WithRetries.h index f795a963911..e465fbb1e50 100644 --- a/src/Backups/WithRetries.h +++ b/src/Backups/WithRetries.h @@ -1,9 +1,11 @@ #pragma once -#include +#include #include +#include #include + namespace DB { @@ -15,20 +17,13 @@ class WithRetries { public: using FaultyKeeper = Coordination::ZooKeeperWithFaultInjection::Ptr; - using RenewerCallback = std::function; + using RenewerCallback = std::function; - struct KeeperSettings + enum Kind { - UInt64 keeper_max_retries{0}; - UInt64 keeper_retry_initial_backoff_ms{0}; - UInt64 keeper_retry_max_backoff_ms{0}; - UInt64 batch_size_for_keeper_multiread{10000}; - Float64 keeper_fault_injection_probability{0}; - UInt64 keeper_fault_injection_seed{42}; - UInt64 keeper_value_max_size{1048576}; - UInt64 batch_size_for_keeper_multi{1000}; - - static KeeperSettings fromContext(ContextPtr context); + kNormal, + kInitialization, + kErrorHandling, }; /// For simplicity a separate ZooKeeperRetriesInfo and a faulty [Zoo]Keeper client @@ -48,23 +43,23 @@ public: private: friend class WithRetries; - RetriesControlHolder(const WithRetries * parent, const String & name); + RetriesControlHolder(const WithRetries * parent, const String & name, Kind kind); }; - RetriesControlHolder createRetriesControlHolder(const String & name); - WithRetries(LoggerPtr log, zkutil::GetZooKeeper get_zookeeper_, const KeeperSettings & settings, QueryStatusPtr process_list_element_, RenewerCallback callback); + RetriesControlHolder createRetriesControlHolder(const String & name, Kind kind = Kind::kNormal) const; + WithRetries(LoggerPtr log, zkutil::GetZooKeeper get_zookeeper_, const BackupKeeperSettings & settings, QueryStatusPtr process_list_element_, RenewerCallback callback = {}); /// Used to re-establish new connection inside a retry loop. void renewZooKeeper(FaultyKeeper my_faulty_zookeeper) const; - const KeeperSettings & getKeeperSettings() const; + const BackupKeeperSettings & getKeeperSettings() const; private: /// This will provide a special wrapper which is useful for testing FaultyKeeper getFaultyZooKeeper() const; LoggerPtr log; zkutil::GetZooKeeper get_zookeeper; - KeeperSettings settings; + BackupKeeperSettings settings; QueryStatusPtr process_list_element; /// This callback is called each time when a new [Zoo]Keeper session is created. @@ -76,7 +71,6 @@ private: /// it could lead just to a failed backup which could possibly be successful /// if there were a little bit more retries. RenewerCallback callback; - ZooKeeperRetriesInfo global_zookeeper_retries_info; /// This is needed only to protect zookeeper object mutable std::mutex zookeeper_mutex; diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt index 39499cc577d..3627d760d4c 100644 --- a/src/CMakeLists.txt +++ b/src/CMakeLists.txt @@ -136,6 +136,7 @@ add_headers_and_sources(dbms Storages/ObjectStorage/HDFS) add_headers_and_sources(dbms Storages/ObjectStorage/Local) add_headers_and_sources(dbms Storages/ObjectStorage/DataLakes) add_headers_and_sources(dbms Common/NamedCollections) +add_headers_and_sources(dbms Common/Scheduler/Workload) if (TARGET ch_contrib::amqp_cpp) add_headers_and_sources(dbms Storages/RabbitMQ) diff --git a/src/Client/ClientApplicationBase.cpp b/src/Client/ClientApplicationBase.cpp index d26641fe5f9..f506d7c99f5 100644 --- a/src/Client/ClientApplicationBase.cpp +++ b/src/Client/ClientApplicationBase.cpp @@ -167,7 +167,8 @@ void ClientApplicationBase::init(int argc, char ** argv) ("query_kind", po::value()->default_value("initial_query"), "One of initial_query/secondary_query/no_query") ("query_id", po::value(), "query_id") - ("history_file", po::value(), "path to history file") + ("history_file", po::value(), "Path to a file containing command history.") + ("history_max_entries", po::value()->default_value(1000000), "Maximum number of entries in the history file.") ("stage", po::value()->default_value("complete"), "Request query processing up to specified stage: complete,fetch_columns,with_mergeable_state,with_mergeable_state_after_aggregation,with_mergeable_state_after_aggregation_and_limit") ("progress", po::value()->implicit_value(ProgressOption::TTY, "tty")->default_value(ProgressOption::DEFAULT, "default"), "Print progress of queries execution - to TTY: tty|on|1|true|yes; to STDERR non-interactive mode: err; OFF: off|0|false|no; DEFAULT - interactive to TTY, non-interactive is off") @@ -350,6 +351,8 @@ void ClientApplicationBase::init(int argc, char ** argv) getClientConfiguration().setBool("highlight", options["highlight"].as()); if (options.count("history_file")) getClientConfiguration().setString("history_file", options["history_file"].as()); + if (options.count("history_max_entries")) + getClientConfiguration().setUInt("history_max_entries", options["history_max_entries"].as()); if (options.count("interactive")) getClientConfiguration().setBool("interactive", true); if (options.count("pager")) @@ -418,7 +421,7 @@ void ClientApplicationBase::init(int argc, char ** argv) UInt64 max_client_memory_usage_int = parseWithSizeSuffix(max_client_memory_usage.c_str(), max_client_memory_usage.length()); total_memory_tracker.setHardLimit(max_client_memory_usage_int); - total_memory_tracker.setDescription("(total)"); + total_memory_tracker.setDescription("Global"); total_memory_tracker.setMetric(CurrentMetrics::MemoryTracking); } diff --git a/src/Client/ClientBase.cpp b/src/Client/ClientBase.cpp index 23aa7e841cb..0a824753dc0 100644 --- a/src/Client/ClientBase.cpp +++ b/src/Client/ClientBase.cpp @@ -470,8 +470,7 @@ void ClientBase::onData(Block & block, ASTPtr parsed_query) { if (!need_render_progress && select_into_file && !select_into_file_and_stdout) error_stream << "\r"; - bool toggle_enabled = getClientConfiguration().getBool("enable-progress-table-toggle", true); - progress_table.writeTable(*tty_buf, progress_table_toggle_on.load(), toggle_enabled); + progress_table.writeTable(*tty_buf, progress_table_toggle_on.load(), progress_table_toggle_enabled); } } @@ -825,6 +824,9 @@ void ClientBase::initTTYBuffer(ProgressOption progress_option, ProgressOption pr if (!need_render_progress && !need_render_progress_table) return; + progress_table_toggle_enabled = getClientConfiguration().getBool("enable-progress-table-toggle"); + progress_table_toggle_on = !progress_table_toggle_enabled; + /// If need_render_progress and need_render_progress_table are enabled, /// use ProgressOption that was set for the progress bar for progress table as well. ProgressOption progress = progress_option ? progress_option : progress_table_option; @@ -881,7 +883,7 @@ void ClientBase::initTTYBuffer(ProgressOption progress_option, ProgressOption pr void ClientBase::initKeystrokeInterceptor() { - if (is_interactive && need_render_progress_table && getClientConfiguration().getBool("enable-progress-table-toggle", true)) + if (is_interactive && need_render_progress_table && progress_table_toggle_enabled) { keystroke_interceptor = std::make_unique(in_fd, error_stream); keystroke_interceptor->registerCallback(' ', [this]() { progress_table_toggle_on = !progress_table_toggle_on; }); @@ -1151,6 +1153,7 @@ void ClientBase::receiveResult(ASTPtr parsed_query, Int32 signals_before_stop, b if (keystroke_interceptor) { + progress_table_toggle_on = false; try { keystroke_interceptor->startIntercept(); @@ -1446,10 +1449,27 @@ void ClientBase::onProfileEvents(Block & block) /// Flush all buffers. void ClientBase::resetOutput() { + if (need_render_progress_table && tty_buf) + progress_table.clearTableOutput(*tty_buf); + /// Order is important: format, compression, file - if (output_format) - output_format->finalize(); + try + { + if (output_format) + output_format->finalize(); + } + catch (...) + { + /// We need to make sure we continue resetting output_format (will stop threads on parallel output) + /// as well as cleaning other output related setup + if (!have_error) + { + client_exception + = std::make_unique(getCurrentExceptionMessageAndPattern(print_stack_trace), getCurrentExceptionCode()); + have_error = true; + } + } output_format.reset(); logs_out_stream.reset(); @@ -2645,6 +2665,8 @@ void ClientBase::runInteractive() } } + history_max_entries = getClientConfiguration().getUInt("history_max_entries"); + LineReader::Patterns query_extenders = {"\\"}; LineReader::Patterns query_delimiters = {";", "\\G", "\\G;"}; char word_break_characters[] = " \t\v\f\a\b\r\n`~!@#$%^&*()-=+[{]}\\|;:'\",<.>/?"; @@ -2657,6 +2679,7 @@ void ClientBase::runInteractive() ReplxxLineReader lr( *suggest, history_file, + history_max_entries, getClientConfiguration().has("multiline"), getClientConfiguration().getBool("ignore_shell_suspend", true), query_extenders, diff --git a/src/Client/ClientBase.h b/src/Client/ClientBase.h index b06958f1d14..6b261714ff6 100644 --- a/src/Client/ClientBase.h +++ b/src/Client/ClientBase.h @@ -328,6 +328,7 @@ protected: String home_path; String history_file; /// Path to a file containing command history. + UInt32 history_max_entries; /// Maximum number of entries in the history file. String current_profile; @@ -340,6 +341,7 @@ protected: ProgressTable progress_table; bool need_render_progress = true; bool need_render_progress_table = true; + bool progress_table_toggle_enabled = true; std::atomic_bool progress_table_toggle_on = false; bool need_render_profile_events = true; bool written_first_block = false; diff --git a/src/Client/ProgressTable.cpp b/src/Client/ProgressTable.cpp index 15da659d3fb..f63935440e4 100644 --- a/src/Client/ProgressTable.cpp +++ b/src/Client/ProgressTable.cpp @@ -180,9 +180,12 @@ void writeWithWidth(Out & out, std::string_view s, size_t width) template void writeWithWidthStrict(Out & out, std::string_view s, size_t width) { - chassert(width != 0); + constexpr std::string_view ellipsis = "…"; if (s.size() > width) - out << s.substr(0, width - 1) << "…"; + if (width <= ellipsis.size()) + out << s.substr(0, width); + else + out << s.substr(0, width - ellipsis.size()) << ellipsis; else out << s; } @@ -219,7 +222,9 @@ void ProgressTable::writeTable(WriteBufferFromFileDescriptor & message, bool sho writeWithWidth(message, COLUMN_EVENT_NAME, column_event_name_width); writeWithWidth(message, COLUMN_VALUE, COLUMN_VALUE_WIDTH); writeWithWidth(message, COLUMN_PROGRESS, COLUMN_PROGRESS_WIDTH); - writeWithWidth(message, COLUMN_DOCUMENTATION_NAME, COLUMN_DOCUMENTATION_WIDTH); + auto col_doc_width = getColumnDocumentationWidth(terminal_width); + if (col_doc_width) + writeWithWidth(message, COLUMN_DOCUMENTATION_NAME, col_doc_width); message << CLEAR_TO_END_OF_LINE; double elapsed_sec = watch.elapsedSeconds(); @@ -257,9 +262,12 @@ void ProgressTable::writeTable(WriteBufferFromFileDescriptor & message, bool sho writeWithWidth(message, formatReadableValue(value_type, progress) + "/s", COLUMN_PROGRESS_WIDTH); - message << setColorForDocumentation(); - const auto * doc = getDocumentation(event_name_to_event.at(name)); - writeWithWidthStrict(message, doc, COLUMN_DOCUMENTATION_WIDTH); + if (col_doc_width) + { + message << setColorForDocumentation(); + const auto * doc = getDocumentation(event_name_to_event.at(name)); + writeWithWidthStrict(message, doc, col_doc_width); + } message << RESET_COLOR; message << CLEAR_TO_END_OF_LINE; @@ -372,6 +380,14 @@ size_t ProgressTable::tableSize() const return metrics.empty() ? 0 : metrics.size() + 1; } +size_t ProgressTable::getColumnDocumentationWidth(size_t terminal_width) const +{ + auto fixed_columns_width = column_event_name_width + COLUMN_VALUE_WIDTH + COLUMN_PROGRESS_WIDTH; + if (terminal_width < fixed_columns_width + COLUMN_DOCUMENTATION_MIN_WIDTH) + return 0; + return terminal_width - fixed_columns_width; +} + ProgressTable::MetricInfo::MetricInfo(ProfileEvents::Type t) : type(t) { } diff --git a/src/Client/ProgressTable.h b/src/Client/ProgressTable.h index a55326e8d3a..f2563d91217 100644 --- a/src/Client/ProgressTable.h +++ b/src/Client/ProgressTable.h @@ -87,6 +87,7 @@ private: }; size_t tableSize() const; + size_t getColumnDocumentationWidth(size_t terminal_width) const; using MetricName = String; @@ -110,7 +111,7 @@ private: static constexpr std::string_view COLUMN_DOCUMENTATION_NAME = "Documentation"; static constexpr size_t COLUMN_VALUE_WIDTH = 20; static constexpr size_t COLUMN_PROGRESS_WIDTH = 20; - static constexpr size_t COLUMN_DOCUMENTATION_WIDTH = 100; + static constexpr size_t COLUMN_DOCUMENTATION_MIN_WIDTH = COLUMN_DOCUMENTATION_NAME.size(); std::ostream & output_stream; int in_fd; diff --git a/src/Client/ReplxxLineReader.cpp b/src/Client/ReplxxLineReader.cpp index 37ceb471e5b..ee90a6cc7b7 100644 --- a/src/Client/ReplxxLineReader.cpp +++ b/src/Client/ReplxxLineReader.cpp @@ -293,6 +293,7 @@ void ReplxxLineReader::setLastIsDelimiter(bool flag) ReplxxLineReader::ReplxxLineReader( Suggest & suggest, const String & history_file_path_, + UInt32 history_max_entries_, bool multiline_, bool ignore_shell_suspend, Patterns extenders_, @@ -313,6 +314,8 @@ ReplxxLineReader::ReplxxLineReader( { using Replxx = replxx::Replxx; + rx.set_max_history_size(static_cast(history_max_entries_)); + if (!history_file_path.empty()) { history_file_fd = open(history_file_path.c_str(), O_RDWR); diff --git a/src/Client/ReplxxLineReader.h b/src/Client/ReplxxLineReader.h index 1dbad2c70dd..ccda47170e6 100644 --- a/src/Client/ReplxxLineReader.h +++ b/src/Client/ReplxxLineReader.h @@ -14,6 +14,7 @@ public: ( Suggest & suggest, const String & history_file_path, + UInt32 history_max_entries, bool multiline, bool ignore_shell_suspend, Patterns extenders_, diff --git a/src/Columns/ColumnArray.cpp b/src/Columns/ColumnArray.cpp index 83d4c24c769..0c6d7c4e5c6 100644 --- a/src/Columns/ColumnArray.cpp +++ b/src/Columns/ColumnArray.cpp @@ -369,6 +369,23 @@ void ColumnArray::popBack(size_t n) offsets_data.resize_assume_reserved(offsets_data.size() - n); } +ColumnCheckpointPtr ColumnArray::getCheckpoint() const +{ + return std::make_shared(size(), getData().getCheckpoint()); +} + +void ColumnArray::updateCheckpoint(ColumnCheckpoint & checkpoint) const +{ + checkpoint.size = size(); + getData().updateCheckpoint(*assert_cast(checkpoint).nested); +} + +void ColumnArray::rollback(const ColumnCheckpoint & checkpoint) +{ + getOffsets().resize_assume_reserved(checkpoint.size); + getData().rollback(*assert_cast(checkpoint).nested); +} + int ColumnArray::compareAtImpl(size_t n, size_t m, const IColumn & rhs_, int nan_direction_hint, const Collator * collator) const { const ColumnArray & rhs = assert_cast(rhs_); diff --git a/src/Columns/ColumnArray.h b/src/Columns/ColumnArray.h index f77268a8be6..ec14b096055 100644 --- a/src/Columns/ColumnArray.h +++ b/src/Columns/ColumnArray.h @@ -161,6 +161,10 @@ public: ColumnPtr compress() const override; + ColumnCheckpointPtr getCheckpoint() const override; + void updateCheckpoint(ColumnCheckpoint & checkpoint) const override; + void rollback(const ColumnCheckpoint & checkpoint) override; + void forEachSubcolumn(MutableColumnCallback callback) override { callback(offsets); diff --git a/src/Columns/ColumnDynamic.cpp b/src/Columns/ColumnDynamic.cpp index 4c7c9b53654..41a9096bc0c 100644 --- a/src/Columns/ColumnDynamic.cpp +++ b/src/Columns/ColumnDynamic.cpp @@ -1000,6 +1000,56 @@ ColumnPtr ColumnDynamic::compress() const }); } +void ColumnDynamic::updateCheckpoint(ColumnCheckpoint & checkpoint) const +{ + auto & nested = assert_cast(checkpoint).nested; + const auto & variants = variant_column_ptr->getVariants(); + + size_t old_size = nested.size(); + chassert(old_size <= variants.size()); + + for (size_t i = 0; i < old_size; ++i) + { + variants[i]->updateCheckpoint(*nested[i]); + } + + /// If column has new variants since last checkpoint create checkpoints for them. + if (old_size < variants.size()) + { + nested.resize(variants.size()); + for (size_t i = old_size; i < variants.size(); ++i) + nested[i] = variants[i]->getCheckpoint(); + } + + checkpoint.size = size(); +} + +void ColumnDynamic::rollback(const ColumnCheckpoint & checkpoint) +{ + const auto & nested = assert_cast(checkpoint).nested; + chassert(nested.size() <= variant_column_ptr->getNumVariants()); + + /// The structure hasn't changed, so we can use generic rollback of Variant column + if (nested.size() == variant_column_ptr->getNumVariants()) + { + variant_column_ptr->rollback(checkpoint); + return; + } + + /// Manually rollback internals of Variant column + variant_column_ptr->getOffsets().resize_assume_reserved(checkpoint.size); + variant_column_ptr->getLocalDiscriminators().resize_assume_reserved(checkpoint.size); + + auto & variants = variant_column_ptr->getVariants(); + for (size_t i = 0; i < nested.size(); ++i) + variants[i]->rollback(*nested[i]); + + /// Keep the structure of variant as is but rollback + /// to 0 variants that are not in the checkpoint. + for (size_t i = nested.size(); i < variants.size(); ++i) + variants[i] = variants[i]->cloneEmpty(); +} + String ColumnDynamic::getTypeNameAt(size_t row_num) const { const auto & variant_col = getVariantColumn(); diff --git a/src/Columns/ColumnDynamic.h b/src/Columns/ColumnDynamic.h index 17b0d80e5eb..57a1545a832 100644 --- a/src/Columns/ColumnDynamic.h +++ b/src/Columns/ColumnDynamic.h @@ -304,6 +304,15 @@ public: variant_column_ptr->protect(); } + ColumnCheckpointPtr getCheckpoint() const override + { + return variant_column_ptr->getCheckpoint(); + } + + void updateCheckpoint(ColumnCheckpoint & checkpoint) const override; + + void rollback(const ColumnCheckpoint & checkpoint) override; + void forEachSubcolumn(MutableColumnCallback callback) override { callback(variant_column); diff --git a/src/Columns/ColumnMap.cpp b/src/Columns/ColumnMap.cpp index 536da4d06d0..7ebbed930d8 100644 --- a/src/Columns/ColumnMap.cpp +++ b/src/Columns/ColumnMap.cpp @@ -312,6 +312,21 @@ void ColumnMap::getExtremes(Field & min, Field & max) const max = std::move(map_max_value); } +ColumnCheckpointPtr ColumnMap::getCheckpoint() const +{ + return nested->getCheckpoint(); +} + +void ColumnMap::updateCheckpoint(ColumnCheckpoint & checkpoint) const +{ + nested->updateCheckpoint(checkpoint); +} + +void ColumnMap::rollback(const ColumnCheckpoint & checkpoint) +{ + nested->rollback(checkpoint); +} + void ColumnMap::forEachSubcolumn(MutableColumnCallback callback) { callback(nested); diff --git a/src/Columns/ColumnMap.h b/src/Columns/ColumnMap.h index 39d15a586b9..575114f8d3a 100644 --- a/src/Columns/ColumnMap.h +++ b/src/Columns/ColumnMap.h @@ -102,6 +102,9 @@ public: size_t byteSizeAt(size_t n) const override; size_t allocatedBytes() const override; void protect() override; + ColumnCheckpointPtr getCheckpoint() const override; + void updateCheckpoint(ColumnCheckpoint & checkpoint) const override; + void rollback(const ColumnCheckpoint & checkpoint) override; void forEachSubcolumn(MutableColumnCallback callback) override; void forEachSubcolumnRecursively(RecursiveMutableColumnCallback callback) override; bool structureEquals(const IColumn & rhs) const override; diff --git a/src/Columns/ColumnNullable.cpp b/src/Columns/ColumnNullable.cpp index 390df390ae6..6e8bd3fc70c 100644 --- a/src/Columns/ColumnNullable.cpp +++ b/src/Columns/ColumnNullable.cpp @@ -302,6 +302,23 @@ void ColumnNullable::popBack(size_t n) getNullMapColumn().popBack(n); } +ColumnCheckpointPtr ColumnNullable::getCheckpoint() const +{ + return std::make_shared(size(), nested_column->getCheckpoint()); +} + +void ColumnNullable::updateCheckpoint(ColumnCheckpoint & checkpoint) const +{ + checkpoint.size = size(); + nested_column->updateCheckpoint(*assert_cast(checkpoint).nested); +} + +void ColumnNullable::rollback(const ColumnCheckpoint & checkpoint) +{ + getNullMapData().resize_assume_reserved(checkpoint.size); + nested_column->rollback(*assert_cast(checkpoint).nested); +} + ColumnPtr ColumnNullable::filter(const Filter & filt, ssize_t result_size_hint) const { ColumnPtr filtered_data = getNestedColumn().filter(filt, result_size_hint); diff --git a/src/Columns/ColumnNullable.h b/src/Columns/ColumnNullable.h index 78274baca51..32ce66c5965 100644 --- a/src/Columns/ColumnNullable.h +++ b/src/Columns/ColumnNullable.h @@ -143,6 +143,10 @@ public: ColumnPtr compress() const override; + ColumnCheckpointPtr getCheckpoint() const override; + void updateCheckpoint(ColumnCheckpoint & checkpoint) const override; + void rollback(const ColumnCheckpoint & checkpoint) override; + void forEachSubcolumn(MutableColumnCallback callback) override { callback(nested_column); diff --git a/src/Columns/ColumnObject.cpp b/src/Columns/ColumnObject.cpp index b962887f7b5..18ba8ed36ee 100644 --- a/src/Columns/ColumnObject.cpp +++ b/src/Columns/ColumnObject.cpp @@ -30,6 +30,23 @@ const std::shared_ptr & getDynamicSerialization() return dynamic_serialization; } +struct ColumnObjectCheckpoint : public ColumnCheckpoint +{ + using CheckpointsMap = std::unordered_map; + + ColumnObjectCheckpoint(size_t size_, CheckpointsMap typed_paths_, CheckpointsMap dynamic_paths_, ColumnCheckpointPtr shared_data_) + : ColumnCheckpoint(size_) + , typed_paths(std::move(typed_paths_)) + , dynamic_paths(std::move(dynamic_paths_)) + , shared_data(std::move(shared_data_)) + { + } + + CheckpointsMap typed_paths; + CheckpointsMap dynamic_paths; + ColumnCheckpointPtr shared_data; +}; + } ColumnObject::ColumnObject( @@ -698,6 +715,69 @@ void ColumnObject::popBack(size_t n) shared_data->popBack(n); } +ColumnCheckpointPtr ColumnObject::getCheckpoint() const +{ + auto get_checkpoints = [](const auto & columns) + { + ColumnObjectCheckpoint::CheckpointsMap checkpoints; + for (const auto & [name, column] : columns) + checkpoints[name] = column->getCheckpoint(); + + return checkpoints; + }; + + return std::make_shared(size(), get_checkpoints(typed_paths), get_checkpoints(dynamic_paths_ptrs), shared_data->getCheckpoint()); +} + +void ColumnObject::updateCheckpoint(ColumnCheckpoint & checkpoint) const +{ + auto & object_checkpoint = assert_cast(checkpoint); + + auto update_checkpoints = [&](const auto & columns_map, auto & checkpoints_map) + { + for (const auto & [name, column] : columns_map) + { + auto & nested = checkpoints_map[name]; + if (!nested) + nested = column->getCheckpoint(); + else + column->updateCheckpoint(*nested); + } + }; + + checkpoint.size = size(); + update_checkpoints(typed_paths, object_checkpoint.typed_paths); + update_checkpoints(dynamic_paths, object_checkpoint.dynamic_paths); + shared_data->updateCheckpoint(*object_checkpoint.shared_data); +} + +void ColumnObject::rollback(const ColumnCheckpoint & checkpoint) +{ + const auto & object_checkpoint = assert_cast(checkpoint); + + auto rollback_columns = [&](auto & columns_map, const auto & checkpoints_map) + { + NameSet names_to_remove; + + /// Rollback subcolumns and remove paths that were not in checkpoint. + for (auto & [name, column] : columns_map) + { + auto it = checkpoints_map.find(name); + if (it == checkpoints_map.end()) + names_to_remove.insert(name); + else + column->rollback(*it->second); + } + + for (const auto & name : names_to_remove) + columns_map.erase(name); + }; + + rollback_columns(typed_paths, object_checkpoint.typed_paths); + rollback_columns(dynamic_paths, object_checkpoint.dynamic_paths); + shared_data->rollback(*object_checkpoint.shared_data); +} + StringRef ColumnObject::serializeValueIntoArena(size_t n, Arena & arena, const char *& begin) const { StringRef res(begin, 0); diff --git a/src/Columns/ColumnObject.h b/src/Columns/ColumnObject.h index a2f9964a243..74ae7e136ce 100644 --- a/src/Columns/ColumnObject.h +++ b/src/Columns/ColumnObject.h @@ -161,6 +161,9 @@ public: size_t byteSizeAt(size_t n) const override; size_t allocatedBytes() const override; void protect() override; + ColumnCheckpointPtr getCheckpoint() const override; + void updateCheckpoint(ColumnCheckpoint & checkpoint) const override; + void rollback(const ColumnCheckpoint & checkpoint) override; void forEachSubcolumn(MutableColumnCallback callback) override; diff --git a/src/Columns/ColumnSparse.cpp b/src/Columns/ColumnSparse.cpp index a908d970a15..a0e47e65fc6 100644 --- a/src/Columns/ColumnSparse.cpp +++ b/src/Columns/ColumnSparse.cpp @@ -308,6 +308,28 @@ void ColumnSparse::popBack(size_t n) _size = new_size; } +ColumnCheckpointPtr ColumnSparse::getCheckpoint() const +{ + return std::make_shared(size(), values->getCheckpoint()); +} + +void ColumnSparse::updateCheckpoint(ColumnCheckpoint & checkpoint) const +{ + checkpoint.size = size(); + values->updateCheckpoint(*assert_cast(checkpoint).nested); +} + +void ColumnSparse::rollback(const ColumnCheckpoint & checkpoint) +{ + _size = checkpoint.size; + + const auto & nested = *assert_cast(checkpoint).nested; + chassert(nested.size > 0); + + values->rollback(nested); + getOffsetsData().resize_assume_reserved(nested.size - 1); +} + ColumnPtr ColumnSparse::filter(const Filter & filt, ssize_t) const { if (_size != filt.size()) diff --git a/src/Columns/ColumnSparse.h b/src/Columns/ColumnSparse.h index 7a4d914e62a..619dce63c1e 100644 --- a/src/Columns/ColumnSparse.h +++ b/src/Columns/ColumnSparse.h @@ -149,6 +149,10 @@ public: ColumnPtr compress() const override; + ColumnCheckpointPtr getCheckpoint() const override; + void updateCheckpoint(ColumnCheckpoint & checkpoint) const override; + void rollback(const ColumnCheckpoint & checkpoint) override; + void forEachSubcolumn(MutableColumnCallback callback) override; void forEachSubcolumnRecursively(RecursiveMutableColumnCallback callback) override; diff --git a/src/Columns/ColumnString.cpp b/src/Columns/ColumnString.cpp index 00cf3bd9c30..269c20397b4 100644 --- a/src/Columns/ColumnString.cpp +++ b/src/Columns/ColumnString.cpp @@ -240,6 +240,23 @@ ColumnPtr ColumnString::permute(const Permutation & perm, size_t limit) const return permuteImpl(*this, perm, limit); } +ColumnCheckpointPtr ColumnString::getCheckpoint() const +{ + auto nested = std::make_shared(chars.size()); + return std::make_shared(size(), std::move(nested)); +} + +void ColumnString::updateCheckpoint(ColumnCheckpoint & checkpoint) const +{ + checkpoint.size = size(); + assert_cast(checkpoint).nested->size = chars.size(); +} + +void ColumnString::rollback(const ColumnCheckpoint & checkpoint) +{ + offsets.resize_assume_reserved(checkpoint.size); + chars.resize_assume_reserved(assert_cast(checkpoint).nested->size); +} void ColumnString::collectSerializedValueSizes(PaddedPODArray & sizes, const UInt8 * is_null) const { diff --git a/src/Columns/ColumnString.h b/src/Columns/ColumnString.h index ec0563b3f00..c2371412437 100644 --- a/src/Columns/ColumnString.h +++ b/src/Columns/ColumnString.h @@ -194,6 +194,10 @@ public: offsets.resize_assume_reserved(offsets.size() - n); } + ColumnCheckpointPtr getCheckpoint() const override; + void updateCheckpoint(ColumnCheckpoint & checkpoint) const override; + void rollback(const ColumnCheckpoint & checkpoint) override; + void collectSerializedValueSizes(PaddedPODArray & sizes, const UInt8 * is_null) const override; StringRef serializeValueIntoArena(size_t n, Arena & arena, char const *& begin) const override; diff --git a/src/Columns/ColumnTuple.cpp b/src/Columns/ColumnTuple.cpp index 1cb6d0b60d8..c3f7d10f650 100644 --- a/src/Columns/ColumnTuple.cpp +++ b/src/Columns/ColumnTuple.cpp @@ -254,6 +254,37 @@ void ColumnTuple::popBack(size_t n) column->popBack(n); } +ColumnCheckpointPtr ColumnTuple::getCheckpoint() const +{ + ColumnCheckpoints checkpoints; + checkpoints.reserve(columns.size()); + + for (const auto & column : columns) + checkpoints.push_back(column->getCheckpoint()); + + return std::make_shared(size(), std::move(checkpoints)); +} + +void ColumnTuple::updateCheckpoint(ColumnCheckpoint & checkpoint) const +{ + auto & checkpoints = assert_cast(checkpoint).nested; + chassert(checkpoints.size() == columns.size()); + + checkpoint.size = size(); + for (size_t i = 0; i < columns.size(); ++i) + columns[i]->updateCheckpoint(*checkpoints[i]); +} + +void ColumnTuple::rollback(const ColumnCheckpoint & checkpoint) +{ + column_length = checkpoint.size; + const auto & checkpoints = assert_cast(checkpoint).nested; + + chassert(columns.size() == checkpoints.size()); + for (size_t i = 0; i < columns.size(); ++i) + columns[i]->rollback(*checkpoints[i]); +} + StringRef ColumnTuple::serializeValueIntoArena(size_t n, Arena & arena, char const *& begin) const { if (columns.empty()) diff --git a/src/Columns/ColumnTuple.h b/src/Columns/ColumnTuple.h index 6968294aef9..c73f90f13d9 100644 --- a/src/Columns/ColumnTuple.h +++ b/src/Columns/ColumnTuple.h @@ -118,6 +118,9 @@ public: size_t byteSizeAt(size_t n) const override; size_t allocatedBytes() const override; void protect() override; + ColumnCheckpointPtr getCheckpoint() const override; + void updateCheckpoint(ColumnCheckpoint & checkpoint) const override; + void rollback(const ColumnCheckpoint & checkpoint) override; void forEachSubcolumn(MutableColumnCallback callback) override; void forEachSubcolumnRecursively(RecursiveMutableColumnCallback callback) override; bool structureEquals(const IColumn & rhs) const override; diff --git a/src/Columns/ColumnVariant.cpp b/src/Columns/ColumnVariant.cpp index 8d7de94a319..564b60e1c1d 100644 --- a/src/Columns/ColumnVariant.cpp +++ b/src/Columns/ColumnVariant.cpp @@ -739,6 +739,39 @@ void ColumnVariant::popBack(size_t n) offsets->popBack(n); } +ColumnCheckpointPtr ColumnVariant::getCheckpoint() const +{ + ColumnCheckpoints checkpoints; + checkpoints.reserve(variants.size()); + + for (const auto & column : variants) + checkpoints.push_back(column->getCheckpoint()); + + return std::make_shared(size(), std::move(checkpoints)); +} + +void ColumnVariant::updateCheckpoint(ColumnCheckpoint & checkpoint) const +{ + auto & checkpoints = assert_cast(checkpoint).nested; + chassert(checkpoints.size() == variants.size()); + + checkpoint.size = size(); + for (size_t i = 0; i < variants.size(); ++i) + variants[i]->updateCheckpoint(*checkpoints[i]); +} + +void ColumnVariant::rollback(const ColumnCheckpoint & checkpoint) +{ + getOffsets().resize_assume_reserved(checkpoint.size); + getLocalDiscriminators().resize_assume_reserved(checkpoint.size); + + const auto & checkpoints = assert_cast(checkpoint).nested; + chassert(variants.size() == checkpoints.size()); + + for (size_t i = 0; i < variants.size(); ++i) + variants[i]->rollback(*checkpoints[i]); +} + StringRef ColumnVariant::serializeValueIntoArena(size_t n, Arena & arena, char const *& begin) const { /// During any serialization/deserialization we should always use global discriminators. diff --git a/src/Columns/ColumnVariant.h b/src/Columns/ColumnVariant.h index a6c45ee27b8..f90a812703d 100644 --- a/src/Columns/ColumnVariant.h +++ b/src/Columns/ColumnVariant.h @@ -248,6 +248,9 @@ public: size_t byteSizeAt(size_t n) const override; size_t allocatedBytes() const override; void protect() override; + ColumnCheckpointPtr getCheckpoint() const override; + void updateCheckpoint(ColumnCheckpoint & checkpoint) const override; + void rollback(const ColumnCheckpoint & checkpoint) override; void forEachSubcolumn(MutableColumnCallback callback) override; void forEachSubcolumnRecursively(RecursiveMutableColumnCallback callback) override; bool structureEquals(const IColumn & rhs) const override; diff --git a/src/Columns/IColumn.h b/src/Columns/IColumn.h index e4fe233ffdf..95becba3fdb 100644 --- a/src/Columns/IColumn.h +++ b/src/Columns/IColumn.h @@ -49,6 +49,40 @@ struct EqualRange using EqualRanges = std::vector; +/// A checkpoint that contains size of column and all its subcolumns. +/// It can be used to rollback column to the previous state, for example +/// after failed parsing when column may be in inconsistent state. +struct ColumnCheckpoint +{ + size_t size; + + explicit ColumnCheckpoint(size_t size_) : size(size_) {} + virtual ~ColumnCheckpoint() = default; +}; + +using ColumnCheckpointPtr = std::shared_ptr; +using ColumnCheckpoints = std::vector; + +struct ColumnCheckpointWithNested : public ColumnCheckpoint +{ + ColumnCheckpointWithNested(size_t size_, ColumnCheckpointPtr nested_) + : ColumnCheckpoint(size_), nested(std::move(nested_)) + { + } + + ColumnCheckpointPtr nested; +}; + +struct ColumnCheckpointWithMultipleNested : public ColumnCheckpoint +{ + ColumnCheckpointWithMultipleNested(size_t size_, ColumnCheckpoints nested_) + : ColumnCheckpoint(size_), nested(std::move(nested_)) + { + } + + ColumnCheckpoints nested; +}; + /// Declares interface to store columns in memory. class IColumn : public COW { @@ -509,6 +543,17 @@ public: /// The operation is slow and performed only for debug builds. virtual void protect() {} + /// Returns checkpoint of current state of column. + virtual ColumnCheckpointPtr getCheckpoint() const { return std::make_shared(size()); } + + /// Updates the checkpoint with current state. It is used to avoid extra allocations in 'getCheckpoint'. + virtual void updateCheckpoint(ColumnCheckpoint & checkpoint) const { checkpoint.size = size(); } + + /// Rollbacks column to the checkpoint. + /// Unlike 'popBack' this method should work correctly even if column has invalid state. + /// Sizes of columns in checkpoint must be less or equal than current size. + virtual void rollback(const ColumnCheckpoint & checkpoint) { popBack(size() - checkpoint.size); } + /// If the column contains subcolumns (such as Array, Nullable, etc), do callback on them. /// Shallow: doesn't do recursive calls; don't do call for itself. diff --git a/src/Columns/tests/gtest_column_dynamic.cpp b/src/Columns/tests/gtest_column_dynamic.cpp index de76261229d..9a435a97a07 100644 --- a/src/Columns/tests/gtest_column_dynamic.cpp +++ b/src/Columns/tests/gtest_column_dynamic.cpp @@ -920,3 +920,71 @@ TEST(ColumnDynamic, compare) ASSERT_EQ(column_from->compareAt(3, 2, *column_from, -1), -1); ASSERT_EQ(column_from->compareAt(3, 4, *column_from, -1), -1); } + +TEST(ColumnDynamic, rollback) +{ + auto check_variant = [](const ColumnVariant & column_variant, std::vector sizes) + { + ASSERT_EQ(column_variant.getNumVariants(), sizes.size()); + size_t num_rows = 0; + + for (size_t i = 0; i < sizes.size(); ++i) + { + ASSERT_EQ(column_variant.getVariants()[i]->size(), sizes[i]); + num_rows += sizes[i]; + } + + ASSERT_EQ(num_rows, column_variant.size()); + }; + + auto check_checkpoint = [&](const ColumnCheckpoint & cp, std::vector sizes) + { + const auto & nested = assert_cast(cp).nested; + size_t num_rows = 0; + + for (size_t i = 0; i < nested.size(); ++i) + { + ASSERT_EQ(nested[i]->size, sizes[i]); + num_rows += sizes[i]; + } + + ASSERT_EQ(num_rows, cp.size); + }; + + std::vector>> checkpoints; + + auto column = ColumnDynamic::create(2); + auto checkpoint = column->getCheckpoint(); + + column->insert(Field(42)); + + column->updateCheckpoint(*checkpoint); + checkpoints.emplace_back(checkpoint, std::vector{0, 1, 0}); + + column->insert(Field("str1")); + column->rollback(*checkpoint); + + check_checkpoint(*checkpoint, checkpoints.back().second); + check_variant(column->getVariantColumn(), checkpoints.back().second); + + column->insert("str1"); + checkpoints.emplace_back(column->getCheckpoint(), std::vector{0, 1, 1}); + + column->insert("str2"); + checkpoints.emplace_back(column->getCheckpoint(), std::vector{0, 1, 2}); + + column->insert(Array({1, 2})); + checkpoints.emplace_back(column->getCheckpoint(), std::vector{1, 1, 2}); + + column->insert(Field(42.42)); + checkpoints.emplace_back(column->getCheckpoint(), std::vector{2, 1, 2}); + + for (const auto & [cp, sizes] : checkpoints) + { + auto column_copy = column->clone(); + column_copy->rollback(*cp); + + check_checkpoint(*cp, sizes); + check_variant(assert_cast(*column_copy).getVariantColumn(), sizes); + } +} diff --git a/src/Columns/tests/gtest_column_object.cpp b/src/Columns/tests/gtest_column_object.cpp index f6a1da64ba3..a20bd26fabd 100644 --- a/src/Columns/tests/gtest_column_object.cpp +++ b/src/Columns/tests/gtest_column_object.cpp @@ -5,6 +5,7 @@ #include #include +#include "Core/Field.h" #include using namespace DB; @@ -349,3 +350,65 @@ TEST(ColumnObject, SkipSerializedInArena) pos = col2->skipSerializedInArena(pos); ASSERT_EQ(pos, end); } + +TEST(ColumnObject, rollback) +{ + auto type = DataTypeFactory::instance().get("JSON(max_dynamic_types=10, max_dynamic_paths=2, a.a UInt32, a.b UInt32)"); + auto col = type->createColumn(); + auto & col_object = assert_cast(*col); + const auto & typed_paths = col_object.getTypedPaths(); + const auto & dynamic_paths = col_object.getDynamicPaths(); + const auto & shared_data = col_object.getSharedDataColumn(); + + auto assert_sizes = [&](size_t size) + { + for (const auto & [name, column] : typed_paths) + ASSERT_EQ(column->size(), size); + + for (const auto & [name, column] : dynamic_paths) + ASSERT_EQ(column->size(), size); + + ASSERT_EQ(shared_data.size(), size); + }; + + auto checkpoint = col_object.getCheckpoint(); + + col_object.insert(Object{{"a.a", Field{1u}}}); + col_object.updateCheckpoint(*checkpoint); + + col_object.insert(Object{{"a.b", Field{2u}}}); + col_object.insert(Object{{"a.a", Field{3u}}}); + + col_object.rollback(*checkpoint); + + assert_sizes(1); + ASSERT_EQ(typed_paths.size(), 2); + ASSERT_EQ(dynamic_paths.size(), 0); + + ASSERT_EQ((*typed_paths.at("a.a"))[0], Field{1u}); + ASSERT_EQ((*typed_paths.at("a.b"))[0], Field{0u}); + + col_object.insert(Object{{"a.c", Field{"ccc"}}}); + + checkpoint = col_object.getCheckpoint(); + + col_object.insert(Object{{"a.d", Field{"ddd"}}}); + col_object.insert(Object{{"a.e", Field{"eee"}}}); + + assert_sizes(4); + ASSERT_EQ(typed_paths.size(), 2); + ASSERT_EQ(dynamic_paths.size(), 2); + + ASSERT_EQ((*typed_paths.at("a.a"))[0], Field{1u}); + ASSERT_EQ((*dynamic_paths.at("a.c"))[1], Field{"ccc"}); + ASSERT_EQ((*dynamic_paths.at("a.d"))[2], Field{"ddd"}); + + col_object.rollback(*checkpoint); + + assert_sizes(2); + ASSERT_EQ(typed_paths.size(), 2); + ASSERT_EQ(dynamic_paths.size(), 1); + + ASSERT_EQ((*typed_paths.at("a.a"))[0], Field{1u}); + ASSERT_EQ((*dynamic_paths.at("a.c"))[1], Field{"ccc"}); +} diff --git a/src/Common/CurrentMetrics.cpp b/src/Common/CurrentMetrics.cpp index c4318fb0fda..0c850fd4d36 100644 --- a/src/Common/CurrentMetrics.cpp +++ b/src/Common/CurrentMetrics.cpp @@ -183,8 +183,14 @@ M(BuildVectorSimilarityIndexThreadsScheduled, "Number of queued or active jobs in the build vector similarity index thread pool.") \ \ M(DiskPlainRewritableAzureDirectoryMapSize, "Number of local-to-remote path entries in the 'plain_rewritable' in-memory map for AzureObjectStorage.") \ + M(DiskPlainRewritableAzureFileCount, "Number of file entries in the 'plain_rewritable' in-memory map for AzureObjectStorage.") \ + M(DiskPlainRewritableAzureUniqueFileNamesCount, "Number of unique file name entries in the 'plain_rewritable' in-memory map for AzureObjectStorage.") \ M(DiskPlainRewritableLocalDirectoryMapSize, "Number of local-to-remote path entries in the 'plain_rewritable' in-memory map for LocalObjectStorage.") \ + M(DiskPlainRewritableLocalFileCount, "Number of file entries in the 'plain_rewritable' in-memory map for LocalObjectStorage.") \ + M(DiskPlainRewritableLocalUniqueFileNamesCount, "Number of unique file name entries in the 'plain_rewritable' in-memory map for LocalObjectStorage.") \ M(DiskPlainRewritableS3DirectoryMapSize, "Number of local-to-remote path entries in the 'plain_rewritable' in-memory map for S3ObjectStorage.") \ + M(DiskPlainRewritableS3FileCount, "Number of file entries in the 'plain_rewritable' in-memory map for S3ObjectStorage.") \ + M(DiskPlainRewritableS3UniqueFileNamesCount, "Number of unique file name entries in the 'plain_rewritable' in-memory map for S3ObjectStorage.") \ \ M(MergeTreePartsLoaderThreads, "Number of threads in the MergeTree parts loader thread pool.") \ M(MergeTreePartsLoaderThreadsActive, "Number of threads in the MergeTree parts loader thread pool running a task.") \ diff --git a/src/Common/Exception.cpp b/src/Common/Exception.cpp index d68537513da..320fc06cb2f 100644 --- a/src/Common/Exception.cpp +++ b/src/Common/Exception.cpp @@ -627,7 +627,7 @@ PreformattedMessage getExceptionMessageAndPattern(const Exception & e, bool with return PreformattedMessage{stream.str(), e.tryGetMessageFormatString(), e.getMessageFormatStringArgs()}; } -std::string getExceptionMessage(std::exception_ptr e, bool with_stacktrace) +std::string getExceptionMessage(std::exception_ptr e, bool with_stacktrace, bool check_embedded_stacktrace) { try { @@ -635,7 +635,7 @@ std::string getExceptionMessage(std::exception_ptr e, bool with_stacktrace) } catch (...) { - return getCurrentExceptionMessage(with_stacktrace); + return getCurrentExceptionMessage(with_stacktrace, check_embedded_stacktrace); } } diff --git a/src/Common/Exception.h b/src/Common/Exception.h index a4f55f41caa..8ec640ff642 100644 --- a/src/Common/Exception.h +++ b/src/Common/Exception.h @@ -329,7 +329,7 @@ void tryLogException(std::exception_ptr e, const AtomicLogger & logger, const st std::string getExceptionMessage(const Exception & e, bool with_stacktrace, bool check_embedded_stacktrace = false); PreformattedMessage getExceptionMessageAndPattern(const Exception & e, bool with_stacktrace, bool check_embedded_stacktrace = false); -std::string getExceptionMessage(std::exception_ptr e, bool with_stacktrace); +std::string getExceptionMessage(std::exception_ptr e, bool with_stacktrace, bool check_embedded_stacktrace = false); template diff --git a/src/Common/GWPAsan.cpp b/src/Common/GWPAsan.cpp index de6991191ea..a210fb3a73a 100644 --- a/src/Common/GWPAsan.cpp +++ b/src/Common/GWPAsan.cpp @@ -57,7 +57,7 @@ static bool guarded_alloc_initialized = [] opts.MaxSimultaneousAllocations = 1024; if (!env_options_raw || !std::string_view{env_options_raw}.contains("SampleRate")) - opts.SampleRate = 10000; + opts.SampleRate = 0; const char * collect_stacktraces = std::getenv("GWP_ASAN_COLLECT_STACKTRACES"); // NOLINT(concurrency-mt-unsafe) if (collect_stacktraces && std::string_view{collect_stacktraces} == "1") diff --git a/src/Common/GWPAsan.h b/src/Common/GWPAsan.h index 846c3417db4..c01a1130739 100644 --- a/src/Common/GWPAsan.h +++ b/src/Common/GWPAsan.h @@ -8,7 +8,6 @@ #include #include -#include namespace GWPAsan { @@ -39,14 +38,6 @@ inline bool shouldSample() return init_finished.load(std::memory_order_relaxed) && GuardedAlloc.shouldSample(); } -inline bool shouldForceSample() -{ - if (!init_finished.load(std::memory_order_relaxed)) - return false; - std::bernoulli_distribution dist(force_sample_probability.load(std::memory_order_relaxed)); - return dist(thread_local_rng); -} - } #endif diff --git a/src/Common/MemoryTracker.cpp b/src/Common/MemoryTracker.cpp index 3ed943f217d..f4af019605e 100644 --- a/src/Common/MemoryTracker.cpp +++ b/src/Common/MemoryTracker.cpp @@ -68,15 +68,15 @@ inline std::string_view toDescription(OvercommitResult result) case OvercommitResult::NONE: return ""; case OvercommitResult::DISABLED: - return "Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero."; + return "Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero"; case OvercommitResult::MEMORY_FREED: throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "OvercommitResult::MEMORY_FREED shouldn't be asked for description"); case OvercommitResult::SELECTED: - return "Query was selected to stop by OvercommitTracker."; + return "Query was selected to stop by OvercommitTracker"; case OvercommitResult::TIMEOUTED: - return "Waiting timeout for memory to be freed is reached."; + return "Waiting timeout for memory to be freed is reached"; case OvercommitResult::NOT_ENOUGH_FREED: - return "Memory overcommit has freed not enough memory."; + return "Memory overcommit has not freed enough memory"; } } @@ -150,15 +150,23 @@ void MemoryTracker::logPeakMemoryUsage() auto peak_bytes = peak.load(std::memory_order::relaxed); if (peak_bytes < 128 * 1024) return; - LOG_DEBUG(getLogger("MemoryTracker"), - "Peak memory usage{}: {}.", (description ? " " + std::string(description) : ""), ReadableSize(peak_bytes)); + LOG_DEBUG( + getLogger("MemoryTracker"), + "{}{} memory usage: {}.", + description ? std::string(description) : "", + description ? " peak" : "Peak", + ReadableSize(peak_bytes)); } void MemoryTracker::logMemoryUsage(Int64 current) const { const auto * description = description_ptr.load(std::memory_order_relaxed); - LOG_DEBUG(getLogger("MemoryTracker"), - "Current memory usage{}: {}.", (description ? " " + std::string(description) : ""), ReadableSize(current)); + LOG_DEBUG( + getLogger("MemoryTracker"), + "{}{} memory usage: {}.", + description ? std::string(description) : "", + description ? " current" : "Current", + ReadableSize(current)); } void MemoryTracker::injectFault() const @@ -178,9 +186,9 @@ void MemoryTracker::injectFault() const const auto * description = description_ptr.load(std::memory_order_relaxed); throw DB::Exception( DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED, - "Memory tracker{}{}: fault injected (at specific point)", - description ? " " : "", - description ? description : ""); + "{}{}: fault injected (at specific point)", + description ? description : "", + description ? " memory tracker" : "Memory tracker"); } void MemoryTracker::debugLogBigAllocationWithoutCheck(Int64 size [[maybe_unused]]) @@ -282,9 +290,9 @@ AllocationTrace MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceed const auto * description = description_ptr.load(std::memory_order_relaxed); throw DB::Exception( DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED, - "Memory tracker{}{}: fault injected. Would use {} (attempt to allocate chunk of {} bytes), maximum: {}", - description ? " " : "", + "{}{}: fault injected. Would use {} (attempt to allocate chunk of {} bytes), maximum: {}", description ? description : "", + description ? " memory tracker" : "Memory tracker", formatReadableSizeWithBinarySuffix(will_be), size, formatReadableSizeWithBinarySuffix(current_hard_limit)); @@ -305,6 +313,8 @@ AllocationTrace MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceed if (overcommit_result != OvercommitResult::MEMORY_FREED) { + bool overcommit_result_ignore + = overcommit_result == OvercommitResult::NONE || overcommit_result == OvercommitResult::DISABLED; /// Revert amount.fetch_sub(size, std::memory_order_relaxed); rss.fetch_sub(size, std::memory_order_relaxed); @@ -314,18 +324,18 @@ AllocationTrace MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceed ProfileEvents::increment(ProfileEvents::QueryMemoryLimitExceeded); const auto * description = description_ptr.load(std::memory_order_relaxed); throw DB::Exception( - DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED, - "Memory limit{}{} exceeded: " - "would use {} (attempt to allocate chunk of {} bytes), current RSS {}, maximum: {}." - "{}{}", - description ? " " : "", - description ? description : "", - formatReadableSizeWithBinarySuffix(will_be), - size, - formatReadableSizeWithBinarySuffix(rss.load(std::memory_order_relaxed)), - formatReadableSizeWithBinarySuffix(current_hard_limit), - overcommit_result == OvercommitResult::NONE ? "" : " OvercommitTracker decision: ", - toDescription(overcommit_result)); + DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED, + "{}{} exceeded: " + "would use {} (attempt to allocate chunk of {} bytes), current RSS {}, maximum: {}." + "{}{}", + description ? description : "", + description ? " memory limit" : "Memory limit", + formatReadableSizeWithBinarySuffix(will_be), + size, + formatReadableSizeWithBinarySuffix(rss.load(std::memory_order_relaxed)), + formatReadableSizeWithBinarySuffix(current_hard_limit), + overcommit_result_ignore ? "" : " OvercommitTracker decision: ", + overcommit_result_ignore ? "" : toDescription(overcommit_result)); } // If OvercommitTracker::needToStopQuery returned false, it guarantees that enough memory is freed. diff --git a/src/Common/PODArray.h b/src/Common/PODArray.h index 48f2ffee8ce..2d69b8ac26c 100644 --- a/src/Common/PODArray.h +++ b/src/Common/PODArray.h @@ -115,11 +115,6 @@ protected: template void alloc(size_t bytes, TAllocatorParams &&... allocator_params) { -#if USE_GWP_ASAN - if (unlikely(GWPAsan::shouldForceSample())) - gwp_asan::getThreadLocals()->NextSampleCounter = 1; -#endif - char * allocated = reinterpret_cast(TAllocator::alloc(bytes, std::forward(allocator_params)...)); c_start = allocated + pad_left; @@ -149,11 +144,6 @@ protected: return; } -#if USE_GWP_ASAN - if (unlikely(GWPAsan::shouldForceSample())) - gwp_asan::getThreadLocals()->NextSampleCounter = 1; -#endif - unprotect(); ptrdiff_t end_diff = c_end - c_start; diff --git a/src/Common/Priority.h b/src/Common/Priority.h index 8952fe4dd5a..f0e5787ae91 100644 --- a/src/Common/Priority.h +++ b/src/Common/Priority.h @@ -6,6 +6,7 @@ /// Separate type (rather than `Int64` is used just to avoid implicit conversion errors and to default-initialize struct Priority { - Int64 value = 0; /// Note that lower value means higher priority. - constexpr operator Int64() const { return value; } /// NOLINT + using Value = Int64; + Value value = 0; /// Note that lower value means higher priority. + constexpr operator Value() const { return value; } /// NOLINT }; diff --git a/src/Common/Scheduler/IResourceManager.h b/src/Common/Scheduler/IResourceManager.h index 8a7077ac3d5..c6f41346e11 100644 --- a/src/Common/Scheduler/IResourceManager.h +++ b/src/Common/Scheduler/IResourceManager.h @@ -26,6 +26,9 @@ class IClassifier : private boost::noncopyable public: virtual ~IClassifier() = default; + /// Returns true iff resource access is allowed by this classifier + virtual bool has(const String & resource_name) = 0; + /// Returns ResourceLink that should be used to access resource. /// Returned link is valid until classifier destruction. virtual ResourceLink get(const String & resource_name) = 0; @@ -46,12 +49,15 @@ public: /// Initialize or reconfigure manager. virtual void updateConfiguration(const Poco::Util::AbstractConfiguration & config) = 0; + /// Returns true iff given resource is controlled through this manager. + virtual bool hasResource(const String & resource_name) const = 0; + /// Obtain a classifier instance required to get access to resources. /// Note that it holds resource configuration, so should be destructed when query is done. virtual ClassifierPtr acquire(const String & classifier_name) = 0; /// For introspection, see `system.scheduler` table - using VisitorFunc = std::function; + using VisitorFunc = std::function; virtual void forEachNode(VisitorFunc visitor) = 0; }; diff --git a/src/Common/Scheduler/ISchedulerConstraint.h b/src/Common/Scheduler/ISchedulerConstraint.h index a976206de74..3bee9c1b424 100644 --- a/src/Common/Scheduler/ISchedulerConstraint.h +++ b/src/Common/Scheduler/ISchedulerConstraint.h @@ -15,8 +15,7 @@ namespace DB * When constraint is again satisfied, scheduleActivation() is called from finishRequest(). * * Derived class behaviour requirements: - * - dequeueRequest() must fill `request->constraint` iff it is nullptr; - * - finishRequest() must be recursive: call to `parent_constraint->finishRequest()`. + * - dequeueRequest() must call `request->addConstraint()`. */ class ISchedulerConstraint : public ISchedulerNode { @@ -25,34 +24,16 @@ public: : ISchedulerNode(event_queue_, config, config_prefix) {} + ISchedulerConstraint(EventQueue * event_queue_, const SchedulerNodeInfo & info_) + : ISchedulerNode(event_queue_, info_) + {} + /// Resource consumption by `request` is finished. /// Should be called outside of scheduling subsystem, implementation must be thread-safe. virtual void finishRequest(ResourceRequest * request) = 0; - void setParent(ISchedulerNode * parent_) override - { - ISchedulerNode::setParent(parent_); - - // Assign `parent_constraint` to the nearest parent derived from ISchedulerConstraint - for (ISchedulerNode * node = parent_; node != nullptr; node = node->parent) - { - if (auto * constraint = dynamic_cast(node)) - { - parent_constraint = constraint; - break; - } - } - } - /// For introspection of current state (true = satisfied, false = violated) virtual bool isSatisfied() = 0; - -protected: - // Reference to nearest parent that is also derived from ISchedulerConstraint. - // Request can traverse through multiple constraints while being dequeue from hierarchy, - // while finishing request should traverse the same chain in reverse order. - // NOTE: it must be immutable after initialization, because it is accessed in not thread-safe way from finishRequest() - ISchedulerConstraint * parent_constraint = nullptr; }; } diff --git a/src/Common/Scheduler/ISchedulerNode.h b/src/Common/Scheduler/ISchedulerNode.h index 0705c4f0a35..5e1239de274 100644 --- a/src/Common/Scheduler/ISchedulerNode.h +++ b/src/Common/Scheduler/ISchedulerNode.h @@ -57,7 +57,13 @@ struct SchedulerNodeInfo SchedulerNodeInfo() = default; - explicit SchedulerNodeInfo(const Poco::Util::AbstractConfiguration & config = emptyConfig(), const String & config_prefix = {}) + explicit SchedulerNodeInfo(double weight_, Priority priority_ = {}) + { + setWeight(weight_); + setPriority(priority_); + } + + explicit SchedulerNodeInfo(const Poco::Util::AbstractConfiguration & config, const String & config_prefix = {}) { setWeight(config.getDouble(config_prefix + ".weight", weight)); setPriority(config.getInt64(config_prefix + ".priority", priority)); @@ -68,7 +74,7 @@ struct SchedulerNodeInfo if (value <= 0 || !isfinite(value)) throw Exception( ErrorCodes::INVALID_SCHEDULER_NODE, - "Negative and non-finite node weights are not allowed: {}", + "Zero, negative and non-finite node weights are not allowed: {}", value); weight = value; } @@ -78,6 +84,11 @@ struct SchedulerNodeInfo priority.value = value; } + void setPriority(Priority value) + { + priority = value; + } + // To check if configuration update required bool equals(const SchedulerNodeInfo & o) const { @@ -123,7 +134,14 @@ public: , info(config, config_prefix) {} - virtual ~ISchedulerNode() = default; + ISchedulerNode(EventQueue * event_queue_, const SchedulerNodeInfo & info_) + : event_queue(event_queue_) + , info(info_) + {} + + virtual ~ISchedulerNode(); + + virtual const String & getTypeName() const = 0; /// Checks if two nodes configuration is equal virtual bool equals(ISchedulerNode * other) @@ -134,10 +152,11 @@ public: /// Attach new child virtual void attachChild(const std::shared_ptr & child) = 0; - /// Detach and destroy child + /// Detach child + /// NOTE: child might be destroyed if the only reference was stored in parent virtual void removeChild(ISchedulerNode * child) = 0; - /// Get attached child by name + /// Get attached child by name (for tests only) virtual ISchedulerNode * getChild(const String & child_name) = 0; /// Activation of child due to the first pending request @@ -147,7 +166,7 @@ public: /// Returns true iff node is active virtual bool isActive() = 0; - /// Returns number of active children + /// Returns number of active children (for introspection only). virtual size_t activeChildren() = 0; /// Returns the first request to be executed as the first component of resulting pair. @@ -155,10 +174,10 @@ public: virtual std::pair dequeueRequest() = 0; /// Returns full path string using names of every parent - String getPath() + String getPath() const { String result; - ISchedulerNode * ptr = this; + const ISchedulerNode * ptr = this; while (ptr->parent) { result = "/" + ptr->basename + result; @@ -168,10 +187,7 @@ public: } /// Attach to a parent (used by attachChild) - virtual void setParent(ISchedulerNode * parent_) - { - parent = parent_; - } + void setParent(ISchedulerNode * parent_); protected: /// Notify parents about the first pending request or constraint becoming satisfied. @@ -307,6 +323,15 @@ public: pending.notify_one(); } + /// Removes an activation from queue + void cancelActivation(ISchedulerNode * node) + { + std::unique_lock lock{mutex}; + if (node->is_linked()) + activations.erase(activations.iterator_to(*node)); + node->activation_event_id = 0; + } + /// Process single event if it exists /// Note that postponing constraint are ignored, use it to empty the queue including postponed events on shutdown /// Returns `true` iff event has been processed @@ -471,6 +496,20 @@ private: std::atomic manual_time{TimePoint()}; // for tests only }; +inline ISchedulerNode::~ISchedulerNode() +{ + // Make sure there is no dangling reference in activations queue + event_queue->cancelActivation(this); +} + +inline void ISchedulerNode::setParent(ISchedulerNode * parent_) +{ + parent = parent_; + // Avoid activation of a detached node + if (parent == nullptr) + event_queue->cancelActivation(this); +} + inline void ISchedulerNode::scheduleActivation() { if (likely(parent)) diff --git a/src/Common/Scheduler/ISchedulerQueue.h b/src/Common/Scheduler/ISchedulerQueue.h index b7a51870a24..6c77cee6b9d 100644 --- a/src/Common/Scheduler/ISchedulerQueue.h +++ b/src/Common/Scheduler/ISchedulerQueue.h @@ -21,6 +21,10 @@ public: : ISchedulerNode(event_queue_, config, config_prefix) {} + ISchedulerQueue(EventQueue * event_queue_, const SchedulerNodeInfo & info_) + : ISchedulerNode(event_queue_, info_) + {} + // Wrapper for `enqueueRequest()` that should be used to account for available resource budget // Returns `estimated_cost` that should be passed later to `adjustBudget()` [[ nodiscard ]] ResourceCost enqueueRequestUsingBudget(ResourceRequest * request) @@ -47,6 +51,11 @@ public: /// Should be called outside of scheduling subsystem, implementation must be thread-safe. virtual bool cancelRequest(ResourceRequest * request) = 0; + /// Fails all the resource requests in queue and marks this queue as not usable. + /// Afterwards any new request will be failed on `enqueueRequest()`. + /// NOTE: This is done for queues that are about to be destructed. + virtual void purgeQueue() = 0; + /// For introspection ResourceCost getBudget() const { diff --git a/src/Common/Scheduler/Nodes/ClassifiersConfig.cpp b/src/Common/Scheduler/Nodes/ClassifiersConfig.cpp index 3be61801149..455d0880aa6 100644 --- a/src/Common/Scheduler/Nodes/ClassifiersConfig.cpp +++ b/src/Common/Scheduler/Nodes/ClassifiersConfig.cpp @@ -5,11 +5,6 @@ namespace DB { -namespace ErrorCodes -{ - extern const int RESOURCE_NOT_FOUND; -} - ClassifierDescription::ClassifierDescription(const Poco::Util::AbstractConfiguration & config, const String & config_prefix) { Poco::Util::AbstractConfiguration::Keys keys; @@ -31,9 +26,11 @@ ClassifiersConfig::ClassifiersConfig(const Poco::Util::AbstractConfiguration & c const ClassifierDescription & ClassifiersConfig::get(const String & classifier_name) { + static ClassifierDescription empty; if (auto it = classifiers.find(classifier_name); it != classifiers.end()) return it->second; - throw Exception(ErrorCodes::RESOURCE_NOT_FOUND, "Unknown workload classifier '{}' to access resources", classifier_name); + else + return empty; } } diff --git a/src/Common/Scheduler/Nodes/ClassifiersConfig.h b/src/Common/Scheduler/Nodes/ClassifiersConfig.h index 186c49943ad..62db719568b 100644 --- a/src/Common/Scheduler/Nodes/ClassifiersConfig.h +++ b/src/Common/Scheduler/Nodes/ClassifiersConfig.h @@ -10,6 +10,7 @@ namespace DB /// Mapping of resource name into path string (e.g. "disk1" -> "/path/to/class") struct ClassifierDescription : std::unordered_map { + ClassifierDescription() = default; ClassifierDescription(const Poco::Util::AbstractConfiguration & config, const String & config_prefix); }; diff --git a/src/Common/Scheduler/Nodes/DynamicResourceManager.cpp b/src/Common/Scheduler/Nodes/CustomResourceManager.cpp similarity index 84% rename from src/Common/Scheduler/Nodes/DynamicResourceManager.cpp rename to src/Common/Scheduler/Nodes/CustomResourceManager.cpp index 5bf884fc3df..b9ab89ee2b8 100644 --- a/src/Common/Scheduler/Nodes/DynamicResourceManager.cpp +++ b/src/Common/Scheduler/Nodes/CustomResourceManager.cpp @@ -1,7 +1,6 @@ -#include +#include #include -#include #include #include @@ -21,7 +20,7 @@ namespace ErrorCodes extern const int INVALID_SCHEDULER_NODE; } -DynamicResourceManager::State::State(EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config) +CustomResourceManager::State::State(EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config) : classifiers(config) { Poco::Util::AbstractConfiguration::Keys keys; @@ -35,7 +34,7 @@ DynamicResourceManager::State::State(EventQueue * event_queue, const Poco::Util: } } -DynamicResourceManager::State::Resource::Resource( +CustomResourceManager::State::Resource::Resource( const String & name, EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config, @@ -92,7 +91,7 @@ DynamicResourceManager::State::Resource::Resource( throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "undefined root node path '/' for resource '{}'", name); } -DynamicResourceManager::State::Resource::~Resource() +CustomResourceManager::State::Resource::~Resource() { // NOTE: we should rely on `attached_to` and cannot use `parent`, // NOTE: because `parent` can be `nullptr` in case attachment is still in event queue @@ -106,14 +105,14 @@ DynamicResourceManager::State::Resource::~Resource() } } -DynamicResourceManager::State::Node::Node(const String & name, EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config, const std::string & config_prefix) +CustomResourceManager::State::Node::Node(const String & name, EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config, const std::string & config_prefix) : type(config.getString(config_prefix + ".type", "fifo")) , ptr(SchedulerNodeFactory::instance().get(type, event_queue, config, config_prefix)) { ptr->basename = name; } -bool DynamicResourceManager::State::Resource::equals(const DynamicResourceManager::State::Resource & o) const +bool CustomResourceManager::State::Resource::equals(const CustomResourceManager::State::Resource & o) const { if (nodes.size() != o.nodes.size()) return false; @@ -130,14 +129,14 @@ bool DynamicResourceManager::State::Resource::equals(const DynamicResourceManage return true; } -bool DynamicResourceManager::State::Node::equals(const DynamicResourceManager::State::Node & o) const +bool CustomResourceManager::State::Node::equals(const CustomResourceManager::State::Node & o) const { if (type != o.type) return false; return ptr->equals(o.ptr.get()); } -DynamicResourceManager::Classifier::Classifier(const DynamicResourceManager::StatePtr & state_, const String & classifier_name) +CustomResourceManager::Classifier::Classifier(const CustomResourceManager::StatePtr & state_, const String & classifier_name) : state(state_) { // State is immutable, but nodes are mutable and thread-safe @@ -162,20 +161,25 @@ DynamicResourceManager::Classifier::Classifier(const DynamicResourceManager::Sta } } -ResourceLink DynamicResourceManager::Classifier::get(const String & resource_name) +bool CustomResourceManager::Classifier::has(const String & resource_name) +{ + return resources.contains(resource_name); +} + +ResourceLink CustomResourceManager::Classifier::get(const String & resource_name) { if (auto iter = resources.find(resource_name); iter != resources.end()) return iter->second; throw Exception(ErrorCodes::RESOURCE_ACCESS_DENIED, "Access denied to resource '{}'", resource_name); } -DynamicResourceManager::DynamicResourceManager() +CustomResourceManager::CustomResourceManager() : state(new State()) { scheduler.start(); } -void DynamicResourceManager::updateConfiguration(const Poco::Util::AbstractConfiguration & config) +void CustomResourceManager::updateConfiguration(const Poco::Util::AbstractConfiguration & config) { StatePtr new_state = std::make_shared(scheduler.event_queue, config); @@ -217,7 +221,13 @@ void DynamicResourceManager::updateConfiguration(const Poco::Util::AbstractConfi // NOTE: after mutex unlock `state` became available for Classifier(s) and must be immutable } -ClassifierPtr DynamicResourceManager::acquire(const String & classifier_name) +bool CustomResourceManager::hasResource(const String & resource_name) const +{ + std::lock_guard lock{mutex}; + return state->resources.contains(resource_name); +} + +ClassifierPtr CustomResourceManager::acquire(const String & classifier_name) { // Acquire a reference to the current state StatePtr state_ref; @@ -229,7 +239,7 @@ ClassifierPtr DynamicResourceManager::acquire(const String & classifier_name) return std::make_shared(state_ref, classifier_name); } -void DynamicResourceManager::forEachNode(IResourceManager::VisitorFunc visitor) +void CustomResourceManager::forEachNode(IResourceManager::VisitorFunc visitor) { // Acquire a reference to the current state StatePtr state_ref; @@ -244,7 +254,7 @@ void DynamicResourceManager::forEachNode(IResourceManager::VisitorFunc visitor) { for (auto & [name, resource] : state_ref->resources) for (auto & [path, node] : resource->nodes) - visitor(name, path, node.type, node.ptr); + visitor(name, path, node.ptr.get()); promise.set_value(); }); @@ -252,9 +262,4 @@ void DynamicResourceManager::forEachNode(IResourceManager::VisitorFunc visitor) future.get(); } -void registerDynamicResourceManager(ResourceManagerFactory & factory) -{ - factory.registerMethod("dynamic"); -} - } diff --git a/src/Common/Scheduler/Nodes/DynamicResourceManager.h b/src/Common/Scheduler/Nodes/CustomResourceManager.h similarity index 86% rename from src/Common/Scheduler/Nodes/DynamicResourceManager.h rename to src/Common/Scheduler/Nodes/CustomResourceManager.h index 4b0a3a48b61..900a9c4e50b 100644 --- a/src/Common/Scheduler/Nodes/DynamicResourceManager.h +++ b/src/Common/Scheduler/Nodes/CustomResourceManager.h @@ -10,7 +10,9 @@ namespace DB { /* - * Implementation of `IResourceManager` supporting arbitrary dynamic hierarchy of scheduler nodes. + * Implementation of `IResourceManager` supporting arbitrary hierarchy of scheduler nodes. + * Scheduling hierarchies for every resource is described through server xml or yaml configuration. + * Configuration could be changed dynamically without server restart. * All resources are controlled by single root `SchedulerRoot`. * * State of manager is set of resources attached to the scheduler. States are referenced by classifiers. @@ -24,11 +26,12 @@ namespace DB * violation will apply to fairness. Old version exists as long as there is at least one classifier * instance referencing it. Classifiers are typically attached to queries and will be destructed with them. */ -class DynamicResourceManager : public IResourceManager +class CustomResourceManager : public IResourceManager { public: - DynamicResourceManager(); + CustomResourceManager(); void updateConfiguration(const Poco::Util::AbstractConfiguration & config) override; + bool hasResource(const String & resource_name) const override; ClassifierPtr acquire(const String & classifier_name) override; void forEachNode(VisitorFunc visitor) override; @@ -79,6 +82,7 @@ private: { public: Classifier(const StatePtr & state_, const String & classifier_name); + bool has(const String & resource_name) override; ResourceLink get(const String & resource_name) override; private: std::unordered_map resources; // accessible resources by names @@ -86,7 +90,7 @@ private: }; SchedulerRoot scheduler; - std::mutex mutex; + mutable std::mutex mutex; StatePtr state; }; diff --git a/src/Common/Scheduler/Nodes/FairPolicy.h b/src/Common/Scheduler/Nodes/FairPolicy.h index 246642ff2fd..a865711c460 100644 --- a/src/Common/Scheduler/Nodes/FairPolicy.h +++ b/src/Common/Scheduler/Nodes/FairPolicy.h @@ -28,7 +28,7 @@ namespace ErrorCodes * of a child is set to vruntime of "start" of the last request. This guarantees immediate processing * of at least single request of newly activated children and thus best isolation and scheduling latency. */ -class FairPolicy : public ISchedulerNode +class FairPolicy final : public ISchedulerNode { /// Scheduling state of a child struct Item @@ -48,6 +48,23 @@ public: : ISchedulerNode(event_queue_, config, config_prefix) {} + FairPolicy(EventQueue * event_queue_, const SchedulerNodeInfo & info_) + : ISchedulerNode(event_queue_, info_) + {} + + ~FairPolicy() override + { + // We need to clear `parent` in all children to avoid dangling references + while (!children.empty()) + removeChild(children.begin()->second.get()); + } + + const String & getTypeName() const override + { + static String type_name("fair"); + return type_name; + } + bool equals(ISchedulerNode * other) override { if (!ISchedulerNode::equals(other)) diff --git a/src/Common/Scheduler/Nodes/FifoQueue.h b/src/Common/Scheduler/Nodes/FifoQueue.h index 90f8fffe665..9502fae1a45 100644 --- a/src/Common/Scheduler/Nodes/FifoQueue.h +++ b/src/Common/Scheduler/Nodes/FifoQueue.h @@ -23,13 +23,28 @@ namespace ErrorCodes /* * FIFO queue to hold pending resource requests */ -class FifoQueue : public ISchedulerQueue +class FifoQueue final : public ISchedulerQueue { public: FifoQueue(EventQueue * event_queue_, const Poco::Util::AbstractConfiguration & config, const String & config_prefix) : ISchedulerQueue(event_queue_, config, config_prefix) {} + FifoQueue(EventQueue * event_queue_, const SchedulerNodeInfo & info_) + : ISchedulerQueue(event_queue_, info_) + {} + + ~FifoQueue() override + { + purgeQueue(); + } + + const String & getTypeName() const override + { + static String type_name("fifo"); + return type_name; + } + bool equals(ISchedulerNode * other) override { if (!ISchedulerNode::equals(other)) @@ -42,6 +57,8 @@ public: void enqueueRequest(ResourceRequest * request) override { std::lock_guard lock(mutex); + if (is_not_usable) + throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Scheduler queue is about to be destructed"); queue_cost += request->cost; bool was_empty = requests.empty(); requests.push_back(*request); @@ -66,6 +83,8 @@ public: bool cancelRequest(ResourceRequest * request) override { std::lock_guard lock(mutex); + if (is_not_usable) + return false; // Any request should already be failed or executed if (request->is_linked()) { // It's impossible to check that `request` is indeed inserted to this queue and not another queue. @@ -88,6 +107,19 @@ public: return false; } + void purgeQueue() override + { + std::lock_guard lock(mutex); + is_not_usable = true; + while (!requests.empty()) + { + ResourceRequest * request = &requests.front(); + requests.pop_front(); + request->failed(std::make_exception_ptr( + Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Scheduler queue with resource request is about to be destructed"))); + } + } + bool isActive() override { std::lock_guard lock(mutex); @@ -131,6 +163,7 @@ private: std::mutex mutex; Int64 queue_cost = 0; boost::intrusive::list requests; + bool is_not_usable = false; }; } diff --git a/src/Common/Scheduler/Nodes/IOResourceManager.cpp b/src/Common/Scheduler/Nodes/IOResourceManager.cpp new file mode 100644 index 00000000000..e2042a29a80 --- /dev/null +++ b/src/Common/Scheduler/Nodes/IOResourceManager.cpp @@ -0,0 +1,532 @@ +#include + +#include +#include + +#include +#include +#include +#include +#include +#include + +#include +#include + +#include +#include +#include + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int RESOURCE_NOT_FOUND; + extern const int INVALID_SCHEDULER_NODE; + extern const int LOGICAL_ERROR; +} + +namespace +{ + String getEntityName(const ASTPtr & ast) + { + if (auto * create = typeid_cast(ast.get())) + return create->getWorkloadName(); + if (auto * create = typeid_cast(ast.get())) + return create->getResourceName(); + return "unknown-workload-entity"; + } +} + +IOResourceManager::NodeInfo::NodeInfo(const ASTPtr & ast, const String & resource_name) +{ + auto * create = assert_cast(ast.get()); + name = create->getWorkloadName(); + parent = create->getWorkloadParent(); + settings.updateFromChanges(create->changes, resource_name); +} + +IOResourceManager::Resource::Resource(const ASTPtr & resource_entity_) + : resource_entity(resource_entity_) + , resource_name(getEntityName(resource_entity)) +{ + scheduler.start(); +} + +IOResourceManager::Resource::~Resource() +{ + scheduler.stop(); +} + +void IOResourceManager::Resource::createNode(const NodeInfo & info) +{ + if (info.name.empty()) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Workload must have a name in resource '{}'", + resource_name); + + if (info.name == info.parent) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Self-referencing workload '{}' is not allowed in resource '{}'", + info.name, resource_name); + + if (node_for_workload.contains(info.name)) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Node for creating workload '{}' already exist in resource '{}'", + info.name, resource_name); + + if (!info.parent.empty() && !node_for_workload.contains(info.parent)) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Parent node '{}' for creating workload '{}' does not exist in resource '{}'", + info.parent, info.name, resource_name); + + if (info.parent.empty() && root_node) + throw Exception(ErrorCodes::LOGICAL_ERROR, "The second root workload '{}' is not allowed (current root '{}') in resource '{}'", + info.name, root_node->basename, resource_name); + + executeInSchedulerThread([&, this] + { + auto node = std::make_shared(scheduler.event_queue, info.settings); + node->basename = info.name; + if (!info.parent.empty()) + node_for_workload[info.parent]->attachUnifiedChild(node); + else + { + root_node = node; + scheduler.attachChild(root_node); + } + node_for_workload[info.name] = node; + + updateCurrentVersion(); + }); +} + +void IOResourceManager::Resource::deleteNode(const NodeInfo & info) +{ + if (!node_for_workload.contains(info.name)) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Node for removing workload '{}' does not exist in resource '{}'", + info.name, resource_name); + + if (!info.parent.empty() && !node_for_workload.contains(info.parent)) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Parent node '{}' for removing workload '{}' does not exist in resource '{}'", + info.parent, info.name, resource_name); + + auto node = node_for_workload[info.name]; + + if (node->hasUnifiedChildren()) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Removing workload '{}' with children in resource '{}'", + info.name, resource_name); + + executeInSchedulerThread([&] + { + if (!info.parent.empty()) + node_for_workload[info.parent]->detachUnifiedChild(node); + else + { + chassert(node == root_node); + scheduler.removeChild(root_node.get()); + root_node.reset(); + } + + node_for_workload.erase(info.name); + + updateCurrentVersion(); + }); +} + +void IOResourceManager::Resource::updateNode(const NodeInfo & old_info, const NodeInfo & new_info) +{ + if (old_info.name != new_info.name) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Updating a name of workload '{}' to '{}' is not allowed in resource '{}'", + old_info.name, new_info.name, resource_name); + + if (old_info.parent != new_info.parent && (old_info.parent.empty() || new_info.parent.empty())) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Workload '{}' invalid update of parent from '{}' to '{}' in resource '{}'", + old_info.name, old_info.parent, new_info.parent, resource_name); + + if (!node_for_workload.contains(old_info.name)) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Node for updating workload '{}' does not exist in resource '{}'", + old_info.name, resource_name); + + if (!old_info.parent.empty() && !node_for_workload.contains(old_info.parent)) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Old parent node '{}' for updating workload '{}' does not exist in resource '{}'", + old_info.parent, old_info.name, resource_name); + + if (!new_info.parent.empty() && !node_for_workload.contains(new_info.parent)) + throw Exception(ErrorCodes::LOGICAL_ERROR, "New parent node '{}' for updating workload '{}' does not exist in resource '{}'", + new_info.parent, new_info.name, resource_name); + + executeInSchedulerThread([&, this] + { + auto node = node_for_workload[old_info.name]; + bool detached = false; + if (UnifiedSchedulerNode::updateRequiresDetach(old_info.parent, new_info.parent, old_info.settings, new_info.settings)) + { + if (!old_info.parent.empty()) + node_for_workload[old_info.parent]->detachUnifiedChild(node); + detached = true; + } + + node->updateSchedulingSettings(new_info.settings); + + if (detached) + { + if (!new_info.parent.empty()) + node_for_workload[new_info.parent]->attachUnifiedChild(node); + } + updateCurrentVersion(); + }); +} + +void IOResourceManager::Resource::updateCurrentVersion() +{ + auto previous_version = current_version; + + // Create a full list of constraints and queues in the current hierarchy + current_version = std::make_shared(); + if (root_node) + root_node->addRawPointerNodes(current_version->nodes); + + // See details in version control section of description in IOResourceManager.h + if (previous_version) + { + previous_version->newer_version = current_version; + previous_version.reset(); // Destroys previous version nodes if there are no classifiers referencing it + } +} + +IOResourceManager::Workload::Workload(IOResourceManager * resource_manager_, const ASTPtr & workload_entity_) + : resource_manager(resource_manager_) + , workload_entity(workload_entity_) +{ + try + { + for (auto & [resource_name, resource] : resource_manager->resources) + resource->createNode(NodeInfo(workload_entity, resource_name)); + } + catch (...) + { + throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected error in IOResourceManager: {}", + getCurrentExceptionMessage(/* with_stacktrace = */ true)); + } +} + +IOResourceManager::Workload::~Workload() +{ + try + { + for (auto & [resource_name, resource] : resource_manager->resources) + resource->deleteNode(NodeInfo(workload_entity, resource_name)); + } + catch (...) + { + throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected error in IOResourceManager: {}", + getCurrentExceptionMessage(/* with_stacktrace = */ true)); + } +} + +void IOResourceManager::Workload::updateWorkload(const ASTPtr & new_entity) +{ + try + { + for (auto & [resource_name, resource] : resource_manager->resources) + resource->updateNode(NodeInfo(workload_entity, resource_name), NodeInfo(new_entity, resource_name)); + workload_entity = new_entity; + } + catch (...) + { + throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected error in IOResourceManager: {}", + getCurrentExceptionMessage(/* with_stacktrace = */ true)); + } +} + +String IOResourceManager::Workload::getParent() const +{ + return assert_cast(workload_entity.get())->getWorkloadParent(); +} + +IOResourceManager::IOResourceManager(IWorkloadEntityStorage & storage_) + : storage(storage_) + , log{getLogger("IOResourceManager")} +{ + subscription = storage.getAllEntitiesAndSubscribe( + [this] (const std::vector & events) + { + for (const auto & [entity_type, entity_name, entity] : events) + { + switch (entity_type) + { + case WorkloadEntityType::Workload: + { + if (entity) + createOrUpdateWorkload(entity_name, entity); + else + deleteWorkload(entity_name); + break; + } + case WorkloadEntityType::Resource: + { + if (entity) + createOrUpdateResource(entity_name, entity); + else + deleteResource(entity_name); + break; + } + case WorkloadEntityType::MAX: break; + } + } + }); +} + +IOResourceManager::~IOResourceManager() +{ + subscription.reset(); + resources.clear(); + workloads.clear(); +} + +void IOResourceManager::updateConfiguration(const Poco::Util::AbstractConfiguration &) +{ + // No-op +} + +void IOResourceManager::createOrUpdateWorkload(const String & workload_name, const ASTPtr & ast) +{ + std::unique_lock lock{mutex}; + if (auto workload_iter = workloads.find(workload_name); workload_iter != workloads.end()) + workload_iter->second->updateWorkload(ast); + else + workloads.emplace(workload_name, std::make_shared(this, ast)); +} + +void IOResourceManager::deleteWorkload(const String & workload_name) +{ + std::unique_lock lock{mutex}; + if (auto workload_iter = workloads.find(workload_name); workload_iter != workloads.end()) + { + // Note that we rely of the fact that workload entity storage will not drop workload that is used as a parent + workloads.erase(workload_iter); + } + else // Workload to be deleted does not exist -- do nothing, throwing exceptions from a subscription is pointless + LOG_ERROR(log, "Delete workload that doesn't exist: {}", workload_name); +} + +void IOResourceManager::createOrUpdateResource(const String & resource_name, const ASTPtr & ast) +{ + std::unique_lock lock{mutex}; + if (auto resource_iter = resources.find(resource_name); resource_iter != resources.end()) + resource_iter->second->updateResource(ast); + else + { + // Add all workloads into the new resource + auto resource = std::make_shared(ast); + for (Workload * workload : topologicallySortedWorkloads()) + resource->createNode(NodeInfo(workload->workload_entity, resource_name)); + + // Attach the resource + resources.emplace(resource_name, resource); + } +} + +void IOResourceManager::deleteResource(const String & resource_name) +{ + std::unique_lock lock{mutex}; + if (auto resource_iter = resources.find(resource_name); resource_iter != resources.end()) + { + resources.erase(resource_iter); + } + else // Resource to be deleted does not exist -- do nothing, throwing exceptions from a subscription is pointless + LOG_ERROR(log, "Delete resource that doesn't exist: {}", resource_name); +} + +IOResourceManager::Classifier::~Classifier() +{ + // Detach classifier from all resources in parallel (executed in every scheduler thread) + std::vector> futures; + { + std::unique_lock lock{mutex}; + futures.reserve(attachments.size()); + for (auto & [resource_name, attachment] : attachments) + { + futures.emplace_back(attachment.resource->detachClassifier(std::move(attachment.version))); + attachment.link.reset(); // Just in case because it is not valid any longer + } + } + + // Wait for all tasks to finish (to avoid races in case of exceptions) + for (auto & future : futures) + future.wait(); + + // There should not be any exceptions because it just destruct few objects, but let's rethrow just in case + for (auto & future : futures) + future.get(); + + // This unreferences and probably destroys `Resource` objects. + // NOTE: We cannot do it in the scheduler threads (because thread cannot join itself). + attachments.clear(); +} + +std::future IOResourceManager::Resource::detachClassifier(VersionPtr && version) +{ + auto detach_promise = std::make_shared>(); // event queue task is std::function, which requires copy semanticss + auto future = detach_promise->get_future(); + scheduler.event_queue->enqueue([detached_version = std::move(version), promise = std::move(detach_promise)] mutable + { + try + { + // Unreferences and probably destroys the version and scheduler nodes it owns. + // The main reason from moving destruction into the scheduler thread is to + // free memory in the same thread it was allocated to avoid memtrackers drift. + detached_version.reset(); + promise->set_value(); + } + catch (...) + { + promise->set_exception(std::current_exception()); + } + }); + return future; +} + +bool IOResourceManager::Classifier::has(const String & resource_name) +{ + std::unique_lock lock{mutex}; + return attachments.contains(resource_name); +} + +ResourceLink IOResourceManager::Classifier::get(const String & resource_name) +{ + std::unique_lock lock{mutex}; + if (auto iter = attachments.find(resource_name); iter != attachments.end()) + return iter->second.link; + else + throw Exception(ErrorCodes::RESOURCE_NOT_FOUND, "Access denied to resource '{}'", resource_name); +} + +void IOResourceManager::Classifier::attach(const ResourcePtr & resource, const VersionPtr & version, ResourceLink link) +{ + std::unique_lock lock{mutex}; + chassert(!attachments.contains(resource->getName())); + attachments[resource->getName()] = Attachment{.resource = resource, .version = version, .link = link}; +} + +void IOResourceManager::Resource::updateResource(const ASTPtr & new_resource_entity) +{ + chassert(getEntityName(new_resource_entity) == resource_name); + resource_entity = new_resource_entity; +} + +std::future IOResourceManager::Resource::attachClassifier(Classifier & classifier, const String & workload_name) +{ + auto attach_promise = std::make_shared>(); // event queue task is std::function, which requires copy semantics + auto future = attach_promise->get_future(); + scheduler.event_queue->enqueue([&, this, promise = std::move(attach_promise)] + { + try + { + if (auto iter = node_for_workload.find(workload_name); iter != node_for_workload.end()) + { + auto queue = iter->second->getQueue(); + if (!queue) + throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Unable to use workload '{}' that have children for resource '{}'", + workload_name, resource_name); + classifier.attach(shared_from_this(), current_version, ResourceLink{.queue = queue.get()}); + } + else + { + // This resource does not have specified workload. It is either unknown or managed by another resource manager. + // We leave this resource not attached to the classifier. Access denied will be thrown later on `classifier->get(resource_name)` + } + promise->set_value(); + } + catch (...) + { + promise->set_exception(std::current_exception()); + } + }); + return future; +} + +bool IOResourceManager::hasResource(const String & resource_name) const +{ + std::unique_lock lock{mutex}; + return resources.contains(resource_name); +} + +ClassifierPtr IOResourceManager::acquire(const String & workload_name) +{ + auto classifier = std::make_shared(); + + // Attach classifier to all resources in parallel (executed in every scheduler thread) + std::vector> futures; + { + std::unique_lock lock{mutex}; + futures.reserve(resources.size()); + for (auto & [resource_name, resource] : resources) + futures.emplace_back(resource->attachClassifier(*classifier, workload_name)); + } + + // Wait for all tasks to finish (to avoid races in case of exceptions) + for (auto & future : futures) + future.wait(); + + // Rethrow exceptions if any + for (auto & future : futures) + future.get(); + + return classifier; +} + +void IOResourceManager::Resource::forEachResourceNode(IResourceManager::VisitorFunc & visitor) +{ + executeInSchedulerThread([&, this] + { + for (auto & [path, node] : node_for_workload) + { + node->forEachSchedulerNode([&] (ISchedulerNode * scheduler_node) + { + visitor(resource_name, scheduler_node->getPath(), scheduler_node); + }); + } + }); +} + +void IOResourceManager::forEachNode(IResourceManager::VisitorFunc visitor) +{ + // Copy resource to avoid holding mutex for a long time + std::unordered_map resources_copy; + { + std::unique_lock lock{mutex}; + resources_copy = resources; + } + + /// Run tasks one by one to avoid concurrent calls to visitor + for (auto & [resource_name, resource] : resources_copy) + resource->forEachResourceNode(visitor); +} + +void IOResourceManager::topologicallySortedWorkloadsImpl(Workload * workload, std::unordered_set & visited, std::vector & sorted_workloads) +{ + if (visited.contains(workload)) + return; + visited.insert(workload); + + // Recurse into parent (if any) + String parent = workload->getParent(); + if (!parent.empty()) + { + auto parent_iter = workloads.find(parent); + chassert(parent_iter != workloads.end()); // validations check that all parents exist + topologicallySortedWorkloadsImpl(parent_iter->second.get(), visited, sorted_workloads); + } + + sorted_workloads.push_back(workload); +} + +std::vector IOResourceManager::topologicallySortedWorkloads() +{ + std::vector sorted_workloads; + std::unordered_set visited; + for (auto & [workload_name, workload] : workloads) + topologicallySortedWorkloadsImpl(workload.get(), visited, sorted_workloads); + return sorted_workloads; +} + +} diff --git a/src/Common/Scheduler/Nodes/IOResourceManager.h b/src/Common/Scheduler/Nodes/IOResourceManager.h new file mode 100644 index 00000000000..cfd8a234b37 --- /dev/null +++ b/src/Common/Scheduler/Nodes/IOResourceManager.h @@ -0,0 +1,281 @@ +#pragma once + +#include +#include + +#include +#include +#include +#include +#include +#include + +#include + +#include + +#include +#include +#include +#include +#include + +namespace DB +{ + +/* + * Implementation of `IResourceManager` that creates hierarchy of scheduler nodes according to + * workload entities (WORKLOADs and RESOURCEs). It subscribes for updates in IWorkloadEntityStorage and + * creates hierarchy of UnifiedSchedulerNode identical to the hierarchy of WORKLOADs. + * For every RESOURCE an independent hierarchy of scheduler nodes is created. + * + * Manager process updates of WORKLOADs and RESOURCEs: CREATE/DROP/ALTER. + * When a RESOURCE is created (dropped) a corresponding scheduler nodes hierarchy is created (destroyed). + * After DROP RESOURCE parts of hierarchy might be kept alive while at least one query uses it. + * + * Manager is specific to IO only because it create scheduler node hierarchies for RESOURCEs having + * WRITE DISK and/or READ DISK definitions. CPU and memory resources are managed separately. + * + * Classifiers are used (1) to access IO resources and (2) to keep shared ownership of scheduling nodes. + * This allows `ResourceRequest` and `ResourceLink` to hold raw pointers as long as + * `ClassifierPtr` is acquired and held. + * + * === RESOURCE ARCHITECTURE === + * Let's consider how a single resource is implemented. Every workload is represented by corresponding UnifiedSchedulerNode. + * Every UnifiedSchedulerNode manages its own subtree of ISchedulerNode objects (see details in UnifiedSchedulerNode.h) + * UnifiedSchedulerNode for workload w/o children has a queue, which provide a ResourceLink for consumption. + * Parent of the root workload for a resource is SchedulerRoot with its own scheduler thread. + * So every resource has its dedicated thread for processing of resource request and other events (see EventQueue). + * + * Here is an example of SQL and corresponding hierarchy of scheduler nodes: + * CREATE RESOURCE my_io_resource (...) + * CREATE WORKLOAD all + * CREATE WORKLOAD production PARENT all + * CREATE WORKLOAD development PARENT all + * + * root - SchedulerRoot (with scheduler thread and EventQueue) + * | + * all - UnifiedSchedulerNode + * | + * p0_fair - FairPolicy (part of parent UnifiedSchedulerNode internal structure) + * / \ + * production development - UnifiedSchedulerNode + * | | + * queue queue - FifoQueue (part of parent UnifiedSchedulerNode internal structure) + * + * === UPDATING WORKLOADS === + * Workload may be created, updated or deleted. + * Updating a child of a workload might lead to updating other workloads: + * 1. Workload itself: it's structure depend on settings of children workloads + * (e.g. fifo node of a leaf workload is remove when the first child is added; + * and a fair node is inserted after the first two children are added). + * 2. Other children: for them path to root might be changed (e.g. intermediate priority node is inserted) + * + * === VERSION CONTROL === + * Versions are created on hierarchy updates and hold ownership of nodes that are used through raw pointers. + * Classifier reference version of every resource it use. Older version reference newer version. + * Here is a diagram explaining version control based on Version objects (for 1 resource): + * + * [nodes] [nodes] [nodes] + * ^ ^ ^ + * | | | + * version1 --> version2 -...-> versionN + * ^ ^ ^ + * | | | + * old_classifier new_classifier current_version + * + * Previous version should hold reference to a newer version. It is required for proper handling of updates. + * Classifiers that were created for any of old versions may use nodes of newer version due to updateNode(). + * It may move a queue to a new position in the hierarchy or create/destroy constraints, thus resource requests + * created by old classifier may reference constraints of newer versions through `request->constraints` which + * is filled during dequeueRequest(). + * + * === THREADS === + * scheduler thread: + * - one thread per resource + * - uses event_queue (per resource) for processing w/o holding mutex for every scheduler node + * - handle resource requests + * - node activations + * - scheduler hierarchy updates + * query thread: + * - multiple independent threads + * - send resource requests + * - acquire and release classifiers (via scheduler event queues) + * control thread: + * - modify workload and resources through subscription + * + * === SYNCHRONIZATION === + * List of related sync primitives and their roles: + * IOResourceManager::mutex + * - protects resource manager data structures - resource and workloads + * - serialize control thread actions + * IOResourceManager::Resource::scheduler->event_queue + * - serializes scheduler hierarchy events + * - events are created in control and query threads + * - all events are processed by specific scheduler thread + * - hierarchy-wide actions: requests dequeueing, activations propagation and nodes updates. + * - resource version control management + * FifoQueue::mutex and SemaphoreContraint::mutex + * - serializes query and scheduler threads on specific node accesses + * - resource request processing: enqueueRequest(), dequeueRequest() and finishRequest() + */ +class IOResourceManager : public IResourceManager +{ +public: + explicit IOResourceManager(IWorkloadEntityStorage & storage_); + ~IOResourceManager() override; + void updateConfiguration(const Poco::Util::AbstractConfiguration & config) override; + bool hasResource(const String & resource_name) const override; + ClassifierPtr acquire(const String & workload_name) override; + void forEachNode(VisitorFunc visitor) override; + +private: + // Forward declarations + struct NodeInfo; + struct Version; + class Resource; + struct Workload; + class Classifier; + + friend struct Workload; + + using VersionPtr = std::shared_ptr; + using ResourcePtr = std::shared_ptr; + using WorkloadPtr = std::shared_ptr; + + /// Helper for parsing workload AST for a specific resource + struct NodeInfo + { + String name; // Workload name + String parent; // Name of parent workload + SchedulingSettings settings; // Settings specific for a given resource + + NodeInfo(const ASTPtr & ast, const String & resource_name); + }; + + /// Ownership control for scheduler nodes, which could be referenced by raw pointers + struct Version + { + std::vector nodes; + VersionPtr newer_version; + }; + + /// Holds a thread and hierarchy of unified scheduler nodes for specific RESOURCE + class Resource : public std::enable_shared_from_this, boost::noncopyable + { + public: + explicit Resource(const ASTPtr & resource_entity_); + ~Resource(); + + const String & getName() const { return resource_name; } + + /// Hierarchy management + void createNode(const NodeInfo & info); + void deleteNode(const NodeInfo & info); + void updateNode(const NodeInfo & old_info, const NodeInfo & new_info); + + /// Updates resource entity + void updateResource(const ASTPtr & new_resource_entity); + + /// Updates a classifier to contain a reference for specified workload + std::future attachClassifier(Classifier & classifier, const String & workload_name); + + /// Remove classifier reference. This destroys scheduler nodes in proper scheduler thread + std::future detachClassifier(VersionPtr && version); + + /// Introspection + void forEachResourceNode(IOResourceManager::VisitorFunc & visitor); + + private: + void updateCurrentVersion(); + + template + void executeInSchedulerThread(Task && task) + { + std::promise promise; + auto future = promise.get_future(); + scheduler.event_queue->enqueue([&] + { + try + { + task(); + promise.set_value(); + } + catch (...) + { + promise.set_exception(std::current_exception()); + } + }); + future.get(); // Blocks until execution is done in the scheduler thread + } + + ASTPtr resource_entity; + const String resource_name; + SchedulerRoot scheduler; + + // TODO(serxa): consider using resource_manager->mutex + scheduler thread for updates and mutex only for reading to avoid slow acquire/release of classifier + /// These field should be accessed only by the scheduler thread + std::unordered_map node_for_workload; + UnifiedSchedulerNodePtr root_node; + VersionPtr current_version; + }; + + struct Workload : boost::noncopyable + { + IOResourceManager * resource_manager; + ASTPtr workload_entity; + + Workload(IOResourceManager * resource_manager_, const ASTPtr & workload_entity_); + ~Workload(); + + void updateWorkload(const ASTPtr & new_entity); + String getParent() const; + }; + + class Classifier : public IClassifier + { + public: + ~Classifier() override; + + /// Implements IClassifier interface + /// NOTE: It is called from query threads (possibly multiple) + bool has(const String & resource_name) override; + ResourceLink get(const String & resource_name) override; + + /// Attaches/detaches a specific resource + /// NOTE: It is called from scheduler threads (possibly multiple) + void attach(const ResourcePtr & resource, const VersionPtr & version, ResourceLink link); + void detach(const ResourcePtr & resource); + + private: + IOResourceManager * resource_manager; + std::mutex mutex; + struct Attachment + { + ResourcePtr resource; + VersionPtr version; + ResourceLink link; + }; + std::unordered_map attachments; // TSA_GUARDED_BY(mutex); + }; + + void createOrUpdateWorkload(const String & workload_name, const ASTPtr & ast); + void deleteWorkload(const String & workload_name); + void createOrUpdateResource(const String & resource_name, const ASTPtr & ast); + void deleteResource(const String & resource_name); + + // Topological sorting of workloads + void topologicallySortedWorkloadsImpl(Workload * workload, std::unordered_set & visited, std::vector & sorted_workloads); + std::vector topologicallySortedWorkloads(); + + IWorkloadEntityStorage & storage; + scope_guard subscription; + + mutable std::mutex mutex; + std::unordered_map workloads; // TSA_GUARDED_BY(mutex); + std::unordered_map resources; // TSA_GUARDED_BY(mutex); + + LoggerPtr log; +}; + +} diff --git a/src/Common/Scheduler/Nodes/PriorityPolicy.h b/src/Common/Scheduler/Nodes/PriorityPolicy.h index b170ab0dbee..cfbe242c13e 100644 --- a/src/Common/Scheduler/Nodes/PriorityPolicy.h +++ b/src/Common/Scheduler/Nodes/PriorityPolicy.h @@ -19,7 +19,7 @@ namespace ErrorCodes * Scheduler node that implements priority scheduling policy. * Requests are scheduled in order of priorities. */ -class PriorityPolicy : public ISchedulerNode +class PriorityPolicy final : public ISchedulerNode { /// Scheduling state of a child struct Item @@ -39,6 +39,23 @@ public: : ISchedulerNode(event_queue_, config, config_prefix) {} + explicit PriorityPolicy(EventQueue * event_queue_, const SchedulerNodeInfo & node_info) + : ISchedulerNode(event_queue_, node_info) + {} + + ~PriorityPolicy() override + { + // We need to clear `parent` in all children to avoid dangling references + while (!children.empty()) + removeChild(children.begin()->second.get()); + } + + const String & getTypeName() const override + { + static String type_name("priority"); + return type_name; + } + bool equals(ISchedulerNode * other) override { if (!ISchedulerNode::equals(other)) diff --git a/src/Common/Scheduler/Nodes/SemaphoreConstraint.h b/src/Common/Scheduler/Nodes/SemaphoreConstraint.h index fe1b03b74bd..e223100a646 100644 --- a/src/Common/Scheduler/Nodes/SemaphoreConstraint.h +++ b/src/Common/Scheduler/Nodes/SemaphoreConstraint.h @@ -1,5 +1,6 @@ #pragma once +#include "Common/Scheduler/ISchedulerNode.h" #include #include @@ -13,7 +14,7 @@ namespace DB * Limited concurrency constraint. * Blocks if either number of concurrent in-flight requests exceeds `max_requests`, or their total cost exceeds `max_cost` */ -class SemaphoreConstraint : public ISchedulerConstraint +class SemaphoreConstraint final : public ISchedulerConstraint { static constexpr Int64 default_max_requests = std::numeric_limits::max(); static constexpr Int64 default_max_cost = std::numeric_limits::max(); @@ -24,6 +25,25 @@ public: , max_cost(config.getInt64(config_prefix + ".max_cost", config.getInt64(config_prefix + ".max_bytes", default_max_cost))) {} + SemaphoreConstraint(EventQueue * event_queue_, const SchedulerNodeInfo & info_, Int64 max_requests_, Int64 max_cost_) + : ISchedulerConstraint(event_queue_, info_) + , max_requests(max_requests_) + , max_cost(max_cost_) + {} + + ~SemaphoreConstraint() override + { + // We need to clear `parent` in child to avoid dangling references + if (child) + removeChild(child.get()); + } + + const String & getTypeName() const override + { + static String type_name("inflight_limit"); + return type_name; + } + bool equals(ISchedulerNode * other) override { if (!ISchedulerNode::equals(other)) @@ -68,15 +88,14 @@ public: if (!request) return {nullptr, false}; - // Request has reference to the first (closest to leaf) `constraint`, which can have `parent_constraint`. - // The former is initialized here dynamically and the latter is initialized once during hierarchy construction. - if (!request->constraint) - request->constraint = this; - - // Update state on request arrival std::unique_lock lock(mutex); - requests++; - cost += request->cost; + if (request->addConstraint(this)) + { + // Update state on request arrival + requests++; + cost += request->cost; + } + child_active = child_now_active; if (!active()) busy_periods++; @@ -86,10 +105,6 @@ public: void finishRequest(ResourceRequest * request) override { - // Recursive traverse of parent flow controls in reverse order - if (parent_constraint) - parent_constraint->finishRequest(request); - // Update state on request departure std::unique_lock lock(mutex); bool was_active = active(); @@ -109,6 +124,32 @@ public: parent->activateChild(this); } + /// Update limits. + /// Should be called from the scheduler thread because it could lead to activation or deactivation + void updateConstraints(const SchedulerNodePtr & self, Int64 new_max_requests, UInt64 new_max_cost) + { + std::unique_lock lock(mutex); + bool was_active = active(); + max_requests = new_max_requests; + max_cost = new_max_cost; + + if (parent) + { + // Activate on transition from inactive state + if (!was_active && active()) + parent->activateChild(this); + // Deactivate on transition into inactive state + else if (was_active && !active()) + { + // Node deactivation is usually done in dequeueRequest(), but we do not want to + // do extra call to active() on every request just to make sure there was no update(). + // There is no interface method to do deactivation, so we do the following trick. + parent->removeChild(this); + parent->attachChild(self); // This call is the only reason we have `recursive_mutex` + } + } + } + bool isActive() override { std::unique_lock lock(mutex); @@ -150,10 +191,10 @@ private: return satisfied() && child_active; } - const Int64 max_requests = default_max_requests; - const Int64 max_cost = default_max_cost; + Int64 max_requests = default_max_requests; + Int64 max_cost = default_max_cost; - std::mutex mutex; + std::recursive_mutex mutex; Int64 requests = 0; Int64 cost = 0; bool child_active = false; diff --git a/src/Common/Scheduler/Nodes/ThrottlerConstraint.h b/src/Common/Scheduler/Nodes/ThrottlerConstraint.h index b279cbe972b..a2594b7ff2e 100644 --- a/src/Common/Scheduler/Nodes/ThrottlerConstraint.h +++ b/src/Common/Scheduler/Nodes/ThrottlerConstraint.h @@ -3,8 +3,6 @@ #include #include -#include -#include #include @@ -15,7 +13,7 @@ namespace DB * Limited throughput constraint. Blocks if token-bucket constraint is violated: * i.e. more than `max_burst + duration * max_speed` cost units (aka tokens) dequeued from this node in last `duration` seconds. */ -class ThrottlerConstraint : public ISchedulerConstraint +class ThrottlerConstraint final : public ISchedulerConstraint { public: static constexpr double default_burst_seconds = 1.0; @@ -28,10 +26,28 @@ public: , tokens(max_burst) {} + ThrottlerConstraint(EventQueue * event_queue_, const SchedulerNodeInfo & info_, double max_speed_, double max_burst_) + : ISchedulerConstraint(event_queue_, info_) + , max_speed(max_speed_) + , max_burst(max_burst_) + , last_update(event_queue_->now()) + , tokens(max_burst) + {} + ~ThrottlerConstraint() override { // We should cancel event on destruction to avoid dangling references from event queue event_queue->cancelPostponed(postponed); + + // We need to clear `parent` in child to avoid dangling reference + if (child) + removeChild(child.get()); + } + + const String & getTypeName() const override + { + static String type_name("bandwidth_limit"); + return type_name; } bool equals(ISchedulerNode * other) override @@ -78,10 +94,7 @@ public: if (!request) return {nullptr, false}; - // Request has reference to the first (closest to leaf) `constraint`, which can have `parent_constraint`. - // The former is initialized here dynamically and the latter is initialized once during hierarchy construction. - if (!request->constraint) - request->constraint = this; + // We don't do `request->addConstraint(this)` because `finishRequest()` is no-op updateBucket(request->cost); @@ -92,12 +105,8 @@ public: return {request, active()}; } - void finishRequest(ResourceRequest * request) override + void finishRequest(ResourceRequest *) override { - // Recursive traverse of parent flow controls in reverse order - if (parent_constraint) - parent_constraint->finishRequest(request); - // NOTE: Token-bucket constraint does not require any action when consumption ends } @@ -108,6 +117,21 @@ public: parent->activateChild(this); } + /// Update limits. + /// Should be called from the scheduler thread because it could lead to activation + void updateConstraints(double new_max_speed, double new_max_burst) + { + event_queue->cancelPostponed(postponed); + postponed = EventQueue::not_postponed; + bool was_active = active(); + updateBucket(0, true); // To apply previous params for duration since `last_update` + max_speed = new_max_speed; + max_burst = new_max_burst; + updateBucket(0, false); // To postpone (if needed) using new params + if (!was_active && active() && parent) + parent->activateChild(this); + } + bool isActive() override { return active(); @@ -150,7 +174,7 @@ private: parent->activateChild(this); } - void updateBucket(ResourceCost use = 0) + void updateBucket(ResourceCost use = 0, bool do_not_postpone = false) { auto now = event_queue->now(); if (max_speed > 0.0) @@ -160,7 +184,7 @@ private: tokens -= use; // This is done outside min() to avoid passing large requests w/o token consumption after long idle period // Postpone activation until there is positive amount of tokens - if (tokens < 0.0) + if (!do_not_postpone && tokens < 0.0) { auto delay_ns = std::chrono::nanoseconds(static_cast(-tokens / max_speed * 1e9)); if (postponed == EventQueue::not_postponed) @@ -184,8 +208,8 @@ private: return satisfied() && child_active; } - const double max_speed{0}; /// in tokens per second - const double max_burst{0}; /// in tokens + double max_speed{0}; /// in tokens per second + double max_burst{0}; /// in tokens EventQueue::TimePoint last_update; UInt64 postponed = EventQueue::not_postponed; diff --git a/src/Common/Scheduler/Nodes/UnifiedSchedulerNode.h b/src/Common/Scheduler/Nodes/UnifiedSchedulerNode.h new file mode 100644 index 00000000000..2c4b7c4f3bc --- /dev/null +++ b/src/Common/Scheduler/Nodes/UnifiedSchedulerNode.h @@ -0,0 +1,606 @@ +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int INVALID_SCHEDULER_NODE; + extern const int LOGICAL_ERROR; +} + +class UnifiedSchedulerNode; +using UnifiedSchedulerNodePtr = std::shared_ptr; + +/* + * Unified scheduler node combines multiple nodes internally to provide all available scheduling policies and constraints. + * Whole scheduling hierarchy could "logically" consist of unified nodes only. Physically intermediate "internal" nodes + * are also present. This approach is easiers for manipulations in runtime than using multiple types of nodes. + * + * Unified node is capable of updating its internal structure based on: + * 1. Number of children (fifo if =0 or fairness/priority if >0). + * 2. Priorities of its children (for subtree structure). + * 3. `SchedulingSettings` associated with unified node (for throttler and semaphore constraints). + * + * In general, unified node has "internal" subtree with the following structure: + * + * THIS <-- UnifiedSchedulerNode object + * | + * THROTTLER <-- [Optional] Throttling scheduling constraint + * | + * [If no children]------ SEMAPHORE <-- [Optional] Semaphore constraint + * | | + * FIFO PRIORITY <-- [Optional] Scheduling policy distinguishing priorities + * .-------' '-------. + * FAIRNESS[p1] ... FAIRNESS[pN] <-- [Optional] Policies for fairness if priorities are equal + * / \ / \ + * CHILD[p1,w1] ... CHILD[p1,wM] CHILD[pN,w1] ... CHILD[pN,wM] <-- Unified children (UnifiedSchedulerNode objects) + * + * NOTE: to distinguish different kinds of children we use the following terms: + * - immediate child: child of unified object (THROTTLER); + * - unified child: leaf of this "internal" subtree (CHILD[p,w]); + * - intermediate node: any child that is not UnifiedSchedulerNode (unified child or `this`) + */ +class UnifiedSchedulerNode final : public ISchedulerNode +{ +private: + /// Helper function for managing a parent of a node + static void reparent(const SchedulerNodePtr & node, const SchedulerNodePtr & new_parent) + { + reparent(node, new_parent.get()); + } + + /// Helper function for managing a parent of a node + static void reparent(const SchedulerNodePtr & node, ISchedulerNode * new_parent) + { + chassert(node); + chassert(new_parent); + if (new_parent == node->parent) + return; + if (node->parent) + node->parent->removeChild(node.get()); + new_parent->attachChild(node); + } + + /// Helper function for managing a parent of a node + static void detach(const SchedulerNodePtr & node) + { + if (node->parent) + node->parent->removeChild(node.get()); + } + + /// A branch of the tree for a specific priority value + struct FairnessBranch + { + SchedulerNodePtr root; /// FairPolicy node is used if multiple children with the same priority are attached + std::unordered_map children; // basename -> child + + bool empty() const { return children.empty(); } + + SchedulerNodePtr getRoot() + { + chassert(!children.empty()); + if (root) + return root; + chassert(children.size() == 1); + return children.begin()->second; + } + + /// Attaches a new child. + /// Returns root node if it has been changed to a different node, otherwise returns null. + [[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child) + { + if (auto [it, inserted] = children.emplace(child->basename, child); !inserted) + throw Exception( + ErrorCodes::INVALID_SCHEDULER_NODE, + "Can't add another child with the same path: {}", + it->second->getPath()); + + if (children.size() == 2) + { + // Insert fair node if we have just added the second child + chassert(!root); + root = std::make_shared(event_queue_, SchedulerNodeInfo{}); + root->info.setPriority(child->info.priority); + root->basename = fmt::format("p{}_fair", child->info.priority.value); + for (auto & [_, node] : children) + reparent(node, root); + return root; // New root has been created + } + else if (children.size() == 1) + return child; // We have added single child so far and it is the new root + else + reparent(child, root); + return {}; // Root is the same + } + + /// Detaches a child. + /// Returns root node if it has been changed to a different node, otherwise returns null. + /// NOTE: It could also return null if `empty()` after detaching + [[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue *, const UnifiedSchedulerNodePtr & child) + { + auto it = children.find(child->basename); + if (it == children.end()) + return {}; // unknown child + + detach(child); + children.erase(it); + if (children.size() == 1) + { + // Remove fair if the only child has left + chassert(root); + detach(root); + root.reset(); + return children.begin()->second; // The last child is a new root now + } + else if (children.empty()) + return {}; // We have detached the last child + else + return {}; // Root is the same (two or more children have left) + } + }; + + /// Handles all the children nodes with intermediate fair and/or priority nodes + struct ChildrenBranch + { + SchedulerNodePtr root; /// PriorityPolicy node is used if multiple children with different priority are attached + std::unordered_map branches; /// Branches for different priority values + + // Returns true iff there are no unified children attached + bool empty() const { return branches.empty(); } + + SchedulerNodePtr getRoot() + { + chassert(!branches.empty()); + if (root) + return root; + return branches.begin()->second.getRoot(); // There should be exactly one child-branch + } + + /// Attaches a new child. + /// Returns root node if it has been changed to a different node, otherwise returns null. + [[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child) + { + auto [it, new_branch] = branches.try_emplace(child->info.priority); + auto & child_branch = it->second; + auto branch_root = child_branch.attachUnifiedChild(event_queue_, child); + if (!new_branch) + { + if (branch_root) + { + if (root) + reparent(branch_root, root); + else + return branch_root; + } + return {}; + } + else + { + chassert(branch_root); + if (branches.size() == 2) + { + // Insert priority node if we have just added the second branch + chassert(!root); + root = std::make_shared(event_queue_, SchedulerNodeInfo{}); + root->basename = "prio"; + for (auto & [_, branch] : branches) + reparent(branch.getRoot(), root); + return root; // New root has been created + } + else if (branches.size() == 1) + return child; // We have added single child so far and it is the new root + else + reparent(child, root); + return {}; // Root is the same + } + } + + /// Detaches a child. + /// Returns root node if it has been changed to a different node, otherwise returns null. + /// NOTE: It could also return null if `empty()` after detaching + [[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child) + { + auto it = branches.find(child->info.priority); + if (it == branches.end()) + return {}; // unknown child + + auto & child_branch = it->second; + auto branch_root = child_branch.detachUnifiedChild(event_queue_, child); + if (child_branch.empty()) + { + branches.erase(it); + if (branches.size() == 1) + { + // Remove priority node if the only child-branch has left + chassert(root); + detach(root); + root.reset(); + return branches.begin()->second.getRoot(); // The last child-branch is a new root now + } + else if (branches.empty()) + return {}; // We have detached the last child + else + return {}; // Root is the same (two or more children-branches have left) + } + if (branch_root) + { + if (root) + reparent(branch_root, root); + else + return branch_root; + } + return {}; // Root is the same + } + }; + + /// Handles degenerate case of zero children (a fifo queue) or delegate to `ChildrenBranch`. + struct QueueOrChildrenBranch + { + SchedulerNodePtr queue; /// FifoQueue node is used if there are no children + ChildrenBranch branch; /// Used if there is at least one child + + SchedulerNodePtr getRoot() + { + if (queue) + return queue; + else + return branch.getRoot(); + } + + // Should be called after constructor, before any other methods + [[nodiscard]] SchedulerNodePtr initialize(EventQueue * event_queue_) + { + createQueue(event_queue_); + return queue; + } + + /// Attaches a new child. + /// Returns root node if it has been changed to a different node, otherwise returns null. + [[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child) + { + if (queue) + removeQueue(); + return branch.attachUnifiedChild(event_queue_, child); + } + + /// Detaches a child. + /// Returns root node if it has been changed to a different node, otherwise returns null. + [[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child) + { + if (queue) + return {}; // No-op, it already has no children + auto branch_root = branch.detachUnifiedChild(event_queue_, child); + if (branch.empty()) + { + createQueue(event_queue_); + return queue; + } + return branch_root; + } + + private: + void createQueue(EventQueue * event_queue_) + { + queue = std::make_shared(event_queue_, SchedulerNodeInfo{}); + queue->basename = "fifo"; + } + + void removeQueue() + { + // This unified node will not be able to process resource requests any longer + // All remaining resource requests are be aborted on queue destruction + detach(queue); + std::static_pointer_cast(queue)->purgeQueue(); + queue.reset(); + } + }; + + /// Handles all the nodes under this unified node + /// Specifically handles constraints with `QueueOrChildrenBranch` under it + struct ConstraintsBranch + { + SchedulerNodePtr throttler; + SchedulerNodePtr semaphore; + QueueOrChildrenBranch branch; + SchedulingSettings settings; + + // Should be called after constructor, before any other methods + [[nodiscard]] SchedulerNodePtr initialize(EventQueue * event_queue_, const SchedulingSettings & settings_) + { + settings = settings_; + SchedulerNodePtr node = branch.initialize(event_queue_); + if (settings.hasSemaphore()) + { + semaphore = std::make_shared(event_queue_, SchedulerNodeInfo{}, settings.max_requests, settings.max_cost); + semaphore->basename = "semaphore"; + reparent(node, semaphore); + node = semaphore; + } + if (settings.hasThrottler()) + { + throttler = std::make_shared(event_queue_, SchedulerNodeInfo{}, settings.max_speed, settings.max_burst); + throttler->basename = "throttler"; + reparent(node, throttler); + node = throttler; + } + return node; + } + + /// Attaches a new child. + /// Returns root node if it has been changed to a different node, otherwise returns null. + [[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child) + { + if (auto branch_root = branch.attachUnifiedChild(event_queue_, child)) + { + // If both semaphore and throttler exist we should reparent to the farthest from the root + if (semaphore) + reparent(branch_root, semaphore); + else if (throttler) + reparent(branch_root, throttler); + else + return branch_root; + } + return {}; + } + + /// Detaches a child. + /// Returns root node if it has been changed to a different node, otherwise returns null. + [[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child) + { + if (auto branch_root = branch.detachUnifiedChild(event_queue_, child)) + { + if (semaphore) + reparent(branch_root, semaphore); + else if (throttler) + reparent(branch_root, throttler); + else + return branch_root; + } + return {}; + } + + /// Updates constraint-related nodes. + /// Returns root node if it has been changed to a different node, otherwise returns null. + [[nodiscard]] SchedulerNodePtr updateSchedulingSettings(EventQueue * event_queue_, const SchedulingSettings & new_settings) + { + SchedulerNodePtr node = branch.getRoot(); + + if (!settings.hasSemaphore() && new_settings.hasSemaphore()) // Add semaphore + { + semaphore = std::make_shared(event_queue_, SchedulerNodeInfo{}, new_settings.max_requests, new_settings.max_cost); + semaphore->basename = "semaphore"; + reparent(node, semaphore); + node = semaphore; + } + else if (settings.hasSemaphore() && !new_settings.hasSemaphore()) // Remove semaphore + { + detach(semaphore); + semaphore.reset(); + } + else if (settings.hasSemaphore() && new_settings.hasSemaphore()) // Update semaphore + { + static_cast(*semaphore).updateConstraints(semaphore, new_settings.max_requests, new_settings.max_cost); + node = semaphore; + } + + if (!settings.hasThrottler() && new_settings.hasThrottler()) // Add throttler + { + throttler = std::make_shared(event_queue_, SchedulerNodeInfo{}, new_settings.max_speed, new_settings.max_burst); + throttler->basename = "throttler"; + reparent(node, throttler); + node = throttler; + } + else if (settings.hasThrottler() && !new_settings.hasThrottler()) // Remove throttler + { + detach(throttler); + throttler.reset(); + } + else if (settings.hasThrottler() && new_settings.hasThrottler()) // Update throttler + { + static_cast(*throttler).updateConstraints(new_settings.max_speed, new_settings.max_burst); + node = throttler; + } + + settings = new_settings; + return node; + } + }; + +public: + explicit UnifiedSchedulerNode(EventQueue * event_queue_, const SchedulingSettings & settings) + : ISchedulerNode(event_queue_, SchedulerNodeInfo(settings.weight, settings.priority)) + { + immediate_child = impl.initialize(event_queue, settings); + reparent(immediate_child, this); + } + + ~UnifiedSchedulerNode() override + { + // We need to clear `parent` in child to avoid dangling references + if (immediate_child) + removeChild(immediate_child.get()); + } + + /// Attaches a unified child as a leaf of internal subtree and insert or update all the intermediate nodes + /// NOTE: Do not confuse with `attachChild()` which is used only for immediate children + void attachUnifiedChild(const UnifiedSchedulerNodePtr & child) + { + if (auto new_child = impl.attachUnifiedChild(event_queue, child)) + reparent(new_child, this); + } + + /// Detaches unified child and update all the intermediate nodes. + /// Detached child could be safely attached to another parent. + /// NOTE: Do not confuse with `removeChild()` which is used only for immediate children + void detachUnifiedChild(const UnifiedSchedulerNodePtr & child) + { + if (auto new_child = impl.detachUnifiedChild(event_queue, child)) + reparent(new_child, this); + } + + static bool updateRequiresDetach(const String & old_parent, const String & new_parent, const SchedulingSettings & old_settings, const SchedulingSettings & new_settings) + { + return old_parent != new_parent || old_settings.priority != new_settings.priority; + } + + /// Updates scheduling settings. Set of constraints might change. + /// NOTE: Caller is responsible for detaching and attaching if `updateRequiresDetach` returns true + void updateSchedulingSettings(const SchedulingSettings & new_settings) + { + info.setPriority(new_settings.priority); + info.setWeight(new_settings.weight); + if (auto new_child = impl.updateSchedulingSettings(event_queue, new_settings)) + reparent(new_child, this); + } + + const SchedulingSettings & getSettings() const + { + return impl.settings; + } + + /// Returns the queue to be used for resource requests or `nullptr` if it has unified children + std::shared_ptr getQueue() const + { + return static_pointer_cast(impl.branch.queue); + } + + /// Collects nodes that could be accessed with raw pointers by resource requests (queue and constraints) + /// NOTE: This is a building block for classifier. Note that due to possible movement of a queue, set of constraints + /// for that queue might change in future, and `request->constraints` might reference nodes not in + /// the initial set of nodes returned by `addRawPointerNodes()`. To avoid destruction of such additional nodes + /// classifier must (indirectly) hold nodes return by `addRawPointerNodes()` for all future versions of + /// all unified nodes. Such a version control is done by `IOResourceManager`. + void addRawPointerNodes(std::vector & nodes) + { + // NOTE: `impl.throttler` could be skipped, because ThrottlerConstraint does not call `request->addConstraint()` + if (impl.semaphore) + nodes.push_back(impl.semaphore); + if (impl.branch.queue) + nodes.push_back(impl.branch.queue); + for (auto & [_0, branch] : impl.branch.branch.branches) + { + for (auto & [_1, child] : branch.children) + child->addRawPointerNodes(nodes); + } + } + + bool hasUnifiedChildren() const + { + return impl.branch.queue == nullptr; + } + + /// Introspection. Calls a visitor for self and every internal node. Do not recurse into unified children. + void forEachSchedulerNode(std::function visitor) + { + visitor(this); + if (impl.throttler) + visitor(impl.throttler.get()); + if (impl.semaphore) + visitor(impl.semaphore.get()); + if (impl.branch.queue) + visitor(impl.branch.queue.get()); + if (impl.branch.branch.root) // priority + visitor(impl.branch.branch.root.get()); + for (auto & [_, branch] : impl.branch.branch.branches) + { + if (branch.root) // fairness + visitor(branch.root.get()); + } + } + +protected: // Hide all the ISchedulerNode interface methods as an implementation details + const String & getTypeName() const override + { + static String type_name("unified"); + return type_name; + } + + bool equals(ISchedulerNode *) override + { + throw Exception(ErrorCodes::LOGICAL_ERROR, "UnifiedSchedulerNode should not be used with CustomResourceManager"); + } + + /// Attaches an immediate child (used through `reparent()`) + void attachChild(const SchedulerNodePtr & child_) override + { + immediate_child = child_; + immediate_child->setParent(this); + + // Activate if required + if (immediate_child->isActive()) + activateChild(immediate_child.get()); + } + + /// Removes an immediate child (used through `reparent()`) + void removeChild(ISchedulerNode * child) override + { + if (immediate_child.get() == child) + { + child_active = false; // deactivate + immediate_child->setParent(nullptr); // detach + immediate_child.reset(); + } + } + + ISchedulerNode * getChild(const String & child_name) override + { + if (immediate_child->basename == child_name) + return immediate_child.get(); + else + return nullptr; + } + + std::pair dequeueRequest() override + { + auto [request, child_now_active] = immediate_child->dequeueRequest(); + if (!request) + return {nullptr, false}; + + child_active = child_now_active; + if (!child_active) + busy_periods++; + incrementDequeued(request->cost); + return {request, child_active}; + } + + bool isActive() override + { + return child_active; + } + + /// Shows number of immediate active children (for introspection) + size_t activeChildren() override + { + return child_active; + } + + /// Activate an immediate child + void activateChild(ISchedulerNode * child) override + { + if (child == immediate_child.get()) + if (!std::exchange(child_active, true) && parent) + parent->activateChild(this); + } + +private: + ConstraintsBranch impl; + SchedulerNodePtr immediate_child; // An immediate child (actually the root of the whole subtree) + bool child_active = false; +}; + +} diff --git a/src/Common/Scheduler/Nodes/registerResourceManagers.cpp b/src/Common/Scheduler/Nodes/registerResourceManagers.cpp deleted file mode 100644 index c5d5ba5b981..00000000000 --- a/src/Common/Scheduler/Nodes/registerResourceManagers.cpp +++ /dev/null @@ -1,15 +0,0 @@ -#include -#include - -namespace DB -{ - -void registerDynamicResourceManager(ResourceManagerFactory &); - -void registerResourceManagers() -{ - auto & factory = ResourceManagerFactory::instance(); - registerDynamicResourceManager(factory); -} - -} diff --git a/src/Common/Scheduler/Nodes/registerResourceManagers.h b/src/Common/Scheduler/Nodes/registerResourceManagers.h deleted file mode 100644 index 243b25a9587..00000000000 --- a/src/Common/Scheduler/Nodes/registerResourceManagers.h +++ /dev/null @@ -1,8 +0,0 @@ -#pragma once - -namespace DB -{ - -void registerResourceManagers(); - -} diff --git a/src/Common/Scheduler/Nodes/tests/ResourceTest.h b/src/Common/Scheduler/Nodes/tests/ResourceTest.h index c787a686a09..927f87d5aa6 100644 --- a/src/Common/Scheduler/Nodes/tests/ResourceTest.h +++ b/src/Common/Scheduler/Nodes/tests/ResourceTest.h @@ -1,5 +1,8 @@ #pragma once +#include + +#include #include #include #include @@ -7,26 +10,35 @@ #include #include #include +#include #include -#include #include #include #include +#include +#include +#include #include #include #include #include +#include namespace DB { +namespace ErrorCodes +{ + extern const int RESOURCE_ACCESS_DENIED; +} + struct ResourceTestBase { ResourceTestBase() { - [[maybe_unused]] static bool typesRegistered = [] { registerSchedulerNodes(); registerResourceManagers(); return true; }(); + [[maybe_unused]] static bool typesRegistered = [] { registerSchedulerNodes(); return true; }(); } template @@ -37,10 +49,16 @@ struct ResourceTestBase Poco::AutoPtr config{new Poco::Util::XMLConfiguration(stream)}; String config_prefix = "node"; + return add(event_queue, root_node, path, std::ref(*config), config_prefix); + } + + template + static TClass * add(EventQueue * event_queue, SchedulerNodePtr & root_node, const String & path, Args... args) + { if (path == "/") { EXPECT_TRUE(root_node.get() == nullptr); - root_node.reset(new TClass(event_queue, *config, config_prefix)); + root_node.reset(new TClass(event_queue, std::forward(args)...)); return static_cast(root_node.get()); } @@ -65,73 +83,114 @@ struct ResourceTestBase } EXPECT_TRUE(!child_name.empty()); // wrong path - SchedulerNodePtr node = std::make_shared(event_queue, *config, config_prefix); + SchedulerNodePtr node = std::make_shared(event_queue, std::forward(args)...); node->basename = child_name; parent->attachChild(node); return static_cast(node.get()); } }; - -struct ConstraintTest : public SemaphoreConstraint -{ - explicit ConstraintTest(EventQueue * event_queue_, const Poco::Util::AbstractConfiguration & config = emptyConfig(), const String & config_prefix = {}) - : SemaphoreConstraint(event_queue_, config, config_prefix) - {} - - std::pair dequeueRequest() override - { - auto [request, active] = SemaphoreConstraint::dequeueRequest(); - if (request) - { - std::unique_lock lock(mutex); - requests.insert(request); - } - return {request, active}; - } - - void finishRequest(ResourceRequest * request) override - { - { - std::unique_lock lock(mutex); - requests.erase(request); - } - SemaphoreConstraint::finishRequest(request); - } - - std::mutex mutex; - std::set requests; -}; - class ResourceTestClass : public ResourceTestBase { struct Request : public ResourceRequest { + ResourceTestClass * test; String name; - Request(ResourceCost cost_, const String & name_) + Request(ResourceTestClass * test_, ResourceCost cost_, const String & name_) : ResourceRequest(cost_) + , test(test_) , name(name_) {} void execute() override { } + + void failed(const std::exception_ptr &) override + { + test->failed_cost += cost; + delete this; + } }; public: + ~ResourceTestClass() + { + if (root_node) + dequeue(); // Just to avoid any leaks of `Request` object + } + template void add(const String & path, const String & xml = {}) { ResourceTestBase::add(&event_queue, root_node, path, xml); } + template + void addCustom(const String & path, Args... args) + { + ResourceTestBase::add(&event_queue, root_node, path, std::forward(args)...); + } + + UnifiedSchedulerNodePtr createUnifiedNode(const String & basename, const SchedulingSettings & settings = {}) + { + return createUnifiedNode(basename, {}, settings); + } + + UnifiedSchedulerNodePtr createUnifiedNode(const String & basename, const UnifiedSchedulerNodePtr & parent, const SchedulingSettings & settings = {}) + { + auto node = std::make_shared(&event_queue, settings); + node->basename = basename; + if (parent) + { + parent->attachUnifiedChild(node); + } + else + { + EXPECT_TRUE(root_node.get() == nullptr); + root_node = node; + } + return node; + } + + // Updates the parent and/or scheduling settings for a specidfied `node`. + // Unit test implementation must make sure that all needed queues and constraints are not going to be destroyed. + // Normally it is the responsibility of IOResourceManager, but we do not use it here, so manual version control is required. + // (see IOResourceManager::Resource::updateCurrentVersion() fo details) + void updateUnifiedNode(const UnifiedSchedulerNodePtr & node, const UnifiedSchedulerNodePtr & old_parent, const UnifiedSchedulerNodePtr & new_parent, const SchedulingSettings & new_settings) + { + EXPECT_TRUE((old_parent && new_parent) || (!old_parent && !new_parent)); // changing root node is not supported + bool detached = false; + if (UnifiedSchedulerNode::updateRequiresDetach( + old_parent ? old_parent->basename : "", + new_parent ? new_parent->basename : "", + node->getSettings(), + new_settings)) + { + if (old_parent) + old_parent->detachUnifiedChild(node); + detached = true; + } + + node->updateSchedulingSettings(new_settings); + + if (detached && new_parent) + new_parent->attachUnifiedChild(node); + } + + + void enqueue(const UnifiedSchedulerNodePtr & node, const std::vector & costs) + { + enqueueImpl(node->getQueue().get(), costs, node->basename); + } + void enqueue(const String & path, const std::vector & costs) { ASSERT_TRUE(root_node.get() != nullptr); // root should be initialized first ISchedulerNode * node = root_node.get(); size_t pos = 1; - while (pos < path.length()) + while (node && pos < path.length()) { size_t slash = path.find('/', pos); if (slash != String::npos) @@ -146,13 +205,17 @@ public: pos = String::npos; } } - ISchedulerQueue * queue = dynamic_cast(node); - ASSERT_TRUE(queue != nullptr); // not a queue + if (node) + enqueueImpl(dynamic_cast(node), costs); + } + void enqueueImpl(ISchedulerQueue * queue, const std::vector & costs, const String & name = {}) + { + ASSERT_TRUE(queue != nullptr); // not a queue + if (!queue) + return; // to make clang-analyzer-core.NonNullParamChecker happy for (ResourceCost cost : costs) - { - queue->enqueueRequest(new Request(cost, queue->basename)); - } + queue->enqueueRequest(new Request(this, cost, name.empty() ? queue->basename : name)); processEvents(); // to activate queues } @@ -208,6 +271,12 @@ public: consumed_cost[name] -= value; } + void failed(ResourceCost value) + { + EXPECT_EQ(failed_cost, value); + failed_cost -= value; + } + void processEvents() { while (event_queue.tryProcess()) {} @@ -217,8 +286,11 @@ private: EventQueue event_queue; SchedulerNodePtr root_node; std::unordered_map consumed_cost; + ResourceCost failed_cost = 0; }; +enum EnqueueOnlyEnum { EnqueueOnly }; + template struct ResourceTestManager : public ResourceTestBase { @@ -230,16 +302,49 @@ struct ResourceTestManager : public ResourceTestBase struct Guard : public ResourceGuard { ResourceTestManager & t; + ResourceCost cost; - Guard(ResourceTestManager & t_, ResourceLink link_, ResourceCost cost) - : ResourceGuard(ResourceGuard::Metrics::getIOWrite(), link_, cost, Lock::Defer) + /// Works like regular ResourceGuard, ready for consumption after constructor + Guard(ResourceTestManager & t_, ResourceLink link_, ResourceCost cost_) + : ResourceGuard(ResourceGuard::Metrics::getIOWrite(), link_, cost_, Lock::Defer) , t(t_) + , cost(cost_) { t.onEnqueue(link); + waitExecute(); + } + + /// Just enqueue resource request, do not block (needed for tests to sync). Call `waitExecuted()` afterwards + Guard(ResourceTestManager & t_, ResourceLink link_, ResourceCost cost_, EnqueueOnlyEnum) + : ResourceGuard(ResourceGuard::Metrics::getIOWrite(), link_, cost_, Lock::Defer) + , t(t_) + , cost(cost_) + { + t.onEnqueue(link); + } + + /// Waits for ResourceRequest::execute() to be called for enqueued request + void waitExecute() + { lock(); t.onExecute(link); consume(cost); } + + /// Waits for ResourceRequest::failure() to be called for enqueued request + void waitFailed(const String & pattern) + { + try + { + lock(); + FAIL(); + } + catch (Exception & e) + { + ASSERT_EQ(e.code(), ErrorCodes::RESOURCE_ACCESS_DENIED); + ASSERT_TRUE(e.message().contains(pattern)); + } + } }; struct TItem @@ -264,10 +369,24 @@ struct ResourceTestManager : public ResourceTestBase , busy_period(thread_count) {} + enum DoNotInitManagerEnum { DoNotInitManager }; + + explicit ResourceTestManager(size_t thread_count, DoNotInitManagerEnum) + : busy_period(thread_count) + {} + ~ResourceTestManager() + { + wait(); + } + + void wait() { for (auto & thread : threads) - thread.join(); + { + if (thread.joinable()) + thread.join(); + } } void update(const String & xml) diff --git a/src/Common/Scheduler/Nodes/tests/gtest_dynamic_resource_manager.cpp b/src/Common/Scheduler/Nodes/tests/gtest_custom_resource_manager.cpp similarity index 82% rename from src/Common/Scheduler/Nodes/tests/gtest_dynamic_resource_manager.cpp rename to src/Common/Scheduler/Nodes/tests/gtest_custom_resource_manager.cpp index 3328196cced..37432128606 100644 --- a/src/Common/Scheduler/Nodes/tests/gtest_dynamic_resource_manager.cpp +++ b/src/Common/Scheduler/Nodes/tests/gtest_custom_resource_manager.cpp @@ -2,15 +2,15 @@ #include -#include +#include #include using namespace DB; -using ResourceTest = ResourceTestManager; +using ResourceTest = ResourceTestManager; using TestGuard = ResourceTest::Guard; -TEST(SchedulerDynamicResourceManager, Smoke) +TEST(SchedulerCustomResourceManager, Smoke) { ResourceTest t; @@ -31,25 +31,25 @@ TEST(SchedulerDynamicResourceManager, Smoke) )CONFIG"); - ClassifierPtr cA = t.manager->acquire("A"); - ClassifierPtr cB = t.manager->acquire("B"); + ClassifierPtr c_a = t.manager->acquire("A"); + ClassifierPtr c_b = t.manager->acquire("B"); for (int i = 0; i < 10; i++) { - ResourceGuard gA(ResourceGuard::Metrics::getIOWrite(), cA->get("res1"), 1, ResourceGuard::Lock::Defer); - gA.lock(); - gA.consume(1); - gA.unlock(); + ResourceGuard g_a(ResourceGuard::Metrics::getIOWrite(), c_a->get("res1"), 1, ResourceGuard::Lock::Defer); + g_a.lock(); + g_a.consume(1); + g_a.unlock(); - ResourceGuard gB(ResourceGuard::Metrics::getIOWrite(), cB->get("res1")); - gB.unlock(); + ResourceGuard g_b(ResourceGuard::Metrics::getIOWrite(), c_b->get("res1")); + g_b.unlock(); - ResourceGuard gC(ResourceGuard::Metrics::getIORead(), cB->get("res1")); - gB.consume(2); + ResourceGuard g_c(ResourceGuard::Metrics::getIORead(), c_b->get("res1")); + g_b.consume(2); } } -TEST(SchedulerDynamicResourceManager, Fairness) +TEST(SchedulerCustomResourceManager, Fairness) { // Total cost for A and B cannot differ for more than 1 (every request has cost equal to 1). // Requests from A use `value = 1` and from B `value = -1` is used. diff --git a/src/Common/Scheduler/Nodes/tests/gtest_event_queue.cpp b/src/Common/Scheduler/Nodes/tests/gtest_event_queue.cpp index 07798f78080..9989215ba7b 100644 --- a/src/Common/Scheduler/Nodes/tests/gtest_event_queue.cpp +++ b/src/Common/Scheduler/Nodes/tests/gtest_event_queue.cpp @@ -13,6 +13,12 @@ public: , log(log_) {} + const String & getTypeName() const override + { + static String type_name("fake"); + return type_name; + } + void attachChild(const SchedulerNodePtr & child) override { log += " +" + child->basename; diff --git a/src/Common/Scheduler/Nodes/tests/gtest_io_resource_manager.cpp b/src/Common/Scheduler/Nodes/tests/gtest_io_resource_manager.cpp new file mode 100644 index 00000000000..2bac69185d3 --- /dev/null +++ b/src/Common/Scheduler/Nodes/tests/gtest_io_resource_manager.cpp @@ -0,0 +1,335 @@ +#include + +#include +#include + +#include +#include +#include + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace DB; + +class WorkloadEntityTestStorage : public WorkloadEntityStorageBase +{ +public: + WorkloadEntityTestStorage() + : WorkloadEntityStorageBase(Context::getGlobalContextInstance()) + {} + + void loadEntities() override {} + + void executeQuery(const String & query) + { + ParserCreateWorkloadQuery create_workload_p; + ParserDropWorkloadQuery drop_workload_p; + ParserCreateResourceQuery create_resource_p; + ParserDropResourceQuery drop_resource_p; + + auto parse = [&] (IParser & parser) + { + String error; + const char * end = query.data(); + return tryParseQuery( + parser, + end, + query.data() + query.size(), + error, + false, + "", + false, + 0, + DBMS_DEFAULT_MAX_PARSER_DEPTH, + DBMS_DEFAULT_MAX_PARSER_BACKTRACKS, + true); + }; + + if (ASTPtr create_workload = parse(create_workload_p)) + { + auto & parsed = create_workload->as(); + auto workload_name = parsed.getWorkloadName(); + bool throw_if_exists = !parsed.if_not_exists && !parsed.or_replace; + bool replace_if_exists = parsed.or_replace; + + storeEntity( + nullptr, + WorkloadEntityType::Workload, + workload_name, + create_workload, + throw_if_exists, + replace_if_exists, + {}); + } + else if (ASTPtr create_resource = parse(create_resource_p)) + { + auto & parsed = create_resource->as(); + auto resource_name = parsed.getResourceName(); + bool throw_if_exists = !parsed.if_not_exists && !parsed.or_replace; + bool replace_if_exists = parsed.or_replace; + + storeEntity( + nullptr, + WorkloadEntityType::Resource, + resource_name, + create_resource, + throw_if_exists, + replace_if_exists, + {}); + } + else if (ASTPtr drop_workload = parse(drop_workload_p)) + { + auto & parsed = drop_workload->as(); + bool throw_if_not_exists = !parsed.if_exists; + removeEntity( + nullptr, + WorkloadEntityType::Workload, + parsed.workload_name, + throw_if_not_exists); + } + else if (ASTPtr drop_resource = parse(drop_resource_p)) + { + auto & parsed = drop_resource->as(); + bool throw_if_not_exists = !parsed.if_exists; + removeEntity( + nullptr, + WorkloadEntityType::Resource, + parsed.resource_name, + throw_if_not_exists); + } + else + throw Exception(ErrorCodes::LOGICAL_ERROR, "Invalid query in WorkloadEntityTestStorage: {}", query); + } + +private: + WorkloadEntityStorageBase::OperationResult storeEntityImpl( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + ASTPtr create_entity_query, + bool throw_if_exists, + bool replace_if_exists, + const Settings & settings) override + { + UNUSED(current_context, entity_type, entity_name, create_entity_query, throw_if_exists, replace_if_exists, settings); + return OperationResult::Ok; + } + + WorkloadEntityStorageBase::OperationResult removeEntityImpl( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + bool throw_if_not_exists) override + { + UNUSED(current_context, entity_type, entity_name, throw_if_not_exists); + return OperationResult::Ok; + } +}; + +struct ResourceTest : ResourceTestManager +{ + WorkloadEntityTestStorage storage; + + explicit ResourceTest(size_t thread_count = 1) + : ResourceTestManager(thread_count, DoNotInitManager) + { + manager = std::make_shared(storage); + } + + void query(const String & query_str) + { + storage.executeQuery(query_str); + } + + template + void async(const String & workload, Func func) + { + threads.emplace_back([=, this, func2 = std::move(func)] + { + ClassifierPtr classifier = manager->acquire(workload); + func2(classifier); + }); + } + + template + void async(const String & workload, const String & resource, Func func) + { + threads.emplace_back([=, this, func2 = std::move(func)] + { + ClassifierPtr classifier = manager->acquire(workload); + ResourceLink link = classifier->get(resource); + func2(link); + }); + } +}; + +using TestGuard = ResourceTest::Guard; + +TEST(SchedulerIOResourceManager, Smoke) +{ + ResourceTest t; + + t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)"); + t.query("CREATE WORKLOAD all SETTINGS max_requests = 10"); + t.query("CREATE WORKLOAD A in all"); + t.query("CREATE WORKLOAD B in all SETTINGS weight = 3"); + + ClassifierPtr c_a = t.manager->acquire("A"); + ClassifierPtr c_b = t.manager->acquire("B"); + + for (int i = 0; i < 10; i++) + { + ResourceGuard g_a(ResourceGuard::Metrics::getIOWrite(), c_a->get("res1"), 1, ResourceGuard::Lock::Defer); + g_a.lock(); + g_a.consume(1); + g_a.unlock(); + + ResourceGuard g_b(ResourceGuard::Metrics::getIOWrite(), c_b->get("res1")); + g_b.unlock(); + + ResourceGuard g_c(ResourceGuard::Metrics::getIORead(), c_b->get("res1")); + g_b.consume(2); + } +} + +TEST(SchedulerIOResourceManager, Fairness) +{ + // Total cost for A and B cannot differ for more than 1 (every request has cost equal to 1). + // Requests from A use `value = 1` and from B `value = -1` is used. + std::atomic unfairness = 0; + auto fairness_diff = [&] (Int64 value) + { + Int64 cur_unfairness = unfairness.fetch_add(value, std::memory_order_relaxed) + value; + EXPECT_NEAR(cur_unfairness, 0, 1); + }; + + constexpr size_t threads_per_queue = 2; + int requests_per_thread = 100; + ResourceTest t(2 * threads_per_queue + 1); + + t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)"); + t.query("CREATE WORKLOAD all SETTINGS max_requests = 1"); + t.query("CREATE WORKLOAD A IN all"); + t.query("CREATE WORKLOAD B IN all"); + t.query("CREATE WORKLOAD leader IN all"); + + for (int thread = 0; thread < threads_per_queue; thread++) + { + t.threads.emplace_back([&] + { + ClassifierPtr c = t.manager->acquire("A"); + ResourceLink link = c->get("res1"); + t.startBusyPeriod(link, 1, requests_per_thread); + for (int request = 0; request < requests_per_thread; request++) + { + TestGuard g(t, link, 1); + fairness_diff(1); + } + }); + } + + for (int thread = 0; thread < threads_per_queue; thread++) + { + t.threads.emplace_back([&] + { + ClassifierPtr c = t.manager->acquire("B"); + ResourceLink link = c->get("res1"); + t.startBusyPeriod(link, 1, requests_per_thread); + for (int request = 0; request < requests_per_thread; request++) + { + TestGuard g(t, link, 1); + fairness_diff(-1); + } + }); + } + + ClassifierPtr c = t.manager->acquire("leader"); + ResourceLink link = c->get("res1"); + t.blockResource(link); + + t.wait(); // Wait for threads to finish before destructing locals +} + +TEST(SchedulerIOResourceManager, DropNotEmptyQueue) +{ + ResourceTest t; + + t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)"); + t.query("CREATE WORKLOAD all SETTINGS max_requests = 1"); + t.query("CREATE WORKLOAD intermediate IN all"); + + std::barrier sync_before_enqueue(2); + std::barrier sync_before_drop(3); + std::barrier sync_after_drop(2); + t.async("intermediate", "res1", [&] (ResourceLink link) + { + TestGuard g(t, link, 1); + sync_before_enqueue.arrive_and_wait(); + sync_before_drop.arrive_and_wait(); // 1st resource request is consuming + sync_after_drop.arrive_and_wait(); // 1st resource request is still consuming + }); + + sync_before_enqueue.arrive_and_wait(); // to maintain correct order of resource requests + + t.async("intermediate", "res1", [&] (ResourceLink link) + { + TestGuard g(t, link, 1, EnqueueOnly); + sync_before_drop.arrive_and_wait(); // 2nd resource request is enqueued + g.waitFailed("is about to be destructed"); + }); + + sync_before_drop.arrive_and_wait(); // main thread triggers FifoQueue destruction by adding a unified child + t.query("CREATE WORKLOAD leaf IN intermediate"); + sync_after_drop.arrive_and_wait(); + + t.wait(); // Wait for threads to finish before destructing locals +} + +TEST(SchedulerIOResourceManager, DropNotEmptyQueueLong) +{ + ResourceTest t; + + t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)"); + t.query("CREATE WORKLOAD all SETTINGS max_requests = 1"); + t.query("CREATE WORKLOAD intermediate IN all"); + + static constexpr int queue_size = 100; + std::barrier sync_before_enqueue(2); + std::barrier sync_before_drop(2 + queue_size); + std::barrier sync_after_drop(2); + t.async("intermediate", "res1", [&] (ResourceLink link) + { + TestGuard g(t, link, 1); + sync_before_enqueue.arrive_and_wait(); + sync_before_drop.arrive_and_wait(); // 1st resource request is consuming + sync_after_drop.arrive_and_wait(); // 1st resource request is still consuming + }); + + sync_before_enqueue.arrive_and_wait(); // to maintain correct order of resource requests + + for (int i = 0; i < queue_size; i++) + { + t.async("intermediate", "res1", [&] (ResourceLink link) + { + TestGuard g(t, link, 1, EnqueueOnly); + sync_before_drop.arrive_and_wait(); // many resource requests are enqueued + g.waitFailed("is about to be destructed"); + }); + } + + sync_before_drop.arrive_and_wait(); // main thread triggers FifoQueue destruction by adding a unified child + t.query("CREATE WORKLOAD leaf IN intermediate"); + sync_after_drop.arrive_and_wait(); + + t.wait(); // Wait for threads to finish before destructing locals +} diff --git a/src/Common/Scheduler/Nodes/tests/gtest_resource_class_fair.cpp b/src/Common/Scheduler/Nodes/tests/gtest_resource_class_fair.cpp index 16cce309c2a..d859693eba5 100644 --- a/src/Common/Scheduler/Nodes/tests/gtest_resource_class_fair.cpp +++ b/src/Common/Scheduler/Nodes/tests/gtest_resource_class_fair.cpp @@ -8,18 +8,17 @@ using namespace DB; using ResourceTest = ResourceTestClass; -/// Tests disabled because of leaks in the test themselves: https://github.com/ClickHouse/ClickHouse/issues/67678 - -TEST(DISABLED_SchedulerFairPolicy, Factory) +TEST(SchedulerFairPolicy, Factory) { ResourceTest t; Poco::AutoPtr cfg = new Poco::Util::XMLConfiguration(); - SchedulerNodePtr fair = SchedulerNodeFactory::instance().get("fair", /* event_queue = */ nullptr, *cfg, ""); + EventQueue event_queue; + SchedulerNodePtr fair = SchedulerNodeFactory::instance().get("fair", &event_queue, *cfg, ""); EXPECT_TRUE(dynamic_cast(fair.get()) != nullptr); } -TEST(DISABLED_SchedulerFairPolicy, FairnessWeights) +TEST(SchedulerFairPolicy, FairnessWeights) { ResourceTest t; @@ -43,7 +42,7 @@ TEST(DISABLED_SchedulerFairPolicy, FairnessWeights) t.consumed("B", 20); } -TEST(DISABLED_SchedulerFairPolicy, Activation) +TEST(SchedulerFairPolicy, Activation) { ResourceTest t; @@ -79,7 +78,7 @@ TEST(DISABLED_SchedulerFairPolicy, Activation) t.consumed("B", 10); } -TEST(DISABLED_SchedulerFairPolicy, FairnessMaxMin) +TEST(SchedulerFairPolicy, FairnessMaxMin) { ResourceTest t; @@ -103,7 +102,7 @@ TEST(DISABLED_SchedulerFairPolicy, FairnessMaxMin) t.consumed("A", 20); } -TEST(DISABLED_SchedulerFairPolicy, HierarchicalFairness) +TEST(SchedulerFairPolicy, HierarchicalFairness) { ResourceTest t; diff --git a/src/Common/Scheduler/Nodes/tests/gtest_resource_class_priority.cpp b/src/Common/Scheduler/Nodes/tests/gtest_resource_class_priority.cpp index d3d38aae048..ab248209635 100644 --- a/src/Common/Scheduler/Nodes/tests/gtest_resource_class_priority.cpp +++ b/src/Common/Scheduler/Nodes/tests/gtest_resource_class_priority.cpp @@ -8,18 +8,17 @@ using namespace DB; using ResourceTest = ResourceTestClass; -/// Tests disabled because of leaks in the test themselves: https://github.com/ClickHouse/ClickHouse/issues/67678 - -TEST(DISABLED_SchedulerPriorityPolicy, Factory) +TEST(SchedulerPriorityPolicy, Factory) { ResourceTest t; Poco::AutoPtr cfg = new Poco::Util::XMLConfiguration(); - SchedulerNodePtr prio = SchedulerNodeFactory::instance().get("priority", /* event_queue = */ nullptr, *cfg, ""); + EventQueue event_queue; + SchedulerNodePtr prio = SchedulerNodeFactory::instance().get("priority", &event_queue, *cfg, ""); EXPECT_TRUE(dynamic_cast(prio.get()) != nullptr); } -TEST(DISABLED_SchedulerPriorityPolicy, Priorities) +TEST(SchedulerPriorityPolicy, Priorities) { ResourceTest t; @@ -53,7 +52,7 @@ TEST(DISABLED_SchedulerPriorityPolicy, Priorities) t.consumed("C", 0); } -TEST(DISABLED_SchedulerPriorityPolicy, Activation) +TEST(SchedulerPriorityPolicy, Activation) { ResourceTest t; @@ -94,7 +93,7 @@ TEST(DISABLED_SchedulerPriorityPolicy, Activation) t.consumed("C", 0); } -TEST(DISABLED_SchedulerPriorityPolicy, SinglePriority) +TEST(SchedulerPriorityPolicy, SinglePriority) { ResourceTest t; diff --git a/src/Common/Scheduler/Nodes/tests/gtest_resource_scheduler.cpp b/src/Common/Scheduler/Nodes/tests/gtest_resource_scheduler.cpp index ddfe0cfbc6f..85d35fab0a6 100644 --- a/src/Common/Scheduler/Nodes/tests/gtest_resource_scheduler.cpp +++ b/src/Common/Scheduler/Nodes/tests/gtest_resource_scheduler.cpp @@ -1,5 +1,6 @@ #include +#include #include #include @@ -101,6 +102,11 @@ struct MyRequest : public ResourceRequest if (on_execute) on_execute(); } + + void failed(const std::exception_ptr &) override + { + FAIL(); + } }; TEST(SchedulerRoot, Smoke) @@ -108,14 +114,14 @@ TEST(SchedulerRoot, Smoke) ResourceTest t; ResourceHolder r1(t); - auto * fc1 = r1.add("/", "1"); + auto * fc1 = r1.add("/", "1"); r1.add("/prio"); auto a = r1.addQueue("/prio/A", "1"); auto b = r1.addQueue("/prio/B", "2"); r1.registerResource(); ResourceHolder r2(t); - auto * fc2 = r2.add("/", "1"); + auto * fc2 = r2.add("/", "1"); r2.add("/prio"); auto c = r2.addQueue("/prio/C", "-1"); auto d = r2.addQueue("/prio/D", "-2"); @@ -123,25 +129,25 @@ TEST(SchedulerRoot, Smoke) { ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), a); - EXPECT_TRUE(fc1->requests.contains(&rg.request)); + EXPECT_TRUE(fc1->getInflights().first == 1); rg.consume(1); } { ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), b); - EXPECT_TRUE(fc1->requests.contains(&rg.request)); + EXPECT_TRUE(fc1->getInflights().first == 1); rg.consume(1); } { ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), c); - EXPECT_TRUE(fc2->requests.contains(&rg.request)); + EXPECT_TRUE(fc2->getInflights().first == 1); rg.consume(1); } { ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), d); - EXPECT_TRUE(fc2->requests.contains(&rg.request)); + EXPECT_TRUE(fc2->getInflights().first == 1); rg.consume(1); } } @@ -151,7 +157,7 @@ TEST(SchedulerRoot, Budget) ResourceTest t; ResourceHolder r1(t); - r1.add("/", "1"); + r1.add("/", "1"); r1.add("/prio"); auto a = r1.addQueue("/prio/A", ""); r1.registerResource(); @@ -176,7 +182,7 @@ TEST(SchedulerRoot, Cancel) ResourceTest t; ResourceHolder r1(t); - auto * fc1 = r1.add("/", "1"); + auto * fc1 = r1.add("/", "1"); r1.add("/prio"); auto a = r1.addQueue("/prio/A", "1"); auto b = r1.addQueue("/prio/B", "2"); @@ -189,7 +195,7 @@ TEST(SchedulerRoot, Cancel) MyRequest request(1,[&] { sync.arrive_and_wait(); // (A) - EXPECT_TRUE(fc1->requests.contains(&request)); + EXPECT_TRUE(fc1->getInflights().first == 1); sync.arrive_and_wait(); // (B) request.finish(); destruct_sync.arrive_and_wait(); // (C) @@ -214,5 +220,5 @@ TEST(SchedulerRoot, Cancel) consumer1.join(); consumer2.join(); - EXPECT_TRUE(fc1->requests.empty()); + EXPECT_TRUE(fc1->getInflights().first == 0); } diff --git a/src/Common/Scheduler/Nodes/tests/gtest_throttler_constraint.cpp b/src/Common/Scheduler/Nodes/tests/gtest_throttler_constraint.cpp index 2bc24cdb292..585bb738b27 100644 --- a/src/Common/Scheduler/Nodes/tests/gtest_throttler_constraint.cpp +++ b/src/Common/Scheduler/Nodes/tests/gtest_throttler_constraint.cpp @@ -10,9 +10,7 @@ using namespace DB; using ResourceTest = ResourceTestClass; -/// Tests disabled because of leaks in the test themselves: https://github.com/ClickHouse/ClickHouse/issues/67678 - -TEST(DISABLED_SchedulerThrottlerConstraint, LeakyBucketConstraint) +TEST(SchedulerThrottlerConstraint, LeakyBucketConstraint) { ResourceTest t; EventQueue::TimePoint start = std::chrono::system_clock::now(); @@ -42,7 +40,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, LeakyBucketConstraint) t.consumed("A", 10); } -TEST(DISABLED_SchedulerThrottlerConstraint, Unlimited) +TEST(SchedulerThrottlerConstraint, Unlimited) { ResourceTest t; EventQueue::TimePoint start = std::chrono::system_clock::now(); @@ -59,7 +57,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, Unlimited) } } -TEST(DISABLED_SchedulerThrottlerConstraint, Pacing) +TEST(SchedulerThrottlerConstraint, Pacing) { ResourceTest t; EventQueue::TimePoint start = std::chrono::system_clock::now(); @@ -79,7 +77,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, Pacing) } } -TEST(DISABLED_SchedulerThrottlerConstraint, BucketFilling) +TEST(SchedulerThrottlerConstraint, BucketFilling) { ResourceTest t; EventQueue::TimePoint start = std::chrono::system_clock::now(); @@ -113,7 +111,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, BucketFilling) t.consumed("A", 3); } -TEST(DISABLED_SchedulerThrottlerConstraint, PeekAndAvgLimits) +TEST(SchedulerThrottlerConstraint, PeekAndAvgLimits) { ResourceTest t; EventQueue::TimePoint start = std::chrono::system_clock::now(); @@ -141,7 +139,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, PeekAndAvgLimits) } } -TEST(DISABLED_SchedulerThrottlerConstraint, ThrottlerAndFairness) +TEST(SchedulerThrottlerConstraint, ThrottlerAndFairness) { ResourceTest t; EventQueue::TimePoint start = std::chrono::system_clock::now(); @@ -160,22 +158,22 @@ TEST(DISABLED_SchedulerThrottlerConstraint, ThrottlerAndFairness) t.enqueue("/fair/B", {req_cost}); } - double shareA = 0.1; - double shareB = 0.9; + double share_a = 0.1; + double share_b = 0.9; // Bandwidth-latency coupling due to fairness: worst latency is inversely proportional to share - auto max_latencyA = static_cast(req_cost * (1.0 + 1.0 / shareA)); - auto max_latencyB = static_cast(req_cost * (1.0 + 1.0 / shareB)); + auto max_latency_a = static_cast(req_cost * (1.0 + 1.0 / share_a)); + auto max_latency_b = static_cast(req_cost * (1.0 + 1.0 / share_b)); - double consumedA = 0; - double consumedB = 0; + double consumed_a = 0; + double consumed_b = 0; for (int seconds = 0; seconds < 100; seconds++) { t.process(start + std::chrono::seconds(seconds)); double arrival_curve = 100.0 + 10.0 * seconds + req_cost; - t.consumed("A", static_cast(arrival_curve * shareA - consumedA), max_latencyA); - t.consumed("B", static_cast(arrival_curve * shareB - consumedB), max_latencyB); - consumedA = arrival_curve * shareA; - consumedB = arrival_curve * shareB; + t.consumed("A", static_cast(arrival_curve * share_a - consumed_a), max_latency_a); + t.consumed("B", static_cast(arrival_curve * share_b - consumed_b), max_latency_b); + consumed_a = arrival_curve * share_a; + consumed_b = arrival_curve * share_b; } } diff --git a/src/Common/Scheduler/Nodes/tests/gtest_unified_scheduler_node.cpp b/src/Common/Scheduler/Nodes/tests/gtest_unified_scheduler_node.cpp new file mode 100644 index 00000000000..b5bcc07f71a --- /dev/null +++ b/src/Common/Scheduler/Nodes/tests/gtest_unified_scheduler_node.cpp @@ -0,0 +1,748 @@ +#include +#include + +#include +#include +#include + +#include +#include +#include + +using namespace DB; + +using ResourceTest = ResourceTestClass; + +TEST(SchedulerUnifiedNode, Smoke) +{ + ResourceTest t; + + t.addCustom("/", SchedulingSettings{}); + + t.enqueue("/fifo", {10, 10}); + t.dequeue(2); + t.consumed("fifo", 20); +} + +TEST(SchedulerUnifiedNode, FairnessWeight) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{}}); + auto b = t.createUnifiedNode("B", all, {.weight = 3.0, .priority = Priority{}}); + + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10}); + + t.dequeue(4); + t.consumed("A", 10); + t.consumed("B", 30); + + t.dequeue(4); + t.consumed("A", 10); + t.consumed("B", 30); + + t.dequeue(); + t.consumed("A", 60); + t.consumed("B", 20); +} + +TEST(SchedulerUnifiedNode, FairnessActivation) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all); + auto b = t.createUnifiedNode("B", all); + auto c = t.createUnifiedNode("C", all); + + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(b, {10}); + t.enqueue(c, {10, 10}); + + t.dequeue(3); + t.consumed("A", 10); + t.consumed("B", 10); + t.consumed("C", 10); + + t.dequeue(4); + t.consumed("A", 30); + t.consumed("B", 0); + t.consumed("C", 10); + + t.enqueue(b, {10, 10}); + t.dequeue(1); + t.consumed("B", 10); + + t.enqueue(c, {10, 10}); + t.dequeue(1); + t.consumed("C", 10); + + t.dequeue(2); // A B or B A + t.consumed("A", 10); + t.consumed("B", 10); +} + +TEST(SchedulerUnifiedNode, FairnessMaxMin) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all); + auto b = t.createUnifiedNode("B", all); + + t.enqueue(a, {10, 10}); // make sure A is never empty + + for (int i = 0; i < 10; i++) + { + t.enqueue(a, {10, 10, 10, 10}); + t.enqueue(b, {10, 10}); + + t.dequeue(6); + t.consumed("A", 40); + t.consumed("B", 20); + } + + t.dequeue(2); + t.consumed("A", 20); +} + +TEST(SchedulerUnifiedNode, FairnessHierarchical) +{ + ResourceTest t; + + + auto all = t.createUnifiedNode("all"); + auto x = t.createUnifiedNode("X", all); + auto y = t.createUnifiedNode("Y", all); + auto a = t.createUnifiedNode("A", x); + auto b = t.createUnifiedNode("B", x); + auto c = t.createUnifiedNode("C", y); + auto d = t.createUnifiedNode("D", y); + + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10}); + for (int i = 0; i < 4; i++) + { + t.dequeue(8); + t.consumed("A", 20); + t.consumed("B", 20); + t.consumed("C", 20); + t.consumed("D", 20); + } + + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10}); + for (int i = 0; i < 4; i++) + { + t.dequeue(8); + t.consumed("A", 40); + t.consumed("C", 20); + t.consumed("D", 20); + } + + t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10}); + for (int i = 0; i < 4; i++) + { + t.dequeue(8); + t.consumed("B", 40); + t.consumed("C", 20); + t.consumed("D", 20); + } + + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10}); + for (int i = 0; i < 4; i++) + { + t.dequeue(8); + t.consumed("A", 20); + t.consumed("B", 20); + t.consumed("C", 40); + } + + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10}); + for (int i = 0; i < 4; i++) + { + t.dequeue(8); + t.consumed("A", 20); + t.consumed("B", 20); + t.consumed("D", 40); + } + + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10}); + for (int i = 0; i < 4; i++) + { + t.dequeue(8); + t.consumed("A", 40); + t.consumed("D", 40); + } +} + +TEST(SchedulerUnifiedNode, Priority) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all, {.priority = Priority{3}}); + auto b = t.createUnifiedNode("B", all, {.priority = Priority{2}}); + auto c = t.createUnifiedNode("C", all, {.priority = Priority{1}}); + + t.enqueue(a, {10, 10, 10}); + t.enqueue(b, {10, 10, 10}); + t.enqueue(c, {10, 10, 10}); + + t.dequeue(2); + t.consumed("A", 0); + t.consumed("B", 0); + t.consumed("C", 20); + + t.dequeue(2); + t.consumed("A", 0); + t.consumed("B", 10); + t.consumed("C", 10); + + t.dequeue(2); + t.consumed("A", 0); + t.consumed("B", 20); + t.consumed("C", 0); + + t.dequeue(); + t.consumed("A", 30); + t.consumed("B", 0); + t.consumed("C", 0); +} + +TEST(SchedulerUnifiedNode, PriorityActivation) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all, {.priority = Priority{3}}); + auto b = t.createUnifiedNode("B", all, {.priority = Priority{2}}); + auto c = t.createUnifiedNode("C", all, {.priority = Priority{1}}); + + t.enqueue(a, {10, 10, 10, 10, 10, 10}); + t.enqueue(b, {10}); + t.enqueue(c, {10, 10}); + + t.dequeue(3); + t.consumed("A", 0); + t.consumed("B", 10); + t.consumed("C", 20); + + t.dequeue(2); + t.consumed("A", 20); + t.consumed("B", 0); + t.consumed("C", 0); + + t.enqueue(b, {10, 10, 10}); + t.dequeue(2); + t.consumed("A", 0); + t.consumed("B", 20); + t.consumed("C", 0); + + t.enqueue(c, {10, 10}); + t.dequeue(3); + t.consumed("A", 0); + t.consumed("B", 10); + t.consumed("C", 20); + + t.dequeue(2); + t.consumed("A", 20); + t.consumed("B", 0); + t.consumed("C", 0); +} + +TEST(SchedulerUnifiedNode, List) +{ + ResourceTest t; + + std::list list; + list.push_back(t.createUnifiedNode("all")); + + for (int length = 1; length < 5; length++) + { + String name = fmt::format("L{}", length); + list.push_back(t.createUnifiedNode(name, list.back())); + + for (int i = 0; i < 3; i++) + { + t.enqueue(list.back(), {10, 10}); + t.dequeue(1); + t.consumed(name, 10); + + for (int j = 0; j < 3; j++) + { + t.enqueue(list.back(), {10, 10, 10}); + t.dequeue(1); + t.consumed(name, 10); + t.dequeue(1); + t.consumed(name, 10); + t.dequeue(1); + t.consumed(name, 10); + } + + t.dequeue(1); + t.consumed(name, 10); + } + } +} + +TEST(SchedulerUnifiedNode, ThrottlerLeakyBucket) +{ + ResourceTest t; + EventQueue::TimePoint start = std::chrono::system_clock::now(); + t.process(start, 0); + + auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 20.0}); + + t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10}); + + t.process(start + std::chrono::seconds(0)); + t.consumed("all", 30); // It is allowed to go below zero for exactly one resource request + + t.process(start + std::chrono::seconds(1)); + t.consumed("all", 10); + + t.process(start + std::chrono::seconds(2)); + t.consumed("all", 10); + + t.process(start + std::chrono::seconds(3)); + t.consumed("all", 10); + + t.process(start + std::chrono::seconds(4)); + t.consumed("all", 10); + + t.process(start + std::chrono::seconds(100500)); + t.consumed("all", 10); +} + +TEST(SchedulerUnifiedNode, ThrottlerPacing) +{ + ResourceTest t; + EventQueue::TimePoint start = std::chrono::system_clock::now(); + t.process(start, 0); + + // Zero burst allows you to send one request of any `size` and than throttle for `size/max_speed` seconds. + // Useful if outgoing traffic should be "paced", i.e. have the least possible burstiness. + auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 1.0, .max_burst = 0.0}); + + t.enqueue(all, {1, 2, 3, 1, 2, 1}); + int output[] = {1, 2, 0, 3, 0, 0, 1, 2, 0, 1, 0}; + for (int i = 0; i < std::size(output); i++) + { + t.process(start + std::chrono::seconds(i)); + t.consumed("all", output[i]); + } +} + +TEST(SchedulerUnifiedNode, ThrottlerBucketFilling) +{ + ResourceTest t; + EventQueue::TimePoint start = std::chrono::system_clock::now(); + t.process(start, 0); + + auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0}); + + t.enqueue(all, {100}); + + t.process(start + std::chrono::seconds(0)); + t.consumed("all", 100); // consume all tokens, but it is still active (not negative) + + t.process(start + std::chrono::seconds(5)); + t.consumed("all", 0); // There was nothing to consume + + t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10, 10, 10}); + t.process(start + std::chrono::seconds(5)); + t.consumed("all", 60); // 5 sec * 10 tokens/sec = 50 tokens + 1 extra request to go below zero + + t.process(start + std::chrono::seconds(100)); + t.consumed("all", 40); // Consume rest + + t.process(start + std::chrono::seconds(200)); + + t.enqueue(all, {95, 1, 1, 1, 1, 1, 1, 1, 1, 1}); + t.process(start + std::chrono::seconds(200)); + t.consumed("all", 101); // check we cannot consume more than max_burst + 1 request + + t.process(start + std::chrono::seconds(100500)); + t.consumed("all", 3); +} + +TEST(SchedulerUnifiedNode, ThrottlerAndFairness) +{ + ResourceTest t; + EventQueue::TimePoint start = std::chrono::system_clock::now(); + t.process(start, 0); + + auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0}); + auto a = t.createUnifiedNode("A", all, {.weight = 10.0, .priority = Priority{}}); + auto b = t.createUnifiedNode("B", all, {.weight = 90.0, .priority = Priority{}}); + + ResourceCost req_cost = 1; + ResourceCost total_cost = 2000; + for (int i = 0; i < total_cost / req_cost; i++) + { + t.enqueue(a, {req_cost}); + t.enqueue(b, {req_cost}); + } + + double share_a = 0.1; + double share_b = 0.9; + + // Bandwidth-latency coupling due to fairness: worst latency is inversely proportional to share + auto max_latency_a = static_cast(req_cost * (1.0 + 1.0 / share_a)); + auto max_latency_b = static_cast(req_cost * (1.0 + 1.0 / share_b)); + + double consumed_a = 0; + double consumed_b = 0; + for (int seconds = 0; seconds < 100; seconds++) + { + t.process(start + std::chrono::seconds(seconds)); + double arrival_curve = 100.0 + 10.0 * seconds + req_cost; + t.consumed("A", static_cast(arrival_curve * share_a - consumed_a), max_latency_a); + t.consumed("B", static_cast(arrival_curve * share_b - consumed_b), max_latency_b); + consumed_a = arrival_curve * share_a; + consumed_b = arrival_curve * share_b; + } +} + +TEST(SchedulerUnifiedNode, QueueWithRequestsDestruction) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + + t.enqueue(all, {10, 10}); // enqueue reqeuests to be canceled + + // This will destroy the queue and fail both requests + auto a = t.createUnifiedNode("A", all); + t.failed(20); + + // Check that everything works fine after destruction + auto b = t.createUnifiedNode("B", all); + t.enqueue(a, {10, 10}); // make sure A is never empty + for (int i = 0; i < 10; i++) + { + t.enqueue(a, {10, 10, 10, 10}); + t.enqueue(b, {10, 10}); + + t.dequeue(6); + t.consumed("A", 40); + t.consumed("B", 20); + } + t.dequeue(2); + t.consumed("A", 20); +} + +TEST(SchedulerUnifiedNode, ResourceGuardException) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + + t.enqueue(all, {10, 10}); // enqueue reqeuests to be canceled + + std::thread consumer([queue = all->getQueue()] + { + ResourceLink link{.queue = queue.get()}; + bool caught = false; + try + { + ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), link); + } + catch (...) + { + caught = true; + } + ASSERT_TRUE(caught); + }); + + // This will destroy the queue and fail both requests + auto a = t.createUnifiedNode("A", all); + t.failed(20); + consumer.join(); + + // Check that everything works fine after destruction + auto b = t.createUnifiedNode("B", all); + t.enqueue(a, {10, 10}); // make sure A is never empty + for (int i = 0; i < 10; i++) + { + t.enqueue(a, {10, 10, 10, 10}); + t.enqueue(b, {10, 10}); + + t.dequeue(6); + t.consumed("A", 40); + t.consumed("B", 20); + } + t.dequeue(2); + t.consumed("A", 20); +} + +TEST(SchedulerUnifiedNode, UpdateWeight) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{}}); + auto b = t.createUnifiedNode("B", all, {.weight = 3.0, .priority = Priority{}}); + + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10}); + + t.dequeue(4); + t.consumed("A", 10); + t.consumed("B", 30); + + t.updateUnifiedNode(b, all, all, {.weight = 1.0, .priority = Priority{}}); + + t.dequeue(4); + t.consumed("A", 20); + t.consumed("B", 20); + + t.dequeue(4); + t.consumed("A", 20); + t.consumed("B", 20); +} + +TEST(SchedulerUnifiedNode, UpdatePriority) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{}}); + auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{}}); + + t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10}); + + t.dequeue(2); + t.consumed("A", 10); + t.consumed("B", 10); + + t.updateUnifiedNode(a, all, all, {.weight = 1.0, .priority = Priority{-1}}); + + t.dequeue(2); + t.consumed("A", 20); + t.consumed("B", 0); + + t.updateUnifiedNode(b, all, all, {.weight = 1.0, .priority = Priority{-2}}); + + t.dequeue(2); + t.consumed("A", 0); + t.consumed("B", 20); + + t.updateUnifiedNode(a, all, all, {.weight = 1.0, .priority = Priority{-2}}); + + t.dequeue(2); + t.consumed("A", 10); + t.consumed("B", 10); +} + +TEST(SchedulerUnifiedNode, UpdateParentOfLeafNode) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{1}}); + auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{2}}); + auto x = t.createUnifiedNode("X", a, {}); + auto y = t.createUnifiedNode("Y", b, {}); + + t.enqueue(x, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(y, {10, 10, 10, 10, 10, 10, 10, 10}); + + t.dequeue(2); + t.consumed("X", 20); + t.consumed("Y", 0); + + t.updateUnifiedNode(x, a, b, {}); + + t.dequeue(2); + t.consumed("X", 10); + t.consumed("Y", 10); + + t.updateUnifiedNode(y, b, a, {}); + + t.dequeue(2); + t.consumed("X", 0); + t.consumed("Y", 20); + + t.updateUnifiedNode(y, a, all, {}); + t.updateUnifiedNode(x, b, all, {}); + + t.dequeue(4); + t.consumed("X", 20); + t.consumed("Y", 20); +} + +TEST(SchedulerUnifiedNode, UpdatePriorityOfIntermediateNode) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{1}}); + auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{2}}); + auto x1 = t.createUnifiedNode("X1", a, {}); + auto y1 = t.createUnifiedNode("Y1", b, {}); + auto x2 = t.createUnifiedNode("X2", a, {}); + auto y2 = t.createUnifiedNode("Y2", b, {}); + + t.enqueue(x1, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(y1, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(x2, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(y2, {10, 10, 10, 10, 10, 10, 10, 10}); + + t.dequeue(4); + t.consumed("X1", 20); + t.consumed("Y1", 0); + t.consumed("X2", 20); + t.consumed("Y2", 0); + + t.updateUnifiedNode(a, all, all, {.weight = 1.0, .priority = Priority{2}}); + + t.dequeue(4); + t.consumed("X1", 10); + t.consumed("Y1", 10); + t.consumed("X2", 10); + t.consumed("Y2", 10); + + t.updateUnifiedNode(b, all, all, {.weight = 1.0, .priority = Priority{1}}); + + t.dequeue(4); + t.consumed("X1", 0); + t.consumed("Y1", 20); + t.consumed("X2", 0); + t.consumed("Y2", 20); +} + +TEST(SchedulerUnifiedNode, UpdateParentOfIntermediateNode) +{ + ResourceTest t; + + auto all = t.createUnifiedNode("all"); + auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{1}}); + auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{2}}); + auto c = t.createUnifiedNode("C", a, {}); + auto d = t.createUnifiedNode("D", b, {}); + auto x1 = t.createUnifiedNode("X1", c, {}); + auto y1 = t.createUnifiedNode("Y1", d, {}); + auto x2 = t.createUnifiedNode("X2", c, {}); + auto y2 = t.createUnifiedNode("Y2", d, {}); + + t.enqueue(x1, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(y1, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(x2, {10, 10, 10, 10, 10, 10, 10, 10}); + t.enqueue(y2, {10, 10, 10, 10, 10, 10, 10, 10}); + + t.dequeue(4); + t.consumed("X1", 20); + t.consumed("Y1", 0); + t.consumed("X2", 20); + t.consumed("Y2", 0); + + t.updateUnifiedNode(c, a, b, {}); + + t.dequeue(4); + t.consumed("X1", 10); + t.consumed("Y1", 10); + t.consumed("X2", 10); + t.consumed("Y2", 10); + + t.updateUnifiedNode(d, b, a, {}); + + t.dequeue(4); + t.consumed("X1", 0); + t.consumed("Y1", 20); + t.consumed("X2", 0); + t.consumed("Y2", 20); +} + +TEST(SchedulerUnifiedNode, UpdateThrottlerMaxSpeed) +{ + ResourceTest t; + EventQueue::TimePoint start = std::chrono::system_clock::now(); + t.process(start, 0); + + auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 20.0}); + + t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10}); + + t.process(start + std::chrono::seconds(0)); + t.consumed("all", 30); // It is allowed to go below zero for exactly one resource request + + t.process(start + std::chrono::seconds(1)); + t.consumed("all", 10); + + t.process(start + std::chrono::seconds(2)); + t.consumed("all", 10); + + t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 1.0, .max_burst = 20.0}); + + t.process(start + std::chrono::seconds(12)); + t.consumed("all", 10); + + t.process(start + std::chrono::seconds(22)); + t.consumed("all", 10); + + t.process(start + std::chrono::seconds(100500)); + t.consumed("all", 10); +} + +TEST(SchedulerUnifiedNode, UpdateThrottlerMaxBurst) +{ + ResourceTest t; + EventQueue::TimePoint start = std::chrono::system_clock::now(); + t.process(start, 0); + + auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0}); + + t.enqueue(all, {100}); + + t.process(start + std::chrono::seconds(0)); + t.consumed("all", 100); // consume all tokens, but it is still active (not negative) + + t.process(start + std::chrono::seconds(2)); + t.consumed("all", 0); // There was nothing to consume + t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 10.0, .max_burst = 30.0}); + + t.process(start + std::chrono::seconds(5)); + t.consumed("all", 0); // There was nothing to consume + + t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10, 10, 10}); + t.process(start + std::chrono::seconds(5)); + t.consumed("all", 40); // min(30 tokens, 5 sec * 10 tokens/sec) = 30 tokens + 1 extra request to go below zero + + t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0}); + + t.process(start + std::chrono::seconds(100)); + t.consumed("all", 60); // Consume rest + + t.process(start + std::chrono::seconds(150)); + t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 100.0, .max_burst = 200.0}); + + t.process(start + std::chrono::seconds(200)); + + t.enqueue(all, {195, 1, 1, 1, 1, 1, 1, 1, 1, 1}); + t.process(start + std::chrono::seconds(200)); + t.consumed("all", 201); // check we cannot consume more than max_burst + 1 request + + t.process(start + std::chrono::seconds(100500)); + t.consumed("all", 3); +} diff --git a/src/Common/Scheduler/ResourceGuard.h b/src/Common/Scheduler/ResourceGuard.h index cf97f7acf93..ba3532598af 100644 --- a/src/Common/Scheduler/ResourceGuard.h +++ b/src/Common/Scheduler/ResourceGuard.h @@ -12,6 +12,7 @@ #include #include +#include #include @@ -34,6 +35,11 @@ namespace CurrentMetrics namespace DB { +namespace ErrorCodes +{ + extern const int RESOURCE_ACCESS_DENIED; +} + /* * Scoped resource guard. * Waits for resource to be available in constructor and releases resource in destructor @@ -109,12 +115,25 @@ public: dequeued_cv.notify_one(); } + // This function is executed inside scheduler thread and wakes thread that issued this `request`. + // That thread will throw an exception. + void failed(const std::exception_ptr & ptr) override + { + std::unique_lock lock(mutex); + chassert(state == Enqueued); + state = Dequeued; + exception = ptr; + dequeued_cv.notify_one(); + } + void wait() { CurrentMetrics::Increment scheduled(metrics->scheduled_count); auto timer = CurrentThread::getProfileEvents().timer(metrics->wait_microseconds); std::unique_lock lock(mutex); dequeued_cv.wait(lock, [this] { return state == Dequeued; }); + if (exception) + throw Exception(ErrorCodes::RESOURCE_ACCESS_DENIED, "Resource request failed: {}", getExceptionMessage(exception, /* with_stacktrace = */ false)); } void finish(ResourceCost real_cost_, ResourceLink link_) @@ -151,6 +170,7 @@ public: std::mutex mutex; std::condition_variable dequeued_cv; RequestState state = Finished; + std::exception_ptr exception; }; /// Creates pending request for resource; blocks while resource is not available (unless `Lock::Defer`) diff --git a/src/Common/Scheduler/ResourceManagerFactory.h b/src/Common/Scheduler/ResourceManagerFactory.h deleted file mode 100644 index 52f271e51b1..00000000000 --- a/src/Common/Scheduler/ResourceManagerFactory.h +++ /dev/null @@ -1,55 +0,0 @@ -#pragma once - -#include -#include - -#include - -#include - -#include -#include -#include - -namespace DB -{ - -namespace ErrorCodes -{ - extern const int INVALID_SCHEDULER_NODE; -} - -class ResourceManagerFactory : private boost::noncopyable -{ -public: - static ResourceManagerFactory & instance() - { - static ResourceManagerFactory ret; - return ret; - } - - ResourceManagerPtr get(const String & name) - { - std::lock_guard lock{mutex}; - if (auto iter = methods.find(name); iter != methods.end()) - return iter->second(); - throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Unknown scheduler node type: {}", name); - } - - template - void registerMethod(const String & name) - { - std::lock_guard lock{mutex}; - methods[name] = [] () - { - return std::make_shared(); - }; - } - -private: - std::mutex mutex; - using Method = std::function; - std::unordered_map methods; -}; - -} diff --git a/src/Common/Scheduler/ResourceRequest.cpp b/src/Common/Scheduler/ResourceRequest.cpp index 26e8084cdfa..674c7650adf 100644 --- a/src/Common/Scheduler/ResourceRequest.cpp +++ b/src/Common/Scheduler/ResourceRequest.cpp @@ -1,13 +1,34 @@ #include #include +#include + +#include + namespace DB { void ResourceRequest::finish() { - if (constraint) - constraint->finishRequest(this); + // Iterate over constraints in reverse order + for (ISchedulerConstraint * constraint : std::ranges::reverse_view(constraints)) + { + if (constraint) + constraint->finishRequest(this); + } +} + +bool ResourceRequest::addConstraint(ISchedulerConstraint * new_constraint) +{ + for (auto & constraint : constraints) + { + if (!constraint) + { + constraint = new_constraint; + return true; + } + } + return false; } } diff --git a/src/Common/Scheduler/ResourceRequest.h b/src/Common/Scheduler/ResourceRequest.h index 7b6a5af0fe6..bb9bfbfc8fd 100644 --- a/src/Common/Scheduler/ResourceRequest.h +++ b/src/Common/Scheduler/ResourceRequest.h @@ -2,7 +2,9 @@ #include #include +#include #include +#include namespace DB { @@ -15,6 +17,9 @@ class ISchedulerConstraint; using ResourceCost = Int64; constexpr ResourceCost ResourceCostMax = std::numeric_limits::max(); +/// Max number of constraints for a request to pass though (depth of constraints chain) +constexpr size_t ResourceMaxConstraints = 8; + /* * Request for a resource consumption. The main moving part of the scheduling subsystem. * Resource requests processing workflow: @@ -39,8 +44,7 @@ constexpr ResourceCost ResourceCostMax = std::numeric_limits::max(); * * Request can also be canceled before (3) using ISchedulerQueue::cancelRequest(). * Returning false means it is too late for request to be canceled. It should be processed in a regular way. - * Returning true means successful cancel and therefore steps (4) and (5) are not going to happen - * and step (6) MUST be omitted. + * Returning true means successful cancel and therefore steps (4) and (5) are not going to happen. */ class ResourceRequest : public boost::intrusive::list_base_hook<> { @@ -49,9 +53,10 @@ public: /// NOTE: If cost is not known in advance, ResourceBudget should be used (note that every ISchedulerQueue has it) ResourceCost cost; - /// Scheduler node to be notified on consumption finish - /// Auto-filled during request enqueue/dequeue - ISchedulerConstraint * constraint; + /// Scheduler nodes to be notified on consumption finish + /// Auto-filled during request dequeue + /// Vector is not used to avoid allocations in the scheduler thread + std::array constraints; explicit ResourceRequest(ResourceCost cost_ = 1) { @@ -62,7 +67,8 @@ public: void reset(ResourceCost cost_) { cost = cost_; - constraint = nullptr; + for (auto & constraint : constraints) + constraint = nullptr; // Note that list_base_hook should be reset independently (by intrusive list) } @@ -74,11 +80,18 @@ public: /// (e.g. setting an std::promise or creating a job in a thread pool) virtual void execute() = 0; + /// Callback to trigger an error in case if resource is unavailable. + virtual void failed(const std::exception_ptr & ptr) = 0; + /// Stop resource consumption and notify resource scheduler. /// Should be called when resource consumption is finished by consumer. /// ResourceRequest should not be destructed or reset before calling to `finish()`. - /// WARNING: this function MUST not be called if request was canceled. + /// It is okay to call finish() even for failed and canceled requests (it will be no-op) void finish(); + + /// Is called from the scheduler thread to fill `constraints` chain + /// Returns `true` iff constraint was added successfully + bool addConstraint(ISchedulerConstraint * new_constraint); }; } diff --git a/src/Common/Scheduler/SchedulerRoot.h b/src/Common/Scheduler/SchedulerRoot.h index 6a3c3962eb1..451f29f33f2 100644 --- a/src/Common/Scheduler/SchedulerRoot.h +++ b/src/Common/Scheduler/SchedulerRoot.h @@ -28,27 +28,27 @@ namespace ErrorCodes * Resource scheduler root node with a dedicated thread. * Immediate children correspond to different resources. */ -class SchedulerRoot : public ISchedulerNode +class SchedulerRoot final : public ISchedulerNode { private: - struct TResource + struct Resource { SchedulerNodePtr root; // Intrusive cyclic list of active resources - TResource * next = nullptr; - TResource * prev = nullptr; + Resource * next = nullptr; + Resource * prev = nullptr; - explicit TResource(const SchedulerNodePtr & root_) + explicit Resource(const SchedulerNodePtr & root_) : root(root_) { root->info.parent.ptr = this; } // Get pointer stored by ctor in info - static TResource * get(SchedulerNodeInfo & info) + static Resource * get(SchedulerNodeInfo & info) { - return reinterpret_cast(info.parent.ptr); + return reinterpret_cast(info.parent.ptr); } }; @@ -60,6 +60,8 @@ public: ~SchedulerRoot() override { stop(); + while (!children.empty()) + removeChild(children.begin()->first); } /// Runs separate scheduler thread @@ -95,6 +97,12 @@ public: } } + const String & getTypeName() const override + { + static String type_name("scheduler"); + return type_name; + } + bool equals(ISchedulerNode * other) override { if (!ISchedulerNode::equals(other)) @@ -179,16 +187,11 @@ public: void activateChild(ISchedulerNode * child) override { - activate(TResource::get(child->info)); - } - - void setParent(ISchedulerNode *) override - { - abort(); // scheduler must be the root and this function should not be called + activate(Resource::get(child->info)); } private: - void activate(TResource * value) + void activate(Resource * value) { assert(value->next == nullptr && value->prev == nullptr); if (current == nullptr) // No active children @@ -206,7 +209,7 @@ private: } } - void deactivate(TResource * value) + void deactivate(Resource * value) { if (value->next == nullptr) return; // Already deactivated @@ -251,8 +254,8 @@ private: request->execute(); } - TResource * current = nullptr; // round-robin pointer - std::unordered_map children; // resources by pointer + Resource * current = nullptr; // round-robin pointer + std::unordered_map children; // resources by pointer std::atomic stop_flag = false; EventQueue events; ThreadFromGlobalPool scheduler; diff --git a/src/Common/Scheduler/SchedulingSettings.cpp b/src/Common/Scheduler/SchedulingSettings.cpp new file mode 100644 index 00000000000..60319cdd54c --- /dev/null +++ b/src/Common/Scheduler/SchedulingSettings.cpp @@ -0,0 +1,130 @@ +#include +#include +#include +#include + + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int BAD_ARGUMENTS; +} + +void SchedulingSettings::updateFromChanges(const ASTCreateWorkloadQuery::SettingsChanges & changes, const String & resource_name) +{ + struct { + std::optional new_weight; + std::optional new_priority; + std::optional new_max_speed; + std::optional new_max_burst; + std::optional new_max_requests; + std::optional new_max_cost; + + static Float64 getNotNegativeFloat64(const String & name, const Field & field) + { + { + UInt64 val; + if (field.tryGet(val)) + return static_cast(val); // We dont mind slight loss of precision + } + + { + Int64 val; + if (field.tryGet(val)) + { + if (val < 0) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Unexpected negative Int64 value for workload setting '{}'", name); + return static_cast(val); // We dont mind slight loss of precision + } + } + + return field.safeGet(); + } + + static Int64 getNotNegativeInt64(const String & name, const Field & field) + { + { + UInt64 val; + if (field.tryGet(val)) + { + // Saturate on overflow + if (val > static_cast(std::numeric_limits::max())) + val = std::numeric_limits::max(); + return static_cast(val); + } + } + + { + Int64 val; + if (field.tryGet(val)) + { + if (val < 0) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Unexpected negative Int64 value for workload setting '{}'", name); + return val; + } + } + + return field.safeGet(); + } + + void read(const String & name, const Field & value) + { + if (name == "weight") + new_weight = getNotNegativeFloat64(name, value); + else if (name == "priority") + new_priority = Priority{value.safeGet()}; + else if (name == "max_speed") + new_max_speed = getNotNegativeFloat64(name, value); + else if (name == "max_burst") + new_max_burst = getNotNegativeFloat64(name, value); + else if (name == "max_requests") + new_max_requests = getNotNegativeInt64(name, value); + else if (name == "max_cost") + new_max_cost = getNotNegativeInt64(name, value); + } + } regular, specific; + + // Read changed setting values + for (const auto & [name, value, resource] : changes) + { + if (resource.empty()) + regular.read(name, value); + else if (resource == resource_name) + specific.read(name, value); + } + + auto get_value = [] (const std::optional & specific_new, const std::optional & regular_new, T & old) + { + if (specific_new) + return *specific_new; + if (regular_new) + return *regular_new; + return old; + }; + + // Validate that we could use values read in a scheduler node + { + SchedulerNodeInfo validating_node( + get_value(specific.new_weight, regular.new_weight, weight), + get_value(specific.new_priority, regular.new_priority, priority)); + } + + // Commit new values. + // Previous values are left intentionally for ALTER query to be able to skip not mentioned setting values + weight = get_value(specific.new_weight, regular.new_weight, weight); + priority = get_value(specific.new_priority, regular.new_priority, priority); + if (specific.new_max_speed || regular.new_max_speed) + { + max_speed = get_value(specific.new_max_speed, regular.new_max_speed, max_speed); + // We always set max_burst if max_speed is changed. + // This is done for users to be able to ignore more advanced max_burst setting and rely only on max_speed + max_burst = default_burst_seconds * max_speed; + } + max_burst = get_value(specific.new_max_burst, regular.new_max_burst, max_burst); + max_requests = get_value(specific.new_max_requests, regular.new_max_requests, max_requests); + max_cost = get_value(specific.new_max_cost, regular.new_max_cost, max_cost); +} + +} diff --git a/src/Common/Scheduler/SchedulingSettings.h b/src/Common/Scheduler/SchedulingSettings.h new file mode 100644 index 00000000000..6db3ef0dce9 --- /dev/null +++ b/src/Common/Scheduler/SchedulingSettings.h @@ -0,0 +1,39 @@ +#pragma once + +#include + +#include +#include + +#include + +namespace DB +{ + +struct SchedulingSettings +{ + /// Priority and weight among siblings + Float64 weight = 1.0; + Priority priority; + + /// Throttling constraints. + /// Up to 2 independent throttlers: one for average speed and one for peek speed. + static constexpr Float64 default_burst_seconds = 1.0; + Float64 max_speed = 0; // Zero means unlimited + Float64 max_burst = 0; // default is `default_burst_seconds * max_speed` + + /// Limits total number of concurrent resource requests that are allowed to consume + static constexpr Int64 default_max_requests = std::numeric_limits::max(); + Int64 max_requests = default_max_requests; + + /// Limits total cost of concurrent resource requests that are allowed to consume + static constexpr Int64 default_max_cost = std::numeric_limits::max(); + Int64 max_cost = default_max_cost; + + bool hasThrottler() const { return max_speed != 0; } + bool hasSemaphore() const { return max_requests != default_max_requests || max_cost != default_max_cost; } + + void updateFromChanges(const ASTCreateWorkloadQuery::SettingsChanges & changes, const String & resource_name = {}); +}; + +} diff --git a/src/Common/Scheduler/Workload/IWorkloadEntityStorage.h b/src/Common/Scheduler/Workload/IWorkloadEntityStorage.h new file mode 100644 index 00000000000..adb3a808eea --- /dev/null +++ b/src/Common/Scheduler/Workload/IWorkloadEntityStorage.h @@ -0,0 +1,91 @@ +#pragma once + +#include +#include + +#include + +#include + + +namespace DB +{ + +class IAST; +struct Settings; + +enum class WorkloadEntityType : uint8_t +{ + Workload, + Resource, + + MAX +}; + +/// Interface for a storage of workload entities (WORKLOAD and RESOURCE). +class IWorkloadEntityStorage +{ +public: + virtual ~IWorkloadEntityStorage() = default; + + /// Whether this storage can replicate entities to another node. + virtual bool isReplicated() const { return false; } + virtual String getReplicationID() const { return ""; } + + /// Loads all entities. Can be called once - if entities are already loaded the function does nothing. + virtual void loadEntities() = 0; + + /// Get entity by name. If no entity stored with entity_name throws exception. + virtual ASTPtr get(const String & entity_name) const = 0; + + /// Get entity by name. If no entity stored with entity_name return nullptr. + virtual ASTPtr tryGet(const String & entity_name) const = 0; + + /// Check if entity with entity_name is stored. + virtual bool has(const String & entity_name) const = 0; + + /// Get all entity names. + virtual std::vector getAllEntityNames() const = 0; + + /// Get all entity names of specified type. + virtual std::vector getAllEntityNames(WorkloadEntityType entity_type) const = 0; + + /// Get all entities. + virtual std::vector> getAllEntities() const = 0; + + /// Check whether any entity have been stored. + virtual bool empty() const = 0; + + /// Stops watching. + virtual void stopWatching() {} + + /// Stores an entity. + virtual bool storeEntity( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + ASTPtr create_entity_query, + bool throw_if_exists, + bool replace_if_exists, + const Settings & settings) = 0; + + /// Removes an entity. + virtual bool removeEntity( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + bool throw_if_not_exists) = 0; + + struct Event + { + WorkloadEntityType type; + String name; + ASTPtr entity; /// new or changed entity, null if removed + }; + using OnChangedHandler = std::function &)>; + + /// Gets all current entries, pass them through `handler` and subscribes for all later changes. + virtual scope_guard getAllEntitiesAndSubscribe(const OnChangedHandler & handler) = 0; +}; + +} diff --git a/src/Common/Scheduler/Workload/WorkloadEntityDiskStorage.cpp b/src/Common/Scheduler/Workload/WorkloadEntityDiskStorage.cpp new file mode 100644 index 00000000000..1bff672c150 --- /dev/null +++ b/src/Common/Scheduler/Workload/WorkloadEntityDiskStorage.cpp @@ -0,0 +1,287 @@ +#include + +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include + +#include + +#include +#include +#include +#include + +#include +#include + +#include + +namespace fs = std::filesystem; + + +namespace DB +{ + +namespace Setting +{ + extern const SettingsUInt64 max_parser_backtracks; + extern const SettingsUInt64 max_parser_depth; + extern const SettingsBool fsync_metadata; +} + +namespace ErrorCodes +{ + extern const int DIRECTORY_DOESNT_EXIST; + extern const int BAD_ARGUMENTS; +} + + +namespace +{ + constexpr std::string_view workload_prefix = "workload_"; + constexpr std::string_view resource_prefix = "resource_"; + constexpr std::string_view sql_suffix = ".sql"; + + /// Converts a path to an absolute path and append it with a separator. + String makeDirectoryPathCanonical(const String & directory_path) + { + auto canonical_directory_path = std::filesystem::weakly_canonical(directory_path); + if (canonical_directory_path.has_filename()) + canonical_directory_path += std::filesystem::path::preferred_separator; + return canonical_directory_path; + } +} + +WorkloadEntityDiskStorage::WorkloadEntityDiskStorage(const ContextPtr & global_context_, const String & dir_path_) + : WorkloadEntityStorageBase(global_context_) + , dir_path{makeDirectoryPathCanonical(dir_path_)} +{ + log = getLogger("WorkloadEntityDiskStorage"); +} + + +ASTPtr WorkloadEntityDiskStorage::tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name) +{ + return tryLoadEntity(entity_type, entity_name, getFilePath(entity_type, entity_name), /* check_file_exists= */ true); +} + + +ASTPtr WorkloadEntityDiskStorage::tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name, const String & path, bool check_file_exists) +{ + LOG_DEBUG(log, "Loading workload entity {} from file {}", backQuote(entity_name), path); + + try + { + if (check_file_exists && !fs::exists(path)) + return nullptr; + + /// There is .sql file with workload entity creation statement. + ReadBufferFromFile in(path); + + String entity_create_query; + readStringUntilEOF(entity_create_query, in); + + auto parse = [&] (auto parser) + { + return parseQuery( + parser, + entity_create_query.data(), + entity_create_query.data() + entity_create_query.size(), + "", + 0, + global_context->getSettingsRef()[Setting::max_parser_depth], + global_context->getSettingsRef()[Setting::max_parser_backtracks]); + }; + + switch (entity_type) + { + case WorkloadEntityType::Workload: return parse(ParserCreateWorkloadQuery()); + case WorkloadEntityType::Resource: return parse(ParserCreateResourceQuery()); + case WorkloadEntityType::MAX: return nullptr; + } + } + catch (...) + { + tryLogCurrentException(log, fmt::format("while loading workload entity {} from path {}", backQuote(entity_name), path)); + return nullptr; /// Failed to load this entity, will ignore it + } +} + + +void WorkloadEntityDiskStorage::loadEntities() +{ + if (!entities_loaded) + loadEntitiesImpl(); +} + + +void WorkloadEntityDiskStorage::loadEntitiesImpl() +{ + LOG_INFO(log, "Loading workload entities from {}", dir_path); + + if (!std::filesystem::exists(dir_path)) + { + LOG_DEBUG(log, "The directory for workload entities ({}) does not exist: nothing to load", dir_path); + return; + } + + std::vector> entities_name_and_queries; + + Poco::DirectoryIterator dir_end; + for (Poco::DirectoryIterator it(dir_path); it != dir_end; ++it) + { + if (it->isDirectory()) + continue; + + const String & file_name = it.name(); + + if (file_name.starts_with(workload_prefix) && file_name.ends_with(sql_suffix)) + { + String name = unescapeForFileName(file_name.substr( + workload_prefix.size(), + file_name.size() - workload_prefix.size() - sql_suffix.size())); + + if (name.empty()) + continue; + + ASTPtr ast = tryLoadEntity(WorkloadEntityType::Workload, name, dir_path + it.name(), /* check_file_exists= */ false); + if (ast) + entities_name_and_queries.emplace_back(name, ast); + } + + if (file_name.starts_with(resource_prefix) && file_name.ends_with(sql_suffix)) + { + String name = unescapeForFileName(file_name.substr( + resource_prefix.size(), + file_name.size() - resource_prefix.size() - sql_suffix.size())); + + if (name.empty()) + continue; + + ASTPtr ast = tryLoadEntity(WorkloadEntityType::Resource, name, dir_path + it.name(), /* check_file_exists= */ false); + if (ast) + entities_name_and_queries.emplace_back(name, ast); + } + } + + setAllEntities(entities_name_and_queries); + entities_loaded = true; + + LOG_DEBUG(log, "Workload entities loaded"); +} + + +void WorkloadEntityDiskStorage::createDirectory() +{ + std::error_code create_dir_error_code; + fs::create_directories(dir_path, create_dir_error_code); + if (!fs::exists(dir_path) || !fs::is_directory(dir_path) || create_dir_error_code) + throw Exception(ErrorCodes::DIRECTORY_DOESNT_EXIST, "Couldn't create directory {} reason: '{}'", + dir_path, create_dir_error_code.message()); +} + + +WorkloadEntityStorageBase::OperationResult WorkloadEntityDiskStorage::storeEntityImpl( + const ContextPtr & /*current_context*/, + WorkloadEntityType entity_type, + const String & entity_name, + ASTPtr create_entity_query, + bool throw_if_exists, + bool replace_if_exists, + const Settings & settings) +{ + createDirectory(); + String file_path = getFilePath(entity_type, entity_name); + LOG_DEBUG(log, "Storing workload entity {} to file {}", backQuote(entity_name), file_path); + + if (fs::exists(file_path)) + { + if (throw_if_exists) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' already exists", entity_name); + else if (!replace_if_exists) + return OperationResult::Failed; + } + + + String temp_file_path = file_path + ".tmp"; + + try + { + WriteBufferFromFile out(temp_file_path); + formatAST(*create_entity_query, out, false); + writeChar('\n', out); + out.next(); + if (settings[Setting::fsync_metadata]) + out.sync(); + out.close(); + + if (replace_if_exists) + fs::rename(temp_file_path, file_path); + else + renameNoReplace(temp_file_path, file_path); + } + catch (...) + { + fs::remove(temp_file_path); + throw; + } + + LOG_TRACE(log, "Entity {} stored", backQuote(entity_name)); + return OperationResult::Ok; +} + + +WorkloadEntityStorageBase::OperationResult WorkloadEntityDiskStorage::removeEntityImpl( + const ContextPtr & /*current_context*/, + WorkloadEntityType entity_type, + const String & entity_name, + bool throw_if_not_exists) +{ + String file_path = getFilePath(entity_type, entity_name); + LOG_DEBUG(log, "Removing workload entity {} stored in file {}", backQuote(entity_name), file_path); + + bool existed = fs::remove(file_path); + + if (!existed) + { + if (throw_if_not_exists) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' doesn't exist", entity_name); + else + return OperationResult::Failed; + } + + LOG_TRACE(log, "Entity {} removed", backQuote(entity_name)); + return OperationResult::Ok; +} + + +String WorkloadEntityDiskStorage::getFilePath(WorkloadEntityType entity_type, const String & entity_name) const +{ + String file_path; + switch (entity_type) + { + case WorkloadEntityType::Workload: + { + file_path = dir_path + "workload_" + escapeForFileName(entity_name) + ".sql"; + break; + } + case WorkloadEntityType::Resource: + { + file_path = dir_path + "resource_" + escapeForFileName(entity_name) + ".sql"; + break; + } + case WorkloadEntityType::MAX: break; + } + return file_path; +} + +} diff --git a/src/Common/Scheduler/Workload/WorkloadEntityDiskStorage.h b/src/Common/Scheduler/Workload/WorkloadEntityDiskStorage.h new file mode 100644 index 00000000000..cb3fb600182 --- /dev/null +++ b/src/Common/Scheduler/Workload/WorkloadEntityDiskStorage.h @@ -0,0 +1,44 @@ +#pragma once + +#include +#include +#include + + +namespace DB +{ + +/// Loads workload entities from a specified folder. +class WorkloadEntityDiskStorage : public WorkloadEntityStorageBase +{ +public: + WorkloadEntityDiskStorage(const ContextPtr & global_context_, const String & dir_path_); + void loadEntities() override; + +private: + OperationResult storeEntityImpl( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + ASTPtr create_entity_query, + bool throw_if_exists, + bool replace_if_exists, + const Settings & settings) override; + + OperationResult removeEntityImpl( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + bool throw_if_not_exists) override; + + void createDirectory(); + void loadEntitiesImpl(); + ASTPtr tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name); + ASTPtr tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name, const String & file_path, bool check_file_exists); + String getFilePath(WorkloadEntityType entity_type, const String & entity_name) const; + + String dir_path; + std::atomic entities_loaded = false; +}; + +} diff --git a/src/Common/Scheduler/Workload/WorkloadEntityKeeperStorage.cpp b/src/Common/Scheduler/Workload/WorkloadEntityKeeperStorage.cpp new file mode 100644 index 00000000000..4b60a7ec57e --- /dev/null +++ b/src/Common/Scheduler/Workload/WorkloadEntityKeeperStorage.cpp @@ -0,0 +1,273 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +namespace DB +{ +namespace Setting +{ +extern const SettingsUInt64 max_parser_backtracks; +extern const SettingsUInt64 max_parser_depth; +} + +namespace ErrorCodes +{ + extern const int BAD_ARGUMENTS; + extern const int LOGICAL_ERROR; +} + +WorkloadEntityKeeperStorage::WorkloadEntityKeeperStorage( + const ContextPtr & global_context_, const String & zookeeper_path_) + : WorkloadEntityStorageBase(global_context_) + , zookeeper_getter{[global_context_]() { return global_context_->getZooKeeper(); }} + , zookeeper_path{zookeeper_path_} + , watch{std::make_shared()} +{ + log = getLogger("WorkloadEntityKeeperStorage"); + if (zookeeper_path.empty()) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "ZooKeeper path must be non-empty"); + + if (zookeeper_path.back() == '/') + zookeeper_path.pop_back(); + + /// If zookeeper chroot prefix is used, path should start with '/', because chroot concatenates without it. + if (zookeeper_path.front() != '/') + zookeeper_path = "/" + zookeeper_path; +} + +WorkloadEntityKeeperStorage::~WorkloadEntityKeeperStorage() +{ + SCOPE_EXIT_SAFE(stopWatchingThread()); +} + +void WorkloadEntityKeeperStorage::startWatchingThread() +{ + if (!watching_flag.exchange(true)) + watching_thread = ThreadFromGlobalPool(&WorkloadEntityKeeperStorage::processWatchQueue, this); +} + +void WorkloadEntityKeeperStorage::stopWatchingThread() +{ + if (watching_flag.exchange(false)) + { + watch->cv.notify_one(); + if (watching_thread.joinable()) + watching_thread.join(); + } +} + +zkutil::ZooKeeperPtr WorkloadEntityKeeperStorage::getZooKeeper() +{ + auto [zookeeper, session_status] = zookeeper_getter.getZooKeeper(); + + if (session_status == zkutil::ZooKeeperCachingGetter::SessionStatus::New) + { + /// It's possible that we connected to different [Zoo]Keeper instance + /// so we may read a bit stale state. + zookeeper->sync(zookeeper_path); + + createRootNodes(zookeeper); + + auto lock = getLock(); + refreshEntities(zookeeper); + } + + return zookeeper; +} + +void WorkloadEntityKeeperStorage::loadEntities() +{ + /// loadEntities() is called at start from Server::main(), so it's better not to stop here on no connection to ZooKeeper or any other error. + /// However the watching thread must be started anyway in case the connection will be established later. + try + { + auto lock = getLock(); + refreshEntities(getZooKeeper()); + } + catch (...) + { + tryLogCurrentException(log, "Failed to load workload entities"); + } + startWatchingThread(); +} + + +void WorkloadEntityKeeperStorage::processWatchQueue() +{ + LOG_DEBUG(log, "Started watching thread"); + setThreadName("WrkldEntWatch"); + + UInt64 handled = 0; + while (watching_flag) + { + try + { + /// Re-initialize ZooKeeper session if expired + getZooKeeper(); + + { + std::unique_lock lock{watch->mutex}; + if (!watch->cv.wait_for(lock, std::chrono::seconds(10), [&] { return !watching_flag || handled != watch->triggered; })) + continue; + handled = watch->triggered; + } + + auto lock = getLock(); + refreshEntities(getZooKeeper()); + } + catch (...) + { + tryLogCurrentException(log, "Will try to restart watching thread after error"); + zookeeper_getter.resetCache(); + sleepForSeconds(5); + } + } + + LOG_DEBUG(log, "Stopped watching thread"); +} + + +void WorkloadEntityKeeperStorage::stopWatching() +{ + stopWatchingThread(); +} + +void WorkloadEntityKeeperStorage::createRootNodes(const zkutil::ZooKeeperPtr & zookeeper) +{ + zookeeper->createAncestors(zookeeper_path); + // If node does not exist we consider it to be equal to empty node: no workload entities + zookeeper->createIfNotExists(zookeeper_path, ""); +} + +WorkloadEntityStorageBase::OperationResult WorkloadEntityKeeperStorage::storeEntityImpl( + const ContextPtr & /*current_context*/, + WorkloadEntityType entity_type, + const String & entity_name, + ASTPtr create_entity_query, + bool /*throw_if_exists*/, + bool /*replace_if_exists*/, + const Settings &) +{ + LOG_DEBUG(log, "Storing workload entity {}", backQuote(entity_name)); + + String new_data = serializeAllEntities(Event{entity_type, entity_name, create_entity_query}); + auto zookeeper = getZooKeeper(); + + Coordination::Stat stat; + auto code = zookeeper->trySet(zookeeper_path, new_data, current_version, &stat); + if (code != Coordination::Error::ZOK) + { + refreshEntities(zookeeper); + return OperationResult::Retry; + } + + current_version = stat.version; + + LOG_DEBUG(log, "Workload entity {} stored", backQuote(entity_name)); + + return OperationResult::Ok; +} + + +WorkloadEntityStorageBase::OperationResult WorkloadEntityKeeperStorage::removeEntityImpl( + const ContextPtr & /*current_context*/, + WorkloadEntityType entity_type, + const String & entity_name, + bool /*throw_if_not_exists*/) +{ + LOG_DEBUG(log, "Removing workload entity {}", backQuote(entity_name)); + + String new_data = serializeAllEntities(Event{entity_type, entity_name, {}}); + auto zookeeper = getZooKeeper(); + + Coordination::Stat stat; + auto code = zookeeper->trySet(zookeeper_path, new_data, current_version, &stat); + if (code != Coordination::Error::ZOK) + { + refreshEntities(zookeeper); + return OperationResult::Retry; + } + + current_version = stat.version; + + LOG_DEBUG(log, "Workload entity {} removed", backQuote(entity_name)); + + return OperationResult::Ok; +} + +std::pair WorkloadEntityKeeperStorage::getDataAndSetWatch(const zkutil::ZooKeeperPtr & zookeeper) +{ + const auto data_watcher = [my_watch = watch](const Coordination::WatchResponse & response) + { + if (response.type == Coordination::Event::CHANGED) + { + std::unique_lock lock{my_watch->mutex}; + my_watch->triggered++; + my_watch->cv.notify_one(); + } + }; + + Coordination::Stat stat; + String data; + bool exists = zookeeper->tryGetWatch(zookeeper_path, data, &stat, data_watcher); + if (!exists) + { + createRootNodes(zookeeper); + data = zookeeper->getWatch(zookeeper_path, &stat, data_watcher); + } + return {data, stat.version}; +} + +void WorkloadEntityKeeperStorage::refreshEntities(const zkutil::ZooKeeperPtr & zookeeper) +{ + auto [data, version] = getDataAndSetWatch(zookeeper); + if (version == current_version) + return; + + LOG_DEBUG(log, "Refreshing workload entities from keeper"); + ASTs queries; + ParserCreateWorkloadEntity parser; + const char * begin = data.data(); /// begin of current query + const char * pos = begin; /// parser moves pos from begin to the end of current query + const char * end = begin + data.size(); + while (pos < end) + { + queries.emplace_back(parseQueryAndMovePosition(parser, pos, end, "", true, 0, DBMS_DEFAULT_MAX_PARSER_DEPTH, DBMS_DEFAULT_MAX_PARSER_BACKTRACKS)); + while (isWhitespaceASCII(*pos) || *pos == ';') + ++pos; + } + + /// Read and parse all SQL entities from data we just read from ZooKeeper + std::vector> new_entities; + for (const auto & query : queries) + { + LOG_TRACE(log, "Read keeper entity definition: {}", serializeAST(*query)); + if (auto * create_workload_query = query->as()) + new_entities.emplace_back(create_workload_query->getWorkloadName(), query); + else if (auto * create_resource_query = query->as()) + new_entities.emplace_back(create_resource_query->getResourceName(), query); + else + throw Exception(ErrorCodes::LOGICAL_ERROR, "Invalid workload entity query in keeper storage: {}", query->getID()); + } + + setAllEntities(new_entities); + current_version = version; + + LOG_DEBUG(log, "Workload entities refreshing is done"); +} + +} + diff --git a/src/Common/Scheduler/Workload/WorkloadEntityKeeperStorage.h b/src/Common/Scheduler/Workload/WorkloadEntityKeeperStorage.h new file mode 100644 index 00000000000..25dcd6d8c9a --- /dev/null +++ b/src/Common/Scheduler/Workload/WorkloadEntityKeeperStorage.h @@ -0,0 +1,71 @@ +#pragma once + +#include +#include +#include +#include +#include + +#include +#include + +namespace DB +{ + +/// Loads RESOURCE and WORKLOAD sql objects from Keeper. +class WorkloadEntityKeeperStorage : public WorkloadEntityStorageBase +{ +public: + WorkloadEntityKeeperStorage(const ContextPtr & global_context_, const String & zookeeper_path_); + ~WorkloadEntityKeeperStorage() override; + + bool isReplicated() const override { return true; } + String getReplicationID() const override { return zookeeper_path; } + + void loadEntities() override; + void stopWatching() override; + +private: + OperationResult storeEntityImpl( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + ASTPtr create_entity_query, + bool throw_if_exists, + bool replace_if_exists, + const Settings & settings) override; + + OperationResult removeEntityImpl( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + bool throw_if_not_exists) override; + + void processWatchQueue(); + + zkutil::ZooKeeperPtr getZooKeeper(); + + void startWatchingThread(); + void stopWatchingThread(); + + void createRootNodes(const zkutil::ZooKeeperPtr & zookeeper); + std::pair getDataAndSetWatch(const zkutil::ZooKeeperPtr & zookeeper); + void refreshEntities(const zkutil::ZooKeeperPtr & zookeeper); + + zkutil::ZooKeeperCachingGetter zookeeper_getter; + String zookeeper_path; + Int32 current_version = 0; + + ThreadFromGlobalPool watching_thread; + std::atomic watching_flag = false; + + struct WatchEvent + { + std::mutex mutex; + std::condition_variable cv; + UInt64 triggered = 0; + }; + std::shared_ptr watch; +}; + +} diff --git a/src/Common/Scheduler/Workload/WorkloadEntityStorageBase.cpp b/src/Common/Scheduler/Workload/WorkloadEntityStorageBase.cpp new file mode 100644 index 00000000000..c758111a53e --- /dev/null +++ b/src/Common/Scheduler/Workload/WorkloadEntityStorageBase.cpp @@ -0,0 +1,773 @@ +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include +#include +#include + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int BAD_ARGUMENTS; + extern const int LOGICAL_ERROR; +} + +namespace +{ + +/// Removes details from a CREATE query to be used as workload entity definition +ASTPtr normalizeCreateWorkloadEntityQuery(const IAST & create_query) +{ + auto ptr = create_query.clone(); + if (auto * res = typeid_cast(ptr.get())) + { + res->if_not_exists = false; + res->or_replace = false; + } + if (auto * res = typeid_cast(ptr.get())) + { + res->if_not_exists = false; + res->or_replace = false; + } + return ptr; +} + +/// Returns a type of a workload entity `ptr` +WorkloadEntityType getEntityType(const ASTPtr & ptr) +{ + if (auto * res = typeid_cast(ptr.get()); res) + return WorkloadEntityType::Workload; + if (auto * res = typeid_cast(ptr.get()); res) + return WorkloadEntityType::Resource; + chassert(false); + return WorkloadEntityType::MAX; +} + +bool entityEquals(const ASTPtr & lhs, const ASTPtr & rhs) +{ + if (auto * a = typeid_cast(lhs.get())) + { + if (auto * b = typeid_cast(rhs.get())) + { + return std::forward_as_tuple(a->getWorkloadName(), a->getWorkloadParent(), a->changes) + == std::forward_as_tuple(b->getWorkloadName(), b->getWorkloadParent(), b->changes); + } + } + if (auto * a = typeid_cast(lhs.get())) + { + if (auto * b = typeid_cast(rhs.get())) + return std::forward_as_tuple(a->getResourceName(), a->operations) + == std::forward_as_tuple(b->getResourceName(), b->operations); + } + return false; +} + +/// Workload entities could reference each other. +/// This enum defines all possible reference types +enum class ReferenceType +{ + Parent, // Source workload references target workload as a parent + ForResource // Source workload references target resource in its `SETTINGS x = y FOR resource` clause +}; + +/// Runs a `func` callback for every reference from `source` to `target`. +/// This function is the source of truth defining what `target` references are stored in a workload `source_entity` +void forEachReference( + const ASTPtr & source_entity, + std::function func) +{ + if (auto * res = typeid_cast(source_entity.get())) + { + // Parent reference + String parent = res->getWorkloadParent(); + if (!parent.empty()) + func(parent, res->getWorkloadName(), ReferenceType::Parent); + + // References to RESOURCEs mentioned in SETTINGS clause after FOR keyword + std::unordered_set resources; + for (const auto & [name, value, resource] : res->changes) + { + if (!resource.empty()) + resources.insert(resource); + } + for (const String & resource : resources) + func(resource, res->getWorkloadName(), ReferenceType::ForResource); + } + if (auto * res = typeid_cast(source_entity.get()); res) + { + // RESOURCE has no references to be validated, we allow mentioned disks to be created later + } +} + +/// Helper for recursive DFS +void topologicallySortedWorkloadsImpl(const String & name, const ASTPtr & ast, const std::unordered_map & workloads, std::unordered_set & visited, std::vector> & sorted_workloads) +{ + if (visited.contains(name)) + return; + visited.insert(name); + + // Recurse into parent (if any) + String parent = typeid_cast(ast.get())->getWorkloadParent(); + if (!parent.empty()) + { + auto parent_iter = workloads.find(parent); + if (parent_iter == workloads.end()) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Workload metadata inconsistency: Workload '{}' parent '{}' does not exist. This must be fixed manually.", name, parent); + topologicallySortedWorkloadsImpl(parent, parent_iter->second, workloads, visited, sorted_workloads); + } + + sorted_workloads.emplace_back(name, ast); +} + +/// Returns pairs {worload_name, create_workload_ast} in order that respect child-parent relation (parent first, then children) +std::vector> topologicallySortedWorkloads(const std::unordered_map & workloads) +{ + std::vector> sorted_workloads; + std::unordered_set visited; + for (const auto & [name, ast] : workloads) + topologicallySortedWorkloadsImpl(name, ast, workloads, visited, sorted_workloads); + return sorted_workloads; +} + +/// Helper for recursive DFS +void topologicallySortedDependenciesImpl( + const String & name, + const std::unordered_map> & dependencies, + std::unordered_set & visited, + std::vector & result) +{ + if (visited.contains(name)) + return; + visited.insert(name); + + if (auto it = dependencies.find(name); it != dependencies.end()) + { + for (const String & dep : it->second) + topologicallySortedDependenciesImpl(dep, dependencies, visited, result); + } + + result.emplace_back(name); +} + +/// Returns nodes in topological order that respect `dependencies` (key is node name, value is set of dependencies) +std::vector topologicallySortedDependencies(const std::unordered_map> & dependencies) +{ + std::unordered_set visited; // Set to track visited nodes + std::vector result; // Result to store nodes in topologically sorted order + + // Perform DFS for each node in the graph + for (const auto & [name, _] : dependencies) + topologicallySortedDependenciesImpl(name, dependencies, visited, result); + + return result; +} + +/// Represents a change of a workload entity (WORKLOAD or RESOURCE) +struct EntityChange +{ + String name; /// Name of entity + ASTPtr before; /// Entity before change (CREATE if not set) + ASTPtr after; /// Entity after change (DROP if not set) + + std::vector toEvents() const + { + if (!after) + return {{getEntityType(before), name, {}}}; + else if (!before) + return {{getEntityType(after), name, after}}; + else + { + auto type_before = getEntityType(before); + auto type_after = getEntityType(after); + // If type changed, we have to remove an old entity and add a new one + if (type_before != type_after) + return {{type_before, name, {}}, {type_after, name, after}}; + else + return {{type_after, name, after}}; + } + } +}; + +/// Returns `changes` ordered for execution. +/// Every intemediate state during execution will be consistent (i.e. all references will be valid) +/// NOTE: It does not validate changes, any problem will be detected during execution. +/// NOTE: There will be no error if valid order does not exist. +std::vector topologicallySortedChanges(const std::vector & changes) +{ + // Construct map from entity name into entity change + std::unordered_map change_by_name; + for (const auto & change : changes) + change_by_name[change.name] = &change; + + // Construct references maps (before changes and after changes) + std::unordered_map> old_sources; // Key is target. Value is set of names of source entities. + std::unordered_map> new_targets; // Key is source. Value is set of names of target entities. + for (const auto & change : changes) + { + if (change.before) + { + forEachReference(change.before, + [&] (const String & target, const String & source, ReferenceType) + { + old_sources[target].insert(source); + }); + } + if (change.after) + { + forEachReference(change.after, + [&] (const String & target, const String & source, ReferenceType) + { + new_targets[source].insert(target); + }); + } + } + + // There are consistency rules that regulate order in which changes must be applied (see below). + // Construct DAG of dependencies between changes. + std::unordered_map> dependencies; // Key is entity name. Value is set of names of entity that should be changed first. + for (const auto & change : changes) + { + dependencies.emplace(change.name, std::unordered_set{}); // Make sure we create nodes that have no dependencies + for (const auto & event : change.toEvents()) + { + if (!event.entity) // DROP + { + // Rule 1: Entity can only be removed after all existing references to it are removed as well. + for (const String & source : old_sources[event.name]) + { + if (change_by_name.contains(source)) + dependencies[event.name].insert(source); + } + } + else // CREATE || CREATE OR REPLACE + { + // Rule 2: Entity can only be created after all entities it references are created as well. + for (const String & target : new_targets[event.name]) + { + if (auto it = change_by_name.find(target); it != change_by_name.end()) + { + const EntityChange & target_change = *it->second; + // If target is creating, it should be created first. + // (But if target is updating, there is no dependency). + if (!target_change.before) + dependencies[event.name].insert(target); + } + } + } + } + } + + // Topological sort of changes to respect consistency rules + std::vector result; + for (const String & name : topologicallySortedDependencies(dependencies)) + result.push_back(*change_by_name[name]); + + return result; +} + +} + +WorkloadEntityStorageBase::WorkloadEntityStorageBase(ContextPtr global_context_) + : handlers(std::make_shared()) + , global_context(std::move(global_context_)) + , log{getLogger("WorkloadEntityStorage")} // could be overridden in derived class +{} + +ASTPtr WorkloadEntityStorageBase::get(const String & entity_name) const +{ + if (auto result = tryGet(entity_name)) + return result; + throw Exception(ErrorCodes::BAD_ARGUMENTS, + "The workload entity name '{}' is not saved", + entity_name); +} + +ASTPtr WorkloadEntityStorageBase::tryGet(const String & entity_name) const +{ + std::lock_guard lock(mutex); + + auto it = entities.find(entity_name); + if (it == entities.end()) + return nullptr; + + return it->second; +} + +bool WorkloadEntityStorageBase::has(const String & entity_name) const +{ + return tryGet(entity_name) != nullptr; +} + +std::vector WorkloadEntityStorageBase::getAllEntityNames() const +{ + std::vector entity_names; + + std::lock_guard lock(mutex); + entity_names.reserve(entities.size()); + + for (const auto & [name, _] : entities) + entity_names.emplace_back(name); + + return entity_names; +} + +std::vector WorkloadEntityStorageBase::getAllEntityNames(WorkloadEntityType entity_type) const +{ + std::vector entity_names; + + std::lock_guard lock(mutex); + for (const auto & [name, entity] : entities) + { + if (getEntityType(entity) == entity_type) + entity_names.emplace_back(name); + } + + return entity_names; +} + +bool WorkloadEntityStorageBase::empty() const +{ + std::lock_guard lock(mutex); + return entities.empty(); +} + +bool WorkloadEntityStorageBase::storeEntity( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + ASTPtr create_entity_query, + bool throw_if_exists, + bool replace_if_exists, + const Settings & settings) +{ + if (entity_name.empty()) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity name should not be empty."); + + create_entity_query = normalizeCreateWorkloadEntityQuery(*create_entity_query); + auto * workload = typeid_cast(create_entity_query.get()); + auto * resource = typeid_cast(create_entity_query.get()); + + while (true) + { + std::unique_lock lock{mutex}; + + ASTPtr old_entity; // entity to be REPLACED + if (auto it = entities.find(entity_name); it != entities.end()) + { + if (throw_if_exists) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' already exists", entity_name); + else if (!replace_if_exists) + return false; + else + old_entity = it->second; + } + + // Validate CREATE OR REPLACE + if (old_entity) + { + auto * old_workload = typeid_cast(old_entity.get()); + auto * old_resource = typeid_cast(old_entity.get()); + if (workload && !old_workload) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' already exists, but it is not a workload", entity_name); + if (resource && !old_resource) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' already exists, but it is not a resource", entity_name); + if (workload && !old_workload->hasParent() && workload->hasParent()) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "It is not allowed to remove root workload"); + } + + // Validate workload + if (workload) + { + if (!workload->hasParent()) + { + if (!root_name.empty() && root_name != workload->getWorkloadName()) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "The second root is not allowed. You should probably add 'PARENT {}' clause.", root_name); + } + + SchedulingSettings validator; + validator.updateFromChanges(workload->changes); + } + + forEachReference(create_entity_query, + [this, workload] (const String & target, const String & source, ReferenceType type) + { + if (auto it = entities.find(target); it == entities.end()) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' references another workload entity '{}' that doesn't exist", source, target); + + switch (type) + { + case ReferenceType::Parent: + { + if (typeid_cast(entities[target].get()) == nullptr) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload parent should reference another workload, not '{}'.", target); + break; + } + case ReferenceType::ForResource: + { + if (typeid_cast(entities[target].get()) == nullptr) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload settings should reference resource in FOR clause, not '{}'.", target); + + // Validate that we could parse the settings for specific resource + SchedulingSettings validator; + validator.updateFromChanges(workload->changes, target); + break; + } + } + + // Detect reference cycles. + // The only way to create a cycle is to add an edge that will be a part of a new cycle. + // We are going to add an edge: `source` -> `target`, so we ensure there is no path back `target` -> `source`. + if (isIndirectlyReferenced(source, target)) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity cycles are not allowed"); + }); + + auto result = storeEntityImpl( + current_context, + entity_type, + entity_name, + create_entity_query, + throw_if_exists, + replace_if_exists, + settings); + + if (result == OperationResult::Retry) + continue; // Entities were updated, we need to rerun all the validations + + if (result == OperationResult::Ok) + { + Event event{entity_type, entity_name, create_entity_query}; + applyEvent(lock, event); + unlockAndNotify(lock, {std::move(event)}); + } + + return result == OperationResult::Ok; + } +} + +bool WorkloadEntityStorageBase::removeEntity( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + bool throw_if_not_exists) +{ + while (true) + { + std::unique_lock lock(mutex); + auto it = entities.find(entity_name); + if (it == entities.end()) + { + if (throw_if_not_exists) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' doesn't exist", entity_name); + else + return false; + } + + if (auto reference_it = references.find(entity_name); reference_it != references.end()) + { + String names; + for (const String & name : reference_it->second) + names += " " + name; + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' cannot be dropped. It is referenced by:{}", entity_name, names); + } + + auto result = removeEntityImpl( + current_context, + entity_type, + entity_name, + throw_if_not_exists); + + if (result == OperationResult::Retry) + continue; // Entities were updated, we need to rerun all the validations + + if (result == OperationResult::Ok) + { + Event event{entity_type, entity_name, {}}; + applyEvent(lock, event); + unlockAndNotify(lock, {std::move(event)}); + } + + return result == OperationResult::Ok; + } +} + +scope_guard WorkloadEntityStorageBase::getAllEntitiesAndSubscribe(const OnChangedHandler & handler) +{ + scope_guard result; + + std::vector current_state; + { + std::lock_guard lock{mutex}; + current_state = orderEntities(entities); + + std::lock_guard lock2{handlers->mutex}; + handlers->list.push_back(handler); + auto handler_it = std::prev(handlers->list.end()); + result = [my_handlers = handlers, handler_it] + { + std::lock_guard lock3{my_handlers->mutex}; + my_handlers->list.erase(handler_it); + }; + } + + // When you subscribe you get all the entities back to your handler immediately if already loaded, or later when loaded + handler(current_state); + + return result; +} + +void WorkloadEntityStorageBase::unlockAndNotify( + std::unique_lock & lock, + std::vector tx) +{ + if (tx.empty()) + return; + + std::vector current_handlers; + { + std::lock_guard handlers_lock{handlers->mutex}; + boost::range::copy(handlers->list, std::back_inserter(current_handlers)); + } + + lock.unlock(); + + for (const auto & handler : current_handlers) + { + try + { + handler(tx); + } + catch (...) + { + tryLogCurrentException(__PRETTY_FUNCTION__); + } + } +} + +std::unique_lock WorkloadEntityStorageBase::getLock() const +{ + return std::unique_lock{mutex}; +} + +void WorkloadEntityStorageBase::setAllEntities(const std::vector> & raw_new_entities) +{ + std::unordered_map new_entities; + for (const auto & [entity_name, create_query] : raw_new_entities) + new_entities[entity_name] = normalizeCreateWorkloadEntityQuery(*create_query); + + std::unique_lock lock(mutex); + + // Fill vector of `changes` based on difference between current `entities` and `new_entities` + std::vector changes; + for (const auto & [entity_name, entity] : entities) + { + if (auto it = new_entities.find(entity_name); it != new_entities.end()) + { + if (!entityEquals(entity, it->second)) + { + changes.emplace_back(entity_name, entity, it->second); // Update entities that are present in both `new_entities` and `entities` + LOG_TRACE(log, "Workload entity {} was updated", entity_name); + } + else + LOG_TRACE(log, "Workload entity {} is the same", entity_name); + } + else + { + changes.emplace_back(entity_name, entity, ASTPtr{}); // Remove entities that are not present in `new_entities` + LOG_TRACE(log, "Workload entity {} was dropped", entity_name); + } + } + for (const auto & [entity_name, entity] : new_entities) + { + if (!entities.contains(entity_name)) + { + changes.emplace_back(entity_name, ASTPtr{}, entity); // Create entities that are only present in `new_entities` + LOG_TRACE(log, "Workload entity {} was created", entity_name); + } + } + + // Sort `changes` to respect consistency of references and apply them one by one. + std::vector tx; + for (const auto & change : topologicallySortedChanges(changes)) + { + for (const auto & event : change.toEvents()) + { + // TODO(serxa): do validation and throw LOGICAL_ERROR if failed + applyEvent(lock, event); + tx.push_back(event); + } + } + + // Notify subscribers + unlockAndNotify(lock, tx); +} + +void WorkloadEntityStorageBase::applyEvent( + std::unique_lock &, + const Event & event) +{ + if (event.entity) // CREATE || CREATE OR REPLACE + { + LOG_DEBUG(log, "Create or replace workload entity: {}", serializeAST(*event.entity)); + + auto * workload = typeid_cast(event.entity.get()); + + // Validate workload + if (workload && !workload->hasParent()) + root_name = workload->getWorkloadName(); + + // Remove references of a replaced entity (only for CREATE OR REPLACE) + if (auto it = entities.find(event.name); it != entities.end()) + removeReferences(it->second); + + // Insert references of created entity + insertReferences(event.entity); + + // Store in memory + entities[event.name] = event.entity; + } + else // DROP + { + auto it = entities.find(event.name); + chassert(it != entities.end()); + + LOG_DEBUG(log, "Drop workload entity: {}", event.name); + + if (event.name == root_name) + root_name.clear(); + + // Clean up references + removeReferences(it->second); + + // Remove from memory + entities.erase(it); + } +} + +std::vector> WorkloadEntityStorageBase::getAllEntities() const +{ + std::lock_guard lock{mutex}; + std::vector> all_entities; + all_entities.reserve(entities.size()); + std::copy(entities.begin(), entities.end(), std::back_inserter(all_entities)); + return all_entities; +} + +bool WorkloadEntityStorageBase::isIndirectlyReferenced(const String & target, const String & source) +{ + std::queue bfs; + std::unordered_set visited; + visited.insert(target); + bfs.push(target); + while (!bfs.empty()) + { + String current = bfs.front(); + bfs.pop(); + if (current == source) + return true; + if (auto it = references.find(current); it != references.end()) + { + for (const String & node : it->second) + { + if (visited.contains(node)) + continue; + visited.insert(node); + bfs.push(node); + } + } + } + return false; +} + +void WorkloadEntityStorageBase::insertReferences(const ASTPtr & entity) +{ + if (!entity) + return; + forEachReference(entity, + [this] (const String & target, const String & source, ReferenceType) + { + references[target].insert(source); + }); +} + +void WorkloadEntityStorageBase::removeReferences(const ASTPtr & entity) +{ + if (!entity) + return; + forEachReference(entity, + [this] (const String & target, const String & source, ReferenceType) + { + references[target].erase(source); + if (references[target].empty()) + references.erase(target); + }); +} + +std::vector WorkloadEntityStorageBase::orderEntities( + const std::unordered_map & all_entities, + std::optional change) +{ + std::vector result; + + std::unordered_map workloads; + for (const auto & [entity_name, ast] : all_entities) + { + if (typeid_cast(ast.get())) + { + if (change && change->name == entity_name) + continue; // Skip this workload if it is removed or updated + workloads.emplace(entity_name, ast); + } + else if (typeid_cast(ast.get())) + { + if (change && change->name == entity_name) + continue; // Skip this resource if it is removed or updated + // Resources should go first because workloads could reference them + result.emplace_back(WorkloadEntityType::Resource, entity_name, ast); + } + else + throw Exception(ErrorCodes::LOGICAL_ERROR, "Invalid workload entity type '{}'", ast->getID()); + } + + // Introduce new entity described by `change` + if (change && change->entity) + { + if (change->type == WorkloadEntityType::Workload) + workloads.emplace(change->name, change->entity); + else if (change->type == WorkloadEntityType::Resource) + result.emplace_back(WorkloadEntityType::Resource, change->name, change->entity); + } + + // Workloads should go in an order such that children are enlisted only after its parent + for (auto & [entity_name, ast] : topologicallySortedWorkloads(workloads)) + result.emplace_back(WorkloadEntityType::Workload, entity_name, ast); + + return result; +} + +String WorkloadEntityStorageBase::serializeAllEntities(std::optional change) +{ + std::unique_lock lock; + auto ordered_entities = orderEntities(entities, change); + WriteBufferFromOwnString buf; + for (const auto & event : ordered_entities) + { + formatAST(*event.entity, buf, false, true); + buf.write(";\n", 2); + } + return buf.str(); +} + +} diff --git a/src/Common/Scheduler/Workload/WorkloadEntityStorageBase.h b/src/Common/Scheduler/Workload/WorkloadEntityStorageBase.h new file mode 100644 index 00000000000..d57bf8201b3 --- /dev/null +++ b/src/Common/Scheduler/Workload/WorkloadEntityStorageBase.h @@ -0,0 +1,126 @@ +#pragma once + +#include +#include +#include +#include + +#include +#include + +#include + +namespace DB +{ + +class WorkloadEntityStorageBase : public IWorkloadEntityStorage +{ +public: + explicit WorkloadEntityStorageBase(ContextPtr global_context_); + ASTPtr get(const String & entity_name) const override; + + ASTPtr tryGet(const String & entity_name) const override; + + bool has(const String & entity_name) const override; + + std::vector getAllEntityNames() const override; + std::vector getAllEntityNames(WorkloadEntityType entity_type) const override; + + std::vector> getAllEntities() const override; + + bool empty() const override; + + bool storeEntity( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + ASTPtr create_entity_query, + bool throw_if_exists, + bool replace_if_exists, + const Settings & settings) override; + + bool removeEntity( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + bool throw_if_not_exists) override; + + scope_guard getAllEntitiesAndSubscribe( + const OnChangedHandler & handler) override; + +protected: + enum class OperationResult + { + Ok, + Failed, + Retry + }; + + virtual OperationResult storeEntityImpl( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + ASTPtr create_entity_query, + bool throw_if_exists, + bool replace_if_exists, + const Settings & settings) = 0; + + virtual OperationResult removeEntityImpl( + const ContextPtr & current_context, + WorkloadEntityType entity_type, + const String & entity_name, + bool throw_if_not_exists) = 0; + + std::unique_lock getLock() const; + + /// Replace current `entities` with `new_entities` and notifies subscribers. + /// Note that subscribers will be notified with a sequence of events. + /// It is guaranteed that all itermediate states (between every pair of consecutive events) + /// will be consistent (all references between entities will be valid) + void setAllEntities(const std::vector> & new_entities); + + /// Serialize `entities` stored in memory plus one optional `change` into multiline string + String serializeAllEntities(std::optional change = {}); + +private: + /// Change state in memory + void applyEvent(std::unique_lock & lock, const Event & event); + + /// Notify subscribers about changes describe by vector of events `tx` + void unlockAndNotify(std::unique_lock & lock, std::vector tx); + + /// Return true iff `references` has a path from `source` to `target` + bool isIndirectlyReferenced(const String & target, const String & source); + + /// Adds references that are described by `entity` to `references` + void insertReferences(const ASTPtr & entity); + + /// Removes references that are described by `entity` from `references` + void removeReferences(const ASTPtr & entity); + + /// Returns an ordered vector of `entities` + std::vector orderEntities( + const std::unordered_map & all_entities, + std::optional change = {}); + + struct Handlers + { + std::mutex mutex; + std::list list; + }; + /// shared_ptr is here for safety because WorkloadEntityStorageBase can be destroyed before all subscriptions are removed. + std::shared_ptr handlers; + + mutable std::recursive_mutex mutex; + std::unordered_map entities; /// Maps entity name into CREATE entity query + + // Validation + std::unordered_map> references; /// Keep track of references between entities. Key is target. Value is set of sources + String root_name; /// current root workload name + +protected: + ContextPtr global_context; + LoggerPtr log; +}; + +} diff --git a/src/Common/Scheduler/Workload/createWorkloadEntityStorage.cpp b/src/Common/Scheduler/Workload/createWorkloadEntityStorage.cpp new file mode 100644 index 00000000000..5dc1265e31d --- /dev/null +++ b/src/Common/Scheduler/Workload/createWorkloadEntityStorage.cpp @@ -0,0 +1,45 @@ +#include +#include +#include +#include +#include +#include +#include + +namespace fs = std::filesystem; + + +namespace DB +{ + + +namespace ErrorCodes +{ + extern const int INVALID_CONFIG_PARAMETER; +} + +std::unique_ptr createWorkloadEntityStorage(const ContextMutablePtr & global_context) +{ + const String zookeeper_path_key = "workload_zookeeper_path"; + const String disk_path_key = "workload_path"; + + const auto & config = global_context->getConfigRef(); + if (config.has(zookeeper_path_key)) + { + if (config.has(disk_path_key)) + { + throw Exception( + ErrorCodes::INVALID_CONFIG_PARAMETER, + "'{}' and '{}' must not be both specified in the config", + zookeeper_path_key, + disk_path_key); + } + return std::make_unique(global_context, config.getString(zookeeper_path_key)); + } + + String default_path = fs::path{global_context->getPath()} / "workload" / ""; + String path = config.getString(disk_path_key, default_path); + return std::make_unique(global_context, path); +} + +} diff --git a/src/Common/Scheduler/Workload/createWorkloadEntityStorage.h b/src/Common/Scheduler/Workload/createWorkloadEntityStorage.h new file mode 100644 index 00000000000..936e1275010 --- /dev/null +++ b/src/Common/Scheduler/Workload/createWorkloadEntityStorage.h @@ -0,0 +1,11 @@ +#pragma once + +#include +#include + +namespace DB +{ + +std::unique_ptr createWorkloadEntityStorage(const ContextMutablePtr & global_context); + +} diff --git a/src/Common/Scheduler/createResourceManager.cpp b/src/Common/Scheduler/createResourceManager.cpp new file mode 100644 index 00000000000..fd9743dbf72 --- /dev/null +++ b/src/Common/Scheduler/createResourceManager.cpp @@ -0,0 +1,104 @@ +#include +#include +#include +#include +#include + +#include +#include + + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int RESOURCE_ACCESS_DENIED; +} + +class ResourceManagerDispatcher : public IResourceManager +{ +private: + class Classifier : public IClassifier + { + public: + void addClassifier(const ClassifierPtr & classifier) + { + classifiers.push_back(classifier); + } + + bool has(const String & resource_name) override + { + for (const auto & classifier : classifiers) + { + if (classifier->has(resource_name)) + return true; + } + return false; + } + + ResourceLink get(const String & resource_name) override + { + for (auto & classifier : classifiers) + { + if (classifier->has(resource_name)) + return classifier->get(resource_name); + } + throw Exception(ErrorCodes::RESOURCE_ACCESS_DENIED, "Access denied to resource '{}'", resource_name); + } + private: + std::vector classifiers; // should be constant after initialization to avoid races + }; + +public: + void addManager(const ResourceManagerPtr & manager) + { + managers.push_back(manager); + } + + void updateConfiguration(const Poco::Util::AbstractConfiguration & config) override + { + for (auto & manager : managers) + manager->updateConfiguration(config); + } + + bool hasResource(const String & resource_name) const override + { + for (const auto & manager : managers) + { + if (manager->hasResource(resource_name)) + return true; + } + return false; + } + + ClassifierPtr acquire(const String & workload_name) override + { + auto classifier = std::make_shared(); + for (const auto & manager : managers) + classifier->addClassifier(manager->acquire(workload_name)); + return classifier; + } + + void forEachNode(VisitorFunc visitor) override + { + for (const auto & manager : managers) + manager->forEachNode(visitor); + } + +private: + std::vector managers; // Should be constant after initialization to avoid races +}; + +ResourceManagerPtr createResourceManager(const ContextMutablePtr & global_context) +{ + auto dispatcher = std::make_shared(); + + // NOTE: if the same resource is described by both managers, then manager added earlier will be used. + dispatcher->addManager(std::make_shared()); + dispatcher->addManager(std::make_shared(global_context->getWorkloadEntityStorage())); + + return dispatcher; +} + +} diff --git a/src/Common/Scheduler/createResourceManager.h b/src/Common/Scheduler/createResourceManager.h new file mode 100644 index 00000000000..d80a17f3bff --- /dev/null +++ b/src/Common/Scheduler/createResourceManager.h @@ -0,0 +1,11 @@ +#pragma once + +#include +#include + +namespace DB +{ + +ResourceManagerPtr createResourceManager(const ContextMutablePtr & global_context); + +} diff --git a/src/Common/ThreadStatus.cpp b/src/Common/ThreadStatus.cpp index e38d3480664..268d97e62ef 100644 --- a/src/Common/ThreadStatus.cpp +++ b/src/Common/ThreadStatus.cpp @@ -78,7 +78,7 @@ ThreadStatus::ThreadStatus(bool check_current_thread_on_destruction_) last_rusage = std::make_unique(); - memory_tracker.setDescription("(for thread)"); + memory_tracker.setDescription("Thread"); log = getLogger("ThreadStatus"); current_thread = this; diff --git a/src/Common/ZooKeeper/ZooKeeperArgs.cpp b/src/Common/ZooKeeper/ZooKeeperArgs.cpp index cdc9a1afe4c..c488d829b9d 100644 --- a/src/Common/ZooKeeper/ZooKeeperArgs.cpp +++ b/src/Common/ZooKeeper/ZooKeeperArgs.cpp @@ -176,6 +176,10 @@ void ZooKeeperArgs::initFromKeeperSection(const Poco::Util::AbstractConfiguratio { connection_timeout_ms = config.getInt(config_name + "." + key); } + else if (key == "num_connection_retries") + { + num_connection_retries = config.getInt(config_name + "." + key); + } else if (key == "enable_fault_injections_during_startup") { enable_fault_injections_during_startup = config.getBool(config_name + "." + key); diff --git a/src/Common/ZooKeeper/ZooKeeperArgs.h b/src/Common/ZooKeeper/ZooKeeperArgs.h index 3754c2f7aac..e790e578808 100644 --- a/src/Common/ZooKeeper/ZooKeeperArgs.h +++ b/src/Common/ZooKeeper/ZooKeeperArgs.h @@ -39,6 +39,7 @@ struct ZooKeeperArgs String sessions_path = "/clickhouse/sessions"; String client_availability_zone; int32_t connection_timeout_ms = Coordination::DEFAULT_CONNECTION_TIMEOUT_MS; + UInt64 num_connection_retries = 2; int32_t session_timeout_ms = Coordination::DEFAULT_SESSION_TIMEOUT_MS; int32_t operation_timeout_ms = Coordination::DEFAULT_OPERATION_TIMEOUT_MS; bool enable_fault_injections_during_startup = false; diff --git a/src/Common/ZooKeeper/ZooKeeperImpl.cpp b/src/Common/ZooKeeper/ZooKeeperImpl.cpp index 173f37c3454..7b027f48d4b 100644 --- a/src/Common/ZooKeeper/ZooKeeperImpl.cpp +++ b/src/Common/ZooKeeper/ZooKeeperImpl.cpp @@ -440,7 +440,9 @@ void ZooKeeper::connect( if (nodes.empty()) throw Exception::fromMessage(Error::ZBADARGUMENTS, "No nodes passed to ZooKeeper constructor"); - static constexpr size_t num_tries = 3; + /// We always have at least one attempt to connect. + size_t num_tries = args.num_connection_retries + 1; + bool connected = false; bool dns_error = false; diff --git a/src/Common/ZooKeeper/ZooKeeperRetries.h b/src/Common/ZooKeeper/ZooKeeperRetries.h index b5b03971385..acea521a7ce 100644 --- a/src/Common/ZooKeeper/ZooKeeperRetries.h +++ b/src/Common/ZooKeeper/ZooKeeperRetries.h @@ -15,14 +15,15 @@ namespace ErrorCodes struct ZooKeeperRetriesInfo { + ZooKeeperRetriesInfo() = default; ZooKeeperRetriesInfo(UInt64 max_retries_, UInt64 initial_backoff_ms_, UInt64 max_backoff_ms_) : max_retries(max_retries_), initial_backoff_ms(std::min(initial_backoff_ms_, max_backoff_ms_)), max_backoff_ms(max_backoff_ms_) { } - UInt64 max_retries; - UInt64 initial_backoff_ms; - UInt64 max_backoff_ms; + UInt64 max_retries = 0; /// "max_retries = 0" means only one attempt. + UInt64 initial_backoff_ms = 100; + UInt64 max_backoff_ms = 5000; }; class ZooKeeperRetriesControl @@ -220,6 +221,7 @@ private: return false; } + /// Check if the query was cancelled. if (process_list_element) process_list_element->checkTimeLimit(); @@ -228,6 +230,10 @@ private: sleepForMilliseconds(current_backoff_ms); current_backoff_ms = std::min(current_backoff_ms * 2, retries_info.max_backoff_ms); + /// Check if the query was cancelled again after sleeping. + if (process_list_element) + process_list_element->checkTimeLimit(); + return true; } diff --git a/src/Core/BaseSettings.cpp b/src/Core/BaseSettings.cpp index c535b9ce65e..2cce94f9d0a 100644 --- a/src/Core/BaseSettings.cpp +++ b/src/Core/BaseSettings.cpp @@ -8,6 +8,7 @@ namespace DB { namespace ErrorCodes { + extern const int INCORRECT_DATA; extern const int UNKNOWN_SETTING; } @@ -31,11 +32,19 @@ void BaseSettingsHelpers::writeFlags(Flags flags, WriteBuffer & out) } -BaseSettingsHelpers::Flags BaseSettingsHelpers::readFlags(ReadBuffer & in) +UInt64 BaseSettingsHelpers::readFlags(ReadBuffer & in) { UInt64 res; readVarUInt(res, in); - return static_cast(res); + return res; +} + +SettingsTierType BaseSettingsHelpers::getTier(UInt64 flags) +{ + int8_t tier = static_cast(flags & Flags::TIER); + if (tier > SettingsTierType::BETA) + throw Exception(ErrorCodes::INCORRECT_DATA, "Unknown tier value: '{}'", tier); + return SettingsTierType{tier}; } diff --git a/src/Core/BaseSettings.h b/src/Core/BaseSettings.h index 2a2e0bb334e..949b884636f 100644 --- a/src/Core/BaseSettings.h +++ b/src/Core/BaseSettings.h @@ -2,6 +2,7 @@ #include #include +#include #include #include #include @@ -21,6 +22,27 @@ namespace DB class ReadBuffer; class WriteBuffer; +struct BaseSettingsHelpers +{ + [[noreturn]] static void throwSettingNotFound(std::string_view name); + static void warningSettingNotFound(std::string_view name); + + static void writeString(std::string_view str, WriteBuffer & out); + static String readString(ReadBuffer & in); + + enum Flags : UInt64 + { + IMPORTANT = 0x01, + CUSTOM = 0x02, + TIER = 0x0c, /// 0b1100 == 2 bits + /// If adding new flags, consider first if Tier might need more bits + }; + + static SettingsTierType getTier(UInt64 flags); + static void writeFlags(Flags flags, WriteBuffer & out); + static UInt64 readFlags(ReadBuffer & in); +}; + /** Template class to define collections of settings. * If you create a new setting, please also add it to ./utils/check-style/check-settings-style * for validation @@ -138,7 +160,7 @@ public: const char * getTypeName() const; const char * getDescription() const; bool isCustom() const; - bool isObsolete() const; + SettingsTierType getTier() const; bool operator==(const SettingFieldRef & other) const { return (getName() == other.getName()) && (getValue() == other.getValue()); } bool operator!=(const SettingFieldRef & other) const { return !(*this == other); } @@ -225,24 +247,6 @@ private: std::conditional_t custom_settings_map; }; -struct BaseSettingsHelpers -{ - [[noreturn]] static void throwSettingNotFound(std::string_view name); - static void warningSettingNotFound(std::string_view name); - - static void writeString(std::string_view str, WriteBuffer & out); - static String readString(ReadBuffer & in); - - enum Flags : UInt64 - { - IMPORTANT = 0x01, - CUSTOM = 0x02, - OBSOLETE = 0x04, - }; - static void writeFlags(Flags flags, WriteBuffer & out); - static Flags readFlags(ReadBuffer & in); -}; - template void BaseSettings::set(std::string_view name, const Field & value) { @@ -477,7 +481,7 @@ void BaseSettings::read(ReadBuffer & in, SettingsWriteFormat format) size_t index = accessor.find(name); using Flags = BaseSettingsHelpers::Flags; - Flags flags{0}; + UInt64 flags{0}; if (format >= SettingsWriteFormat::STRINGS_WITH_FLAGS) flags = BaseSettingsHelpers::readFlags(in); bool is_important = (flags & Flags::IMPORTANT); @@ -797,14 +801,14 @@ bool BaseSettings::SettingFieldRef::isCustom() const } template -bool BaseSettings::SettingFieldRef::isObsolete() const +SettingsTierType BaseSettings::SettingFieldRef::getTier() const { if constexpr (Traits::allow_custom_settings) { if (custom_setting) - return false; + return SettingsTierType::PRODUCTION; } - return accessor->isObsolete(index); + return accessor->getTier(index); } using AliasMap = std::unordered_map; @@ -835,8 +839,8 @@ using AliasMap = std::unordered_map; const String & getName(size_t index) const { return field_infos[index].name; } \ const char * getTypeName(size_t index) const { return field_infos[index].type; } \ const char * getDescription(size_t index) const { return field_infos[index].description; } \ - bool isImportant(size_t index) const { return field_infos[index].is_important; } \ - bool isObsolete(size_t index) const { return field_infos[index].is_obsolete; } \ + bool isImportant(size_t index) const { return field_infos[index].flags & BaseSettingsHelpers::Flags::IMPORTANT; } \ + SettingsTierType getTier(size_t index) const { return BaseSettingsHelpers::getTier(field_infos[index].flags); } \ Field castValueUtil(size_t index, const Field & value) const { return field_infos[index].cast_value_util_function(value); } \ String valueToStringUtil(size_t index, const Field & value) const { return field_infos[index].value_to_string_util_function(value); } \ Field stringToValueUtil(size_t index, const String & str) const { return field_infos[index].string_to_value_util_function(str); } \ @@ -856,8 +860,7 @@ using AliasMap = std::unordered_map; String name; \ const char * type; \ const char * description; \ - bool is_important; \ - bool is_obsolete; \ + UInt64 flags; \ Field (*cast_value_util_function)(const Field &); \ String (*value_to_string_util_function)(const Field &); \ Field (*string_to_value_util_function)(const String &); \ @@ -968,8 +971,8 @@ struct DefineAliases /// NOLINTNEXTLINE #define IMPLEMENT_SETTINGS_TRAITS_(TYPE, NAME, DEFAULT, DESCRIPTION, FLAGS) \ res.field_infos.emplace_back( \ - FieldInfo{#NAME, #TYPE, DESCRIPTION, (FLAGS) & IMPORTANT, \ - static_cast((FLAGS) & BaseSettingsHelpers::Flags::OBSOLETE), \ + FieldInfo{#NAME, #TYPE, DESCRIPTION, \ + static_cast(FLAGS), \ [](const Field & value) -> Field { return static_cast(SettingField##TYPE{value}); }, \ [](const Field & value) -> String { return SettingField##TYPE{value}.toString(); }, \ [](const String & str) -> Field { SettingField##TYPE temp; temp.parseFromString(str); return static_cast(temp); }, \ diff --git a/src/Core/ServerSettings.cpp b/src/Core/ServerSettings.cpp index 7c2cb49a2ba..d573377fc8b 100644 --- a/src/Core/ServerSettings.cpp +++ b/src/Core/ServerSettings.cpp @@ -58,6 +58,7 @@ namespace DB DECLARE(Double, cannot_allocate_thread_fault_injection_probability, 0, "For testing purposes.", 0) \ DECLARE(Int32, max_connections, 1024, "Max server connections.", 0) \ DECLARE(UInt32, asynchronous_metrics_update_period_s, 1, "Period in seconds for updating asynchronous metrics.", 0) \ + DECLARE(Bool, asynchronous_metrics_enable_heavy_metrics, false, "Enable the calculation of heavy asynchronous metrics.", 0) \ DECLARE(UInt32, asynchronous_heavy_metrics_update_period_s, 120, "Period in seconds for updating heavy asynchronous metrics.", 0) \ DECLARE(String, default_database, "default", "Default database name.", 0) \ DECLARE(String, tmp_policy, "", "Policy for storage with temporary data.", 0) \ @@ -99,6 +100,7 @@ namespace DB DECLARE(String, mark_cache_policy, DEFAULT_MARK_CACHE_POLICY, "Mark cache policy name.", 0) \ DECLARE(UInt64, mark_cache_size, DEFAULT_MARK_CACHE_MAX_SIZE, "Size of cache for marks (index of MergeTree family of tables).", 0) \ DECLARE(Double, mark_cache_size_ratio, DEFAULT_MARK_CACHE_SIZE_RATIO, "The size of the protected queue in the mark cache relative to the cache's total size.", 0) \ + DECLARE(Double, mark_cache_prewarm_ratio, 0.95, "The ratio of total size of mark cache to fill during prewarm.", 0) \ DECLARE(String, index_uncompressed_cache_policy, DEFAULT_INDEX_UNCOMPRESSED_CACHE_POLICY, "Secondary index uncompressed cache policy name.", 0) \ DECLARE(UInt64, index_uncompressed_cache_size, DEFAULT_INDEX_UNCOMPRESSED_CACHE_MAX_SIZE, "Size of cache for uncompressed blocks of secondary indices. Zero means disabled.", 0) \ DECLARE(Double, index_uncompressed_cache_size_ratio, DEFAULT_INDEX_UNCOMPRESSED_CACHE_SIZE_RATIO, "The size of the protected queue in the secondary index uncompressed cache relative to the cache's total size.", 0) \ @@ -147,6 +149,7 @@ namespace DB DECLARE(UInt64, tables_loader_foreground_pool_size, 0, "The maximum number of threads that will be used for foreground (that is being waited for by a query) loading of tables. Also used for synchronous loading of tables before the server start. Zero means use all CPUs.", 0) \ DECLARE(UInt64, tables_loader_background_pool_size, 0, "The maximum number of threads that will be used for background async loading of tables. Zero means use all CPUs.", 0) \ DECLARE(Bool, async_load_databases, false, "Enable asynchronous loading of databases and tables to speedup server startup. Queries to not yet loaded entity will be blocked until load is finished.", 0) \ + DECLARE(Bool, async_load_system_database, false, "Enable asynchronous loading of system tables that are not required on server startup. Queries to not yet loaded tables will be blocked until load is finished.", 0) \ DECLARE(Bool, display_secrets_in_show_and_select, false, "Allow showing secrets in SHOW and SELECT queries via a format setting and a grant", 0) \ DECLARE(Seconds, keep_alive_timeout, DEFAULT_HTTP_KEEP_ALIVE_TIMEOUT, "The number of seconds that ClickHouse waits for incoming requests before closing the connection.", 0) \ DECLARE(UInt64, max_keep_alive_requests, 10000, "The maximum number of requests handled via a single http keepalive connection before the server closes this connection.", 0) \ @@ -181,7 +184,6 @@ namespace DB DECLARE(String, merge_workload, "default", "Name of workload to be used to access resources for all merges (may be overridden by a merge tree setting)", 0) \ DECLARE(String, mutation_workload, "default", "Name of workload to be used to access resources for all mutations (may be overridden by a merge tree setting)", 0) \ DECLARE(Bool, prepare_system_log_tables_on_startup, false, "If true, ClickHouse creates all configured `system.*_log` tables before the startup. It can be helpful if some startup scripts depend on these tables.", 0) \ - DECLARE(Double, gwp_asan_force_sample_probability, 0.0003, "Probability that an allocation from specific places will be sampled by GWP Asan (i.e. PODArray allocations)", 0) \ DECLARE(UInt64, config_reload_interval_ms, 2000, "How often clickhouse will reload config and check for new changes", 0) \ DECLARE(UInt64, memory_worker_period_ms, 0, "Tick period of background memory worker which corrects memory tracker memory usages and cleans up unused pages during higher memory usage. If set to 0, default value will be used depending on the memory usage source", 0) \ DECLARE(Bool, disable_insertion_and_mutation, false, "Disable all insert/alter/delete queries. This setting will be enabled if someone needs read-only nodes to prevent insertion and mutation affect reading performance.", 0) \ @@ -190,6 +192,13 @@ namespace DB DECLARE(UInt64, parts_killer_pool_size, 128, "Threads for cleanup of shared merge tree outdated threads. Only available in ClickHouse Cloud", 0) \ DECLARE(UInt64, keeper_multiread_batch_size, 10'000, "Maximum size of batch for MultiRead request to [Zoo]Keeper that support batching. If set to 0, batching is disabled. Available only in ClickHouse Cloud.", 0) \ DECLARE(Bool, use_legacy_mongodb_integration, true, "Use the legacy MongoDB integration implementation. Note: it's highly recommended to set this option to false, since legacy implementation will be removed in the future. Please submit any issues you encounter with the new implementation.", 0) \ + \ + DECLARE(UInt64, prefetch_threadpool_pool_size, 100, "Size of background pool for prefetches for remote object storages", 0) \ + DECLARE(UInt64, prefetch_threadpool_queue_size, 1000000, "Number of tasks which is possible to push into prefetches pool", 0) \ + DECLARE(UInt64, load_marks_threadpool_pool_size, 50, "Size of background pool for marks loading", 0) \ + DECLARE(UInt64, load_marks_threadpool_queue_size, 1000000, "Number of tasks which is possible to push into prefetches pool", 0) \ + DECLARE(UInt64, threadpool_writer_pool_size, 100, "Size of background pool for write requests to object storages", 0) \ + DECLARE(UInt64, threadpool_writer_queue_size, 1000000, "Number of tasks which is possible to push into background pool for write requests to object storages", 0) /// If you add a setting which can be updated at runtime, please update 'changeable_settings' map in dumpToSystemServerSettingsColumns below @@ -337,7 +346,7 @@ void ServerSettings::dumpToSystemServerSettingsColumns(ServerSettingColumnsParam res_columns[4]->insert(setting.getDescription()); res_columns[5]->insert(setting.getTypeName()); res_columns[6]->insert(is_changeable ? changeable_settings_it->second.second : ChangeableWithoutRestart::No); - res_columns[7]->insert(setting.isObsolete()); + res_columns[7]->insert(setting.getTier() == SettingsTierType::OBSOLETE); } } } diff --git a/src/Core/Settings.cpp b/src/Core/Settings.cpp index 217be639a12..4f1d7d0469c 100644 --- a/src/Core/Settings.cpp +++ b/src/Core/Settings.cpp @@ -1,7 +1,5 @@ -#include #include #include -#include #include #include #include @@ -40,10 +38,17 @@ namespace ErrorCodes * Note: as an alternative, we could implement settings to be completely dynamic in the form of the map: String -> Field, * but we are not going to do it, because settings are used everywhere as static struct fields. * - * `flags` can be either 0 or IMPORTANT. - * A setting is "IMPORTANT" if it affects the results of queries and can't be ignored by older versions. + * `flags` can include a Tier (BETA | EXPERIMENTAL) and an optional bitwise AND with IMPORTANT. + * The default (0) means a PRODUCTION ready setting * - * When adding new or changing existing settings add them to the settings changes history in SettingsChangesHistory.h + * A setting is "IMPORTANT" if it affects the results of queries and can't be ignored by older versions. + * Tiers: + * EXPERIMENTAL: The feature is in active development stage. Mostly for developers or for ClickHouse enthusiasts. + * BETA: There are no known bugs problems in the functionality, but the outcome of using it together with other + * features/components is unknown and correctness is not guaranteed. + * PRODUCTION (Default): The feature is safe to use along with other features from the PRODUCTION tier. + * + * When adding new or changing existing settings add them to the settings changes history in SettingsChangesHistory.cpp * for tracking settings changes in different versions and for special `compatibility` settings to work correctly. */ @@ -437,6 +442,9 @@ Enables or disables creating a new file on each insert in azure engine tables )", 0) \ DECLARE(Bool, s3_check_objects_after_upload, false, R"( Check each uploaded object to s3 with head request to be sure that upload was successful +)", 0) \ + DECLARE(Bool, azure_check_objects_after_upload, false, R"( +Check each uploaded object in azure blob storage to be sure that upload was successful )", 0) \ DECLARE(Bool, s3_allow_parallel_part_upload, true, R"( Use multiple threads for s3 multipart upload. It may lead to slightly higher memory usage @@ -2657,29 +2665,44 @@ The maximum amount of data consumed by temporary files on disk in bytes for all The maximum amount of data consumed by temporary files on disk in bytes for all concurrently running queries. Zero means unlimited. )", 0)\ \ - DECLARE(UInt64, backup_restore_keeper_max_retries, 20, R"( -Max retries for keeper operations during backup or restore + DECLARE(UInt64, backup_restore_keeper_max_retries, 1000, R"( +Max retries for [Zoo]Keeper operations in the middle of a BACKUP or RESTORE operation. +Should be big enough so the whole operation won't fail because of a temporary [Zoo]Keeper failure. )", 0) \ DECLARE(UInt64, backup_restore_keeper_retry_initial_backoff_ms, 100, R"( Initial backoff timeout for [Zoo]Keeper operations during backup or restore )", 0) \ DECLARE(UInt64, backup_restore_keeper_retry_max_backoff_ms, 5000, R"( Max backoff timeout for [Zoo]Keeper operations during backup or restore +)", 0) \ + DECLARE(UInt64, backup_restore_failure_after_host_disconnected_for_seconds, 3600, R"( +If a host during a BACKUP ON CLUSTER or RESTORE ON CLUSTER operation doesn't recreate its ephemeral 'alive' node in ZooKeeper for this amount of time then the whole backup or restore is considered as failed. +This value should be bigger than any reasonable time for a host to reconnect to ZooKeeper after a failure. +Zero means unlimited. +)", 0) \ + DECLARE(UInt64, backup_restore_keeper_max_retries_while_initializing, 20, R"( +Max retries for [Zoo]Keeper operations during the initialization of a BACKUP ON CLUSTER or RESTORE ON CLUSTER operation. +)", 0) \ + DECLARE(UInt64, backup_restore_keeper_max_retries_while_handling_error, 20, R"( +Max retries for [Zoo]Keeper operations while handling an error of a BACKUP ON CLUSTER or RESTORE ON CLUSTER operation. +)", 0) \ + DECLARE(UInt64, backup_restore_finish_timeout_after_error_sec, 180, R"( +How long the initiator should wait for other host to react to the 'error' node and stop their work on the current BACKUP ON CLUSTER or RESTORE ON CLUSTER operation. +)", 0) \ + DECLARE(UInt64, backup_restore_keeper_value_max_size, 1048576, R"( +Maximum size of data of a [Zoo]Keeper's node during backup +)", 0) \ + DECLARE(UInt64, backup_restore_batch_size_for_keeper_multi, 1000, R"( +Maximum size of batch for multi request to [Zoo]Keeper during backup or restore +)", 0) \ + DECLARE(UInt64, backup_restore_batch_size_for_keeper_multiread, 10000, R"( +Maximum size of batch for multiread request to [Zoo]Keeper during backup or restore )", 0) \ DECLARE(Float, backup_restore_keeper_fault_injection_probability, 0.0f, R"( Approximate probability of failure for a keeper request during backup or restore. Valid value is in interval [0.0f, 1.0f] )", 0) \ DECLARE(UInt64, backup_restore_keeper_fault_injection_seed, 0, R"( 0 - random seed, otherwise the setting value -)", 0) \ - DECLARE(UInt64, backup_restore_keeper_value_max_size, 1048576, R"( -Maximum size of data of a [Zoo]Keeper's node during backup -)", 0) \ - DECLARE(UInt64, backup_restore_batch_size_for_keeper_multiread, 10000, R"( -Maximum size of batch for multiread request to [Zoo]Keeper during backup or restore -)", 0) \ - DECLARE(UInt64, backup_restore_batch_size_for_keeper_multi, 1000, R"( -Maximum size of batch for multi request to [Zoo]Keeper during backup or restore )", 0) \ DECLARE(UInt64, backup_restore_s3_retry_attempts, 1000, R"( Setting for Aws::Client::RetryStrategy, Aws::Client does retries itself, 0 means no retries. It takes place only for backup/restore. @@ -4451,9 +4474,8 @@ Optimize GROUP BY when all keys in block are constant DECLARE(Bool, legacy_column_name_of_tuple_literal, false, R"( List all names of element of large tuple literals in their column names instead of hash. This settings exists only for compatibility reasons. It makes sense to set to 'true', while doing rolling update of cluster from version lower than 21.7 to higher. )", 0) \ - DECLARE(Bool, enable_named_columns_in_function_tuple, true, R"( + DECLARE(Bool, enable_named_columns_in_function_tuple, false, R"( Generate named tuples in function tuple() when all names are unique and can be treated as unquoted identifiers. -Beware that this setting might currently result in broken queries. It's not recommended to use in production )", 0) \ \ DECLARE(Bool, query_plan_enable_optimizations, true, R"( @@ -5104,6 +5126,9 @@ Only in ClickHouse Cloud. A maximum number of unacknowledged in-flight packets i )", 0) \ DECLARE(UInt64, distributed_cache_data_packet_ack_window, DistributedCache::ACK_DATA_PACKET_WINDOW, R"( Only in ClickHouse Cloud. A window for sending ACK for DataPacket sequence in a single distributed cache read request +)", 0) \ + DECLARE(Bool, distributed_cache_discard_connection_if_unread_data, true, R"( +Only in ClickHouse Cloud. Discard connection if some data is unread. )", 0) \ \ DECLARE(Bool, parallelize_output_from_storages, true, R"( @@ -5503,90 +5528,102 @@ For testing purposes. Replaces all external table functions to Null to not initi DECLARE(Bool, restore_replace_external_dictionary_source_to_null, false, R"( Replace external dictionary sources to Null on restore. Useful for testing purposes )", 0) \ - DECLARE(Bool, create_if_not_exists, false, R"( -Enable `IF NOT EXISTS` for `CREATE` statement by default. If either this setting or `IF NOT EXISTS` is specified and a table with the provided name already exists, no exception will be thrown. -)", 0) \ - DECLARE(Bool, enforce_strict_identifier_format, false, R"( -If enabled, only allow identifiers containing alphanumeric characters and underscores. -)", 0) \ - DECLARE(Bool, mongodb_throw_on_unsupported_query, true, R"( -If enabled, MongoDB tables will return an error when a MongoDB query cannot be built. Otherwise, ClickHouse reads the full table and processes it locally. This option does not apply to the legacy implementation or when 'allow_experimental_analyzer=0'. -)", 0) \ - \ - /* ###################################### */ \ - /* ######## EXPERIMENTAL FEATURES ####### */ \ - /* ###################################### */ \ - DECLARE(Bool, allow_experimental_materialized_postgresql_table, false, R"( -Allows to use the MaterializedPostgreSQL table engine. Disabled by default, because this feature is experimental -)", 0) \ - DECLARE(Bool, allow_experimental_funnel_functions, false, R"( -Enable experimental functions for funnel analysis. -)", 0) \ - DECLARE(Bool, allow_experimental_nlp_functions, false, R"( -Enable experimental functions for natural language processing. -)", 0) \ - DECLARE(Bool, allow_experimental_hash_functions, false, R"( -Enable experimental hash functions -)", 0) \ - DECLARE(Bool, allow_experimental_object_type, false, R"( -Allow Object and JSON data types -)", 0) \ - DECLARE(Bool, allow_experimental_time_series_table, false, R"( -Allows creation of tables with the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine. + /* Parallel replicas */ \ + DECLARE(UInt64, allow_experimental_parallel_reading_from_replicas, 0, R"( +Use up to `max_parallel_replicas` the number of replicas from each shard for SELECT query execution. Reading is parallelized and coordinated dynamically. 0 - disabled, 1 - enabled, silently disable them in case of failure, 2 - enabled, throw an exception in case of failure +)", BETA) ALIAS(enable_parallel_replicas) \ + DECLARE(NonZeroUInt64, max_parallel_replicas, 1, R"( +The maximum number of replicas for each shard when executing a query. Possible values: -- 0 — the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is disabled. -- 1 — the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is enabled. -)", 0) \ - DECLARE(Bool, allow_experimental_vector_similarity_index, false, R"( -Allow experimental vector similarity index -)", 0) \ - DECLARE(Bool, allow_experimental_variant_type, false, R"( -Allows creation of experimental [Variant](../../sql-reference/data-types/variant.md). -)", 0) \ - DECLARE(Bool, allow_experimental_dynamic_type, false, R"( -Allow Dynamic data type -)", 0) \ - DECLARE(Bool, allow_experimental_json_type, false, R"( -Allow JSON data type -)", 0) \ - DECLARE(Bool, allow_experimental_codecs, false, R"( -If it is set to true, allow to specify experimental compression codecs (but we don't have those yet and this option does nothing). -)", 0) \ - DECLARE(Bool, allow_experimental_shared_set_join, true, R"( -Only in ClickHouse Cloud. Allow to create ShareSet and SharedJoin -)", 0) \ - DECLARE(UInt64, max_limit_for_ann_queries, 1'000'000, R"( -SELECT queries with LIMIT bigger than this setting cannot use vector similarity indexes. Helps to prevent memory overflows in vector similarity indexes. -)", 0) \ - DECLARE(UInt64, hnsw_candidate_list_size_for_search, 256, R"( -The size of the dynamic candidate list when searching the vector similarity index, also known as 'ef_search'. -)", 0) \ - DECLARE(Bool, throw_on_unsupported_query_inside_transaction, true, R"( -Throw exception if unsupported query is used inside transaction -)", 0) \ - DECLARE(TransactionsWaitCSNMode, wait_changes_become_visible_after_commit_mode, TransactionsWaitCSNMode::WAIT_UNKNOWN, R"( -Wait for committed changes to become actually visible in the latest snapshot -)", 0) \ - DECLARE(Bool, implicit_transaction, false, R"( -If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback) -)", 0) \ - DECLARE(UInt64, grace_hash_join_initial_buckets, 1, R"( -Initial number of grace hash join buckets -)", 0) \ - DECLARE(UInt64, grace_hash_join_max_buckets, 1024, R"( -Limit on the number of grace hash join buckets -)", 0) \ - DECLARE(UInt64, join_to_sort_minimum_perkey_rows, 40, R"( -The lower limit of per-key average rows in the right table to determine whether to rerange the right table by key in left or inner join. This setting ensures that the optimization is not applied for sparse table keys -)", 0) \ - DECLARE(UInt64, join_to_sort_maximum_table_rows, 10000, R"( -The maximum number of rows in the right table to determine whether to rerange the right table by key in left or inner join. -)", 0) \ - DECLARE(Bool, allow_experimental_join_right_table_sorting, false, R"( -If it is set to true, and the conditions of `join_to_sort_minimum_perkey_rows` and `join_to_sort_maximum_table_rows` are met, rerange the right table by key to improve the performance in left or inner hash join. +- Positive integer. + +**Additional Info** + +This options will produce different results depending on the settings used. + +:::note +This setting will produce incorrect results when joins or subqueries are involved, and all tables don't meet certain requirements. See [Distributed Subqueries and max_parallel_replicas](../../sql-reference/operators/in.md/#max_parallel_replica-subqueries) for more details. +::: + +### Parallel processing using `SAMPLE` key + +A query may be processed faster if it is executed on several servers in parallel. But the query performance may degrade in the following cases: + +- The position of the sampling key in the partitioning key does not allow efficient range scans. +- Adding a sampling key to the table makes filtering by other columns less efficient. +- The sampling key is an expression that is expensive to calculate. +- The cluster latency distribution has a long tail, so that querying more servers increases the query overall latency. + +### Parallel processing using [parallel_replicas_custom_key](#parallel_replicas_custom_key) + +This setting is useful for any replicated table. )", 0) \ + DECLARE(ParallelReplicasMode, parallel_replicas_mode, ParallelReplicasMode::READ_TASKS, R"( +Type of filter to use with custom key for parallel replicas. default - use modulo operation on the custom key, range - use range filter on custom key using all possible values for the value type of custom key. +)", BETA) \ + DECLARE(UInt64, parallel_replicas_count, 0, R"( +This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the number of parallel replicas participating in query processing. +)", BETA) \ + DECLARE(UInt64, parallel_replica_offset, 0, R"( +This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the index of the replica participating in query processing among parallel replicas. +)", BETA) \ + DECLARE(String, parallel_replicas_custom_key, "", R"( +An arbitrary integer expression that can be used to split work between replicas for a specific table. +The value can be any integer expression. + +Simple expressions using primary keys are preferred. + +If the setting is used on a cluster that consists of a single shard with multiple replicas, those replicas will be converted into virtual shards. +Otherwise, it will behave same as for `SAMPLE` key, it will use multiple replicas of each shard. +)", BETA) \ + DECLARE(UInt64, parallel_replicas_custom_key_range_lower, 0, R"( +Allows the filter type `range` to split the work evenly between replicas based on the custom range `[parallel_replicas_custom_key_range_lower, INT_MAX]`. + +When used in conjunction with [parallel_replicas_custom_key_range_upper](#parallel_replicas_custom_key_range_upper), it lets the filter evenly split the work over replicas for the range `[parallel_replicas_custom_key_range_lower, parallel_replicas_custom_key_range_upper]`. + +Note: This setting will not cause any additional data to be filtered during query processing, rather it changes the points at which the range filter breaks up the range `[0, INT_MAX]` for parallel processing. +)", BETA) \ + DECLARE(UInt64, parallel_replicas_custom_key_range_upper, 0, R"( +Allows the filter type `range` to split the work evenly between replicas based on the custom range `[0, parallel_replicas_custom_key_range_upper]`. A value of 0 disables the upper bound, setting it the max value of the custom key expression. + +When used in conjunction with [parallel_replicas_custom_key_range_lower](#parallel_replicas_custom_key_range_lower), it lets the filter evenly split the work over replicas for the range `[parallel_replicas_custom_key_range_lower, parallel_replicas_custom_key_range_upper]`. + +Note: This setting will not cause any additional data to be filtered during query processing, rather it changes the points at which the range filter breaks up the range `[0, INT_MAX]` for parallel processing +)", BETA) \ + DECLARE(String, cluster_for_parallel_replicas, "", R"( +Cluster for a shard in which current server is located +)", BETA) \ + DECLARE(Bool, parallel_replicas_allow_in_with_subquery, true, R"( +If true, subquery for IN will be executed on every follower replica. +)", BETA) \ + DECLARE(Float, parallel_replicas_single_task_marks_count_multiplier, 2, R"( +A multiplier which will be added during calculation for minimal number of marks to retrieve from coordinator. This will be applied only for remote replicas. +)", BETA) \ + DECLARE(Bool, parallel_replicas_for_non_replicated_merge_tree, false, R"( +If true, ClickHouse will use parallel replicas algorithm also for non-replicated MergeTree tables +)", BETA) \ + DECLARE(UInt64, parallel_replicas_min_number_of_rows_per_replica, 0, R"( +Limit the number of replicas used in a query to (estimated rows to read / min_number_of_rows_per_replica). The max is still limited by 'max_parallel_replicas' +)", BETA) \ + DECLARE(Bool, parallel_replicas_prefer_local_join, true, R"( +If true, and JOIN can be executed with parallel replicas algorithm, and all storages of right JOIN part are *MergeTree, local JOIN will be used instead of GLOBAL JOIN. +)", BETA) \ + DECLARE(UInt64, parallel_replicas_mark_segment_size, 0, R"( +Parts virtually divided into segments to be distributed between replicas for parallel reading. This setting controls the size of these segments. Not recommended to change until you're absolutely sure in what you're doing. Value should be in range [128; 16384] +)", BETA) \ + DECLARE(Bool, parallel_replicas_local_plan, false, R"( +Build local plan for local replica +)", BETA) \ + \ + DECLARE(Bool, allow_experimental_analyzer, true, R"( +Allow new query analyzer. +)", IMPORTANT | BETA) ALIAS(enable_analyzer) \ + DECLARE(Bool, analyzer_compatibility_join_using_top_level_identifier, false, R"( +Force to resolve identifier in JOIN USING from projection (for example, in `SELECT a + 1 AS b FROM t1 JOIN t2 USING (b)` join will be performed by `t1.a + 1 = t2.b`, rather then `t1.b = t2.b`). +)", BETA) \ + \ DECLARE(Timezone, session_timezone, "", R"( Sets the implicit time zone of the current session or query. The implicit time zone is the time zone applied to values of type DateTime/DateTime64 which have no explicitly specified time zone. @@ -5646,126 +5683,121 @@ This happens due to different parsing pipelines: **See also** - [timezone](../server-configuration-parameters/settings.md#timezone) +)", BETA) \ +DECLARE(Bool, create_if_not_exists, false, R"( +Enable `IF NOT EXISTS` for `CREATE` statement by default. If either this setting or `IF NOT EXISTS` is specified and a table with the provided name already exists, no exception will be thrown. +)", 0) \ + DECLARE(Bool, enforce_strict_identifier_format, false, R"( +If enabled, only allow identifiers containing alphanumeric characters and underscores. +)", 0) \ + DECLARE(Bool, mongodb_throw_on_unsupported_query, true, R"( +If enabled, MongoDB tables will return an error when a MongoDB query cannot be built. Otherwise, ClickHouse reads the full table and processes it locally. This option does not apply to the legacy implementation or when 'allow_experimental_analyzer=0'. +)", 0) \ + DECLARE(Bool, implicit_select, false, R"( +Allow writing simple SELECT queries without the leading SELECT keyword, which makes it simple for calculator-style usage, e.g. `1 + 2` becomes a valid query. )", 0) \ - DECLARE(Bool, use_hive_partitioning, false, R"( -When enabled, ClickHouse will detect Hive-style partitioning in path (`/name=value/`) in file-like table engines [File](../../engines/table-engines/special/file.md#hive-style-partitioning)/[S3](../../engines/table-engines/integrations/s3.md#hive-style-partitioning)/[URL](../../engines/table-engines/special/url.md#hive-style-partitioning)/[HDFS](../../engines/table-engines/integrations/hdfs.md#hive-style-partitioning)/[AzureBlobStorage](../../engines/table-engines/integrations/azureBlobStorage.md#hive-style-partitioning) and will allow to use partition columns as virtual columns in the query. These virtual columns will have the same names as in the partitioned path, but starting with `_`. -)", 0)\ \ - DECLARE(Bool, allow_statistics_optimize, false, R"( -Allows using statistics to optimize queries -)", 0) ALIAS(allow_statistic_optimize) \ - DECLARE(Bool, allow_experimental_statistics, false, R"( -Allows defining columns with [statistics](../../engines/table-engines/mergetree-family/mergetree.md#table_engine-mergetree-creating-a-table) and [manipulate statistics](../../engines/table-engines/mergetree-family/mergetree.md#column-statistics). -)", 0) ALIAS(allow_experimental_statistic) \ \ - /* Parallel replicas */ \ - DECLARE(UInt64, allow_experimental_parallel_reading_from_replicas, 0, R"( -Use up to `max_parallel_replicas` the number of replicas from each shard for SELECT query execution. Reading is parallelized and coordinated dynamically. 0 - disabled, 1 - enabled, silently disable them in case of failure, 2 - enabled, throw an exception in case of failure -)", 0) ALIAS(enable_parallel_replicas) \ - DECLARE(NonZeroUInt64, max_parallel_replicas, 1, R"( -The maximum number of replicas for each shard when executing a query. + /* ####################################################### */ \ + /* ########### START OF EXPERIMENTAL FEATURES ############ */ \ + /* ## ADD PRODUCTION / BETA FEATURES BEFORE THIS BLOCK ## */ \ + /* ####################################################### */ \ + \ + DECLARE(Bool, allow_experimental_materialized_postgresql_table, false, R"( +Allows to use the MaterializedPostgreSQL table engine. Disabled by default, because this feature is experimental +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_funnel_functions, false, R"( +Enable experimental functions for funnel analysis. +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_nlp_functions, false, R"( +Enable experimental functions for natural language processing. +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_hash_functions, false, R"( +Enable experimental hash functions +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_object_type, false, R"( +Allow Object and JSON data types +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_time_series_table, false, R"( +Allows creation of tables with the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine. Possible values: -- Positive integer. - -**Additional Info** - -This options will produce different results depending on the settings used. - -:::note -This setting will produce incorrect results when joins or subqueries are involved, and all tables don't meet certain requirements. See [Distributed Subqueries and max_parallel_replicas](../../sql-reference/operators/in.md/#max_parallel_replica-subqueries) for more details. -::: - -### Parallel processing using `SAMPLE` key - -A query may be processed faster if it is executed on several servers in parallel. But the query performance may degrade in the following cases: - -- The position of the sampling key in the partitioning key does not allow efficient range scans. -- Adding a sampling key to the table makes filtering by other columns less efficient. -- The sampling key is an expression that is expensive to calculate. -- The cluster latency distribution has a long tail, so that querying more servers increases the query overall latency. - -### Parallel processing using [parallel_replicas_custom_key](#parallel_replicas_custom_key) - -This setting is useful for any replicated table. -)", 0) \ - DECLARE(ParallelReplicasMode, parallel_replicas_mode, ParallelReplicasMode::READ_TASKS, R"( -Type of filter to use with custom key for parallel replicas. default - use modulo operation on the custom key, range - use range filter on custom key using all possible values for the value type of custom key. -)", 0) \ - DECLARE(UInt64, parallel_replicas_count, 0, R"( -This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the number of parallel replicas participating in query processing. -)", 0) \ - DECLARE(UInt64, parallel_replica_offset, 0, R"( -This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the index of the replica participating in query processing among parallel replicas. -)", 0) \ - DECLARE(String, parallel_replicas_custom_key, "", R"( -An arbitrary integer expression that can be used to split work between replicas for a specific table. -The value can be any integer expression. - -Simple expressions using primary keys are preferred. - -If the setting is used on a cluster that consists of a single shard with multiple replicas, those replicas will be converted into virtual shards. -Otherwise, it will behave same as for `SAMPLE` key, it will use multiple replicas of each shard. -)", 0) \ - DECLARE(UInt64, parallel_replicas_custom_key_range_lower, 0, R"( -Allows the filter type `range` to split the work evenly between replicas based on the custom range `[parallel_replicas_custom_key_range_lower, INT_MAX]`. - -When used in conjunction with [parallel_replicas_custom_key_range_upper](#parallel_replicas_custom_key_range_upper), it lets the filter evenly split the work over replicas for the range `[parallel_replicas_custom_key_range_lower, parallel_replicas_custom_key_range_upper]`. - -Note: This setting will not cause any additional data to be filtered during query processing, rather it changes the points at which the range filter breaks up the range `[0, INT_MAX]` for parallel processing. -)", 0) \ - DECLARE(UInt64, parallel_replicas_custom_key_range_upper, 0, R"( -Allows the filter type `range` to split the work evenly between replicas based on the custom range `[0, parallel_replicas_custom_key_range_upper]`. A value of 0 disables the upper bound, setting it the max value of the custom key expression. - -When used in conjunction with [parallel_replicas_custom_key_range_lower](#parallel_replicas_custom_key_range_lower), it lets the filter evenly split the work over replicas for the range `[parallel_replicas_custom_key_range_lower, parallel_replicas_custom_key_range_upper]`. - -Note: This setting will not cause any additional data to be filtered during query processing, rather it changes the points at which the range filter breaks up the range `[0, INT_MAX]` for parallel processing -)", 0) \ - DECLARE(String, cluster_for_parallel_replicas, "", R"( -Cluster for a shard in which current server is located -)", 0) \ - DECLARE(Bool, parallel_replicas_allow_in_with_subquery, true, R"( -If true, subquery for IN will be executed on every follower replica. -)", 0) \ - DECLARE(Float, parallel_replicas_single_task_marks_count_multiplier, 2, R"( -A multiplier which will be added during calculation for minimal number of marks to retrieve from coordinator. This will be applied only for remote replicas. -)", 0) \ - DECLARE(Bool, parallel_replicas_for_non_replicated_merge_tree, false, R"( -If true, ClickHouse will use parallel replicas algorithm also for non-replicated MergeTree tables -)", 0) \ - DECLARE(UInt64, parallel_replicas_min_number_of_rows_per_replica, 0, R"( -Limit the number of replicas used in a query to (estimated rows to read / min_number_of_rows_per_replica). The max is still limited by 'max_parallel_replicas' -)", 0) \ - DECLARE(Bool, parallel_replicas_prefer_local_join, true, R"( -If true, and JOIN can be executed with parallel replicas algorithm, and all storages of right JOIN part are *MergeTree, local JOIN will be used instead of GLOBAL JOIN. -)", 0) \ - DECLARE(UInt64, parallel_replicas_mark_segment_size, 0, R"( -Parts virtually divided into segments to be distributed between replicas for parallel reading. This setting controls the size of these segments. Not recommended to change until you're absolutely sure in what you're doing. Value should be in range [128; 16384] +- 0 — the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is disabled. +- 1 — the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is enabled. )", 0) \ + DECLARE(Bool, allow_experimental_vector_similarity_index, false, R"( +Allow experimental vector similarity index +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_variant_type, false, R"( +Allows creation of experimental [Variant](../../sql-reference/data-types/variant.md). +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_dynamic_type, false, R"( +Allow Dynamic data type +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_json_type, false, R"( +Allow JSON data type +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_codecs, false, R"( +If it is set to true, allow to specify experimental compression codecs (but we don't have those yet and this option does nothing). +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_shared_set_join, true, R"( +Only in ClickHouse Cloud. Allow to create ShareSet and SharedJoin +)", EXPERIMENTAL) \ + DECLARE(UInt64, max_limit_for_ann_queries, 1'000'000, R"( +SELECT queries with LIMIT bigger than this setting cannot use vector similarity indexes. Helps to prevent memory overflows in vector similarity indexes. +)", EXPERIMENTAL) \ + DECLARE(UInt64, hnsw_candidate_list_size_for_search, 256, R"( +The size of the dynamic candidate list when searching the vector similarity index, also known as 'ef_search'. +)", EXPERIMENTAL) \ + DECLARE(Bool, throw_on_unsupported_query_inside_transaction, true, R"( +Throw exception if unsupported query is used inside transaction +)", EXPERIMENTAL) \ + DECLARE(TransactionsWaitCSNMode, wait_changes_become_visible_after_commit_mode, TransactionsWaitCSNMode::WAIT_UNKNOWN, R"( +Wait for committed changes to become actually visible in the latest snapshot +)", EXPERIMENTAL) \ + DECLARE(Bool, implicit_transaction, false, R"( +If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback) +)", EXPERIMENTAL) \ + DECLARE(UInt64, grace_hash_join_initial_buckets, 1, R"( +Initial number of grace hash join buckets +)", EXPERIMENTAL) \ + DECLARE(UInt64, grace_hash_join_max_buckets, 1024, R"( +Limit on the number of grace hash join buckets +)", EXPERIMENTAL) \ + DECLARE(UInt64, join_to_sort_minimum_perkey_rows, 40, R"( +The lower limit of per-key average rows in the right table to determine whether to rerange the right table by key in left or inner join. This setting ensures that the optimization is not applied for sparse table keys +)", EXPERIMENTAL) \ + DECLARE(UInt64, join_to_sort_maximum_table_rows, 10000, R"( +The maximum number of rows in the right table to determine whether to rerange the right table by key in left or inner join. +)", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_join_right_table_sorting, false, R"( +If it is set to true, and the conditions of `join_to_sort_minimum_perkey_rows` and `join_to_sort_maximum_table_rows` are met, rerange the right table by key to improve the performance in left or inner hash join. +)", EXPERIMENTAL) \ + DECLARE(Bool, use_hive_partitioning, false, R"( +When enabled, ClickHouse will detect Hive-style partitioning in path (`/name=value/`) in file-like table engines [File](../../engines/table-engines/special/file.md#hive-style-partitioning)/[S3](../../engines/table-engines/integrations/s3.md#hive-style-partitioning)/[URL](../../engines/table-engines/special/url.md#hive-style-partitioning)/[HDFS](../../engines/table-engines/integrations/hdfs.md#hive-style-partitioning)/[AzureBlobStorage](../../engines/table-engines/integrations/azureBlobStorage.md#hive-style-partitioning) and will allow to use partition columns as virtual columns in the query. These virtual columns will have the same names as in the partitioned path, but starting with `_`. +)", EXPERIMENTAL)\ + \ + DECLARE(Bool, allow_statistics_optimize, false, R"( +Allows using statistics to optimize queries +)", EXPERIMENTAL) ALIAS(allow_statistic_optimize) \ + DECLARE(Bool, allow_experimental_statistics, false, R"( +Allows defining columns with [statistics](../../engines/table-engines/mergetree-family/mergetree.md#table_engine-mergetree-creating-a-table) and [manipulate statistics](../../engines/table-engines/mergetree-family/mergetree.md#column-statistics). +)", EXPERIMENTAL) ALIAS(allow_experimental_statistic) \ + \ DECLARE(Bool, allow_archive_path_syntax, true, R"( File/S3 engines/table function will parse paths with '::' as '\\ :: \\' if archive has correct extension -)", 0) \ - DECLARE(Bool, parallel_replicas_local_plan, false, R"( -Build local plan for local replica -)", 0) \ +)", EXPERIMENTAL) \ \ DECLARE(Bool, allow_experimental_inverted_index, false, R"( If it is set to true, allow to use experimental inverted index. -)", 0) \ +)", EXPERIMENTAL) \ DECLARE(Bool, allow_experimental_full_text_index, false, R"( If it is set to true, allow to use experimental full-text index. -)", 0) \ +)", EXPERIMENTAL) \ \ DECLARE(Bool, allow_experimental_join_condition, false, R"( Support join with inequal conditions which involve columns from both left and right table. e.g. t1.y < t2.y. -)", 0) \ - \ - DECLARE(Bool, allow_experimental_analyzer, true, R"( -Allow new query analyzer. -)", IMPORTANT) ALIAS(enable_analyzer) \ - DECLARE(Bool, analyzer_compatibility_join_using_top_level_identifier, false, R"( -Force to resolve identifier in JOIN USING from projection (for example, in `SELECT a + 1 AS b FROM t1 JOIN t2 USING (b)` join will be performed by `t1.a + 1 = t2.b`, rather then `t1.b = t2.b`). )", 0) \ \ DECLARE(Bool, allow_experimental_live_view, false, R"( @@ -5778,43 +5810,43 @@ Possible values: )", 0) \ DECLARE(Seconds, live_view_heartbeat_interval, 15, R"( The heartbeat interval in seconds to indicate live query is alive. -)", 0) \ +)", EXPERIMENTAL) \ DECLARE(UInt64, max_live_view_insert_blocks_before_refresh, 64, R"( Limit maximum number of inserted blocks after which mergeable blocks are dropped and query is re-executed. -)", 0) \ +)", EXPERIMENTAL) \ \ DECLARE(Bool, allow_experimental_window_view, false, R"( Enable WINDOW VIEW. Not mature enough. -)", 0) \ +)", EXPERIMENTAL) \ DECLARE(Seconds, window_view_clean_interval, 60, R"( The clean interval of window view in seconds to free outdated data. -)", 0) \ +)", EXPERIMENTAL) \ DECLARE(Seconds, window_view_heartbeat_interval, 15, R"( The heartbeat interval in seconds to indicate watch query is alive. -)", 0) \ +)", EXPERIMENTAL) \ DECLARE(Seconds, wait_for_window_view_fire_signal_timeout, 10, R"( Timeout for waiting for window view fire signal in event time processing -)", 0) \ +)", EXPERIMENTAL) \ \ DECLARE(Bool, stop_refreshable_materialized_views_on_startup, false, R"( On server startup, prevent scheduling of refreshable materialized views, as if with SYSTEM STOP VIEWS. You can manually start them with SYSTEM START VIEWS or SYSTEM START VIEW \\ afterwards. Also applies to newly created views. Has no effect on non-refreshable materialized views. -)", 0) \ +)", EXPERIMENTAL) \ \ DECLARE(Bool, allow_experimental_database_materialized_mysql, false, R"( Allow to create database with Engine=MaterializedMySQL(...). -)", 0) \ +)", EXPERIMENTAL) \ DECLARE(Bool, allow_experimental_database_materialized_postgresql, false, R"( Allow to create database with Engine=MaterializedPostgreSQL(...). -)", 0) \ +)", EXPERIMENTAL) \ \ /** Experimental feature for moving data between shards. */ \ DECLARE(Bool, allow_experimental_query_deduplication, false, R"( Experimental data deduplication for SELECT queries based on part UUIDs -)", 0) \ - DECLARE(Bool, implicit_select, false, R"( -Allow writing simple SELECT queries without the leading SELECT keyword, which makes it simple for calculator-style usage, e.g. `1 + 2` becomes a valid query. -)", 0) - +)", EXPERIMENTAL) \ + \ + /* ####################################################### */ \ + /* ############ END OF EXPERIMENTAL FEATURES ############# */ \ + /* ####################################################### */ \ // End of COMMON_SETTINGS // Please add settings related to formats in Core/FormatFactorySettings.h, move obsolete settings to OBSOLETE_SETTINGS and obsolete format settings to OBSOLETE_FORMAT_SETTINGS. @@ -5893,13 +5925,14 @@ Allow writing simple SELECT queries without the leading SELECT keyword, which ma /** The section above is for obsolete settings. Do not add anything there. */ #endif /// __CLION_IDE__ - #define LIST_OF_SETTINGS(M, ALIAS) \ COMMON_SETTINGS(M, ALIAS) \ OBSOLETE_SETTINGS(M, ALIAS) \ FORMAT_FACTORY_SETTINGS(M, ALIAS) \ OBSOLETE_FORMAT_SETTINGS(M, ALIAS) \ +// clang-format on + DECLARE_SETTINGS_TRAITS_ALLOW_CUSTOM_SETTINGS(SettingsTraits, LIST_OF_SETTINGS) IMPLEMENT_SETTINGS_TRAITS(SettingsTraits, LIST_OF_SETTINGS) @@ -6007,7 +6040,7 @@ void SettingsImpl::checkNoSettingNamesAtTopLevel(const Poco::Util::AbstractConfi { const auto & name = setting.getName(); bool should_skip_check = name == "max_table_size_to_drop" || name == "max_partition_size_to_drop"; - if (config.has(name) && !setting.isObsolete() && !should_skip_check) + if (config.has(name) && (setting.getTier() != SettingsTierType::OBSOLETE) && !should_skip_check) { throw Exception(ErrorCodes::UNKNOWN_ELEMENT_IN_CONFIG, "A setting '{}' appeared at top level in config {}." " But it is user-level setting that should be located in users.xml inside section for specific profile." @@ -6183,7 +6216,7 @@ std::vector Settings::getChangedAndObsoleteNames() const std::vector setting_names; for (const auto & setting : impl->allChanged()) { - if (setting.isObsolete()) + if (setting.getTier() == SettingsTierType::OBSOLETE) setting_names.emplace_back(setting.getName()); } return setting_names; @@ -6232,7 +6265,8 @@ void Settings::dumpToSystemSettingsColumns(MutableColumnsAndConstraints & params res_columns[6]->insert(writability == SettingConstraintWritability::CONST); res_columns[7]->insert(setting.getTypeName()); res_columns[8]->insert(setting.getDefaultValueString()); - res_columns[10]->insert(setting.isObsolete()); + res_columns[10]->insert(setting.getTier() == SettingsTierType::OBSOLETE); + res_columns[11]->insert(setting.getTier()); }; const auto & settings_to_aliases = SettingsImpl::Traits::settingsToAliases(); diff --git a/src/Core/SettingsChangesHistory.cpp b/src/Core/SettingsChangesHistory.cpp index 7dfcdb3c1e6..a5ddffca786 100644 --- a/src/Core/SettingsChangesHistory.cpp +++ b/src/Core/SettingsChangesHistory.cpp @@ -64,6 +64,13 @@ static std::initializer_list +#include + +namespace DB +{ + +std::shared_ptr getSettingsTierEnum() +{ + return std::make_shared( + DataTypeEnum8::Values + { + {"Production", static_cast(SettingsTierType::PRODUCTION)}, + {"Obsolete", static_cast(SettingsTierType::OBSOLETE)}, + {"Experimental", static_cast(SettingsTierType::EXPERIMENTAL)}, + {"Beta", static_cast(SettingsTierType::BETA)} + }); +} + +} diff --git a/src/Core/SettingsTierType.h b/src/Core/SettingsTierType.h new file mode 100644 index 00000000000..d8bba89bc18 --- /dev/null +++ b/src/Core/SettingsTierType.h @@ -0,0 +1,26 @@ +#pragma once + +#include + +#include +#include + +namespace DB +{ + +template +class DataTypeEnum; +using DataTypeEnum8 = DataTypeEnum; + +// Make it signed for compatibility with DataTypeEnum8 +enum SettingsTierType : int8_t +{ + PRODUCTION = 0b0000, + OBSOLETE = 0b0100, + EXPERIMENTAL = 0b1000, + BETA = 0b1100 +}; + +std::shared_ptr getSettingsTierEnum(); + +} diff --git a/src/Core/SortCursor.h b/src/Core/SortCursor.h index f41664a1607..3d568be199c 100644 --- a/src/Core/SortCursor.h +++ b/src/Core/SortCursor.h @@ -195,6 +195,15 @@ struct SortCursorHelper /// The last row of this cursor is no larger than the first row of the another cursor. return !derived().greaterAt(rhs.derived(), impl->rows - 1, 0); } + + bool ALWAYS_INLINE totallyLess(const SortCursorHelper & rhs) const + { + if (impl->rows == 0 || rhs.impl->rows == 0) + return false; + + /// The last row of this cursor is less than the first row of the another cursor. + return rhs.derived().template greaterAt(derived(), 0, impl->rows - 1); + } }; @@ -203,6 +212,7 @@ struct SortCursor : SortCursorHelper using SortCursorHelper::SortCursorHelper; /// The specified row of this cursor is greater than the specified row of another cursor. + template bool ALWAYS_INLINE greaterAt(const SortCursor & rhs, size_t lhs_pos, size_t rhs_pos) const { #if USE_EMBEDDED_COMPILER @@ -218,7 +228,10 @@ struct SortCursor : SortCursorHelper if (res < 0) return false; - return impl->order > rhs.impl->order; + if constexpr (consider_order) + return impl->order > rhs.impl->order; + else + return false; } #endif @@ -235,7 +248,10 @@ struct SortCursor : SortCursorHelper return false; } - return impl->order > rhs.impl->order; + if constexpr (consider_order) + return impl->order > rhs.impl->order; + else + return false; } }; @@ -245,6 +261,7 @@ struct SimpleSortCursor : SortCursorHelper { using SortCursorHelper::SortCursorHelper; + template bool ALWAYS_INLINE greaterAt(const SimpleSortCursor & rhs, size_t lhs_pos, size_t rhs_pos) const { int res = 0; @@ -271,7 +288,10 @@ struct SimpleSortCursor : SortCursorHelper if (res < 0) return false; - return impl->order > rhs.impl->order; + if constexpr (consider_order) + return impl->order > rhs.impl->order; + else + return false; } }; @@ -280,6 +300,7 @@ struct SpecializedSingleColumnSortCursor : SortCursorHelper::SortCursorHelper; + template bool ALWAYS_INLINE greaterAt(const SortCursorHelper & rhs, size_t lhs_pos, size_t rhs_pos) const { auto & this_impl = this->impl; @@ -302,7 +323,10 @@ struct SpecializedSingleColumnSortCursor : SortCursorHelperorder > rhs.impl->order; + if constexpr (consider_order) + return this_impl->order > rhs.impl->order; + else + return false; } }; @@ -311,6 +335,7 @@ struct SortCursorWithCollation : SortCursorHelper { using SortCursorHelper::SortCursorHelper; + template bool ALWAYS_INLINE greaterAt(const SortCursorWithCollation & rhs, size_t lhs_pos, size_t rhs_pos) const { for (size_t i = 0; i < impl->sort_columns_size; ++i) @@ -330,7 +355,10 @@ struct SortCursorWithCollation : SortCursorHelper if (res < 0) return false; } - return impl->order > rhs.impl->order; + if constexpr (consider_order) + return impl->order > rhs.impl->order; + else + return false; } }; diff --git a/src/DataTypes/Serializations/ISerialization.cpp b/src/DataTypes/Serializations/ISerialization.cpp index fdcdf9e0cda..5a60dc30b02 100644 --- a/src/DataTypes/Serializations/ISerialization.cpp +++ b/src/DataTypes/Serializations/ISerialization.cpp @@ -161,7 +161,7 @@ String getNameForSubstreamPath( String stream_name, SubstreamIterator begin, SubstreamIterator end, - bool escape_tuple_delimiter) + bool escape_for_file_name) { using Substream = ISerialization::Substream; @@ -186,7 +186,7 @@ String getNameForSubstreamPath( /// Because nested data may be represented not by Array of Tuple, /// but by separate Array columns with names in a form of a.b, /// and name is encoded as a whole. - if (it->type == Substream::TupleElement && escape_tuple_delimiter) + if (it->type == Substream::TupleElement && escape_for_file_name) stream_name += escapeForFileName(substream_name); else stream_name += substream_name; @@ -206,7 +206,7 @@ String getNameForSubstreamPath( else if (it->type == SubstreamType::ObjectSharedData) stream_name += ".object_shared_data"; else if (it->type == SubstreamType::ObjectTypedPath || it->type == SubstreamType::ObjectDynamicPath) - stream_name += "." + it->object_path_name; + stream_name += "." + (escape_for_file_name ? escapeForFileName(it->object_path_name) : it->object_path_name); } return stream_name; @@ -434,6 +434,14 @@ bool ISerialization::isDynamicSubcolumn(const DB::ISerialization::SubstreamPath return false; } +bool ISerialization::isLowCardinalityDictionarySubcolumn(const DB::ISerialization::SubstreamPath & path) +{ + if (path.empty()) + return false; + + return path[path.size() - 1].type == SubstreamType::DictionaryKeys; +} + ISerialization::SubstreamData ISerialization::createFromPath(const SubstreamPath & path, size_t prefix_len) { assert(prefix_len <= path.size()); diff --git a/src/DataTypes/Serializations/ISerialization.h b/src/DataTypes/Serializations/ISerialization.h index 7bd58a8a981..400bdbf32d3 100644 --- a/src/DataTypes/Serializations/ISerialization.h +++ b/src/DataTypes/Serializations/ISerialization.h @@ -463,6 +463,8 @@ public: /// Returns true if stream with specified path corresponds to dynamic subcolumn. static bool isDynamicSubcolumn(const SubstreamPath & path, size_t prefix_len); + static bool isLowCardinalityDictionarySubcolumn(const SubstreamPath & path); + protected: template State * checkAndGetState(const StatePtr & state) const; diff --git a/src/DataTypes/Serializations/SerializationLowCardinality.cpp b/src/DataTypes/Serializations/SerializationLowCardinality.cpp index baaab6ba3c3..248fe2681b0 100644 --- a/src/DataTypes/Serializations/SerializationLowCardinality.cpp +++ b/src/DataTypes/Serializations/SerializationLowCardinality.cpp @@ -54,7 +54,7 @@ void SerializationLowCardinality::enumerateStreams( .withSerializationInfo(data.serialization_info); settings.path.back().data = dict_data; - dict_inner_serialization->enumerateStreams(settings, callback, dict_data); + callback(settings.path); settings.path.back() = Substream::DictionaryIndexes; settings.path.back().data = data; diff --git a/src/Databases/DatabaseReplicated.cpp b/src/Databases/DatabaseReplicated.cpp index 387667b1b42..8992a9d8548 100644 --- a/src/Databases/DatabaseReplicated.cpp +++ b/src/Databases/DatabaseReplicated.cpp @@ -4,47 +4,49 @@ #include #include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include #include #include #include #include #include +#include +#include #include #include #include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include namespace DB { @@ -55,6 +57,8 @@ namespace Setting extern const SettingsUInt64 max_parser_backtracks; extern const SettingsUInt64 max_parser_depth; extern const SettingsUInt64 max_query_size; + extern const SettingsDistributedDDLOutputMode distributed_ddl_output_mode; + extern const SettingsInt64 distributed_ddl_task_timeout; extern const SettingsBool throw_on_unsupported_query_inside_transaction; } @@ -443,7 +447,6 @@ void DatabaseReplicated::fillClusterAuthInfo(String collection_name, const Poco: cluster_auth_info.cluster_secure_connection = config_ref.getBool(config_prefix + ".cluster_secure_connection", false); } - void DatabaseReplicated::tryConnectToZooKeeperAndInitDatabase(LoadingStrictnessLevel mode) { try @@ -1096,7 +1099,8 @@ BlockIO DatabaseReplicated::tryEnqueueReplicatedDDL(const ASTPtr & query, Contex hosts_to_wait.push_back(unfiltered_hosts[i]); } - return getDistributedDDLStatus(node_path, entry, query_context, &hosts_to_wait); + + return getQueryStatus(node_path, fs::path(zookeeper_path) / "replicas", query_context, hosts_to_wait); } static UUID getTableUUIDIfReplicated(const String & metadata, ContextPtr context) @@ -2040,4 +2044,21 @@ void registerDatabaseReplicated(DatabaseFactory & factory) }; factory.registerDatabase("Replicated", create_fn, {.supports_arguments = true, .supports_settings = true}); } + +BlockIO DatabaseReplicated::getQueryStatus( + const String & node_path, const String & replicas_path, ContextPtr context_, const Strings & hosts_to_wait) +{ + BlockIO io; + if (context_->getSettingsRef()[Setting::distributed_ddl_task_timeout] == 0) + return io; + + auto source = std::make_shared(node_path, replicas_path, context_, hosts_to_wait); + io.pipeline = QueryPipeline(std::move(source)); + + if (context_->getSettingsRef()[Setting::distributed_ddl_output_mode] == DistributedDDLOutputMode::NONE + || context_->getSettingsRef()[Setting::distributed_ddl_output_mode] == DistributedDDLOutputMode::NONE_ONLY_ACTIVE) + io.pipeline.complete(std::make_shared(io.pipeline.getHeader())); + + return io; +} } diff --git a/src/Databases/DatabaseReplicated.h b/src/Databases/DatabaseReplicated.h index 3195de48c1f..fb239435dc1 100644 --- a/src/Databases/DatabaseReplicated.h +++ b/src/Databases/DatabaseReplicated.h @@ -151,6 +151,9 @@ private: void waitDatabaseStarted() const override; void stopLoading() override; + static BlockIO + getQueryStatus(const String & node_path, const String & replicas_path, ContextPtr context, const Strings & hosts_to_wait); + String zookeeper_path; String shard_name; String replica_name; diff --git a/src/Databases/DatabaseReplicatedWorker.cpp b/src/Databases/DatabaseReplicatedWorker.cpp index 3b70383c28b..6a711c92332 100644 --- a/src/Databases/DatabaseReplicatedWorker.cpp +++ b/src/Databases/DatabaseReplicatedWorker.cpp @@ -39,7 +39,14 @@ namespace ErrorCodes static constexpr const char * FORCE_AUTO_RECOVERY_DIGEST = "42"; DatabaseReplicatedDDLWorker::DatabaseReplicatedDDLWorker(DatabaseReplicated * db, ContextPtr context_) - : DDLWorker(/* pool_size */ 1, db->zookeeper_path + "/log", context_, nullptr, {}, fmt::format("DDLWorker({})", db->getDatabaseName())) + : DDLWorker( + /* pool_size */ 1, + db->zookeeper_path + "/log", + db->zookeeper_path + "/replicas", + context_, + nullptr, + {}, + fmt::format("DDLWorker({})", db->getDatabaseName())) , database(db) { /// Pool size must be 1 to avoid reordering of log entries. @@ -192,13 +199,12 @@ void DatabaseReplicatedDDLWorker::initializeReplication() active_node_holder = zkutil::EphemeralNodeHolder::existing(active_path, *active_node_holder_zookeeper); } -String DatabaseReplicatedDDLWorker::enqueueQuery(DDLLogEntry & entry) +String DatabaseReplicatedDDLWorker::enqueueQuery(DDLLogEntry & entry, const ZooKeeperRetriesInfo &, QueryStatusPtr) { auto zookeeper = getAndSetZooKeeper(); return enqueueQueryImpl(zookeeper, entry, database); } - bool DatabaseReplicatedDDLWorker::waitForReplicaToProcessAllEntries(UInt64 timeout_ms) { auto zookeeper = getAndSetZooKeeper(); diff --git a/src/Databases/DatabaseReplicatedWorker.h b/src/Databases/DatabaseReplicatedWorker.h index e741037e702..d2385cbdba3 100644 --- a/src/Databases/DatabaseReplicatedWorker.h +++ b/src/Databases/DatabaseReplicatedWorker.h @@ -24,7 +24,7 @@ class DatabaseReplicatedDDLWorker : public DDLWorker public: DatabaseReplicatedDDLWorker(DatabaseReplicated * db, ContextPtr context_); - String enqueueQuery(DDLLogEntry & entry) override; + String enqueueQuery(DDLLogEntry & entry, const ZooKeeperRetriesInfo &, QueryStatusPtr) override; String tryEnqueueAndExecuteEntry(DDLLogEntry & entry, ContextPtr query_context); @@ -38,9 +38,14 @@ public: UInt32 getLogPointer() const; UInt64 getCurrentInitializationDurationMs() const; + private: bool initializeMainThread() override; - void initializeReplication(); + void initializeReplication() override; + + void createReplicaDirs(const ZooKeeperPtr &, const NameSet &) override { } + void markReplicasActive(bool) override { } + void initializeLogPointer(const String & processed_entry_name); DDLTaskPtr initAndCheckTask(const String & entry_name, String & out_reason, const ZooKeeperPtr & zookeeper, bool dry_run) override; diff --git a/src/Disks/IO/AsynchronousBoundedReadBuffer.cpp b/src/Disks/IO/AsynchronousBoundedReadBuffer.cpp index b24b95af85c..c405d296e60 100644 --- a/src/Disks/IO/AsynchronousBoundedReadBuffer.cpp +++ b/src/Disks/IO/AsynchronousBoundedReadBuffer.cpp @@ -365,7 +365,7 @@ AsynchronousBoundedReadBuffer::~AsynchronousBoundedReadBuffer() } catch (...) { - tryLogCurrentException(__PRETTY_FUNCTION__); + tryLogCurrentException(log); } } diff --git a/src/Disks/IO/WriteBufferFromAzureBlobStorage.cpp b/src/Disks/IO/WriteBufferFromAzureBlobStorage.cpp index a9c0b26aa8d..cf88a54db86 100644 --- a/src/Disks/IO/WriteBufferFromAzureBlobStorage.cpp +++ b/src/Disks/IO/WriteBufferFromAzureBlobStorage.cpp @@ -30,6 +30,7 @@ namespace DB namespace ErrorCodes { + extern const int AZURE_BLOB_STORAGE_ERROR; extern const int LOGICAL_ERROR; } @@ -72,6 +73,7 @@ WriteBufferFromAzureBlobStorage::WriteBufferFromAzureBlobStorage( std::move(schedule_), settings_->max_inflight_parts_for_one_file, limited_log)) + , check_objects_after_upload(settings_->check_objects_after_upload) { allocateBuffer(); } @@ -178,6 +180,24 @@ void WriteBufferFromAzureBlobStorage::finalizeImpl() execWithRetry([&](){ block_blob_client.CommitBlockList(block_ids); }, max_unexpected_write_error_retries); LOG_TRACE(log, "Committed {} blocks for blob `{}`", block_ids.size(), blob_path); } + + if (check_objects_after_upload) + { + try + { + auto blob_client = blob_container_client->GetBlobClient(blob_path); + blob_client.GetProperties(); + } + catch (const Azure::Storage::StorageException & e) + { + if (e.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) + throw Exception( + ErrorCodes::AZURE_BLOB_STORAGE_ERROR, + "Object {} not uploaded to azure blob storage, it's a bug in Azure Blob Storage or its API.", + blob_path); + throw; + } + } } void WriteBufferFromAzureBlobStorage::nextImpl() diff --git a/src/Disks/IO/WriteBufferFromAzureBlobStorage.h b/src/Disks/IO/WriteBufferFromAzureBlobStorage.h index 351fb309d2d..23dd89874bc 100644 --- a/src/Disks/IO/WriteBufferFromAzureBlobStorage.h +++ b/src/Disks/IO/WriteBufferFromAzureBlobStorage.h @@ -90,6 +90,7 @@ private: size_t hidden_size = 0; std::unique_ptr task_tracker; + bool check_objects_after_upload = false; std::deque detached_part_data; }; diff --git a/src/Disks/ObjectStorages/AzureBlobStorage/AzureBlobStorageCommon.cpp b/src/Disks/ObjectStorages/AzureBlobStorage/AzureBlobStorageCommon.cpp index 931deed30ce..49355c15491 100644 --- a/src/Disks/ObjectStorages/AzureBlobStorage/AzureBlobStorageCommon.cpp +++ b/src/Disks/ObjectStorages/AzureBlobStorage/AzureBlobStorageCommon.cpp @@ -39,6 +39,7 @@ namespace Setting extern const SettingsUInt64 azure_sdk_max_retries; extern const SettingsUInt64 azure_sdk_retry_initial_backoff_ms; extern const SettingsUInt64 azure_sdk_retry_max_backoff_ms; + extern const SettingsBool azure_check_objects_after_upload; } namespace ErrorCodes @@ -352,6 +353,7 @@ std::unique_ptr getRequestSettings(const Settings & query_setti settings->sdk_max_retries = query_settings[Setting::azure_sdk_max_retries]; settings->sdk_retry_initial_backoff_ms = query_settings[Setting::azure_sdk_retry_initial_backoff_ms]; settings->sdk_retry_max_backoff_ms = query_settings[Setting::azure_sdk_retry_max_backoff_ms]; + settings->check_objects_after_upload = query_settings[Setting::azure_check_objects_after_upload]; return settings; } @@ -389,6 +391,8 @@ std::unique_ptr getRequestSettings(const Poco::Util::AbstractCo settings->sdk_retry_initial_backoff_ms = config.getUInt64(config_prefix + ".retry_initial_backoff_ms", settings_ref[Setting::azure_sdk_retry_initial_backoff_ms]); settings->sdk_retry_max_backoff_ms = config.getUInt64(config_prefix + ".retry_max_backoff_ms", settings_ref[Setting::azure_sdk_retry_max_backoff_ms]); + settings->check_objects_after_upload = config.getBool(config_prefix + ".check_objects_after_upload", settings_ref[Setting::azure_check_objects_after_upload]); + if (config.has(config_prefix + ".curl_ip_resolve")) { using CurlOptions = Azure::Core::Http::CurlTransportOptions; diff --git a/src/Disks/ObjectStorages/AzureBlobStorage/AzureBlobStorageCommon.h b/src/Disks/ObjectStorages/AzureBlobStorage/AzureBlobStorageCommon.h index bb2f0270924..4f7fd4e88cc 100644 --- a/src/Disks/ObjectStorages/AzureBlobStorage/AzureBlobStorageCommon.h +++ b/src/Disks/ObjectStorages/AzureBlobStorage/AzureBlobStorageCommon.h @@ -50,6 +50,7 @@ struct RequestSettings size_t sdk_retry_initial_backoff_ms = 10; size_t sdk_retry_max_backoff_ms = 1000; bool use_native_copy = false; + bool check_objects_after_upload = false; using CurlOptions = Azure::Core::Http::CurlTransportOptions; CurlOptions::CurlOptIPResolve curl_ip_resolve = CurlOptions::CURL_IPRESOLVE_WHATEVER; diff --git a/src/Disks/ObjectStorages/DiskObjectStorage.cpp b/src/Disks/ObjectStorages/DiskObjectStorage.cpp index fbab25490c1..cc8a873c544 100644 --- a/src/Disks/ObjectStorages/DiskObjectStorage.cpp +++ b/src/Disks/ObjectStorages/DiskObjectStorage.cpp @@ -18,7 +18,8 @@ #include #include #include - +#include +#include namespace DB { @@ -71,8 +72,8 @@ DiskObjectStorage::DiskObjectStorage( , metadata_storage(std::move(metadata_storage_)) , object_storage(std::move(object_storage_)) , send_metadata(config.getBool(config_prefix + ".send_metadata", false)) - , read_resource_name(config.getString(config_prefix + ".read_resource", "")) - , write_resource_name(config.getString(config_prefix + ".write_resource", "")) + , read_resource_name_from_config(config.getString(config_prefix + ".read_resource", "")) + , write_resource_name_from_config(config.getString(config_prefix + ".write_resource", "")) , metadata_helper(std::make_unique(this, ReadSettings{}, WriteSettings{})) { data_source_description = DataSourceDescription{ @@ -83,6 +84,98 @@ DiskObjectStorage::DiskObjectStorage( .is_encrypted = false, .is_cached = object_storage->supportsCache(), }; + resource_changes_subscription = Context::getGlobalContextInstance()->getWorkloadEntityStorage().getAllEntitiesAndSubscribe( + [this] (const std::vector & events) + { + std::unique_lock lock{resource_mutex}; + + // Sets of matching resource names. Required to resolve possible conflicts in deterministic way + std::set new_read_resource_name_from_sql; + std::set new_write_resource_name_from_sql; + std::set new_read_resource_name_from_sql_any; + std::set new_write_resource_name_from_sql_any; + + // Current state + if (!read_resource_name_from_sql.empty()) + new_read_resource_name_from_sql.insert(read_resource_name_from_sql); + if (!write_resource_name_from_sql.empty()) + new_write_resource_name_from_sql.insert(write_resource_name_from_sql); + if (!read_resource_name_from_sql_any.empty()) + new_read_resource_name_from_sql_any.insert(read_resource_name_from_sql_any); + if (!write_resource_name_from_sql_any.empty()) + new_write_resource_name_from_sql_any.insert(write_resource_name_from_sql_any); + + // Process all updates in specified order + for (const auto & [entity_type, resource_name, resource] : events) + { + if (entity_type == WorkloadEntityType::Resource) + { + if (resource) // CREATE RESOURCE + { + auto * create = typeid_cast(resource.get()); + chassert(create); + for (const auto & [mode, disk] : create->operations) + { + if (!disk) + { + switch (mode) + { + case ASTCreateResourceQuery::AccessMode::Read: new_read_resource_name_from_sql_any.insert(resource_name); break; + case ASTCreateResourceQuery::AccessMode::Write: new_write_resource_name_from_sql_any.insert(resource_name); break; + } + } + else if (*disk == name) + { + switch (mode) + { + case ASTCreateResourceQuery::AccessMode::Read: new_read_resource_name_from_sql.insert(resource_name); break; + case ASTCreateResourceQuery::AccessMode::Write: new_write_resource_name_from_sql.insert(resource_name); break; + } + } + } + } + else // DROP RESOURCE + { + new_read_resource_name_from_sql.erase(resource_name); + new_write_resource_name_from_sql.erase(resource_name); + new_read_resource_name_from_sql_any.erase(resource_name); + new_write_resource_name_from_sql_any.erase(resource_name); + } + } + } + + String old_read_resource = getReadResourceNameNoLock(); + String old_write_resource = getWriteResourceNameNoLock(); + + // Apply changes + if (!new_read_resource_name_from_sql_any.empty()) + read_resource_name_from_sql_any = *new_read_resource_name_from_sql_any.begin(); + else + read_resource_name_from_sql_any.clear(); + + if (!new_write_resource_name_from_sql_any.empty()) + write_resource_name_from_sql_any = *new_write_resource_name_from_sql_any.begin(); + else + write_resource_name_from_sql_any.clear(); + + if (!new_read_resource_name_from_sql.empty()) + read_resource_name_from_sql = *new_read_resource_name_from_sql.begin(); + else + read_resource_name_from_sql.clear(); + + if (!new_write_resource_name_from_sql.empty()) + write_resource_name_from_sql = *new_write_resource_name_from_sql.begin(); + else + write_resource_name_from_sql.clear(); + + String new_read_resource = getReadResourceNameNoLock(); + String new_write_resource = getWriteResourceNameNoLock(); + + if (old_read_resource != new_read_resource) + LOG_INFO(log, "Using resource '{}' instead of '{}' for READ", new_read_resource, old_read_resource); + if (old_write_resource != new_write_resource) + LOG_INFO(log, "Using resource '{}' instead of '{}' for WRITE", new_write_resource, old_write_resource); + }); } StoredObjects DiskObjectStorage::getStorageObjects(const String & local_path) const @@ -480,13 +573,29 @@ static inline Settings updateIOSchedulingSettings(const Settings & settings, con String DiskObjectStorage::getReadResourceName() const { std::unique_lock lock(resource_mutex); - return read_resource_name; + return getReadResourceNameNoLock(); } String DiskObjectStorage::getWriteResourceName() const { std::unique_lock lock(resource_mutex); - return write_resource_name; + return getWriteResourceNameNoLock(); +} + +String DiskObjectStorage::getReadResourceNameNoLock() const +{ + if (read_resource_name_from_config.empty()) + return read_resource_name_from_sql.empty() ? read_resource_name_from_sql_any : read_resource_name_from_sql; + else + return read_resource_name_from_config; +} + +String DiskObjectStorage::getWriteResourceNameNoLock() const +{ + if (write_resource_name_from_config.empty()) + return write_resource_name_from_sql.empty() ? write_resource_name_from_sql_any : write_resource_name_from_sql; + else + return write_resource_name_from_config; } std::unique_ptr DiskObjectStorage::readFile( @@ -607,10 +716,10 @@ void DiskObjectStorage::applyNewSettings( { std::unique_lock lock(resource_mutex); - if (String new_read_resource_name = config.getString(config_prefix + ".read_resource", ""); new_read_resource_name != read_resource_name) - read_resource_name = new_read_resource_name; - if (String new_write_resource_name = config.getString(config_prefix + ".write_resource", ""); new_write_resource_name != write_resource_name) - write_resource_name = new_write_resource_name; + if (String new_read_resource_name = config.getString(config_prefix + ".read_resource", ""); new_read_resource_name != read_resource_name_from_config) + read_resource_name_from_config = new_read_resource_name; + if (String new_write_resource_name = config.getString(config_prefix + ".write_resource", ""); new_write_resource_name != write_resource_name_from_config) + write_resource_name_from_config = new_write_resource_name; } IDisk::applyNewSettings(config, context_, config_prefix, disk_map); diff --git a/src/Disks/ObjectStorages/DiskObjectStorage.h b/src/Disks/ObjectStorages/DiskObjectStorage.h index b4cdf620555..6657ee352c9 100644 --- a/src/Disks/ObjectStorages/DiskObjectStorage.h +++ b/src/Disks/ObjectStorages/DiskObjectStorage.h @@ -6,6 +6,8 @@ #include #include +#include + #include "config.h" @@ -228,6 +230,8 @@ private: String getReadResourceName() const; String getWriteResourceName() const; + String getReadResourceNameNoLock() const; + String getWriteResourceNameNoLock() const; const String object_key_prefix; LoggerPtr log; @@ -246,8 +250,13 @@ private: const bool send_metadata; mutable std::mutex resource_mutex; - String read_resource_name; - String write_resource_name; + String read_resource_name_from_config; // specified in disk config.xml read_resource element + String write_resource_name_from_config; // specified in disk config.xml write_resource element + String read_resource_name_from_sql; // described by CREATE RESOURCE query with READ DISK clause + String write_resource_name_from_sql; // described by CREATE RESOURCE query with WRITE DISK clause + String read_resource_name_from_sql_any; // described by CREATE RESOURCE query with READ ANY DISK clause + String write_resource_name_from_sql_any; // described by CREATE RESOURCE query with WRITE ANY DISK clause + scope_guard resource_changes_subscription; std::unique_ptr metadata_helper; }; diff --git a/src/Disks/ObjectStorages/HDFS/HDFSObjectStorage.cpp b/src/Disks/ObjectStorages/HDFS/HDFSObjectStorage.cpp index 182534529ea..7698193ee2f 100644 --- a/src/Disks/ObjectStorages/HDFS/HDFSObjectStorage.cpp +++ b/src/Disks/ObjectStorages/HDFS/HDFSObjectStorage.cpp @@ -103,15 +103,15 @@ std::unique_ptr HDFSObjectStorage::writeObject( /// NOL ErrorCodes::UNSUPPORTED_METHOD, "HDFS API doesn't support custom attributes/metadata for stored objects"); - std::string path = object.remote_path; - if (path.starts_with("/")) - path = path.substr(1); - if (!path.starts_with(url)) - path = fs::path(url) / path; - + auto path = extractObjectKeyFromURL(object); /// Single O_WRONLY in libhdfs adds O_TRUNC return std::make_unique( - path, config, settings->replication, patchSettings(write_settings), buf_size, + url_without_path, + fs::path(data_directory) / path, + config, + settings->replication, + patchSettings(write_settings), + buf_size, mode == WriteMode::Rewrite ? O_WRONLY : O_WRONLY | O_APPEND); } diff --git a/src/Disks/ObjectStorages/InMemoryDirectoryPathMap.h b/src/Disks/ObjectStorages/InMemoryDirectoryPathMap.h index ac07f3558a2..117cbad6203 100644 --- a/src/Disks/ObjectStorages/InMemoryDirectoryPathMap.h +++ b/src/Disks/ObjectStorages/InMemoryDirectoryPathMap.h @@ -2,7 +2,9 @@ #include #include +#include #include +#include #include #include #include @@ -25,10 +27,19 @@ struct InMemoryDirectoryPathMap return path1 < path2; } }; + + using FileNames = std::set; + using FileNamesIterator = FileNames::iterator; + struct FileNameIteratorComparator + { + bool operator()(const FileNames::iterator & lhs, const FileNames::iterator & rhs) const { return *lhs < *rhs; } + }; + struct RemotePathInfo { std::string path; time_t last_modified = 0; + std::set filename_iterators; }; using Map = std::map; @@ -49,9 +60,11 @@ struct InMemoryDirectoryPathMap mutable SharedMutex mutex; #ifdef OS_LINUX + FileNames TSA_GUARDED_BY(mutex) unique_filenames; Map TSA_GUARDED_BY(mutex) map; /// std::shared_mutex may not be annotated with the 'capability' attribute in libcxx. #else + FileNames unique_filenames; Map map; #endif }; diff --git a/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorage.cpp b/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorage.cpp index 5462a27c0a7..d56c5d9143c 100644 --- a/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorage.cpp +++ b/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorage.cpp @@ -220,6 +220,21 @@ void MetadataStorageFromPlainObjectStorageTransaction::removeDirectory(const std } } +void MetadataStorageFromPlainObjectStorageTransaction::createEmptyMetadataFile(const std::string & path) +{ + if (metadata_storage.object_storage->isWriteOnce()) + return; + + addOperation( + std::make_unique(path, *metadata_storage.getPathMap(), object_storage)); +} + +void MetadataStorageFromPlainObjectStorageTransaction::createMetadataFile( + const std::string & path, ObjectStorageKey /*object_key*/, uint64_t /* size_in_bytes */) +{ + createEmptyMetadataFile(path); +} + void MetadataStorageFromPlainObjectStorageTransaction::createDirectory(const std::string & path) { if (metadata_storage.object_storage->isWriteOnce()) @@ -252,12 +267,6 @@ void MetadataStorageFromPlainObjectStorageTransaction::moveDirectory(const std:: metadata_storage.getMetadataKeyPrefix())); } -void MetadataStorageFromPlainObjectStorageTransaction::addBlobToMetadata( - const std::string &, ObjectStorageKey /* object_key */, uint64_t /* size_in_bytes */) -{ - /// Noop, local metadata files is only one file, it is the metadata file itself. -} - UnlinkMetadataFileOperationOutcomePtr MetadataStorageFromPlainObjectStorageTransaction::unlinkMetadata(const std::string & path) { /// The record has become stale, remove it from cache. @@ -269,8 +278,11 @@ UnlinkMetadataFileOperationOutcomePtr MetadataStorageFromPlainObjectStorageTrans metadata_storage.object_metadata_cache->remove(hash.get128()); } - /// No hardlinks, so will always remove file. - return std::make_shared(UnlinkMetadataFileOperationOutcome{0}); + auto result = std::make_shared(UnlinkMetadataFileOperationOutcome{0}); + if (!metadata_storage.object_storage->isWriteOnce()) + addOperation(std::make_unique( + path, *metadata_storage.getPathMap(), object_storage)); + return result; } void MetadataStorageFromPlainObjectStorageTransaction::commit() diff --git a/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorage.h b/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorage.h index c8854bc6d19..db7390af5fd 100644 --- a/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorage.h +++ b/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorage.h @@ -114,22 +114,19 @@ public: const IMetadataStorage & getStorageForNonTransactionalReads() const override; - void addBlobToMetadata(const std::string & path, ObjectStorageKey object_key, uint64_t size_in_bytes) override; + void addBlobToMetadata(const std::string & /* path */, ObjectStorageKey /* object_key */, uint64_t /* size_in_bytes */) override + { + // Noop + } void setLastModified(const String &, const Poco::Timestamp &) override { /// Noop } - void createEmptyMetadataFile(const std::string & /* path */) override - { - /// No metadata, no need to create anything. - } + void createEmptyMetadataFile(const std::string & /* path */) override; - void createMetadataFile(const std::string & /* path */, ObjectStorageKey /* object_key */, uint64_t /* size_in_bytes */) override - { - /// Noop - } + void createMetadataFile(const std::string & /* path */, ObjectStorageKey /* object_key */, uint64_t /* size_in_bytes */) override; void createDirectory(const std::string & path) override; diff --git a/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorageOperations.cpp b/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorageOperations.cpp index d2e0243a4cf..ea57d691908 100644 --- a/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorageOperations.cpp +++ b/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorageOperations.cpp @@ -1,6 +1,8 @@ #include "MetadataStorageFromPlainObjectStorageOperations.h" #include +#include +#include #include #include #include @@ -76,7 +78,7 @@ void MetadataStorageFromPlainObjectStorageCreateDirectoryOperation::execute(std: std::lock_guard lock(path_map.mutex); auto & map = path_map.map; [[maybe_unused]] auto result - = map.emplace(base_path, InMemoryDirectoryPathMap::RemotePathInfo{object_key_prefix, Poco::Timestamp{}.epochTime()}); + = map.emplace(base_path, InMemoryDirectoryPathMap::RemotePathInfo{object_key_prefix, Poco::Timestamp{}.epochTime(), {}}); chassert(result.second); } auto metric = object_storage->getMetadataStorageMetrics().directory_map_size; @@ -287,4 +289,122 @@ void MetadataStorageFromPlainObjectStorageRemoveDirectoryOperation::undo(std::un CurrentMetrics::add(metric, 1); } +MetadataStorageFromPlainObjectStorageWriteFileOperation::MetadataStorageFromPlainObjectStorageWriteFileOperation( + const std::string & path_, InMemoryDirectoryPathMap & path_map_, ObjectStoragePtr object_storage_) + : path(path_), path_map(path_map_), object_storage(object_storage_) +{ +} + +void MetadataStorageFromPlainObjectStorageWriteFileOperation::execute(std::unique_lock &) +{ + LOG_TEST(getLogger("MetadataStorageFromPlainObjectStorageWriteFileOperation"), "Creating metadata for a file '{}'", path); + + std::lock_guard lock(path_map.mutex); + + auto it = path_map.map.find(path.parent_path()); + /// Some paths (e.g., clickhouse_access_check) may not have parent directories. + if (it == path_map.map.end()) + LOG_TRACE( + getLogger("MetadataStorageFromPlainObjectStorageWriteFileOperation"), + "Parent dirrectory does not exist, skipping path {}", + path); + else + { + auto [filename_it, inserted] = path_map.unique_filenames.emplace(path.filename()); + if (inserted) + { + auto metric = object_storage->getMetadataStorageMetrics().unique_filenames_count; + CurrentMetrics::add(metric, 1); + } + written = it->second.filename_iterators.emplace(filename_it).second; + if (written) + { + auto metric = object_storage->getMetadataStorageMetrics().file_count; + CurrentMetrics::add(metric, 1); + } + } +} + +void MetadataStorageFromPlainObjectStorageWriteFileOperation::undo(std::unique_lock &) +{ + if (written) + { + std::lock_guard lock(path_map.mutex); + auto it = path_map.map.find(path.parent_path()); + chassert(it != path_map.map.end()); + if (it != path_map.map.end()) + { + auto filename_it = path_map.unique_filenames.find(path.filename()); + if (filename_it != path_map.unique_filenames.end()) + { + if (it->second.filename_iterators.erase(filename_it) > 0) + { + auto metric = object_storage->getMetadataStorageMetrics().file_count; + CurrentMetrics::sub(metric, 1); + } + } + } + } +} + +MetadataStorageFromPlainObjectStorageUnlinkMetadataFileOperation::MetadataStorageFromPlainObjectStorageUnlinkMetadataFileOperation( + std::filesystem::path && path_, InMemoryDirectoryPathMap & path_map_, ObjectStoragePtr object_storage_) + : path(path_) + , remote_path(std::filesystem::path(object_storage_->generateObjectKeyForPath(path_, std::nullopt).serialize())) + , path_map(path_map_) + , object_storage(object_storage_) +{ +} + +void MetadataStorageFromPlainObjectStorageUnlinkMetadataFileOperation::execute(std::unique_lock &) +{ + LOG_TEST( + getLogger("MetadataStorageFromPlainObjectStorageUnlinkMetadataFileOperation"), + "Unlinking metadata for a write '{}' with remote path '{}'", + path, + remote_path); + + std::lock_guard lock(path_map.mutex); + auto it = path_map.map.find(path.parent_path()); + if (it == path_map.map.end()) + LOG_TRACE( + getLogger("MetadataStorageFromPlainObjectStorageUnlinkMetadataFileOperation"), + "Parent directory does not exist, skipping path {}", + path); + else + { + auto & filename_iterators = it->second.filename_iterators; + auto filename_it = path_map.unique_filenames.find(path.filename()); + if (filename_it != path_map.unique_filenames.end()) + unlinked = (filename_iterators.erase(filename_it) > 0); + + if (unlinked) + { + auto metric = object_storage->getMetadataStorageMetrics().file_count; + CurrentMetrics::sub(metric, 1); + } + } +} + +void MetadataStorageFromPlainObjectStorageUnlinkMetadataFileOperation::undo(std::unique_lock &) +{ + if (unlinked) + { + std::lock_guard lock(path_map.mutex); + auto it = path_map.map.find(path.parent_path()); + chassert(it != path_map.map.end()); + if (it != path_map.map.end()) + { + auto filename_it = path_map.unique_filenames.find(path.filename()); + if (filename_it != path_map.unique_filenames.end()) + { + if (it->second.filename_iterators.emplace(filename_it).second) + { + auto metric = object_storage->getMetadataStorageMetrics().file_count; + CurrentMetrics::add(metric, 1); + } + } + } + } +} } diff --git a/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorageOperations.h b/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorageOperations.h index 00f1d191b47..565d4429548 100644 --- a/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorageOperations.h +++ b/src/Disks/ObjectStorages/MetadataStorageFromPlainObjectStorageOperations.h @@ -87,4 +87,38 @@ public: void undo(std::unique_lock & metadata_lock) override; }; +class MetadataStorageFromPlainObjectStorageWriteFileOperation final : public IMetadataOperation +{ +private: + std::filesystem::path path; + InMemoryDirectoryPathMap & path_map; + ObjectStoragePtr object_storage; + + bool written = false; + +public: + MetadataStorageFromPlainObjectStorageWriteFileOperation( + const std::string & path, InMemoryDirectoryPathMap & path_map_, ObjectStoragePtr object_storage_); + + void execute(std::unique_lock & metadata_lock) override; + void undo(std::unique_lock & metadata_lock) override; +}; + +class MetadataStorageFromPlainObjectStorageUnlinkMetadataFileOperation final : public IMetadataOperation +{ +private: + std::filesystem::path path; + std::filesystem::path remote_path; + InMemoryDirectoryPathMap & path_map; + ObjectStoragePtr object_storage; + + bool unlinked = false; + +public: + MetadataStorageFromPlainObjectStorageUnlinkMetadataFileOperation( + std::filesystem::path && path_, InMemoryDirectoryPathMap & path_map_, ObjectStoragePtr object_storage_); + + void execute(std::unique_lock & metadata_lock) override; + void undo(std::unique_lock & metadata_lock) override; +}; } diff --git a/src/Disks/ObjectStorages/MetadataStorageFromPlainRewritableObjectStorage.cpp b/src/Disks/ObjectStorages/MetadataStorageFromPlainRewritableObjectStorage.cpp index 115b3bc0616..6966c0053b3 100644 --- a/src/Disks/ObjectStorages/MetadataStorageFromPlainRewritableObjectStorage.cpp +++ b/src/Disks/ObjectStorages/MetadataStorageFromPlainRewritableObjectStorage.cpp @@ -3,18 +3,24 @@ #include #include +#include +#include #include #include +#include +#include #include +#include #include #include #include #include #include -#include "Common/Exception.h" +#include #include #include #include +#include #include "CommonPathPrefixKeyGenerator.h" @@ -45,6 +51,61 @@ std::string getMetadataKeyPrefix(ObjectStoragePtr object_storage) : metadata_key_prefix; } +void loadDirectoryTree( + InMemoryDirectoryPathMap::Map & map, InMemoryDirectoryPathMap::FileNames & unique_filenames, ObjectStoragePtr object_storage) +{ + using FileNamesIterator = InMemoryDirectoryPathMap::FileNamesIterator; + using FileNameIteratorComparator = InMemoryDirectoryPathMap::FileNameIteratorComparator; + const auto common_key_prefix = object_storage->getCommonKeyPrefix(); + ThreadPool & pool = getIOThreadPool().get(); + ThreadPoolCallbackRunnerLocal runner(pool, "PlainRWTreeLoad"); + + std::atomic num_files = 0; + LOG_DEBUG(getLogger("MetadataStorageFromPlainObjectStorage"), "Loading directory tree"); + std::mutex mutex; + for (auto & item : map) + { + auto & remote_path_info = item.second; + const auto remote_path = std::filesystem::path(common_key_prefix) / remote_path_info.path / ""; + runner( + [remote_path, &mutex, &remote_path_info, &unique_filenames, &object_storage, &num_files] + { + setThreadName("PlainRWTreeLoad"); + std::set filename_iterators; + for (auto iterator = object_storage->iterate(remote_path, 0); iterator->isValid(); iterator->next()) + { + auto file = iterator->current(); + String path = file->getPath(); + chassert(path.starts_with(remote_path.string())); + auto filename = std::filesystem::path(path).filename(); + /// Check that the file is a direct child. + if (path.substr(remote_path.string().size()) == filename) + { + auto filename_it = unique_filenames.end(); + { + std::lock_guard lock(mutex); + filename_it = unique_filenames.emplace(filename).first; + } + auto inserted = filename_iterators.emplace(filename_it).second; + chassert(inserted); + if (inserted) + ++num_files; + } + } + + auto metric = object_storage->getMetadataStorageMetrics().file_count; + CurrentMetrics::add(metric, filename_iterators.size()); + remote_path_info.filename_iterators = std::move(filename_iterators); + }); + } + runner.waitForAllToFinishAndRethrowFirstError(); + LOG_DEBUG( + getLogger("MetadataStorageFromPlainObjectStorage"), + "Loaded directory tree for {} directories, found {} files", + map.size(), + num_files); +} + std::shared_ptr loadPathPrefixMap(const std::string & metadata_key_prefix, ObjectStoragePtr object_storage) { auto result = std::make_shared(); @@ -62,6 +123,9 @@ std::shared_ptr loadPathPrefixMap(const std::string & LOG_DEBUG(log, "Loading metadata"); size_t num_files = 0; + + std::mutex mutex; + InMemoryDirectoryPathMap::Map map; for (auto iterator = object_storage->iterate(metadata_key_prefix, 0); iterator->isValid(); iterator->next()) { ++num_files; @@ -72,7 +136,7 @@ std::shared_ptr loadPathPrefixMap(const std::string & continue; runner( - [remote_metadata_path, path, &object_storage, &result, &log, &settings, &metadata_key_prefix] + [remote_metadata_path, path, &object_storage, &mutex, &map, &log, &settings, &metadata_key_prefix] { setThreadName("PlainRWMetaLoad"); @@ -109,13 +173,13 @@ std::shared_ptr loadPathPrefixMap(const std::string & chassert(remote_metadata_path.has_parent_path()); chassert(remote_metadata_path.string().starts_with(metadata_key_prefix)); auto suffix = remote_metadata_path.string().substr(metadata_key_prefix.size()); - auto remote_path = std::filesystem::path(std::move(suffix)); + auto rel_path = std::filesystem::path(std::move(suffix)); std::pair res; { - std::lock_guard lock(result->mutex); - res = result->map.emplace( + std::lock_guard lock(mutex); + res = map.emplace( std::filesystem::path(local_path).parent_path(), - InMemoryDirectoryPathMap::RemotePathInfo{remote_path.parent_path(), last_modified.epochTime()}); + InMemoryDirectoryPathMap::RemotePathInfo{rel_path.parent_path(), last_modified.epochTime(), {}}); } /// This can happen if table replication is enabled, then the same local path is written @@ -126,14 +190,19 @@ std::shared_ptr loadPathPrefixMap(const std::string & "The local path '{}' is already mapped to a remote path '{}', ignoring: '{}'", local_path, res.first->second.path, - remote_path.parent_path().string()); + rel_path.parent_path().string()); }); } runner.waitForAllToFinishAndRethrowFirstError(); + + InMemoryDirectoryPathMap::FileNames unique_filenames; + LOG_DEBUG(log, "Loaded metadata for {} files, found {} directories", num_files, map.size()); + loadDirectoryTree(map, unique_filenames, object_storage); { - SharedLockGuard lock(result->mutex); - LOG_DEBUG(log, "Loaded metadata for {} files, found {} directories", num_files, result->map.size()); + std::lock_guard lock(result->mutex); + result->map = std::move(map); + result->unique_filenames = std::move(unique_filenames); auto metric = object_storage->getMetadataStorageMetrics().directory_map_size; CurrentMetrics::add(metric, result->map.size()); @@ -141,55 +210,6 @@ std::shared_ptr loadPathPrefixMap(const std::string & return result; } -void getDirectChildrenOnDiskImpl( - const std::string & storage_key, - const RelativePathsWithMetadata & remote_paths, - const std::string & local_path, - const InMemoryDirectoryPathMap & path_map, - std::unordered_set & result) -{ - /// Directories are retrieved from the in-memory path map. - { - SharedLockGuard lock(path_map.mutex); - const auto & local_path_prefixes = path_map.map; - const auto end_it = local_path_prefixes.end(); - for (auto it = local_path_prefixes.lower_bound(local_path); it != end_it; ++it) - { - const auto & [k, _] = std::make_tuple(it->first.string(), it->second); - if (!k.starts_with(local_path)) - break; - - auto slash_num = count(k.begin() + local_path.size(), k.end(), '/'); - /// The local_path_prefixes comparator ensures that the paths with the smallest number of - /// hops from the local_path are iterated first. The paths do not end with '/', hence - /// break the loop if the number of slashes is greater than 0. - if (slash_num != 0) - break; - - result.emplace(std::string(k.begin() + local_path.size(), k.end()) + "/"); - } - } - - /// Files. - auto skip_list = std::set{PREFIX_PATH_FILE_NAME}; - for (const auto & elem : remote_paths) - { - const auto & path = elem->relative_path; - chassert(path.find(storage_key) == 0); - const auto child_pos = storage_key.size(); - - auto slash_pos = path.find('/', child_pos); - - if (slash_pos == std::string::npos) - { - /// File names. - auto filename = path.substr(child_pos); - if (!skip_list.contains(filename)) - result.emplace(std::move(filename)); - } - } -} - } MetadataStorageFromPlainRewritableObjectStorage::MetadataStorageFromPlainRewritableObjectStorage( @@ -215,6 +235,9 @@ MetadataStorageFromPlainRewritableObjectStorage::MetadataStorageFromPlainRewrita auto keys_gen = std::make_shared(object_storage->getCommonKeyPrefix(), path_map); object_storage->setKeysGenerator(keys_gen); } + + auto metric = object_storage->getMetadataStorageMetrics().unique_filenames_count; + CurrentMetrics::add(metric, path_map->unique_filenames.size()); } MetadataStorageFromPlainRewritableObjectStorage::~MetadataStorageFromPlainRewritableObjectStorage() @@ -246,17 +269,8 @@ bool MetadataStorageFromPlainRewritableObjectStorage::existsDirectory(const std: std::vector MetadataStorageFromPlainRewritableObjectStorage::listDirectory(const std::string & path) const { - auto key_prefix = object_storage->generateObjectKeyForPath(path, "" /* key_prefix */).serialize(); - - RelativePathsWithMetadata files; - auto absolute_key = std::filesystem::path(object_storage->getCommonKeyPrefix()) / key_prefix / ""; - - object_storage->listObjects(absolute_key, files, 0); - - std::unordered_set directories; - getDirectChildrenOnDisk(absolute_key, files, std::filesystem::path(path) / "", directories); - - return std::vector(std::make_move_iterator(directories.begin()), std::make_move_iterator(directories.end())); + std::unordered_set result = getDirectChildrenOnDisk(std::filesystem::path(path) / ""); + return std::vector(std::make_move_iterator(result.begin()), std::make_move_iterator(result.end())); } std::optional MetadataStorageFromPlainRewritableObjectStorage::getLastModifiedIfExists(const String & path) const @@ -271,13 +285,41 @@ std::optional MetadataStorageFromPlainRewritableObjectStorage:: return std::nullopt; } -void MetadataStorageFromPlainRewritableObjectStorage::getDirectChildrenOnDisk( - const std::string & storage_key, - const RelativePathsWithMetadata & remote_paths, - const std::string & local_path, - std::unordered_set & result) const +std::unordered_set +MetadataStorageFromPlainRewritableObjectStorage::getDirectChildrenOnDisk(const std::filesystem::path & local_path) const { - getDirectChildrenOnDiskImpl(storage_key, remote_paths, local_path, *getPathMap(), result); + std::unordered_set result; + SharedLockGuard lock(path_map->mutex); + const auto end_it = path_map->map.end(); + /// Directories. + for (auto it = path_map->map.lower_bound(local_path); it != end_it; ++it) + { + const auto & subdirectory = it->first.string(); + if (!subdirectory.starts_with(local_path.string())) + break; + + auto slash_num = count(subdirectory.begin() + local_path.string().size(), subdirectory.end(), '/'); + /// The directory map comparator ensures that the paths with the smallest number of + /// hops from the local_path are iterated first. The paths do not end with '/', hence + /// break the loop if the number of slashes to the right from the offset is greater than 0. + if (slash_num != 0) + break; + + result.emplace(std::string(subdirectory.begin() + local_path.string().size(), subdirectory.end()) + "/"); + } + + /// Files. + auto it = path_map->map.find(local_path.parent_path()); + if (it != path_map->map.end()) + { + for (const auto & filename_it : it->second.filename_iterators) + { + chassert(filename_it != path_map->unique_filenames.end()); + result.insert(*filename_it); + } + } + + return result; } bool MetadataStorageFromPlainRewritableObjectStorage::useSeparateLayoutForMetadata() const diff --git a/src/Disks/ObjectStorages/MetadataStorageFromPlainRewritableObjectStorage.h b/src/Disks/ObjectStorages/MetadataStorageFromPlainRewritableObjectStorage.h index 31a7dbe8307..983e379d292 100644 --- a/src/Disks/ObjectStorages/MetadataStorageFromPlainRewritableObjectStorage.h +++ b/src/Disks/ObjectStorages/MetadataStorageFromPlainRewritableObjectStorage.h @@ -35,11 +35,7 @@ public: protected: std::string getMetadataKeyPrefix() const override { return metadata_key_prefix; } std::shared_ptr getPathMap() const override { return path_map; } - void getDirectChildrenOnDisk( - const std::string & storage_key, - const RelativePathsWithMetadata & remote_paths, - const std::string & local_path, - std::unordered_set & result) const; + std::unordered_set getDirectChildrenOnDisk(const std::filesystem::path & local_path) const; private: bool useSeparateLayoutForMetadata() const; diff --git a/src/Disks/ObjectStorages/MetadataStorageMetrics.h b/src/Disks/ObjectStorages/MetadataStorageMetrics.h index 365fd3c8145..ab21f68f90d 100644 --- a/src/Disks/ObjectStorages/MetadataStorageMetrics.h +++ b/src/Disks/ObjectStorages/MetadataStorageMetrics.h @@ -13,6 +13,8 @@ struct MetadataStorageMetrics const ProfileEvents::Event directory_removed = ProfileEvents::end(); CurrentMetrics::Metric directory_map_size = CurrentMetrics::end(); + CurrentMetrics::Metric unique_filenames_count = CurrentMetrics::end(); + CurrentMetrics::Metric file_count = CurrentMetrics::end(); template static MetadataStorageMetrics create() diff --git a/src/Disks/ObjectStorages/S3/diskSettings.cpp b/src/Disks/ObjectStorages/S3/diskSettings.cpp index 1ae3730e4c7..92be835560b 100644 --- a/src/Disks/ObjectStorages/S3/diskSettings.cpp +++ b/src/Disks/ObjectStorages/S3/diskSettings.cpp @@ -177,7 +177,7 @@ std::unique_ptr getClient( auth_settings[S3AuthSetting::secret_access_key], auth_settings[S3AuthSetting::server_side_encryption_customer_key_base64], auth_settings.server_side_encryption_kms_config, - auth_settings.headers, + auth_settings.getHeaders(), credentials_configuration, auth_settings[S3AuthSetting::session_token]); } diff --git a/src/Disks/ObjectStorages/createMetadataStorageMetrics.h b/src/Disks/ObjectStorages/createMetadataStorageMetrics.h index 5cf1fbef2ab..bc2ccec9d85 100644 --- a/src/Disks/ObjectStorages/createMetadataStorageMetrics.h +++ b/src/Disks/ObjectStorages/createMetadataStorageMetrics.h @@ -24,8 +24,14 @@ extern const Event DiskPlainRewritableS3DirectoryRemoved; namespace CurrentMetrics { extern const Metric DiskPlainRewritableAzureDirectoryMapSize; +extern const Metric DiskPlainRewritableAzureUniqueFileNamesCount; +extern const Metric DiskPlainRewritableAzureFileCount; extern const Metric DiskPlainRewritableLocalDirectoryMapSize; +extern const Metric DiskPlainRewritableLocalUniqueFileNamesCount; +extern const Metric DiskPlainRewritableLocalFileCount; extern const Metric DiskPlainRewritableS3DirectoryMapSize; +extern const Metric DiskPlainRewritableS3UniqueFileNamesCount; +extern const Metric DiskPlainRewritableS3FileCount; } namespace DB @@ -38,7 +44,9 @@ inline MetadataStorageMetrics MetadataStorageMetrics::create(column); - auto tmp_nested = col_lc.getDictionary().getNestedColumn()->cloneEmpty(); + auto tmp_nested = removeNullable(col_lc.getDictionary().getNestedColumn()->cloneEmpty())->assumeMutable(); if (!nested->insertResultToColumn(*tmp_nested, element, insert_settings, format_settings, error)) return false; diff --git a/src/Formats/MarkInCompressedFile.h b/src/Formats/MarkInCompressedFile.h index 06ed1476410..e1bcda61b39 100644 --- a/src/Formats/MarkInCompressedFile.h +++ b/src/Formats/MarkInCompressedFile.h @@ -119,4 +119,6 @@ private: std::tuple lookUpMark(size_t idx) const; }; +using PlainMarksByName = std::unordered_map>; + } diff --git a/src/Functions/FunctionsComparison.h b/src/Functions/FunctionsComparison.h index eaa5620cb38..b201d5111c5 100644 --- a/src/Functions/FunctionsComparison.h +++ b/src/Functions/FunctionsComparison.h @@ -1171,7 +1171,7 @@ public: if (left_tuple && right_tuple) { - auto func = FunctionToOverloadResolverAdaptor(std::make_shared>(check_decimal_overflow)); + auto func = std::make_shared(std::make_shared>(check_decimal_overflow)); bool has_nullable = false; bool has_null = false; @@ -1181,7 +1181,7 @@ public: { ColumnsWithTypeAndName args = {{nullptr, left_tuple->getElements()[i], ""}, {nullptr, right_tuple->getElements()[i], ""}}; - auto element_type = func.build(args)->getResultType(); + auto element_type = func->build(args)->getResultType(); has_nullable = has_nullable || element_type->isNullable(); has_null = has_null || element_type->onlyNull(); } diff --git a/src/Functions/IFunction.cpp b/src/Functions/IFunction.cpp index 10a25cfe0d0..68d4f25f08d 100644 --- a/src/Functions/IFunction.cpp +++ b/src/Functions/IFunction.cpp @@ -1,27 +1,28 @@ #include #include -#include -#include -#include +#include +#include +#include +#include +#include +#include #include #include -#include -#include -#include -#include -#include -#include +#include #include #include #include -#include #include -#include -#include +#include +#include +#include #include "config.h" +#include +#include + #if USE_EMBEDDED_COMPILER # include #endif @@ -451,6 +452,7 @@ FunctionBasePtr IFunctionOverloadResolver::build(const ColumnsWithTypeAndName & /// Use FunctionBaseDynamicAdaptor if default implementation for Dynamic is enabled and we have Dynamic type in arguments. if (useDefaultImplementationForDynamic()) { + checkNumberOfArguments(arguments.size()); for (const auto & arg : arguments) { if (isDynamic(arg.type)) diff --git a/src/Functions/array/FunctionsMapMiscellaneous.cpp b/src/Functions/array/FunctionsMapMiscellaneous.cpp index 368c0ad620f..c3586a57161 100644 --- a/src/Functions/array/FunctionsMapMiscellaneous.cpp +++ b/src/Functions/array/FunctionsMapMiscellaneous.cpp @@ -349,19 +349,14 @@ struct MapKeyLikeAdapter } }; -struct FunctionIdentityMap : public FunctionIdentity -{ - bool useDefaultImplementationForLowCardinalityColumns() const override { return false; } -}; - struct NameMapConcat { static constexpr auto name = "mapConcat"; }; using FunctionMapConcat = FunctionMapToArrayAdapter, NameMapConcat>; struct NameMapKeys { static constexpr auto name = "mapKeys"; }; -using FunctionMapKeys = FunctionMapToArrayAdapter, NameMapKeys>; +using FunctionMapKeys = FunctionMapToArrayAdapter, NameMapKeys>; struct NameMapValues { static constexpr auto name = "mapValues"; }; -using FunctionMapValues = FunctionMapToArrayAdapter, NameMapValues>; +using FunctionMapValues = FunctionMapToArrayAdapter, NameMapValues>; struct NameMapContains { static constexpr auto name = "mapContains"; }; using FunctionMapContains = FunctionMapToArrayAdapter, MapToSubcolumnAdapter, NameMapContains>; diff --git a/src/Functions/transform.cpp b/src/Functions/transform.cpp index 45f0a7f5c17..e5445b36809 100644 --- a/src/Functions/transform.cpp +++ b/src/Functions/transform.cpp @@ -211,7 +211,7 @@ namespace ColumnsWithTypeAndName args = arguments; args[0].column = args[0].column->cloneResized(input_rows_count)->convertToFullColumnIfConst(); - auto impl = FunctionToOverloadResolverAdaptor(std::make_shared()).build(args); + auto impl = std::make_shared(std::make_shared())->build(args); return impl->execute(args, result_type, input_rows_count); } diff --git a/src/IO/Archives/createArchiveReader.cpp b/src/IO/Archives/createArchiveReader.cpp index dfa098eede0..97597cc4db7 100644 --- a/src/IO/Archives/createArchiveReader.cpp +++ b/src/IO/Archives/createArchiveReader.cpp @@ -43,7 +43,10 @@ std::shared_ptr createArchiveReader( else if (hasSupported7zExtension(path_to_archive)) { #if USE_LIBARCHIVE - return std::make_shared(path_to_archive); + if (archive_read_function) + throw Exception(ErrorCodes::CANNOT_UNPACK_ARCHIVE, "7z archive supports only local files reading"); + else + return std::make_shared(path_to_archive); #else throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "libarchive library is disabled"); #endif diff --git a/src/IO/BufferWithOwnMemory.h b/src/IO/BufferWithOwnMemory.h index da38bccdea1..79b1bb67aaa 100644 --- a/src/IO/BufferWithOwnMemory.h +++ b/src/IO/BufferWithOwnMemory.h @@ -44,16 +44,10 @@ struct Memory : boost::noncopyable, Allocator char * m_data = nullptr; size_t alignment = 0; - [[maybe_unused]] bool allow_gwp_asan_force_sample{false}; - Memory() = default; /// If alignment != 0, then allocate memory aligned to specified value. - explicit Memory(size_t size_, size_t alignment_ = 0, bool allow_gwp_asan_force_sample_ = false) - : alignment(alignment_), allow_gwp_asan_force_sample(allow_gwp_asan_force_sample_) - { - alloc(size_); - } + explicit Memory(size_t size_, size_t alignment_ = 0) : alignment(alignment_) { alloc(size_); } ~Memory() { @@ -133,11 +127,6 @@ private: ProfileEvents::increment(ProfileEvents::IOBufferAllocs); ProfileEvents::increment(ProfileEvents::IOBufferAllocBytes, new_capacity); -#if USE_GWP_ASAN - if (unlikely(allow_gwp_asan_force_sample && GWPAsan::shouldForceSample())) - gwp_asan::getThreadLocals()->NextSampleCounter = 1; -#endif - m_data = static_cast(Allocator::alloc(new_capacity, alignment)); m_capacity = new_capacity; m_size = new_size; @@ -165,7 +154,7 @@ protected: public: /// If non-nullptr 'existing_memory' is passed, then buffer will not create its own memory and will use existing_memory without ownership. explicit BufferWithOwnMemory(size_t size = DBMS_DEFAULT_BUFFER_SIZE, char * existing_memory = nullptr, size_t alignment = 0) - : Base(nullptr, 0), memory(existing_memory ? 0 : size, alignment, /*allow_gwp_asan_force_sample_=*/true) + : Base(nullptr, 0), memory(existing_memory ? 0 : size, alignment) { Base::set(existing_memory ? existing_memory : memory.data(), size); Base::padded = !existing_memory; diff --git a/src/IO/S3/Credentials.cpp b/src/IO/S3/Credentials.cpp index a3f671e76d9..cde9a7a3662 100644 --- a/src/IO/S3/Credentials.cpp +++ b/src/IO/S3/Credentials.cpp @@ -1,3 +1,5 @@ +#include +#include #include #include @@ -693,6 +695,7 @@ S3CredentialsProviderChain::S3CredentialsProviderChain( static const char AWS_ECS_CONTAINER_CREDENTIALS_RELATIVE_URI[] = "AWS_CONTAINER_CREDENTIALS_RELATIVE_URI"; static const char AWS_ECS_CONTAINER_CREDENTIALS_FULL_URI[] = "AWS_CONTAINER_CREDENTIALS_FULL_URI"; static const char AWS_ECS_CONTAINER_AUTHORIZATION_TOKEN[] = "AWS_CONTAINER_AUTHORIZATION_TOKEN"; + static const char AWS_ECS_CONTAINER_AUTHORIZATION_TOKEN_PATH[] = "AWS_CONTAINER_AUTHORIZATION_TOKEN_PATH"; static const char AWS_EC2_METADATA_DISABLED[] = "AWS_EC2_METADATA_DISABLED"; /// The only difference from DefaultAWSCredentialsProviderChain::DefaultAWSCredentialsProviderChain() @@ -750,7 +753,22 @@ S3CredentialsProviderChain::S3CredentialsProviderChain( } else if (!absolute_uri.empty()) { - const auto token = Aws::Environment::GetEnv(AWS_ECS_CONTAINER_AUTHORIZATION_TOKEN); + auto token = Aws::Environment::GetEnv(AWS_ECS_CONTAINER_AUTHORIZATION_TOKEN); + const auto token_path = Aws::Environment::GetEnv(AWS_ECS_CONTAINER_AUTHORIZATION_TOKEN_PATH); + + if (!token_path.empty()) + { + LOG_INFO(logger, "The environment variable value {} is {}", AWS_ECS_CONTAINER_AUTHORIZATION_TOKEN_PATH, token_path); + + String token_from_file; + + ReadBufferFromFile in(token_path); + readStringUntilEOF(token_from_file, in); + Poco::trimInPlace(token_from_file); + + token = token_from_file; + } + AddProvider(std::make_shared(absolute_uri.c_str(), token.c_str())); /// DO NOT log the value of the authorization token for security purposes. diff --git a/src/IO/S3AuthSettings.cpp b/src/IO/S3AuthSettings.cpp index 799dc6692fa..5d7d4678977 100644 --- a/src/IO/S3AuthSettings.cpp +++ b/src/IO/S3AuthSettings.cpp @@ -105,7 +105,9 @@ S3AuthSettings::S3AuthSettings( } } - headers = getHTTPHeaders(config_prefix, config); + headers = getHTTPHeaders(config_prefix, config, "header"); + access_headers = getHTTPHeaders(config_prefix, config, "access_header"); + server_side_encryption_kms_config = getSSEKMSConfig(config_prefix, config); Poco::Util::AbstractConfiguration::Keys keys; @@ -119,6 +121,7 @@ S3AuthSettings::S3AuthSettings( S3AuthSettings::S3AuthSettings(const S3AuthSettings & settings) : headers(settings.headers) + , access_headers(settings.access_headers) , users(settings.users) , server_side_encryption_kms_config(settings.server_side_encryption_kms_config) , impl(std::make_unique(*settings.impl)) @@ -127,6 +130,7 @@ S3AuthSettings::S3AuthSettings(const S3AuthSettings & settings) S3AuthSettings::S3AuthSettings(S3AuthSettings && settings) noexcept : headers(std::move(settings.headers)) + , access_headers(std::move(settings.access_headers)) , users(std::move(settings.users)) , server_side_encryption_kms_config(std::move(settings.server_side_encryption_kms_config)) , impl(std::make_unique(std::move(*settings.impl))) @@ -145,6 +149,7 @@ S3AUTH_SETTINGS_SUPPORTED_TYPES(S3AuthSettings, IMPLEMENT_SETTING_SUBSCRIPT_OPER S3AuthSettings & S3AuthSettings::operator=(S3AuthSettings && settings) noexcept { headers = std::move(settings.headers); + access_headers = std::move(settings.access_headers); users = std::move(settings.users); server_side_encryption_kms_config = std::move(settings.server_side_encryption_kms_config); *impl = std::move(*settings.impl); @@ -157,6 +162,9 @@ bool S3AuthSettings::operator==(const S3AuthSettings & right) if (headers != right.headers) return false; + if (access_headers != right.access_headers) + return false; + if (users != right.users) return false; @@ -196,6 +204,9 @@ void S3AuthSettings::updateIfChanged(const S3AuthSettings & settings) if (!settings.headers.empty()) headers = settings.headers; + if (!settings.access_headers.empty()) + access_headers = settings.access_headers; + if (!settings.users.empty()) users.insert(settings.users.begin(), settings.users.end()); @@ -205,6 +216,17 @@ void S3AuthSettings::updateIfChanged(const S3AuthSettings & settings) server_side_encryption_kms_config = settings.server_side_encryption_kms_config; } +HTTPHeaderEntries S3AuthSettings::getHeaders() const +{ + bool auth_settings_is_default = !impl->isChanged("access_key_id"); + if (access_headers.empty() || !auth_settings_is_default) + return headers; + + HTTPHeaderEntries result(headers); + result.insert(result.end(), access_headers.begin(), access_headers.end()); + + return result; +} } } diff --git a/src/IO/S3AuthSettings.h b/src/IO/S3AuthSettings.h index 4026adb1e68..38f46cfeccd 100644 --- a/src/IO/S3AuthSettings.h +++ b/src/IO/S3AuthSettings.h @@ -55,8 +55,11 @@ struct S3AuthSettings bool hasUpdates(const S3AuthSettings & other) const; void updateIfChanged(const S3AuthSettings & settings); bool canBeUsedByUser(const String & user) const { return users.empty() || users.contains(user); } + HTTPHeaderEntries getHeaders() const; HTTPHeaderEntries headers; + HTTPHeaderEntries access_headers; + std::unordered_set users; ServerSideEncryptionKMSConfig server_side_encryption_kms_config; diff --git a/src/IO/S3Common.cpp b/src/IO/S3Common.cpp index 5c1ee6ccc78..f12de6a7b54 100644 --- a/src/IO/S3Common.cpp +++ b/src/IO/S3Common.cpp @@ -74,14 +74,14 @@ namespace ErrorCodes namespace S3 { -HTTPHeaderEntries getHTTPHeaders(const std::string & config_elem, const Poco::Util::AbstractConfiguration & config) +HTTPHeaderEntries getHTTPHeaders(const std::string & config_elem, const Poco::Util::AbstractConfiguration & config, const std::string header_key) { HTTPHeaderEntries headers; Poco::Util::AbstractConfiguration::Keys subconfig_keys; config.keys(config_elem, subconfig_keys); for (const std::string & subkey : subconfig_keys) { - if (subkey.starts_with("header")) + if (subkey.starts_with(header_key)) { auto header_str = config.getString(config_elem + "." + subkey); auto delimiter = header_str.find(':'); diff --git a/src/IO/S3Common.h b/src/IO/S3Common.h index 1e40108b09f..22b590dcb18 100644 --- a/src/IO/S3Common.h +++ b/src/IO/S3Common.h @@ -69,7 +69,7 @@ struct ProxyConfigurationResolver; namespace S3 { -HTTPHeaderEntries getHTTPHeaders(const std::string & config_elem, const Poco::Util::AbstractConfiguration & config); +HTTPHeaderEntries getHTTPHeaders(const std::string & config_elem, const Poco::Util::AbstractConfiguration & config, std::string header_key = "header"); ServerSideEncryptionKMSConfig getSSEKMSConfig(const std::string & config_elem, const Poco::Util::AbstractConfiguration & config); } diff --git a/src/Interpreters/Access/InterpreterCreateUserQuery.cpp b/src/Interpreters/Access/InterpreterCreateUserQuery.cpp index d1d41a45793..f2e65ca4a10 100644 --- a/src/Interpreters/Access/InterpreterCreateUserQuery.cpp +++ b/src/Interpreters/Access/InterpreterCreateUserQuery.cpp @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -44,7 +45,7 @@ namespace const std::optional & override_default_roles, const std::optional & override_settings, const std::optional & override_grantees, - const std::optional & valid_until, + const std::optional & global_valid_until, bool reset_authentication_methods, bool replace_authentication_methods, bool allow_implicit_no_password, @@ -105,12 +106,20 @@ namespace user.authentication_methods.emplace_back(authentication_method); } - bool has_no_password_authentication_method = std::find_if(user.authentication_methods.begin(), - user.authentication_methods.end(), - [](const AuthenticationData & authentication_data) - { - return authentication_data.getType() == AuthenticationType::NO_PASSWORD; - }) != user.authentication_methods.end(); + bool has_no_password_authentication_method = false; + + for (auto & authentication_method : user.authentication_methods) + { + if (global_valid_until) + { + authentication_method.setValidUntil(*global_valid_until); + } + + if (authentication_method.getType() == AuthenticationType::NO_PASSWORD) + { + has_no_password_authentication_method = true; + } + } if (has_no_password_authentication_method && user.authentication_methods.size() > 1) { @@ -133,9 +142,6 @@ namespace } } - if (valid_until) - user.valid_until = *valid_until; - if (override_name && !override_name->host_pattern.empty()) { user.allowed_client_hosts = AllowedClientHosts{}; @@ -175,34 +181,6 @@ namespace else if (query.grantees) user.grantees = *query.grantees; } - - time_t getValidUntilFromAST(ASTPtr valid_until, ContextPtr context) - { - if (context) - valid_until = evaluateConstantExpressionAsLiteral(valid_until, context); - - const String valid_until_str = checkAndGetLiteralArgument(valid_until, "valid_until"); - - if (valid_until_str == "infinity") - return 0; - - time_t time = 0; - ReadBufferFromString in(valid_until_str); - - if (context) - { - const auto & time_zone = DateLUT::instance(""); - const auto & utc_time_zone = DateLUT::instance("UTC"); - - parseDateTimeBestEffort(time, in, time_zone, utc_time_zone); - } - else - { - readDateTimeText(time, in); - } - - return time; - } } BlockIO InterpreterCreateUserQuery::execute() @@ -226,9 +204,9 @@ BlockIO InterpreterCreateUserQuery::execute() } } - std::optional valid_until; - if (query.valid_until) - valid_until = getValidUntilFromAST(query.valid_until, getContext()); + std::optional global_valid_until; + if (query.global_valid_until) + global_valid_until = getValidUntilFromAST(query.global_valid_until, getContext()); std::optional default_roles_from_query; if (query.default_roles) @@ -274,7 +252,7 @@ BlockIO InterpreterCreateUserQuery::execute() auto updated_user = typeid_cast>(entity->clone()); updateUserFromQueryImpl( *updated_user, query, authentication_methods, {}, default_roles_from_query, settings_from_query, grantees_from_query, - valid_until, query.reset_authentication_methods_to_new, query.replace_authentication_methods, + global_valid_until, query.reset_authentication_methods_to_new, query.replace_authentication_methods, implicit_no_password_allowed, no_password_allowed, plaintext_password_allowed, getContext()->getServerSettings()[ServerSetting::max_authentication_methods_per_user]); return updated_user; @@ -296,7 +274,7 @@ BlockIO InterpreterCreateUserQuery::execute() auto new_user = std::make_shared(); updateUserFromQueryImpl( *new_user, query, authentication_methods, name, default_roles_from_query, settings_from_query, RolesOrUsersSet::AllTag{}, - valid_until, query.reset_authentication_methods_to_new, query.replace_authentication_methods, + global_valid_until, query.reset_authentication_methods_to_new, query.replace_authentication_methods, implicit_no_password_allowed, no_password_allowed, plaintext_password_allowed, getContext()->getServerSettings()[ServerSetting::max_authentication_methods_per_user]); new_users.emplace_back(std::move(new_user)); @@ -351,9 +329,9 @@ void InterpreterCreateUserQuery::updateUserFromQuery( } } - std::optional valid_until; - if (query.valid_until) - valid_until = getValidUntilFromAST(query.valid_until, {}); + std::optional global_valid_until; + if (query.global_valid_until) + global_valid_until = getValidUntilFromAST(query.global_valid_until, {}); updateUserFromQueryImpl( user, @@ -363,7 +341,7 @@ void InterpreterCreateUserQuery::updateUserFromQuery( {}, {}, {}, - valid_until, + global_valid_until, query.reset_authentication_methods_to_new, query.replace_authentication_methods, allow_no_password, diff --git a/src/Interpreters/Access/InterpreterShowCreateAccessEntityQuery.cpp b/src/Interpreters/Access/InterpreterShowCreateAccessEntityQuery.cpp index ef6ddf1866d..8b7cef056ed 100644 --- a/src/Interpreters/Access/InterpreterShowCreateAccessEntityQuery.cpp +++ b/src/Interpreters/Access/InterpreterShowCreateAccessEntityQuery.cpp @@ -69,13 +69,6 @@ namespace query->authentication_methods.push_back(authentication_method.toAST()); } - if (user.valid_until) - { - WriteBufferFromOwnString out; - writeDateTimeText(user.valid_until, out); - query->valid_until = std::make_shared(out.str()); - } - if (!user.settings.empty()) { if (attach_mode) diff --git a/src/Interpreters/Access/getValidUntilFromAST.cpp b/src/Interpreters/Access/getValidUntilFromAST.cpp new file mode 100644 index 00000000000..caf831e61ee --- /dev/null +++ b/src/Interpreters/Access/getValidUntilFromAST.cpp @@ -0,0 +1,37 @@ +#include +#include +#include +#include +#include +#include + +namespace DB +{ + time_t getValidUntilFromAST(ASTPtr valid_until, ContextPtr context) + { + if (context) + valid_until = evaluateConstantExpressionAsLiteral(valid_until, context); + + const String valid_until_str = checkAndGetLiteralArgument(valid_until, "valid_until"); + + if (valid_until_str == "infinity") + return 0; + + time_t time = 0; + ReadBufferFromString in(valid_until_str); + + if (context) + { + const auto & time_zone = DateLUT::instance(""); + const auto & utc_time_zone = DateLUT::instance("UTC"); + + parseDateTimeBestEffort(time, in, time_zone, utc_time_zone); + } + else + { + readDateTimeText(time, in); + } + + return time; + } +} diff --git a/src/Interpreters/Access/getValidUntilFromAST.h b/src/Interpreters/Access/getValidUntilFromAST.h new file mode 100644 index 00000000000..ab0c6c8c9b6 --- /dev/null +++ b/src/Interpreters/Access/getValidUntilFromAST.h @@ -0,0 +1,9 @@ +#pragma once + +#include +#include + +namespace DB +{ + time_t getValidUntilFromAST(ASTPtr valid_until, ContextPtr context); +} diff --git a/src/Interpreters/AggregationCommon.h b/src/Interpreters/AggregationCommon.h index 43c80d361d1..8a81f4d4614 100644 --- a/src/Interpreters/AggregationCommon.h +++ b/src/Interpreters/AggregationCommon.h @@ -88,7 +88,7 @@ void fillFixedBatch(size_t keys_size, const ColumnRawPtrs & key_columns, const S out.resize_fill(num_rows); /// Note: here we violate strict aliasing. - /// It should be ok as log as we do not reffer to any value from `out` before filling. + /// It should be ok as long as we do not refer to any value from `out` before filling. const char * source = static_cast(column)->getRawDataBegin(); T * dest = reinterpret_cast(reinterpret_cast(out.data()) + offset); fillFixedBatch(num_rows, reinterpret_cast(source), dest); /// NOLINT(bugprone-sizeof-expression) diff --git a/src/Interpreters/AsynchronousInsertQueue.cpp b/src/Interpreters/AsynchronousInsertQueue.cpp index 5cc97effad6..1a2efa2461f 100644 --- a/src/Interpreters/AsynchronousInsertQueue.cpp +++ b/src/Interpreters/AsynchronousInsertQueue.cpp @@ -1050,15 +1050,14 @@ Chunk AsynchronousInsertQueue::processEntriesWithParsing( adding_defaults_transform = std::make_shared(header, columns, *format, insert_context); } - auto on_error = [&](const MutableColumns & result_columns, Exception & e) + auto on_error = [&](const MutableColumns & result_columns, const ColumnCheckpoints & checkpoints, Exception & e) { current_exception = e.displayText(); LOG_ERROR(logger, "Failed parsing for query '{}' with query id {}. {}", key.query_str, current_entry->query_id, current_exception); - for (const auto & column : result_columns) - if (column->size() > total_rows) - column->popBack(column->size() - total_rows); + for (size_t i = 0; i < result_columns.size(); ++i) + result_columns[i]->rollback(*checkpoints[i]); current_entry->finish(std::current_exception()); return 0; @@ -1121,6 +1120,13 @@ Chunk AsynchronousInsertQueue::processPreprocessedEntries( "Expected entry with data kind Preprocessed. Got: {}", entry->chunk.getDataKind()); Block block_to_insert = *block; + if (block_to_insert.rows() == 0) + { + add_to_async_insert_log(entry, /*parsing_exception=*/ "", block_to_insert.rows(), block_to_insert.bytes()); + entry->resetChunk(); + continue; + } + if (!isCompatibleHeader(block_to_insert, header)) convertBlockToHeader(block_to_insert, header); diff --git a/src/Interpreters/Cache/QueryCache.cpp b/src/Interpreters/Cache/QueryCache.cpp index c766c5209fc..7dbee567c5b 100644 --- a/src/Interpreters/Cache/QueryCache.cpp +++ b/src/Interpreters/Cache/QueryCache.cpp @@ -89,11 +89,40 @@ struct HasSystemTablesMatcher { database_table = identifier->name(); } - /// Handle SELECT [...] FROM clusterAllReplicas(, '') - else if (const auto * literal = node->as()) + /// SELECT [...] FROM clusterAllReplicas(, '
') + /// This SQL syntax is quite common but we need to be careful. A naive attempt to cast 'node' to an ASTLiteral will be too general + /// and introduce false positives in queries like + /// 'SELECT * FROM users WHERE name = 'system.metrics' SETTINGS use_query_cache = true;' + /// Therefore, make sure we are really in `clusterAllReplicas`. EXPLAIN AST for + /// 'SELECT * FROM clusterAllReplicas('default', system.one) SETTINGS use_query_cache = 1' + /// returns: + /// [...] + /// Function clusterAllReplicas (children 1) + /// ExpressionList (children 2) + /// Literal 'test_shard_localhost' + /// Literal 'system.one' + /// [...] + else if (const auto * function = node->as()) { - const auto & value = literal->value; - database_table = toString(value); + if (function->name == "clusterAllReplicas") + { + const ASTs & function_children = function->children; + if (!function_children.empty()) + { + if (const auto * expression_list = function_children[0]->as()) + { + const ASTs & expression_list_children = expression_list->children; + if (expression_list_children.size() >= 2) + { + if (const auto * literal = expression_list_children[1]->as()) + { + const auto & value = literal->value; + database_table = toString(value); + } + } + } + } + } } Tokens tokens(database_table.c_str(), database_table.c_str() + database_table.size(), /*max_query_size*/ 2048, /*skip_insignificant*/ true); diff --git a/src/Interpreters/Context.cpp b/src/Interpreters/Context.cpp index b8e178e402b..4f82ed7b046 100644 --- a/src/Interpreters/Context.cpp +++ b/src/Interpreters/Context.cpp @@ -67,7 +67,6 @@ #include #include #include -#include #include #include #include @@ -92,6 +91,8 @@ #include #include #include +#include +#include #include #include #include @@ -272,6 +273,13 @@ namespace ServerSetting extern const ServerSettingsUInt64 max_replicated_sends_network_bandwidth_for_server; extern const ServerSettingsUInt64 tables_loader_background_pool_size; extern const ServerSettingsUInt64 tables_loader_foreground_pool_size; + extern const ServerSettingsUInt64 prefetch_threadpool_pool_size; + extern const ServerSettingsUInt64 prefetch_threadpool_queue_size; + extern const ServerSettingsUInt64 load_marks_threadpool_pool_size; + extern const ServerSettingsUInt64 load_marks_threadpool_queue_size; + extern const ServerSettingsUInt64 threadpool_writer_pool_size; + extern const ServerSettingsUInt64 threadpool_writer_queue_size; + } namespace ErrorCodes @@ -370,6 +378,9 @@ struct ContextSharedPart : boost::noncopyable mutable OnceFlag user_defined_sql_objects_storage_initialized; mutable std::unique_ptr user_defined_sql_objects_storage; + mutable OnceFlag workload_entity_storage_initialized; + mutable std::unique_ptr workload_entity_storage; + #if USE_NLP mutable OnceFlag synonyms_extensions_initialized; mutable std::optional synonyms_extensions; @@ -711,6 +722,7 @@ struct ContextSharedPart : boost::noncopyable SHUTDOWN(log, "dictionaries loader", external_dictionaries_loader, enablePeriodicUpdates(false)); SHUTDOWN(log, "UDFs loader", external_user_defined_executable_functions_loader, enablePeriodicUpdates(false)); SHUTDOWN(log, "another UDFs storage", user_defined_sql_objects_storage, stopWatching()); + SHUTDOWN(log, "workload entity storage", workload_entity_storage, stopWatching()); LOG_TRACE(log, "Shutting down named sessions"); Session::shutdownNamedSessions(); @@ -742,6 +754,7 @@ struct ContextSharedPart : boost::noncopyable std::unique_ptr delete_external_dictionaries_loader; std::unique_ptr delete_external_user_defined_executable_functions_loader; std::unique_ptr delete_user_defined_sql_objects_storage; + std::unique_ptr delete_workload_entity_storage; std::unique_ptr delete_buffer_flush_schedule_pool; std::unique_ptr delete_schedule_pool; std::unique_ptr delete_distributed_schedule_pool; @@ -826,6 +839,7 @@ struct ContextSharedPart : boost::noncopyable delete_external_dictionaries_loader = std::move(external_dictionaries_loader); delete_external_user_defined_executable_functions_loader = std::move(external_user_defined_executable_functions_loader); delete_user_defined_sql_objects_storage = std::move(user_defined_sql_objects_storage); + delete_workload_entity_storage = std::move(workload_entity_storage); delete_buffer_flush_schedule_pool = std::move(buffer_flush_schedule_pool); delete_schedule_pool = std::move(schedule_pool); delete_distributed_schedule_pool = std::move(distributed_schedule_pool); @@ -844,6 +858,7 @@ struct ContextSharedPart : boost::noncopyable delete_external_dictionaries_loader.reset(); delete_external_user_defined_executable_functions_loader.reset(); delete_user_defined_sql_objects_storage.reset(); + delete_workload_entity_storage.reset(); delete_ddl_worker.reset(); delete_buffer_flush_schedule_pool.reset(); delete_schedule_pool.reset(); @@ -1768,7 +1783,7 @@ std::vector Context::getEnabledProfiles() const ResourceManagerPtr Context::getResourceManager() const { callOnce(shared->resource_manager_initialized, [&] { - shared->resource_manager = ResourceManagerFactory::instance().get(getConfigRef().getString("resource_manager", "dynamic")); + shared->resource_manager = createResourceManager(getGlobalContext()); }); return shared->resource_manager; @@ -3015,6 +3030,16 @@ void Context::setUserDefinedSQLObjectsStorage(std::unique_ptruser_defined_sql_objects_storage = std::move(storage); } +IWorkloadEntityStorage & Context::getWorkloadEntityStorage() const +{ + callOnce(shared->workload_entity_storage_initialized, [&] { + shared->workload_entity_storage = createWorkloadEntityStorage(getGlobalContext()); + }); + + std::lock_guard lock(shared->mutex); + return *shared->workload_entity_storage; +} + #if USE_NLP SynonymsExtensions & Context::getSynonymsExtensions() const @@ -3197,9 +3222,8 @@ void Context::clearMarkCache() const ThreadPool & Context::getLoadMarksThreadpool() const { callOnce(shared->load_marks_threadpool_initialized, [&] { - const auto & config = getConfigRef(); - auto pool_size = config.getUInt(".load_marks_threadpool_pool_size", 50); - auto queue_size = config.getUInt(".load_marks_threadpool_queue_size", 1000000); + auto pool_size = shared->server_settings[ServerSetting::load_marks_threadpool_pool_size]; + auto queue_size = shared->server_settings[ServerSetting::load_marks_threadpool_queue_size]; shared->load_marks_threadpool = std::make_unique( CurrentMetrics::MarksLoaderThreads, CurrentMetrics::MarksLoaderThreadsActive, CurrentMetrics::MarksLoaderThreadsScheduled, pool_size, pool_size, queue_size); }); @@ -3392,9 +3416,9 @@ AsynchronousMetrics * Context::getAsynchronousMetrics() const ThreadPool & Context::getPrefetchThreadpool() const { callOnce(shared->prefetch_threadpool_initialized, [&] { - const auto & config = getConfigRef(); - auto pool_size = config.getUInt(".prefetch_threadpool_pool_size", 100); - auto queue_size = config.getUInt(".prefetch_threadpool_queue_size", 1000000); + auto pool_size = shared->server_settings[ServerSetting::prefetch_threadpool_pool_size]; + auto queue_size = shared->server_settings[ServerSetting::prefetch_threadpool_queue_size]; + shared->prefetch_threadpool = std::make_unique( CurrentMetrics::IOPrefetchThreads, CurrentMetrics::IOPrefetchThreadsActive, CurrentMetrics::IOPrefetchThreadsScheduled, pool_size, pool_size, queue_size); }); @@ -3404,8 +3428,7 @@ ThreadPool & Context::getPrefetchThreadpool() const size_t Context::getPrefetchThreadpoolSize() const { - const auto & config = getConfigRef(); - return config.getUInt(".prefetch_threadpool_pool_size", 100); + return shared->server_settings[ServerSetting::prefetch_threadpool_pool_size]; } ThreadPool & Context::getBuildVectorSimilarityIndexThreadPool() const @@ -5678,9 +5701,8 @@ IOUringReader & Context::getIOUringReader() const ThreadPool & Context::getThreadPoolWriter() const { callOnce(shared->threadpool_writer_initialized, [&] { - const auto & config = getConfigRef(); - auto pool_size = config.getUInt(".threadpool_writer_pool_size", 100); - auto queue_size = config.getUInt(".threadpool_writer_queue_size", 1000000); + auto pool_size = shared->server_settings[ServerSetting::threadpool_writer_pool_size]; + auto queue_size = shared->server_settings[ServerSetting::threadpool_writer_queue_size]; shared->threadpool_writer = std::make_unique( CurrentMetrics::IOWriterThreads, CurrentMetrics::IOWriterThreadsActive, CurrentMetrics::IOWriterThreadsScheduled, pool_size, pool_size, queue_size); diff --git a/src/Interpreters/Context.h b/src/Interpreters/Context.h index c62c16098e5..e8ccc31f597 100644 --- a/src/Interpreters/Context.h +++ b/src/Interpreters/Context.h @@ -76,6 +76,7 @@ class EmbeddedDictionaries; class ExternalDictionariesLoader; class ExternalUserDefinedExecutableFunctionsLoader; class IUserDefinedSQLObjectsStorage; +class IWorkloadEntityStorage; class InterserverCredentials; using InterserverCredentialsPtr = std::shared_ptr; class InterserverIOHandler; @@ -893,6 +894,8 @@ public: void setUserDefinedSQLObjectsStorage(std::unique_ptr storage); void loadOrReloadUserDefinedExecutableFunctions(const Poco::Util::AbstractConfiguration & config); + IWorkloadEntityStorage & getWorkloadEntityStorage() const; + #if USE_NLP SynonymsExtensions & getSynonymsExtensions() const; Lemmatizers & getLemmatizers() const; diff --git a/src/Interpreters/DDLOnClusterQueryStatusSource.cpp b/src/Interpreters/DDLOnClusterQueryStatusSource.cpp new file mode 100644 index 00000000000..9b5215eb41a --- /dev/null +++ b/src/Interpreters/DDLOnClusterQueryStatusSource.cpp @@ -0,0 +1,157 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +namespace DB +{ +namespace Setting +{ +extern const SettingsDistributedDDLOutputMode distributed_ddl_output_mode; +} + +namespace ErrorCodes +{ +extern const int TIMEOUT_EXCEEDED; +} + +DDLOnClusterQueryStatusSource::DDLOnClusterQueryStatusSource( + const String & zk_node_path, const String & zk_replicas_path, ContextPtr context_, const Strings & hosts_to_wait) + : DistributedQueryStatusSource( + zk_node_path, zk_replicas_path, getSampleBlock(context_), context_, hosts_to_wait, "DDLOnClusterQueryStatusSource") +{ +} + +ExecutionStatus DDLOnClusterQueryStatusSource::checkStatus(const String & host_id) +{ + fs::path status_path = fs::path(node_path) / "finished" / host_id; + return getExecutionStatus(status_path); +} + +Chunk DDLOnClusterQueryStatusSource::generateChunkWithUnfinishedHosts() const +{ + NameSet unfinished_hosts = waiting_hosts; + for (const auto & host_id : finished_hosts) + unfinished_hosts.erase(host_id); + + NameSet active_hosts_set = NameSet{current_active_hosts.begin(), current_active_hosts.end()}; + + /// Query is not finished on the rest hosts, so fill the corresponding rows with NULLs. + MutableColumns columns = output.getHeader().cloneEmptyColumns(); + for (const String & host_id : unfinished_hosts) + { + size_t num = 0; + auto [host, port] = parseHostAndPort(host_id); + columns[num++]->insert(host); + columns[num++]->insert(port); + columns[num++]->insert(Field{}); + columns[num++]->insert(Field{}); + columns[num++]->insert(unfinished_hosts.size()); + columns[num++]->insert(current_active_hosts.size()); + } + return Chunk(std::move(columns), unfinished_hosts.size()); +} + +Strings DDLOnClusterQueryStatusSource::getNodesToWait() +{ + return {String(fs::path(node_path) / "finished"), String(fs::path(node_path) / "active")}; +} +Chunk DDLOnClusterQueryStatusSource::handleTimeoutExceeded() +{ + timeout_exceeded = true; + + size_t num_unfinished_hosts = waiting_hosts.size() - num_hosts_finished; + size_t num_active_hosts = current_active_hosts.size(); + + constexpr auto msg_format = "Distributed DDL task {} is not finished on {} of {} hosts " + "({} of them are currently executing the task, {} are inactive). " + "They are going to execute the query in background. Was waiting for {} seconds{}"; + + if (throw_on_timeout || (throw_on_timeout_only_active && !stop_waiting_offline_hosts)) + { + if (!first_exception) + first_exception = std::make_unique(Exception( + ErrorCodes::TIMEOUT_EXCEEDED, + msg_format, + node_path, + num_unfinished_hosts, + waiting_hosts.size(), + num_active_hosts, + offline_hosts.size(), + watch.elapsedSeconds(), + stop_waiting_offline_hosts ? "" : ", which is longer than distributed_ddl_task_timeout")); + + return {}; + } + + LOG_INFO( + log, + msg_format, + node_path, + num_unfinished_hosts, + waiting_hosts.size(), + num_active_hosts, + offline_hosts.size(), + watch.elapsedSeconds(), + stop_waiting_offline_hosts ? "" : "which is longer than distributed_ddl_task_timeout"); + + return generateChunkWithUnfinishedHosts(); +} +Chunk DDLOnClusterQueryStatusSource::stopWaitingOfflineHosts() +{ + // Same logic as timeout exceeded + return handleTimeoutExceeded(); +} +void DDLOnClusterQueryStatusSource::handleNonZeroStatusCode(const ExecutionStatus & status, const String & host_id) +{ + assert(status.code != 0); + + if (!first_exception && context->getSettingsRef()[Setting::distributed_ddl_output_mode] != DistributedDDLOutputMode::NEVER_THROW) + { + auto [host, port] = parseHostAndPort(host_id); + first_exception + = std::make_unique(Exception(status.code, "There was an error on [{}:{}]: {}", host, port, status.message)); + } +} +void DDLOnClusterQueryStatusSource::fillHostStatus(const String & host_id, const ExecutionStatus & status, MutableColumns & columns) +{ + size_t num = 0; + auto [host, port] = parseHostAndPort(host_id); + columns[num++]->insert(host); + columns[num++]->insert(port); + columns[num++]->insert(status.code); + columns[num++]->insert(status.message); + columns[num++]->insert(waiting_hosts.size() - num_hosts_finished); + columns[num++]->insert(current_active_hosts.size()); +} + +Block DDLOnClusterQueryStatusSource::getSampleBlock(ContextPtr context_) +{ + auto output_mode = context_->getSettingsRef()[Setting::distributed_ddl_output_mode]; + + auto maybe_make_nullable = [&](const DataTypePtr & type) -> DataTypePtr + { + if (output_mode == DistributedDDLOutputMode::THROW || output_mode == DistributedDDLOutputMode::NONE + || output_mode == DistributedDDLOutputMode::NONE_ONLY_ACTIVE) + return type; + return std::make_shared(type); + }; + + + return Block{ + {std::make_shared(), "host"}, + {std::make_shared(), "port"}, + {maybe_make_nullable(std::make_shared()), "status"}, + {maybe_make_nullable(std::make_shared()), "error"}, + {std::make_shared(), "num_hosts_remaining"}, + {std::make_shared(), "num_hosts_active"}, + }; +} + +} diff --git a/src/Interpreters/DDLOnClusterQueryStatusSource.h b/src/Interpreters/DDLOnClusterQueryStatusSource.h new file mode 100644 index 00000000000..cb50bde40f3 --- /dev/null +++ b/src/Interpreters/DDLOnClusterQueryStatusSource.h @@ -0,0 +1,30 @@ +#pragma once + +#include +#include +#include +#include + +namespace DB +{ +class DDLOnClusterQueryStatusSource final : public DistributedQueryStatusSource +{ +public: + DDLOnClusterQueryStatusSource( + const String & zk_node_path, const String & zk_replicas_path, ContextPtr context_, const Strings & hosts_to_wait); + + String getName() const override { return "DDLOnClusterQueryStatus"; } + +protected: + ExecutionStatus checkStatus(const String & host_id) override; + Chunk generateChunkWithUnfinishedHosts() const override; + Strings getNodesToWait() override; + Chunk handleTimeoutExceeded() override; + Chunk stopWaitingOfflineHosts() override; + void handleNonZeroStatusCode(const ExecutionStatus & status, const String & host_id) override; + void fillHostStatus(const String & host_id, const ExecutionStatus & status, MutableColumns & columns) override; + +private: + static Block getSampleBlock(ContextPtr context_); +}; +} diff --git a/src/Interpreters/DDLWorker.cpp b/src/Interpreters/DDLWorker.cpp index 497ff7d5d07..eaba46f5d48 100644 --- a/src/Interpreters/DDLWorker.cpp +++ b/src/Interpreters/DDLWorker.cpp @@ -1,48 +1,48 @@ -#include -#include +#include +#include +#include +#include +#include +#include +#include #include +#include +#include +#include #include +#include +#include #include #include #include #include -#include -#include #include -#include -#include -#include #include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include #include #include -#include -#include +#include +#include +#include +#include +#include +#include +#include #include +#include +#include +#include + +#include +#include #include + #include #include #include -#include -#include - -#include namespace fs = std::filesystem; - namespace CurrentMetrics { extern const Metric DDLWorkerThreads; @@ -78,7 +78,8 @@ constexpr const char * TASK_PROCESSED_OUT_REASON = "Task has been already proces DDLWorker::DDLWorker( int pool_size_, - const std::string & zk_root_dir, + const std::string & zk_queue_dir, + const std::string & zk_replicas_dir, ContextPtr context_, const Poco::Util::AbstractConfiguration * config, const String & prefix, @@ -104,10 +105,15 @@ DDLWorker::DDLWorker( worker_pool = std::make_unique(CurrentMetrics::DDLWorkerThreads, CurrentMetrics::DDLWorkerThreadsActive, CurrentMetrics::DDLWorkerThreadsScheduled, pool_size); } - queue_dir = zk_root_dir; + queue_dir = zk_queue_dir; if (queue_dir.back() == '/') queue_dir.resize(queue_dir.size() - 1); + replicas_dir = zk_replicas_dir; + if (replicas_dir.back() == '/') + replicas_dir.resize(replicas_dir.size() - 1); + + if (config) { task_max_lifetime = config->getUInt64(prefix + ".task_max_lifetime", static_cast(task_max_lifetime)); @@ -1048,7 +1054,25 @@ void DDLWorker::createStatusDirs(const std::string & node_path, const ZooKeeperP } -String DDLWorker::enqueueQuery(DDLLogEntry & entry) +String DDLWorker::enqueueQuery(DDLLogEntry & entry, const ZooKeeperRetriesInfo & retries_info, QueryStatusPtr process_list_element) +{ + String node_path; + if (retries_info.max_retries > 0) + { + ZooKeeperRetriesControl retries_ctl{"DDLWorker::enqueueQuery", log, retries_info, process_list_element}; + retries_ctl.retryLoop([&]{ + node_path = enqueueQueryAttempt(entry); + }); + } + else + { + node_path = enqueueQueryAttempt(entry); + } + return node_path; +} + + +String DDLWorker::enqueueQueryAttempt(DDLLogEntry & entry) { if (entry.hosts.empty()) throw Exception(ErrorCodes::LOGICAL_ERROR, "Empty host list in a distributed DDL task"); @@ -1058,6 +1082,11 @@ String DDLWorker::enqueueQuery(DDLLogEntry & entry) String query_path_prefix = fs::path(queue_dir) / "query-"; zookeeper->createAncestors(query_path_prefix); + NameSet host_ids; + for (const HostID & host : entry.hosts) + host_ids.emplace(host.toString()); + createReplicaDirs(zookeeper, host_ids); + String node_path = zookeeper->create(query_path_prefix, entry.toString(), zkutil::CreateMode::PersistentSequential); if (max_pushed_entry_metric) { @@ -1097,6 +1126,7 @@ bool DDLWorker::initializeMainThread() { auto zookeeper = getAndSetZooKeeper(); zookeeper->createAncestors(fs::path(queue_dir) / ""); + initializeReplication(); initialized = true; return true; } @@ -1158,6 +1188,14 @@ void DDLWorker::runMainThread() } cleanup_event->set(); + try + { + markReplicasActive(reinitialized); + } + catch (...) + { + tryLogCurrentException(log, "An error occurred when markReplicasActive: "); + } scheduleTasks(reinitialized); subsequent_errors_count = 0; @@ -1215,6 +1253,97 @@ void DDLWorker::runMainThread() } +void DDLWorker::initializeReplication() +{ + auto zookeeper = getAndSetZooKeeper(); + + zookeeper->createAncestors(fs::path(replicas_dir) / ""); + + NameSet host_id_set; + for (const auto & it : context->getClusters()) + { + auto cluster = it.second; + for (const auto & host_ids : cluster->getHostIDs()) + for (const auto & host_id : host_ids) + host_id_set.emplace(host_id); + } + + createReplicaDirs(zookeeper, host_id_set); +} + +void DDLWorker::createReplicaDirs(const ZooKeeperPtr & zookeeper, const NameSet & host_ids) +{ + for (const auto & host_id : host_ids) + zookeeper->createAncestors(fs::path(replicas_dir) / host_id / ""); +} + +void DDLWorker::markReplicasActive(bool reinitialized) +{ + auto zookeeper = getAndSetZooKeeper(); + + if (reinitialized) + { + // Reset all active_node_holders + for (auto & it : active_node_holders) + { + auto & active_node_holder = it.second.second; + if (active_node_holder) + active_node_holder->setAlreadyRemoved(); + active_node_holder.reset(); + } + + active_node_holders.clear(); + } + + const auto maybe_secure_port = context->getTCPPortSecure(); + const auto port = context->getTCPPort(); + + Coordination::Stat replicas_stat; + Strings host_ids = zookeeper->getChildren(replicas_dir, &replicas_stat); + NameSet local_host_ids; + for (const auto & host_id : host_ids) + { + if (active_node_holders.contains(host_id)) + continue; + + try + { + HostID host = HostID::fromString(host_id); + /// The port is considered local if it matches TCP or TCP secure port that the server is listening. + bool is_local_host = (maybe_secure_port && host.isLocalAddress(*maybe_secure_port)) || host.isLocalAddress(port); + + if (is_local_host) + local_host_ids.emplace(host_id); + } + catch (const Exception & e) + { + LOG_WARNING(log, "Unable to check if host {} is a local address, exception: {}", host_id, e.displayText()); + continue; + } + } + + for (const auto & host_id : local_host_ids) + { + auto it = active_node_holders.find(host_id); + if (it != active_node_holders.end()) + { + continue; + } + + String active_path = fs::path(replicas_dir) / host_id / "active"; + if (zookeeper->exists(active_path)) + continue; + + String active_id = toString(ServerUUID::get()); + LOG_TRACE(log, "Trying to mark a replica active: active_path={}, active_id={}", active_path, active_id); + + zookeeper->create(active_path, active_id, zkutil::CreateMode::Ephemeral); + auto active_node_holder_zookeeper = zookeeper; + auto active_node_holder = zkutil::EphemeralNodeHolder::existing(active_path, *active_node_holder_zookeeper); + active_node_holders[host_id] = {active_node_holder_zookeeper, active_node_holder}; + } +} + void DDLWorker::runCleanupThread() { setThreadName("DDLWorkerClnr"); diff --git a/src/Interpreters/DDLWorker.h b/src/Interpreters/DDLWorker.h index ac07b086242..a5f47a51bb3 100644 --- a/src/Interpreters/DDLWorker.h +++ b/src/Interpreters/DDLWorker.h @@ -1,24 +1,24 @@ #pragma once -#include +#include +#include +#include #include +#include #include #include #include -#include -#include +#include #include - #include + #include -#include -#include #include #include #include -#include #include + namespace zkutil { class ZooKeeper; @@ -48,16 +48,27 @@ struct DDLTaskBase; using DDLTaskPtr = std::unique_ptr; using ZooKeeperPtr = std::shared_ptr; class AccessRightsElements; +struct ZooKeeperRetriesInfo; +class QueryStatus; +using QueryStatusPtr = std::shared_ptr; class DDLWorker { public: - DDLWorker(int pool_size_, const std::string & zk_root_dir, ContextPtr context_, const Poco::Util::AbstractConfiguration * config, const String & prefix, - const String & logger_name = "DDLWorker", const CurrentMetrics::Metric * max_entry_metric_ = nullptr, const CurrentMetrics::Metric * max_pushed_entry_metric_ = nullptr); + DDLWorker( + int pool_size_, + const std::string & zk_queue_dir, + const std::string & zk_replicas_dir, + ContextPtr context_, + const Poco::Util::AbstractConfiguration * config, + const String & prefix, + const String & logger_name = "DDLWorker", + const CurrentMetrics::Metric * max_entry_metric_ = nullptr, + const CurrentMetrics::Metric * max_pushed_entry_metric_ = nullptr); virtual ~DDLWorker(); /// Pushes query into DDL queue, returns path to created node - virtual String enqueueQuery(DDLLogEntry & entry); + virtual String enqueueQuery(DDLLogEntry & entry, const ZooKeeperRetriesInfo & retries_info, QueryStatusPtr process_list_element); /// Host ID (name:port) for logging purposes /// Note that in each task hosts are identified individually by name:port from initiator server cluster config @@ -71,6 +82,8 @@ public: return queue_dir; } + std::string getReplicasDir() const { return replicas_dir; } + void startup(); virtual void shutdown(); @@ -110,6 +123,9 @@ protected: mutable std::shared_mutex mtx; }; + /// Pushes query into DDL queue, returns path to created node + String enqueueQueryAttempt(DDLLogEntry & entry); + /// Iterates through queue tasks in ZooKeeper, runs execution of new tasks void scheduleTasks(bool reinitialized); @@ -149,6 +165,10 @@ protected: /// Return false if the worker was stopped (stop_flag = true) virtual bool initializeMainThread(); + virtual void initializeReplication(); + + virtual void createReplicaDirs(const ZooKeeperPtr & zookeeper, const NameSet & host_ids); + virtual void markReplicasActive(bool reinitialized); void runMainThread(); void runCleanupThread(); @@ -160,7 +180,8 @@ protected: std::string host_fqdn; /// current host domain name std::string host_fqdn_id; /// host_name:port - std::string queue_dir; /// dir with queue of queries + std::string queue_dir; /// dir with queue of queries + std::string replicas_dir; mutable std::mutex zookeeper_mutex; ZooKeeperPtr current_zookeeper TSA_GUARDED_BY(zookeeper_mutex); @@ -202,6 +223,8 @@ protected: const CurrentMetrics::Metric * max_entry_metric; const CurrentMetrics::Metric * max_pushed_entry_metric; + + std::unordered_map> active_node_holders; }; diff --git a/src/Interpreters/DistributedQueryStatusSource.cpp b/src/Interpreters/DistributedQueryStatusSource.cpp new file mode 100644 index 00000000000..83701d41c57 --- /dev/null +++ b/src/Interpreters/DistributedQueryStatusSource.cpp @@ -0,0 +1,270 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +namespace DB +{ +namespace Setting +{ +extern const SettingsDistributedDDLOutputMode distributed_ddl_output_mode; +extern const SettingsInt64 distributed_ddl_task_timeout; +} +namespace ErrorCodes +{ +extern const int UNFINISHED; +} + +DistributedQueryStatusSource::DistributedQueryStatusSource( + const String & zk_node_path, + const String & zk_replicas_path, + Block block, + ContextPtr context_, + const Strings & hosts_to_wait, + const char * logger_name) + : ISource(block) + , node_path(zk_node_path) + , replicas_path(zk_replicas_path) + , context(context_) + , watch(CLOCK_MONOTONIC_COARSE) + , log(getLogger(logger_name)) +{ + auto output_mode = context->getSettingsRef()[Setting::distributed_ddl_output_mode]; + throw_on_timeout = output_mode == DistributedDDLOutputMode::THROW || output_mode == DistributedDDLOutputMode::NONE; + throw_on_timeout_only_active + = output_mode == DistributedDDLOutputMode::THROW_ONLY_ACTIVE || output_mode == DistributedDDLOutputMode::NONE_ONLY_ACTIVE; + + waiting_hosts = NameSet(hosts_to_wait.begin(), hosts_to_wait.end()); + + only_running_hosts = output_mode == DistributedDDLOutputMode::THROW_ONLY_ACTIVE + || output_mode == DistributedDDLOutputMode::NULL_STATUS_ON_TIMEOUT_ONLY_ACTIVE + || output_mode == DistributedDDLOutputMode::NONE_ONLY_ACTIVE; + + addTotalRowsApprox(waiting_hosts.size()); + timeout_seconds = context->getSettingsRef()[Setting::distributed_ddl_task_timeout]; +} + + +IProcessor::Status DistributedQueryStatusSource::prepare() +{ + /// This method is overloaded to throw exception after all data is read. + /// Exception is pushed into pipe (instead of simply being thrown) to ensure the order of data processing and exception. + + if (finished) + { + if (first_exception) + { + if (!output.canPush()) + return Status::PortFull; + + output.pushException(std::make_exception_ptr(*first_exception)); + } + + output.finish(); + return Status::Finished; + } + else + return ISource::prepare(); +} + +NameSet DistributedQueryStatusSource::getOfflineHosts(const NameSet & hosts_to_wait, const ZooKeeperPtr & zookeeper) +{ + Strings paths; + Strings hosts_array; + for (const auto & host : hosts_to_wait) + { + hosts_array.push_back(host); + paths.push_back(fs::path(replicas_path) / host / "active"); + } + + NameSet offline; + auto res = zookeeper->tryGet(paths); + for (size_t i = 0; i < res.size(); ++i) + if (res[i].error == Coordination::Error::ZNONODE) + offline.insert(hosts_array[i]); + + if (offline.size() == hosts_to_wait.size()) + { + /// Avoid reporting that all hosts are offline + LOG_WARNING(log, "Did not find active hosts, will wait for all {} hosts. This should not happen often", offline.size()); + return {}; + } + + return offline; +} + +Strings DistributedQueryStatusSource::getNewAndUpdate(const Strings & current_finished_hosts) +{ + Strings diff; + for (const String & host : current_finished_hosts) + { + if (!waiting_hosts.contains(host)) + { + if (!ignoring_hosts.contains(host)) + { + ignoring_hosts.emplace(host); + LOG_INFO(log, "Unexpected host {} appeared in task {}", host, node_path); + } + continue; + } + + if (!finished_hosts.contains(host)) + { + diff.emplace_back(host); + finished_hosts.emplace(host); + } + } + + return diff; +} + + +ExecutionStatus DistributedQueryStatusSource::getExecutionStatus(const fs::path & status_path) +{ + ExecutionStatus status(-1, "Cannot obtain error message"); + + String status_data; + bool finished_exists = false; + + auto retries_ctl = ZooKeeperRetriesControl( + "executeDDLQueryOnCluster", getLogger("DDLQueryStatusSource"), getRetriesInfo(), context->getProcessListElement()); + retries_ctl.retryLoop([&]() { finished_exists = context->getZooKeeper()->tryGet(status_path, status_data); }); + if (finished_exists) + status.tryDeserializeText(status_data); + + return status; +} + +ZooKeeperRetriesInfo DistributedQueryStatusSource::getRetriesInfo() +{ + const auto & config_ref = Context::getGlobalContextInstance()->getConfigRef(); + return ZooKeeperRetriesInfo( + config_ref.getInt("distributed_ddl_keeper_max_retries", 5), + config_ref.getInt("distributed_ddl_keeper_initial_backoff_ms", 100), + config_ref.getInt("distributed_ddl_keeper_max_backoff_ms", 5000)); +} + +std::pair DistributedQueryStatusSource::parseHostAndPort(const String & host_id) +{ + String host = host_id; + UInt16 port = 0; + auto host_and_port = Cluster::Address::fromString(host_id); + host = host_and_port.first; + port = host_and_port.second; + return {host, port}; +} + +Chunk DistributedQueryStatusSource::generate() +{ + bool all_hosts_finished = num_hosts_finished >= waiting_hosts.size(); + + /// Seems like num_hosts_finished cannot be strictly greater than waiting_hosts.size() + assert(num_hosts_finished <= waiting_hosts.size()); + + if (all_hosts_finished || timeout_exceeded) + return {}; + + size_t try_number = 0; + while (true) + { + if (isCancelled()) + return {}; + + if (stop_waiting_offline_hosts) + { + return stopWaitingOfflineHosts(); + } + + if ((timeout_seconds >= 0 && watch.elapsedSeconds() > timeout_seconds)) + { + return handleTimeoutExceeded(); + } + + sleepForMilliseconds(std::min(1000, 50 * try_number)); + + bool node_exists = false; + Strings tmp_hosts; + Strings tmp_active_hosts; + + { + auto retries_ctl = ZooKeeperRetriesControl( + "executeDistributedQueryOnCluster", getLogger(getName()), getRetriesInfo(), context->getProcessListElement()); + retries_ctl.retryLoop( + [&]() + { + auto zookeeper = context->getZooKeeper(); + Strings paths = getNodesToWait(); + auto res = zookeeper->tryGetChildren(paths); + for (size_t i = 0; i < res.size(); ++i) + if (res[i].error != Coordination::Error::ZOK && res[i].error != Coordination::Error::ZNONODE) + throw Coordination::Exception::fromPath(res[i].error, paths[i]); + + if (res[0].error == Coordination::Error::ZNONODE) + node_exists = zookeeper->exists(node_path); + else + node_exists = true; + tmp_hosts = res[0].names; + tmp_active_hosts = res[1].names; + + if (only_running_hosts) + offline_hosts = getOfflineHosts(waiting_hosts, zookeeper); + }); + } + + if (!node_exists) + { + /// Paradoxically, this exception will be throw even in case of "never_throw" mode. + + if (!first_exception) + first_exception = std::make_unique(Exception( + ErrorCodes::UNFINISHED, + "Cannot provide query execution status. The query's node {} has been deleted by the cleaner" + " since it was finished (or its lifetime is expired)", + node_path)); + return {}; + } + + Strings new_hosts = getNewAndUpdate(tmp_hosts); + ++try_number; + + if (only_running_hosts) + { + size_t num_finished_or_offline = 0; + for (const auto & host : waiting_hosts) + num_finished_or_offline += finished_hosts.contains(host) || offline_hosts.contains(host); + + if (num_finished_or_offline == waiting_hosts.size()) + stop_waiting_offline_hosts = true; + } + + if (new_hosts.empty()) + continue; + + current_active_hosts = std::move(tmp_active_hosts); + + MutableColumns columns = output.getHeader().cloneEmptyColumns(); + for (const String & host_id : new_hosts) + { + ExecutionStatus status = checkStatus(host_id); + + if (status.code != 0) + { + handleNonZeroStatusCode(status, host_id); + } + + ++num_hosts_finished; + fillHostStatus(host_id, status, columns); + } + + return Chunk(std::move(columns), new_hosts.size()); + } +} + +} diff --git a/src/Interpreters/DistributedQueryStatusSource.h b/src/Interpreters/DistributedQueryStatusSource.h new file mode 100644 index 00000000000..4f58085a1f0 --- /dev/null +++ b/src/Interpreters/DistributedQueryStatusSource.h @@ -0,0 +1,68 @@ +#pragma once + +#include +#include +#include +#include +#include + +namespace fs = std::filesystem; + +namespace DB +{ +class DistributedQueryStatusSource : public ISource +{ +public: + DistributedQueryStatusSource( + const String & zk_node_path, + const String & zk_replicas_path, + Block block, + ContextPtr context_, + const Strings & hosts_to_wait, + const char * logger_name); + + Chunk generate() override; + Status prepare() override; + +protected: + virtual ExecutionStatus checkStatus(const String & host_id) = 0; + virtual Chunk generateChunkWithUnfinishedHosts() const = 0; + virtual Strings getNodesToWait() = 0; + virtual Chunk handleTimeoutExceeded() = 0; + virtual Chunk stopWaitingOfflineHosts() = 0; + virtual void handleNonZeroStatusCode(const ExecutionStatus & status, const String & host_id) = 0; + virtual void fillHostStatus(const String & host_id, const ExecutionStatus & status, MutableColumns & columns) = 0; + + virtual NameSet getOfflineHosts(const NameSet & hosts_to_wait, const ZooKeeperPtr & zookeeper); + + Strings getNewAndUpdate(const Strings & current_finished_hosts); + ExecutionStatus getExecutionStatus(const fs::path & status_path); + + static ZooKeeperRetriesInfo getRetriesInfo(); + static std::pair parseHostAndPort(const String & host_id); + + String node_path; + String replicas_path; + ContextPtr context; + Stopwatch watch; + LoggerPtr log; + + NameSet waiting_hosts; /// hosts from task host list + NameSet finished_hosts; /// finished hosts from host list + NameSet ignoring_hosts; /// appeared hosts that are not in hosts list + Strings current_active_hosts; /// Hosts that are currently executing the task + NameSet offline_hosts; /// Hosts that are not currently running + size_t num_hosts_finished = 0; + + /// Save the first detected error and throw it at the end of execution + std::unique_ptr first_exception; + + Int64 timeout_seconds = 120; + bool throw_on_timeout = true; + bool throw_on_timeout_only_active = false; + bool only_running_hosts = false; + + bool timeout_exceeded = false; + bool stop_waiting_offline_hosts = false; +}; +} diff --git a/src/Interpreters/InterpreterBackupQuery.cpp b/src/Interpreters/InterpreterBackupQuery.cpp index 6f76b21a7b8..baaa6d40f0d 100644 --- a/src/Interpreters/InterpreterBackupQuery.cpp +++ b/src/Interpreters/InterpreterBackupQuery.cpp @@ -2,6 +2,8 @@ #include #include +#include +#include #include #include #include @@ -18,13 +20,13 @@ namespace DB namespace { - Block getResultRow(const BackupOperationInfo & info) + Block getResultRow(const String & id, BackupStatus status) { auto column_id = ColumnString::create(); auto column_status = ColumnInt8::create(); - column_id->insert(info.id); - column_status->insert(static_cast(info.status)); + column_id->insert(id); + column_status->insert(static_cast(status)); Block res_columns; res_columns.insert(0, {std::move(column_id), std::make_shared(), "id"}); @@ -36,15 +38,18 @@ namespace BlockIO InterpreterBackupQuery::execute() { + const ASTBackupQuery & backup_query = query_ptr->as(); auto & backups_worker = context->getBackupsWorker(); - auto id = backups_worker.start(query_ptr, context); - auto info = backups_worker.getInfo(id); - if (info.exception) - std::rethrow_exception(info.exception); + auto [id, status] = backups_worker.start(query_ptr, context); + + /// Wait if it's a synchronous operation. + bool async = BackupSettings::isAsync(backup_query); + if (!async) + status = backups_worker.wait(id); BlockIO res_io; - res_io.pipeline = QueryPipeline(std::make_shared(getResultRow(info))); + res_io.pipeline = QueryPipeline(std::make_shared(getResultRow(id, status))); return res_io; } diff --git a/src/Interpreters/InterpreterCreateQuery.cpp b/src/Interpreters/InterpreterCreateQuery.cpp index 22bba01a60f..a38a7ab45d1 100644 --- a/src/Interpreters/InterpreterCreateQuery.cpp +++ b/src/Interpreters/InterpreterCreateQuery.cpp @@ -1987,6 +1987,12 @@ BlockIO InterpreterCreateQuery::doCreateOrReplaceTable(ASTCreateQuery & create, UInt16 hashed_zk_path = sipHash64(txn->getTaskZooKeeperPath()); random_suffix = getHexUIntLowercase(hashed_zk_path); } + else if (!current_context->getCurrentQueryId().empty()) + { + random_suffix = getRandomASCIIString(/*length=*/2); + UInt8 hashed_query_id = sipHash64(current_context->getCurrentQueryId()); + random_suffix += getHexUIntLowercase(hashed_query_id); + } else { random_suffix = getRandomASCIIString(/*length=*/4); diff --git a/src/Interpreters/InterpreterCreateResourceQuery.cpp b/src/Interpreters/InterpreterCreateResourceQuery.cpp new file mode 100644 index 00000000000..c6eca7a90d8 --- /dev/null +++ b/src/Interpreters/InterpreterCreateResourceQuery.cpp @@ -0,0 +1,68 @@ +#include +#include + +#include +#include +#include +#include +#include + + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int INCORRECT_QUERY; +} + +BlockIO InterpreterCreateResourceQuery::execute() +{ + ASTCreateResourceQuery & create_resource_query = query_ptr->as(); + + AccessRightsElements access_rights_elements; + access_rights_elements.emplace_back(AccessType::CREATE_RESOURCE); + + if (create_resource_query.or_replace) + access_rights_elements.emplace_back(AccessType::DROP_RESOURCE); + + auto current_context = getContext(); + + if (!create_resource_query.cluster.empty()) + { + if (current_context->getWorkloadEntityStorage().isReplicated()) + throw Exception(ErrorCodes::INCORRECT_QUERY, "ON CLUSTER is not allowed because workload entities are replicated automatically"); + + DDLQueryOnClusterParams params; + params.access_to_check = std::move(access_rights_elements); + return executeDDLQueryOnCluster(query_ptr, current_context, params); + } + + current_context->checkAccess(access_rights_elements); + + auto resource_name = create_resource_query.getResourceName(); + bool throw_if_exists = !create_resource_query.if_not_exists && !create_resource_query.or_replace; + bool replace_if_exists = create_resource_query.or_replace; + + current_context->getWorkloadEntityStorage().storeEntity( + current_context, + WorkloadEntityType::Resource, + resource_name, + query_ptr, + throw_if_exists, + replace_if_exists, + current_context->getSettingsRef()); + + return {}; +} + +void registerInterpreterCreateResourceQuery(InterpreterFactory & factory) +{ + auto create_fn = [] (const InterpreterFactory::Arguments & args) + { + return std::make_unique(args.query, args.context); + }; + factory.registerInterpreter("InterpreterCreateResourceQuery", create_fn); +} + +} diff --git a/src/Interpreters/InterpreterCreateResourceQuery.h b/src/Interpreters/InterpreterCreateResourceQuery.h new file mode 100644 index 00000000000..4bd427e5e8f --- /dev/null +++ b/src/Interpreters/InterpreterCreateResourceQuery.h @@ -0,0 +1,25 @@ +#pragma once + +#include + + +namespace DB +{ + +class Context; + +class InterpreterCreateResourceQuery : public IInterpreter, WithMutableContext +{ +public: + InterpreterCreateResourceQuery(const ASTPtr & query_ptr_, ContextMutablePtr context_) + : WithMutableContext(context_), query_ptr(query_ptr_) + { + } + + BlockIO execute() override; + +private: + ASTPtr query_ptr; +}; + +} diff --git a/src/Interpreters/InterpreterCreateWorkloadQuery.cpp b/src/Interpreters/InterpreterCreateWorkloadQuery.cpp new file mode 100644 index 00000000000..41d0f52c685 --- /dev/null +++ b/src/Interpreters/InterpreterCreateWorkloadQuery.cpp @@ -0,0 +1,68 @@ +#include +#include + +#include +#include +#include +#include +#include + + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int INCORRECT_QUERY; +} + +BlockIO InterpreterCreateWorkloadQuery::execute() +{ + ASTCreateWorkloadQuery & create_workload_query = query_ptr->as(); + + AccessRightsElements access_rights_elements; + access_rights_elements.emplace_back(AccessType::CREATE_WORKLOAD); + + if (create_workload_query.or_replace) + access_rights_elements.emplace_back(AccessType::DROP_WORKLOAD); + + auto current_context = getContext(); + + if (!create_workload_query.cluster.empty()) + { + if (current_context->getWorkloadEntityStorage().isReplicated()) + throw Exception(ErrorCodes::INCORRECT_QUERY, "ON CLUSTER is not allowed because workload entities are replicated automatically"); + + DDLQueryOnClusterParams params; + params.access_to_check = std::move(access_rights_elements); + return executeDDLQueryOnCluster(query_ptr, current_context, params); + } + + current_context->checkAccess(access_rights_elements); + + auto workload_name = create_workload_query.getWorkloadName(); + bool throw_if_exists = !create_workload_query.if_not_exists && !create_workload_query.or_replace; + bool replace_if_exists = create_workload_query.or_replace; + + current_context->getWorkloadEntityStorage().storeEntity( + current_context, + WorkloadEntityType::Workload, + workload_name, + query_ptr, + throw_if_exists, + replace_if_exists, + current_context->getSettingsRef()); + + return {}; +} + +void registerInterpreterCreateWorkloadQuery(InterpreterFactory & factory) +{ + auto create_fn = [] (const InterpreterFactory::Arguments & args) + { + return std::make_unique(args.query, args.context); + }; + factory.registerInterpreter("InterpreterCreateWorkloadQuery", create_fn); +} + +} diff --git a/src/Interpreters/InterpreterCreateWorkloadQuery.h b/src/Interpreters/InterpreterCreateWorkloadQuery.h new file mode 100644 index 00000000000..319388fb64c --- /dev/null +++ b/src/Interpreters/InterpreterCreateWorkloadQuery.h @@ -0,0 +1,25 @@ +#pragma once + +#include + + +namespace DB +{ + +class Context; + +class InterpreterCreateWorkloadQuery : public IInterpreter, WithMutableContext +{ +public: + InterpreterCreateWorkloadQuery(const ASTPtr & query_ptr_, ContextMutablePtr context_) + : WithMutableContext(context_), query_ptr(query_ptr_) + { + } + + BlockIO execute() override; + +private: + ASTPtr query_ptr; +}; + +} diff --git a/src/Interpreters/InterpreterDropResourceQuery.cpp b/src/Interpreters/InterpreterDropResourceQuery.cpp new file mode 100644 index 00000000000..848a74fda23 --- /dev/null +++ b/src/Interpreters/InterpreterDropResourceQuery.cpp @@ -0,0 +1,60 @@ +#include +#include + +#include +#include +#include +#include +#include + + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int INCORRECT_QUERY; +} + +BlockIO InterpreterDropResourceQuery::execute() +{ + ASTDropResourceQuery & drop_resource_query = query_ptr->as(); + + AccessRightsElements access_rights_elements; + access_rights_elements.emplace_back(AccessType::DROP_RESOURCE); + + auto current_context = getContext(); + + if (!drop_resource_query.cluster.empty()) + { + if (current_context->getWorkloadEntityStorage().isReplicated()) + throw Exception(ErrorCodes::INCORRECT_QUERY, "ON CLUSTER is not allowed because workload entities are replicated automatically"); + + DDLQueryOnClusterParams params; + params.access_to_check = std::move(access_rights_elements); + return executeDDLQueryOnCluster(query_ptr, current_context, params); + } + + current_context->checkAccess(access_rights_elements); + + bool throw_if_not_exists = !drop_resource_query.if_exists; + + current_context->getWorkloadEntityStorage().removeEntity( + current_context, + WorkloadEntityType::Resource, + drop_resource_query.resource_name, + throw_if_not_exists); + + return {}; +} + +void registerInterpreterDropResourceQuery(InterpreterFactory & factory) +{ + auto create_fn = [] (const InterpreterFactory::Arguments & args) + { + return std::make_unique(args.query, args.context); + }; + factory.registerInterpreter("InterpreterDropResourceQuery", create_fn); +} + +} diff --git a/src/Interpreters/InterpreterDropResourceQuery.h b/src/Interpreters/InterpreterDropResourceQuery.h new file mode 100644 index 00000000000..588f26fb88c --- /dev/null +++ b/src/Interpreters/InterpreterDropResourceQuery.h @@ -0,0 +1,21 @@ +#pragma once + +#include + +namespace DB +{ + +class Context; + +class InterpreterDropResourceQuery : public IInterpreter, WithMutableContext +{ +public: + InterpreterDropResourceQuery(const ASTPtr & query_ptr_, ContextMutablePtr context_) : WithMutableContext(context_), query_ptr(query_ptr_) {} + + BlockIO execute() override; + +private: + ASTPtr query_ptr; +}; + +} diff --git a/src/Interpreters/InterpreterDropWorkloadQuery.cpp b/src/Interpreters/InterpreterDropWorkloadQuery.cpp new file mode 100644 index 00000000000..bbaa2beb4cd --- /dev/null +++ b/src/Interpreters/InterpreterDropWorkloadQuery.cpp @@ -0,0 +1,60 @@ +#include +#include + +#include +#include +#include +#include +#include + + +namespace DB +{ + +namespace ErrorCodes +{ + extern const int INCORRECT_QUERY; +} + +BlockIO InterpreterDropWorkloadQuery::execute() +{ + ASTDropWorkloadQuery & drop_workload_query = query_ptr->as(); + + AccessRightsElements access_rights_elements; + access_rights_elements.emplace_back(AccessType::DROP_WORKLOAD); + + auto current_context = getContext(); + + if (!drop_workload_query.cluster.empty()) + { + if (current_context->getWorkloadEntityStorage().isReplicated()) + throw Exception(ErrorCodes::INCORRECT_QUERY, "ON CLUSTER is not allowed because workload entities are replicated automatically"); + + DDLQueryOnClusterParams params; + params.access_to_check = std::move(access_rights_elements); + return executeDDLQueryOnCluster(query_ptr, current_context, params); + } + + current_context->checkAccess(access_rights_elements); + + bool throw_if_not_exists = !drop_workload_query.if_exists; + + current_context->getWorkloadEntityStorage().removeEntity( + current_context, + WorkloadEntityType::Workload, + drop_workload_query.workload_name, + throw_if_not_exists); + + return {}; +} + +void registerInterpreterDropWorkloadQuery(InterpreterFactory & factory) +{ + auto create_fn = [] (const InterpreterFactory::Arguments & args) + { + return std::make_unique(args.query, args.context); + }; + factory.registerInterpreter("InterpreterDropWorkloadQuery", create_fn); +} + +} diff --git a/src/Interpreters/InterpreterDropWorkloadQuery.h b/src/Interpreters/InterpreterDropWorkloadQuery.h new file mode 100644 index 00000000000..1297c95e949 --- /dev/null +++ b/src/Interpreters/InterpreterDropWorkloadQuery.h @@ -0,0 +1,21 @@ +#pragma once + +#include + +namespace DB +{ + +class Context; + +class InterpreterDropWorkloadQuery : public IInterpreter, WithMutableContext +{ +public: + InterpreterDropWorkloadQuery(const ASTPtr & query_ptr_, ContextMutablePtr context_) : WithMutableContext(context_), query_ptr(query_ptr_) {} + + BlockIO execute() override; + +private: + ASTPtr query_ptr; +}; + +} diff --git a/src/Interpreters/InterpreterFactory.cpp b/src/Interpreters/InterpreterFactory.cpp index cfc95124895..729a7b86312 100644 --- a/src/Interpreters/InterpreterFactory.cpp +++ b/src/Interpreters/InterpreterFactory.cpp @@ -3,9 +3,13 @@ #include #include #include +#include +#include #include #include #include +#include +#include #include #include #include @@ -332,6 +336,22 @@ InterpreterFactory::InterpreterPtr InterpreterFactory::get(ASTPtr & query, Conte { interpreter_name = "InterpreterDropFunctionQuery"; } + else if (query->as()) + { + interpreter_name = "InterpreterCreateWorkloadQuery"; + } + else if (query->as()) + { + interpreter_name = "InterpreterDropWorkloadQuery"; + } + else if (query->as()) + { + interpreter_name = "InterpreterCreateResourceQuery"; + } + else if (query->as()) + { + interpreter_name = "InterpreterDropResourceQuery"; + } else if (query->as()) { interpreter_name = "InterpreterCreateIndexQuery"; diff --git a/src/Interpreters/InterpreterSystemQuery.cpp b/src/Interpreters/InterpreterSystemQuery.cpp index 8aa1bda1d1c..45636ab40b9 100644 --- a/src/Interpreters/InterpreterSystemQuery.cpp +++ b/src/Interpreters/InterpreterSystemQuery.cpp @@ -89,6 +89,9 @@ namespace CurrentMetrics extern const Metric RestartReplicaThreads; extern const Metric RestartReplicaThreadsActive; extern const Metric RestartReplicaThreadsScheduled; + extern const Metric MergeTreePartsLoaderThreads; + extern const Metric MergeTreePartsLoaderThreadsActive; + extern const Metric MergeTreePartsLoaderThreadsScheduled; } namespace DB @@ -97,6 +100,7 @@ namespace Setting { extern const SettingsSeconds lock_acquire_timeout; extern const SettingsSeconds receive_timeout; + extern const SettingsMaxThreads max_threads; } namespace ServerSetting @@ -359,6 +363,11 @@ BlockIO InterpreterSystemQuery::execute() HTTPConnectionPools::instance().dropCache(); break; } + case Type::PREWARM_MARK_CACHE: + { + prewarmMarkCache(); + break; + } case Type::DROP_MARK_CACHE: getContext()->checkAccess(AccessType::SYSTEM_DROP_MARK_CACHE); system_context->clearMarkCache(); @@ -1016,7 +1025,7 @@ void InterpreterSystemQuery::dropReplica(ASTSystemQuery & query) { ReplicatedTableStatus status; storage_replicated->getStatus(status); - if (status.zookeeper_info.path == query.replica_zk_path) + if (status.replica_path == remote_replica_path) throw Exception(ErrorCodes::TABLE_WAS_NOT_DROPPED, "There is a local table {}, which has the same table path in ZooKeeper. " "Please check the path in query. " @@ -1298,6 +1307,28 @@ RefreshTaskList InterpreterSystemQuery::getRefreshTasks() return tasks; } +void InterpreterSystemQuery::prewarmMarkCache() +{ + if (table_id.empty()) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Table is not specified for prewarming marks cache"); + + getContext()->checkAccess(AccessType::SYSTEM_PREWARM_MARK_CACHE, table_id); + + auto table_ptr = DatabaseCatalog::instance().getTable(table_id, getContext()); + auto * merge_tree = dynamic_cast(table_ptr.get()); + + if (!merge_tree) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Command PREWARM MARK CACHE is supported only for MergeTree table, but got: {}", table_ptr->getName()); + + ThreadPool pool( + CurrentMetrics::MergeTreePartsLoaderThreads, + CurrentMetrics::MergeTreePartsLoaderThreadsActive, + CurrentMetrics::MergeTreePartsLoaderThreadsScheduled, + getContext()->getSettingsRef()[Setting::max_threads]); + + merge_tree->prewarmMarkCache(pool); +} + AccessRightsElements InterpreterSystemQuery::getRequiredAccessForDDLOnCluster() const { @@ -1499,6 +1530,11 @@ AccessRightsElements InterpreterSystemQuery::getRequiredAccessForDDLOnCluster() required_access.emplace_back(AccessType::SYSTEM_WAIT_LOADING_PARTS, query.getDatabase(), query.getTable()); break; } + case Type::PREWARM_MARK_CACHE: + { + required_access.emplace_back(AccessType::SYSTEM_PREWARM_MARK_CACHE, query.getDatabase(), query.getTable()); + break; + } case Type::SYNC_DATABASE_REPLICA: { required_access.emplace_back(AccessType::SYSTEM_SYNC_DATABASE_REPLICA, query.getDatabase()); diff --git a/src/Interpreters/InterpreterSystemQuery.h b/src/Interpreters/InterpreterSystemQuery.h index 82d55125927..e31c6cd739b 100644 --- a/src/Interpreters/InterpreterSystemQuery.h +++ b/src/Interpreters/InterpreterSystemQuery.h @@ -82,6 +82,7 @@ private: AccessRightsElements getRequiredAccessForDDLOnCluster() const; void startStopAction(StorageActionBlockType action_type, bool start); + void prewarmMarkCache(); void stopReplicatedDDLQueries(); void startReplicatedDDLQueries(); diff --git a/src/Interpreters/MetricLog.cpp b/src/Interpreters/MetricLog.cpp index 16a88b976ba..d0d799ea693 100644 --- a/src/Interpreters/MetricLog.cpp +++ b/src/Interpreters/MetricLog.cpp @@ -70,6 +70,15 @@ void MetricLog::stepFunction(const std::chrono::system_clock::time_point current { const ProfileEvents::Count new_value = ProfileEvents::global_counters[i].load(std::memory_order_relaxed); auto & old_value = prev_profile_events[i]; + + /// Profile event counters are supposed to be monotonic. However, at least the `NetworkReceiveBytes` can be inaccurate. + /// So, since in the future the counter should always have a bigger value than in the past, we skip this event. + /// It can be reproduced with the following integration tests: + /// - test_hedged_requests/test.py::test_receive_timeout2 + /// - test_secure_socket::test + if (new_value < old_value) + continue; + elem.profile_events[i] = new_value - old_value; old_value = new_value; } diff --git a/src/Interpreters/ProcessList.cpp b/src/Interpreters/ProcessList.cpp index 177468f1c8b..435fda64bc2 100644 --- a/src/Interpreters/ProcessList.cpp +++ b/src/Interpreters/ProcessList.cpp @@ -276,7 +276,7 @@ ProcessList::insert(const String & query_, const IAST * ast, ContextMutablePtr q thread_group->performance_counters.setTraceProfileEvents(settings[Setting::trace_profile_events]); } - thread_group->memory_tracker.setDescription("(for query)"); + thread_group->memory_tracker.setDescription("Query"); if (settings[Setting::memory_tracker_fault_probability] > 0.0) thread_group->memory_tracker.setFaultProbability(settings[Setting::memory_tracker_fault_probability]); @@ -311,7 +311,7 @@ ProcessList::insert(const String & query_, const IAST * ast, ContextMutablePtr q /// Track memory usage for all simultaneously running queries from single user. user_process_list.user_memory_tracker.setOrRaiseHardLimit(settings[Setting::max_memory_usage_for_user]); user_process_list.user_memory_tracker.setSoftLimit(settings[Setting::memory_overcommit_ratio_denominator_for_user]); - user_process_list.user_memory_tracker.setDescription("(for user)"); + user_process_list.user_memory_tracker.setDescription("User"); if (!total_network_throttler && settings[Setting::max_network_bandwidth_for_all_users]) { @@ -447,12 +447,16 @@ void QueryStatus::ExecutorHolder::remove() executor = nullptr; } -CancellationCode QueryStatus::cancelQuery(bool) +CancellationCode QueryStatus::cancelQuery(bool /* kill */, std::exception_ptr exception) { - if (is_killed.load()) + if (is_killed.exchange(true)) return CancellationCode::CancelSent; - is_killed.store(true); + { + std::lock_guard lock{cancellation_exception_mutex}; + if (!cancellation_exception) + cancellation_exception = exception; + } std::vector executors_snapshot; @@ -486,7 +490,7 @@ void QueryStatus::addPipelineExecutor(PipelineExecutor * e) /// addPipelineExecutor() from the cancelQuery() context, and this will /// lead to deadlock. if (is_killed.load()) - throw Exception(ErrorCodes::QUERY_WAS_CANCELLED, "Query was cancelled"); + throwQueryWasCancelled(); std::lock_guard lock(executors_mutex); assert(!executors.contains(e)); @@ -512,11 +516,20 @@ void QueryStatus::removePipelineExecutor(PipelineExecutor * e) bool QueryStatus::checkTimeLimit() { if (is_killed.load()) - throw Exception(ErrorCodes::QUERY_WAS_CANCELLED, "Query was cancelled"); + throwQueryWasCancelled(); return limits.checkTimeLimit(watch, overflow_mode); } +void QueryStatus::throwQueryWasCancelled() const +{ + std::lock_guard lock{cancellation_exception_mutex}; + if (cancellation_exception) + std::rethrow_exception(cancellation_exception); + else + throw Exception(ErrorCodes::QUERY_WAS_CANCELLED, "Query was cancelled"); +} + bool QueryStatus::checkTimeLimitSoft() { if (is_killed.load()) diff --git a/src/Interpreters/ProcessList.h b/src/Interpreters/ProcessList.h index b2583e74d9b..f171fe8f4d4 100644 --- a/src/Interpreters/ProcessList.h +++ b/src/Interpreters/ProcessList.h @@ -109,6 +109,9 @@ protected: /// KILL was send to the query std::atomic is_killed { false }; + std::exception_ptr cancellation_exception TSA_GUARDED_BY(cancellation_exception_mutex); + mutable std::mutex cancellation_exception_mutex; + /// All data to the client already had been sent. /// Including EndOfStream or Exception. std::atomic is_all_data_sent { false }; @@ -127,6 +130,8 @@ protected: /// A weak pointer is used here because it's a ProcessListEntry which owns this QueryStatus, and not vice versa. void setProcessListEntry(std::weak_ptr process_list_entry_); + [[noreturn]] void throwQueryWasCancelled() const; + mutable std::mutex executors_mutex; struct ExecutorHolder @@ -225,7 +230,9 @@ public: QueryStatusInfo getInfo(bool get_thread_list = false, bool get_profile_events = false, bool get_settings = false) const; - CancellationCode cancelQuery(bool kill); + /// Cancels the current query. + /// Optional argument `exception` allows to set an exception which checkTimeLimit() will throw instead of "QUERY_WAS_CANCELLED". + CancellationCode cancelQuery(bool kill, std::exception_ptr exception = nullptr); bool isKilled() const { return is_killed; } diff --git a/src/Interpreters/QueryMetricLog.cpp b/src/Interpreters/QueryMetricLog.cpp index fea2024d3e4..5ab3fe590e0 100644 --- a/src/Interpreters/QueryMetricLog.cpp +++ b/src/Interpreters/QueryMetricLog.cpp @@ -15,6 +15,7 @@ #include #include +#include #include @@ -86,11 +87,11 @@ void QueryMetricLog::shutdown() Base::shutdown(); } -void QueryMetricLog::startQuery(const String & query_id, TimePoint query_start_time, UInt64 interval_milliseconds) +void QueryMetricLog::startQuery(const String & query_id, TimePoint start_time, UInt64 interval_milliseconds) { QueryMetricLogStatus status; status.interval_milliseconds = interval_milliseconds; - status.next_collect_time = query_start_time + std::chrono::milliseconds(interval_milliseconds); + status.next_collect_time = start_time + std::chrono::milliseconds(interval_milliseconds); auto context = getContext(); const auto & process_list = context->getProcessList(); @@ -99,24 +100,21 @@ void QueryMetricLog::startQuery(const String & query_id, TimePoint query_start_t const auto query_info = process_list.getQueryInfo(query_id, false, true, false); if (!query_info) { - LOG_TRACE(logger, "Query {} is not running anymore, so we couldn't get its QueryInfo", query_id); + LOG_TRACE(logger, "Query {} is not running anymore, so we couldn't get its QueryStatusInfo", query_id); return; } auto elem = createLogMetricElement(query_id, *query_info, current_time); if (elem) add(std::move(elem.value())); - else - LOG_TRACE(logger, "Query {} finished already while this collecting task was running", query_id); }); - status.task->scheduleAfter(interval_milliseconds); - std::lock_guard lock(queries_mutex); + status.task->scheduleAfter(interval_milliseconds); queries.emplace(query_id, std::move(status)); } -void QueryMetricLog::finishQuery(const String & query_id, QueryStatusInfoPtr query_info) +void QueryMetricLog::finishQuery(const String & query_id, TimePoint finish_time, QueryStatusInfoPtr query_info) { std::unique_lock lock(queries_mutex); auto it = queries.find(query_id); @@ -128,7 +126,7 @@ void QueryMetricLog::finishQuery(const String & query_id, QueryStatusInfoPtr que if (query_info) { - auto elem = createLogMetricElement(query_id, *query_info, std::chrono::system_clock::now(), false); + auto elem = createLogMetricElement(query_id, *query_info, finish_time, false); if (elem) add(std::move(elem.value())); } @@ -137,57 +135,84 @@ void QueryMetricLog::finishQuery(const String & query_id, QueryStatusInfoPtr que /// deactivating the task, which happens automatically on its destructor. Thus, we cannot /// deactivate/destroy the task while it's running. Now, the task locks `queries_mutex` to /// prevent concurrent edition of the queries. In short, the mutex order is: exec_mutex -> - /// queries_mutex. Thus, to prevent a deadblock we need to make sure that we always lock them in + /// queries_mutex. So, to prevent a deadblock we need to make sure that we always lock them in /// that order. { - /// Take ownership of the task so that we can destroy it in this scope after unlocking `queries_lock`. + /// Take ownership of the task so that we can destroy it in this scope after unlocking `queries_mutex`. auto task = std::move(it->second.task); /// Build an empty task for the old task to make sure it does not lock any mutex on its destruction. it->second.task = {}; + queries.erase(query_id); + /// Ensure `queries_mutex` is unlocked before calling task's destructor at the end of this /// scope which will lock `exec_mutex`. lock.unlock(); } - - lock.lock(); - queries.erase(query_id); } -std::optional QueryMetricLog::createLogMetricElement(const String & query_id, const QueryStatusInfo & query_info, TimePoint current_time, bool schedule_next) +std::optional QueryMetricLog::createLogMetricElement(const String & query_id, const QueryStatusInfo & query_info, TimePoint query_info_time, bool schedule_next) { - LOG_DEBUG(logger, "Collecting query_metric_log for query {}. Schedule next: {}", query_id, schedule_next); - std::lock_guard lock(queries_mutex); + /// fmtlib supports subsecond formatting in 10.0.0. We're in 9.1.0, so we need to add the milliseconds ourselves. + auto seconds = std::chrono::time_point_cast(query_info_time); + auto microseconds = std::chrono::duration_cast(query_info_time - seconds).count(); + LOG_DEBUG(logger, "Collecting query_metric_log for query {} with QueryStatusInfo from {:%Y.%m.%d %H:%M:%S}.{:06}. Schedule next: {}", query_id, seconds, microseconds, schedule_next); + + std::unique_lock lock(queries_mutex); auto query_status_it = queries.find(query_id); /// The query might have finished while the scheduled task is running. if (query_status_it == queries.end()) + { + lock.unlock(); + LOG_TRACE(logger, "Query {} finished already while this collecting task was running", query_id); return {}; + } + + auto & query_status = query_status_it->second; + if (query_info_time <= query_status.last_collect_time) + { + lock.unlock(); + LOG_TRACE(logger, "Query {} has a more recent metrics collected. Skipping this one", query_id); + return {}; + } + + query_status.last_collect_time = query_info_time; QueryMetricLogElement elem; - elem.event_time = timeInSeconds(current_time); - elem.event_time_microseconds = timeInMicroseconds(current_time); + elem.event_time = timeInSeconds(query_info_time); + elem.event_time_microseconds = timeInMicroseconds(query_info_time); elem.query_id = query_status_it->first; elem.memory_usage = query_info.memory_usage > 0 ? query_info.memory_usage : 0; elem.peak_memory_usage = query_info.peak_memory_usage > 0 ? query_info.peak_memory_usage : 0; - auto & query_status = query_status_it->second; if (query_info.profile_counters) { for (ProfileEvents::Event i = ProfileEvents::Event(0), end = ProfileEvents::end(); i < end; ++i) { const auto & new_value = (*(query_info.profile_counters))[i]; - elem.profile_events[i] = new_value - query_status.last_profile_events[i]; - query_status.last_profile_events[i] = new_value; + auto & old_value = query_status.last_profile_events[i]; + + /// Profile event counters are supposed to be monotonic. However, at least the `NetworkReceiveBytes` can be inaccurate. + /// So, since in the future the counter should always have a bigger value than in the past, we skip this event. + /// It can be reproduced with the following integration tests: + /// - test_hedged_requests/test.py::test_receive_timeout2 + /// - test_secure_socket::test + if (new_value < old_value) + continue; + + elem.profile_events[i] = new_value - old_value; + old_value = new_value; } } else { - elem.profile_events = query_status.last_profile_events; + LOG_TRACE(logger, "Query {} has no profile counters", query_id); + elem.profile_events = std::vector(ProfileEvents::end()); } - if (query_status.task && schedule_next) + if (schedule_next) { query_status.next_collect_time += std::chrono::milliseconds(query_status.interval_milliseconds); const auto wait_time = std::chrono::duration_cast(query_status.next_collect_time - std::chrono::system_clock::now()).count(); diff --git a/src/Interpreters/QueryMetricLog.h b/src/Interpreters/QueryMetricLog.h index d7642bf0ab1..802cee7bf26 100644 --- a/src/Interpreters/QueryMetricLog.h +++ b/src/Interpreters/QueryMetricLog.h @@ -37,6 +37,7 @@ struct QueryMetricLogElement struct QueryMetricLogStatus { UInt64 interval_milliseconds; + std::chrono::system_clock::time_point last_collect_time; std::chrono::system_clock::time_point next_collect_time; std::vector last_profile_events = std::vector(ProfileEvents::end()); BackgroundSchedulePool::TaskHolder task; @@ -52,11 +53,11 @@ public: void shutdown() final; // Both startQuery and finishQuery are called from the thread that executes the query - void startQuery(const String & query_id, TimePoint query_start_time, UInt64 interval_milliseconds); - void finishQuery(const String & query_id, QueryStatusInfoPtr query_info = nullptr); + void startQuery(const String & query_id, TimePoint start_time, UInt64 interval_milliseconds); + void finishQuery(const String & query_id, TimePoint finish_time, QueryStatusInfoPtr query_info = nullptr); private: - std::optional createLogMetricElement(const String & query_id, const QueryStatusInfo & query_info, TimePoint current_time, bool schedule_next = true); + std::optional createLogMetricElement(const String & query_id, const QueryStatusInfo & query_info, TimePoint query_info_time, bool schedule_next = true); std::recursive_mutex queries_mutex; std::unordered_map queries; diff --git a/src/Interpreters/ReplicatedDatabaseQueryStatusSource.cpp b/src/Interpreters/ReplicatedDatabaseQueryStatusSource.cpp new file mode 100644 index 00000000000..09941b09238 --- /dev/null +++ b/src/Interpreters/ReplicatedDatabaseQueryStatusSource.cpp @@ -0,0 +1,170 @@ +#include +#include +#include +#include +#include +#include +#include + +namespace DB +{ +namespace Setting +{ +extern const SettingsBool database_replicated_enforce_synchronous_settings; +extern const SettingsDistributedDDLOutputMode distributed_ddl_output_mode; +} +namespace ErrorCodes +{ +extern const int TIMEOUT_EXCEEDED; +extern const int LOGICAL_ERROR; +} + +ReplicatedDatabaseQueryStatusSource::ReplicatedDatabaseQueryStatusSource( + const String & zk_node_path, const String & zk_replicas_path, ContextPtr context_, const Strings & hosts_to_wait) + : DistributedQueryStatusSource( + zk_node_path, zk_replicas_path, getSampleBlock(), context_, hosts_to_wait, "ReplicatedDatabaseQueryStatusSource") +{ +} + +ExecutionStatus ReplicatedDatabaseQueryStatusSource::checkStatus([[maybe_unused]] const String & host_id) +{ + /// Replicated database retries in case of error, it should not write error status. +#ifdef DEBUG_OR_SANITIZER_BUILD + fs::path status_path = fs::path(node_path) / "finished" / host_id; + return getExecutionStatus(status_path); +#else + return ExecutionStatus{0}; +#endif +} + +Chunk ReplicatedDatabaseQueryStatusSource::generateChunkWithUnfinishedHosts() const +{ + NameSet unfinished_hosts = waiting_hosts; + for (const auto & host_id : finished_hosts) + unfinished_hosts.erase(host_id); + + NameSet active_hosts_set = NameSet{current_active_hosts.begin(), current_active_hosts.end()}; + + /// Query is not finished on the rest hosts, so fill the corresponding rows with NULLs. + MutableColumns columns = output.getHeader().cloneEmptyColumns(); + for (const String & host_id : unfinished_hosts) + { + size_t num = 0; + auto [shard, replica] = DatabaseReplicated::parseFullReplicaName(host_id); + columns[num++]->insert(shard); + columns[num++]->insert(replica); + if (active_hosts_set.contains(host_id)) + columns[num++]->insert(IN_PROGRESS); + else + columns[num++]->insert(QUEUED); + + columns[num++]->insert(unfinished_hosts.size()); + columns[num++]->insert(current_active_hosts.size()); + } + return Chunk(std::move(columns), unfinished_hosts.size()); +} + +Strings ReplicatedDatabaseQueryStatusSource::getNodesToWait() +{ + String node_to_wait = "finished"; + if (context->getSettingsRef()[Setting::database_replicated_enforce_synchronous_settings]) + { + node_to_wait = "synced"; + } + + return {String(fs::path(node_path) / node_to_wait), String(fs::path(node_path) / "active")}; +} + +Chunk ReplicatedDatabaseQueryStatusSource::handleTimeoutExceeded() +{ + timeout_exceeded = true; + + size_t num_unfinished_hosts = waiting_hosts.size() - num_hosts_finished; + size_t num_active_hosts = current_active_hosts.size(); + + constexpr auto msg_format = "ReplicatedDatabase DDL task {} is not finished on {} of {} hosts " + "({} of them are currently executing the task, {} are inactive). " + "They are going to execute the query in background. Was waiting for {} seconds{}"; + + if (throw_on_timeout || (throw_on_timeout_only_active && !stop_waiting_offline_hosts)) + { + if (!first_exception) + first_exception = std::make_unique(Exception( + ErrorCodes::TIMEOUT_EXCEEDED, + msg_format, + node_path, + num_unfinished_hosts, + waiting_hosts.size(), + num_active_hosts, + offline_hosts.size(), + watch.elapsedSeconds(), + stop_waiting_offline_hosts ? "" : ", which is longer than distributed_ddl_task_timeout")); + + /// For Replicated database print a list of unfinished hosts as well. Will return empty block on next iteration. + return generateChunkWithUnfinishedHosts(); + } + + LOG_INFO( + log, + msg_format, + node_path, + num_unfinished_hosts, + waiting_hosts.size(), + num_active_hosts, + offline_hosts.size(), + watch.elapsedSeconds(), + stop_waiting_offline_hosts ? "" : "which is longer than distributed_ddl_task_timeout"); + + return generateChunkWithUnfinishedHosts(); +} + +Chunk ReplicatedDatabaseQueryStatusSource::stopWaitingOfflineHosts() +{ + // Same logic as timeout exceeded + return handleTimeoutExceeded(); +} + +void ReplicatedDatabaseQueryStatusSource::handleNonZeroStatusCode(const ExecutionStatus & status, const String & host_id) +{ + assert(status.code != 0); + + if (!first_exception && context->getSettingsRef()[Setting::distributed_ddl_output_mode] != DistributedDDLOutputMode::NEVER_THROW) + { + throw Exception(ErrorCodes::LOGICAL_ERROR, "There was an error on {}: {} (probably it's a bug)", host_id, status.message); + } +} + +void ReplicatedDatabaseQueryStatusSource::fillHostStatus(const String & host_id, const ExecutionStatus & status, MutableColumns & columns) +{ + size_t num = 0; + if (status.code != 0) + throw Exception(ErrorCodes::LOGICAL_ERROR, "There was an error on {}: {} (probably it's a bug)", host_id, status.message); + auto [shard, replica] = DatabaseReplicated::parseFullReplicaName(host_id); + columns[num++]->insert(shard); + columns[num++]->insert(replica); + columns[num++]->insert(OK); + columns[num++]->insert(waiting_hosts.size() - num_hosts_finished); + columns[num++]->insert(current_active_hosts.size()); +} + +Block ReplicatedDatabaseQueryStatusSource::getSampleBlock() +{ + auto get_status_enum = []() + { + return std::make_shared(DataTypeEnum8::Values{ + {"OK", static_cast(OK)}, + {"IN_PROGRESS", static_cast(IN_PROGRESS)}, + {"QUEUED", static_cast(QUEUED)}, + }); + }; + + return Block{ + {std::make_shared(), "shard"}, + {std::make_shared(), "replica"}, + {get_status_enum(), "status"}, + {std::make_shared(), "num_hosts_remaining"}, + {std::make_shared(), "num_hosts_active"}, + }; +} + +} diff --git a/src/Interpreters/ReplicatedDatabaseQueryStatusSource.h b/src/Interpreters/ReplicatedDatabaseQueryStatusSource.h new file mode 100644 index 00000000000..76a2d5f3f14 --- /dev/null +++ b/src/Interpreters/ReplicatedDatabaseQueryStatusSource.h @@ -0,0 +1,40 @@ +#pragma once + +#include +#include +#include +#include + +namespace DB +{ +class ReplicatedDatabaseQueryStatusSource final : public DistributedQueryStatusSource +{ +public: + ReplicatedDatabaseQueryStatusSource( + const String & zk_node_path, const String & zk_replicas_path, ContextPtr context_, const Strings & hosts_to_wait); + + String getName() const override { return "ReplicatedDatabaseQueryStatus"; } + +protected: + ExecutionStatus checkStatus(const String & host_id) override; + Chunk generateChunkWithUnfinishedHosts() const override; + Strings getNodesToWait() override; + Chunk handleTimeoutExceeded() override; + Chunk stopWaitingOfflineHosts() override; + void handleNonZeroStatusCode(const ExecutionStatus & status, const String & host_id) override; + void fillHostStatus(const String & host_id, const ExecutionStatus & status, MutableColumns & columns) override; + +private: + static Block getSampleBlock(); + + enum ReplicatedDatabaseQueryStatus + { + /// Query is (successfully) finished + OK = 0, + /// Query is not finished yet, but replica is currently executing it + IN_PROGRESS = 1, + /// Replica is not available or busy with previous queries. It will process query asynchronously + QUEUED = 2, + }; +}; +} diff --git a/src/Interpreters/ServerAsynchronousMetrics.cpp b/src/Interpreters/ServerAsynchronousMetrics.cpp index 079029695c9..46a811822c2 100644 --- a/src/Interpreters/ServerAsynchronousMetrics.cpp +++ b/src/Interpreters/ServerAsynchronousMetrics.cpp @@ -54,12 +54,14 @@ void calculateMaxAndSum(Max & max, Sum & sum, T x) ServerAsynchronousMetrics::ServerAsynchronousMetrics( ContextPtr global_context_, unsigned update_period_seconds, + bool update_heavy_metrics_, unsigned heavy_metrics_update_period_seconds, const ProtocolServerMetricsFunc & protocol_server_metrics_func_, bool update_jemalloc_epoch_, bool update_rss_) : WithContext(global_context_) , AsynchronousMetrics(update_period_seconds, protocol_server_metrics_func_, update_jemalloc_epoch_, update_rss_) + , update_heavy_metrics(update_heavy_metrics_) , heavy_metric_update_period(heavy_metrics_update_period_seconds) { /// sanity check @@ -412,7 +414,8 @@ void ServerAsynchronousMetrics::updateImpl(TimePoint update_time, TimePoint curr } #endif - updateHeavyMetricsIfNeeded(current_time, update_time, force_update, first_run, new_values); + if (update_heavy_metrics) + updateHeavyMetricsIfNeeded(current_time, update_time, force_update, first_run, new_values); } void ServerAsynchronousMetrics::logImpl(AsynchronousMetricValues & new_values) @@ -459,10 +462,10 @@ void ServerAsynchronousMetrics::updateDetachedPartsStats() void ServerAsynchronousMetrics::updateHeavyMetricsIfNeeded(TimePoint current_time, TimePoint update_time, bool force_update, bool first_run, AsynchronousMetricValues & new_values) { const auto time_since_previous_update = current_time - heavy_metric_previous_update_time; - const bool update_heavy_metrics = (time_since_previous_update >= heavy_metric_update_period) || force_update || first_run; + const bool need_update_heavy_metrics = (time_since_previous_update >= heavy_metric_update_period) || force_update || first_run; Stopwatch watch; - if (update_heavy_metrics) + if (need_update_heavy_metrics) { heavy_metric_previous_update_time = update_time; if (first_run) diff --git a/src/Interpreters/ServerAsynchronousMetrics.h b/src/Interpreters/ServerAsynchronousMetrics.h index 5fab419a32b..691ddd429b4 100644 --- a/src/Interpreters/ServerAsynchronousMetrics.h +++ b/src/Interpreters/ServerAsynchronousMetrics.h @@ -13,6 +13,7 @@ public: ServerAsynchronousMetrics( ContextPtr global_context_, unsigned update_period_seconds, + bool update_heavy_metrics_, unsigned heavy_metrics_update_period_seconds, const ProtocolServerMetricsFunc & protocol_server_metrics_func_, bool update_jemalloc_epoch_, @@ -24,6 +25,7 @@ private: void updateImpl(TimePoint update_time, TimePoint current_time, bool force_update, bool first_run, AsynchronousMetricValues & new_values) override; void logImpl(AsynchronousMetricValues & new_values) override; + bool update_heavy_metrics; const Duration heavy_metric_update_period; TimePoint heavy_metric_previous_update_time; double heavy_update_interval = 0.; diff --git a/src/Interpreters/Session.cpp b/src/Interpreters/Session.cpp index d92c6c462e7..bc6555af595 100644 --- a/src/Interpreters/Session.cpp +++ b/src/Interpreters/Session.cpp @@ -51,9 +51,9 @@ using NamedSessionKey = std::pair; struct NamedSessionData { NamedSessionKey key; - UInt64 close_cycle = 0; ContextMutablePtr context; std::chrono::steady_clock::duration timeout; + std::chrono::steady_clock::time_point close_time_bucket{}; NamedSessionsStorage & parent; NamedSessionData(NamedSessionKey key_, ContextPtr context_, std::chrono::steady_clock::duration timeout_, NamedSessionsStorage & parent_) @@ -137,6 +137,18 @@ public: if (!isSharedPtrUnique(session)) throw Exception(ErrorCodes::SESSION_IS_LOCKED, "Session {} is locked by a concurrent client", session_id); + + if (session->close_time_bucket != std::chrono::steady_clock::time_point{}) + { + auto bucket_it = close_time_buckets.find(session->close_time_bucket); + auto & bucket_sessions = bucket_it->second; + bucket_sessions.erase(key); + if (bucket_sessions.empty()) + close_time_buckets.erase(bucket_it); + + session->close_time_bucket = std::chrono::steady_clock::time_point{}; + } + return {session, false}; } @@ -179,33 +191,31 @@ private: } }; - /// TODO it's very complicated. Make simple std::map with time_t or boost::multi_index. using Container = std::unordered_map, SessionKeyHash>; - using CloseTimes = std::deque>; Container sessions; - CloseTimes close_times; - std::chrono::steady_clock::duration close_interval = std::chrono::seconds(1); - std::chrono::steady_clock::time_point close_cycle_time = std::chrono::steady_clock::now(); - UInt64 close_cycle = 0; + + // Ordered map of close times for sessions, grouped by the next multiple of close_interval + using CloseTimes = std::map>; + CloseTimes close_time_buckets; + + constexpr static std::chrono::steady_clock::duration close_interval = std::chrono::milliseconds(1000); + constexpr static std::chrono::nanoseconds::rep close_interval_ns = std::chrono::duration_cast(close_interval).count(); void scheduleCloseSession(NamedSessionData & session, std::unique_lock &) { - /// Push it on a queue of sessions to close, on a position corresponding to the timeout. - /// (timeout is measured from current moment of time) + chassert(session.close_time_bucket == std::chrono::steady_clock::time_point{}); - const UInt64 close_index = session.timeout / close_interval + 1; - const auto new_close_cycle = close_cycle + close_index; + const auto session_close_time = std::chrono::steady_clock::now() + session.timeout; + const auto session_close_time_ns = std::chrono::duration_cast(session_close_time.time_since_epoch()).count(); + const auto bucket_padding = close_interval - std::chrono::nanoseconds(session_close_time_ns % close_interval_ns); + const auto close_time_bucket = session_close_time + bucket_padding; - if (session.close_cycle != new_close_cycle) - { - session.close_cycle = new_close_cycle; - if (close_times.size() < close_index + 1) - close_times.resize(close_index + 1); - close_times[close_index].emplace_back(session.key); - } + session.close_time_bucket = close_time_bucket; + auto & bucket_sessions = close_time_buckets[close_time_bucket]; + bucket_sessions.insert(session.key); LOG_TEST(log, "Schedule closing session with session_id: {}, user_id: {}", - session.key.second, session.key.first); + session.key.second, session.key.first); } void cleanThread() @@ -214,55 +224,46 @@ private: std::unique_lock lock{mutex}; while (!quit) { - auto interval = closeSessions(lock); - if (cond.wait_for(lock, interval, [this]() -> bool { return quit; })) + closeSessions(lock); + if (cond.wait_for(lock, close_interval, [this]() -> bool { return quit; })) break; } } - /// Close sessions, that has been expired. Returns how long to wait for next session to be expired, if no new sessions will be added. - std::chrono::steady_clock::duration closeSessions(std::unique_lock & lock) + void closeSessions(std::unique_lock & lock) { const auto now = std::chrono::steady_clock::now(); - /// The time to close the next session did not come - if (now < close_cycle_time) - return close_cycle_time - now; /// Will sleep until it comes. - - const auto current_cycle = close_cycle; - - ++close_cycle; - close_cycle_time = now + close_interval; - - if (close_times.empty()) - return close_interval; - - auto & sessions_to_close = close_times.front(); - - for (const auto & key : sessions_to_close) + for (auto bucket_it = close_time_buckets.begin(); bucket_it != close_time_buckets.end(); bucket_it = close_time_buckets.erase(bucket_it)) { - const auto session = sessions.find(key); + const auto & [time_bucket, session_keys] = *bucket_it; + if (time_bucket > now) + break; - if (session != sessions.end() && session->second->close_cycle <= current_cycle) + for (const auto & key : session_keys) { - if (session->second.use_count() != 1) - { - LOG_TEST(log, "Delay closing session with session_id: {}, user_id: {}", key.second, key.first); + const auto & session_it = sessions.find(key); - /// Skip but move it to close on the next cycle. - session->second->timeout = std::chrono::steady_clock::duration{0}; - scheduleCloseSession(*session->second, lock); - } - else + if (session_it == sessions.end()) + continue; + + const auto & session = session_it->second; + + if (session.use_count() != 1) { - LOG_TRACE(log, "Close session with session_id: {}, user_id: {}", key.second, key.first); - sessions.erase(session); + LOG_TEST(log, "Delay closing session with session_id: {}, user_id: {}, refcount: {}", + key.second, key.first, session.use_count()); + + session->timeout = std::chrono::steady_clock::duration{0}; + scheduleCloseSession(*session, lock); + continue; } + + LOG_TRACE(log, "Close session with session_id: {}, user_id: {}", key.second, key.first); + + sessions.erase(session_it); } } - - close_times.pop_front(); - return close_interval; } std::mutex mutex; @@ -382,12 +383,12 @@ void Session::authenticate(const Credentials & credentials_, const Poco::Net::So void Session::checkIfUserIsStillValid() { - if (user && user->valid_until) + if (const auto valid_until = user_authenticated_with.getValidUntil()) { const time_t now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now()); - if (now > user->valid_until) - throw Exception(ErrorCodes::USER_EXPIRED, "User expired"); + if (now > valid_until) + throw Exception(ErrorCodes::USER_EXPIRED, "Authentication method used has expired"); } } diff --git a/src/Interpreters/Set.cpp b/src/Interpreters/Set.cpp index bf6be4c0349..c6f0455652a 100644 --- a/src/Interpreters/Set.cpp +++ b/src/Interpreters/Set.cpp @@ -6,7 +6,9 @@ #include #include +#include +#include #include #include @@ -278,6 +280,108 @@ void Set::checkIsCreated() const throw Exception(ErrorCodes::LOGICAL_ERROR, "Trying to use set before it has been built."); } +ColumnUInt8::Ptr checkDateTimePrecision(const ColumnWithTypeAndName & column_to_cast) +{ + // Handle nullable columns + const ColumnNullable * original_nullable_column = typeid_cast(column_to_cast.column.get()); + const IColumn * original_nested_column = original_nullable_column + ? &original_nullable_column->getNestedColumn() + : column_to_cast.column.get(); + + // Check if the original column is of ColumnDecimal type + const auto * original_decimal_column = typeid_cast *>(original_nested_column); + if (!original_decimal_column) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Expected ColumnDecimal for DateTime64"); + + // Get the data array from the original column + const auto & original_data = original_decimal_column->getData(); + size_t vec_res_size = original_data.size(); + + // Prepare the precision null map + auto precision_null_map_column = ColumnUInt8::create(vec_res_size, 0); + NullMap & precision_null_map = precision_null_map_column->getData(); + + // Determine which rows should be null based on precision loss + const auto * datetime64_type = assert_cast(column_to_cast.type.get()); + auto scale = datetime64_type->getScale(); + if (scale >= 1) + { + Int64 scale_multiplier = common::exp10_i32(scale); + for (size_t row = 0; row < vec_res_size; ++row) + { + Int64 value = original_data[row]; + if (value % scale_multiplier != 0) + precision_null_map[row] = 1; // Mark as null due to precision loss + else + precision_null_map[row] = 0; + } + } + + return precision_null_map_column; +} + +ColumnPtr mergeNullMaps(const ColumnPtr & null_map_column1, const ColumnUInt8::Ptr & null_map_column2) +{ + if (!null_map_column1) + return null_map_column2; + if (!null_map_column2) + return null_map_column1; + + const auto & null_map1 = assert_cast(*null_map_column1).getData(); + const auto & null_map2 = (*null_map_column2).getData(); + + size_t size = null_map1.size(); + if (size != null_map2.size()) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Null maps have different sizes"); + + auto merged_null_map_column = ColumnUInt8::create(size); + auto & merged_null_map = merged_null_map_column->getData(); + + for (size_t i = 0; i < size; ++i) + merged_null_map[i] = null_map1[i] || null_map2[i]; + + return merged_null_map_column; +} + +void Set::processDateTime64Column( + const ColumnWithTypeAndName & column_to_cast, + ColumnPtr & result, + ColumnPtr & null_map_holder, + ConstNullMapPtr & null_map) const +{ + // Check for sub-second precision and create a null map + ColumnUInt8::Ptr filtered_null_map_column = checkDateTimePrecision(column_to_cast); + + // Extract existing null map and nested column from the result + const ColumnNullable * result_nullable_column = typeid_cast(result.get()); + const IColumn * nested_result_column = result_nullable_column + ? &result_nullable_column->getNestedColumn() + : result.get(); + + ColumnPtr existing_null_map_column = result_nullable_column + ? result_nullable_column->getNullMapColumnPtr() + : nullptr; + + if (transform_null_in) + { + if (!null_map_holder) + null_map_holder = filtered_null_map_column; + else + null_map_holder = mergeNullMaps(null_map_holder, filtered_null_map_column); + + const ColumnUInt8 * null_map_column = checkAndGetColumn(null_map_holder.get()); + if (!null_map_column) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Null map must be ColumnUInt8"); + + null_map = &null_map_column->getData(); + } + else + { + ColumnPtr merged_null_map_column = mergeNullMaps(existing_null_map_column, filtered_null_map_column); + result = ColumnNullable::create(nested_result_column->getPtr(), merged_null_map_column); + } +} + ColumnPtr Set::execute(const ColumnsWithTypeAndName & columns, bool negative) const { size_t num_key_columns = columns.size(); @@ -314,6 +418,10 @@ ColumnPtr Set::execute(const ColumnsWithTypeAndName & columns, bool negative) co Columns materialized_columns; materialized_columns.reserve(num_key_columns); + /// We will check existence in Set only for keys whose components do not contain any NULL value. + ConstNullMapPtr null_map{}; + ColumnPtr null_map_holder; + for (size_t i = 0; i < num_key_columns; ++i) { ColumnPtr result; @@ -331,13 +439,17 @@ ColumnPtr Set::execute(const ColumnsWithTypeAndName & columns, bool negative) co result = castColumnAccurate(column_to_cast, data_types[i], cast_cache.get()); } - materialized_columns.emplace_back() = result; - key_columns.emplace_back() = materialized_columns.back().get(); + // If the original column is DateTime64, check for sub-second precision + if (isDateTime64(column_to_cast.column->getDataType())) + { + processDateTime64Column(column_to_cast, result, null_map_holder, null_map); + } + + // Append the result to materialized columns + materialized_columns.emplace_back(std::move(result)); + key_columns.emplace_back(materialized_columns.back().get()); } - /// We will check existence in Set only for keys whose components do not contain any NULL value. - ConstNullMapPtr null_map{}; - ColumnPtr null_map_holder; if (!transform_null_in) null_map_holder = extractNestedColumnsAndNullMap(key_columns, null_map); diff --git a/src/Interpreters/Set.h b/src/Interpreters/Set.h index 8a821d87dfb..240d651352d 100644 --- a/src/Interpreters/Set.h +++ b/src/Interpreters/Set.h @@ -61,6 +61,8 @@ public: void checkIsCreated() const; + void processDateTime64Column(const ColumnWithTypeAndName & column_to_cast, ColumnPtr & result, ColumnPtr & null_map_holder, ConstNullMapPtr & null_map) const; + /** For columns of 'block', check belonging of corresponding rows to the set. * Return UInt8 column with the result. */ diff --git a/src/Interpreters/SystemLog.cpp b/src/Interpreters/SystemLog.cpp index bbdeb4567af..aafe819967f 100644 --- a/src/Interpreters/SystemLog.cpp +++ b/src/Interpreters/SystemLog.cpp @@ -298,9 +298,6 @@ SystemLogs::SystemLogs(ContextPtr global_context, const Poco::Util::AbstractConf #undef CREATE_PUBLIC_MEMBERS /// NOLINTEND(bugprone-macro-parentheses) - if (session_log) - global_context->addWarningMessage("Table system.session_log is enabled. It's unreliable and may contain garbage. Do not use it for any kind of security monitoring."); - bool should_prepare = global_context->getServerSettings()[ServerSetting::prepare_system_log_tables_on_startup]; try { diff --git a/src/Interpreters/ThreadStatusExt.cpp b/src/Interpreters/ThreadStatusExt.cpp index 0544bbcc92e..4d27a840d51 100644 --- a/src/Interpreters/ThreadStatusExt.cpp +++ b/src/Interpreters/ThreadStatusExt.cpp @@ -119,7 +119,7 @@ void ThreadGroup::unlinkThread() ThreadGroupPtr ThreadGroup::createForQuery(ContextPtr query_context_, std::function fatal_error_callback_) { auto group = std::make_shared(query_context_, std::move(fatal_error_callback_)); - group->memory_tracker.setDescription("(for query)"); + group->memory_tracker.setDescription("Query"); return group; } @@ -127,7 +127,7 @@ ThreadGroupPtr ThreadGroup::createForBackgroundProcess(ContextPtr storage_contex { auto group = std::make_shared(storage_context); - group->memory_tracker.setDescription("background process to apply mutate/merge in table"); + group->memory_tracker.setDescription("Background process (mutate/merge)"); /// However settings from storage context have to be applied const Settings & settings = storage_context->getSettingsRef(); group->memory_tracker.setProfilerStep(settings[Setting::memory_profiler_step]); @@ -384,7 +384,7 @@ void ThreadStatus::initPerformanceCounters() /// TODO: make separate query_thread_performance_counters and thread_performance_counters performance_counters.resetCounters(); memory_tracker.resetCounters(); - memory_tracker.setDescription("(for thread)"); + memory_tracker.setDescription("Thread"); query_start_time.setUp(); diff --git a/src/Interpreters/executeDDLQueryOnCluster.cpp b/src/Interpreters/executeDDLQueryOnCluster.cpp index c5d58a873fb..0b88d07148c 100644 --- a/src/Interpreters/executeDDLQueryOnCluster.cpp +++ b/src/Interpreters/executeDDLQueryOnCluster.cpp @@ -1,33 +1,32 @@ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include +#include #include #include #include -#include -#include -#include "Parsers/ASTSystemQuery.h" -#include -#include -#include -#include #include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include #include #include -#include #include +#include +#include -namespace fs = std::filesystem; - namespace DB { namespace Setting @@ -41,21 +40,11 @@ namespace Setting namespace ErrorCodes { - extern const int NOT_IMPLEMENTED; - extern const int TIMEOUT_EXCEEDED; - extern const int UNFINISHED; - extern const int QUERY_IS_PROHIBITED; - extern const int LOGICAL_ERROR; +extern const int NOT_IMPLEMENTED; +extern const int QUERY_IS_PROHIBITED; +extern const int LOGICAL_ERROR; } -static ZooKeeperRetriesInfo getRetriesInfo() -{ - const auto & config_ref = Context::getGlobalContextInstance()->getConfigRef(); - return ZooKeeperRetriesInfo( - config_ref.getInt("distributed_ddl_keeper_max_retries", 5), - config_ref.getInt("distributed_ddl_keeper_initial_backoff_ms", 100), - config_ref.getInt("distributed_ddl_keeper_max_backoff_ms", 5000)); -} bool isSupportedAlterTypeForOnClusterDDLQuery(int type) { @@ -200,74 +189,21 @@ BlockIO executeDDLQueryOnCluster(const ASTPtr & query_ptr_, ContextPtr context, entry.setSettingsIfRequired(context); entry.tracing_context = OpenTelemetry::CurrentContext(); entry.initial_query_id = context->getClientInfo().initial_query_id; - String node_path = ddl_worker.enqueueQuery(entry); + String node_path = ddl_worker.enqueueQuery(entry, params.retries_info, context->getProcessListElement()); - return getDistributedDDLStatus(node_path, entry, context, /* hosts_to_wait */ nullptr); + return getDDLOnClusterStatus(node_path, ddl_worker.getReplicasDir(), entry, context); } - -class DDLQueryStatusSource final : public ISource -{ -public: - DDLQueryStatusSource( - const String & zk_node_path, const DDLLogEntry & entry, ContextPtr context_, const Strings * hosts_to_wait); - - String getName() const override { return "DDLQueryStatus"; } - Chunk generate() override; - Status prepare() override; - -private: - static Block getSampleBlock(ContextPtr context_, bool hosts_to_wait); - - Strings getNewAndUpdate(const Strings & current_list_of_finished_hosts); - - std::pair parseHostAndPort(const String & host_id) const; - - Chunk generateChunkWithUnfinishedHosts() const; - - enum ReplicatedDatabaseQueryStatus - { - /// Query is (successfully) finished - OK = 0, - /// Query is not finished yet, but replica is currently executing it - IN_PROGRESS = 1, - /// Replica is not available or busy with previous queries. It will process query asynchronously - QUEUED = 2, - }; - - String node_path; - ContextPtr context; - Stopwatch watch; - LoggerPtr log; - - NameSet waiting_hosts; /// hosts from task host list - NameSet finished_hosts; /// finished hosts from host list - NameSet ignoring_hosts; /// appeared hosts that are not in hosts list - Strings current_active_hosts; /// Hosts that are currently executing the task - NameSet offline_hosts; /// Hosts that are not currently running - size_t num_hosts_finished = 0; - - /// Save the first detected error and throw it at the end of execution - std::unique_ptr first_exception; - - Int64 timeout_seconds = 120; - bool is_replicated_database = false; - bool throw_on_timeout = true; - bool throw_on_timeout_only_active = false; - bool only_running_hosts = false; - - bool timeout_exceeded = false; - bool stop_waiting_offline_hosts = false; -}; - - -BlockIO getDistributedDDLStatus(const String & node_path, const DDLLogEntry & entry, ContextPtr context, const Strings * hosts_to_wait) +BlockIO getDDLOnClusterStatus(const String & node_path, const String & replicas_path, const DDLLogEntry & entry, ContextPtr context) { BlockIO io; if (context->getSettingsRef()[Setting::distributed_ddl_task_timeout] == 0) return io; + Strings hosts_to_wait; + for (const HostID & host : entry.hosts) + hosts_to_wait.push_back(host.toString()); - auto source = std::make_shared(node_path, entry, context, hosts_to_wait); + auto source = std::make_shared(node_path, replicas_path, context, hosts_to_wait); io.pipeline = QueryPipeline(std::move(source)); if (context->getSettingsRef()[Setting::distributed_ddl_output_mode] == DistributedDDLOutputMode::NONE @@ -277,394 +213,6 @@ BlockIO getDistributedDDLStatus(const String & node_path, const DDLLogEntry & en return io; } -Block DDLQueryStatusSource::getSampleBlock(ContextPtr context_, bool hosts_to_wait) -{ - auto output_mode = context_->getSettingsRef()[Setting::distributed_ddl_output_mode]; - - auto maybe_make_nullable = [&](const DataTypePtr & type) -> DataTypePtr - { - if (output_mode == DistributedDDLOutputMode::THROW || - output_mode == DistributedDDLOutputMode::NONE || - output_mode == DistributedDDLOutputMode::NONE_ONLY_ACTIVE) - return type; - return std::make_shared(type); - }; - - auto get_status_enum = []() - { - return std::make_shared( - DataTypeEnum8::Values - { - {"OK", static_cast(OK)}, - {"IN_PROGRESS", static_cast(IN_PROGRESS)}, - {"QUEUED", static_cast(QUEUED)}, - }); - }; - - if (hosts_to_wait) - { - return Block{ - {std::make_shared(), "shard"}, - {std::make_shared(), "replica"}, - {get_status_enum(), "status"}, - {std::make_shared(), "num_hosts_remaining"}, - {std::make_shared(), "num_hosts_active"}, - }; - } - - return Block{ - {std::make_shared(), "host"}, - {std::make_shared(), "port"}, - {maybe_make_nullable(std::make_shared()), "status"}, - {maybe_make_nullable(std::make_shared()), "error"}, - {std::make_shared(), "num_hosts_remaining"}, - {std::make_shared(), "num_hosts_active"}, - }; -} - -DDLQueryStatusSource::DDLQueryStatusSource( - const String & zk_node_path, const DDLLogEntry & entry, ContextPtr context_, const Strings * hosts_to_wait) - : ISource(getSampleBlock(context_, static_cast(hosts_to_wait))) - , node_path(zk_node_path) - , context(context_) - , watch(CLOCK_MONOTONIC_COARSE) - , log(getLogger("DDLQueryStatusSource")) -{ - auto output_mode = context->getSettingsRef()[Setting::distributed_ddl_output_mode]; - throw_on_timeout = output_mode == DistributedDDLOutputMode::THROW || output_mode == DistributedDDLOutputMode::NONE; - throw_on_timeout_only_active = output_mode == DistributedDDLOutputMode::THROW_ONLY_ACTIVE || output_mode == DistributedDDLOutputMode::NONE_ONLY_ACTIVE; - - if (hosts_to_wait) - { - waiting_hosts = NameSet(hosts_to_wait->begin(), hosts_to_wait->end()); - is_replicated_database = true; - only_running_hosts = output_mode == DistributedDDLOutputMode::THROW_ONLY_ACTIVE || - output_mode == DistributedDDLOutputMode::NULL_STATUS_ON_TIMEOUT_ONLY_ACTIVE || - output_mode == DistributedDDLOutputMode::NONE_ONLY_ACTIVE; - } - else - { - for (const HostID & host : entry.hosts) - waiting_hosts.emplace(host.toString()); - } - - addTotalRowsApprox(waiting_hosts.size()); - timeout_seconds = context->getSettingsRef()[Setting::distributed_ddl_task_timeout]; -} - -std::pair DDLQueryStatusSource::parseHostAndPort(const String & host_id) const -{ - String host = host_id; - UInt16 port = 0; - if (!is_replicated_database) - { - auto host_and_port = Cluster::Address::fromString(host_id); - host = host_and_port.first; - port = host_and_port.second; - } - return {host, port}; -} - -Chunk DDLQueryStatusSource::generateChunkWithUnfinishedHosts() const -{ - NameSet unfinished_hosts = waiting_hosts; - for (const auto & host_id : finished_hosts) - unfinished_hosts.erase(host_id); - - NameSet active_hosts_set = NameSet{current_active_hosts.begin(), current_active_hosts.end()}; - - /// Query is not finished on the rest hosts, so fill the corresponding rows with NULLs. - MutableColumns columns = output.getHeader().cloneEmptyColumns(); - for (const String & host_id : unfinished_hosts) - { - size_t num = 0; - if (is_replicated_database) - { - auto [shard, replica] = DatabaseReplicated::parseFullReplicaName(host_id); - columns[num++]->insert(shard); - columns[num++]->insert(replica); - if (active_hosts_set.contains(host_id)) - columns[num++]->insert(IN_PROGRESS); - else - columns[num++]->insert(QUEUED); - } - else - { - auto [host, port] = parseHostAndPort(host_id); - columns[num++]->insert(host); - columns[num++]->insert(port); - columns[num++]->insert(Field{}); - columns[num++]->insert(Field{}); - } - columns[num++]->insert(unfinished_hosts.size()); - columns[num++]->insert(current_active_hosts.size()); - } - return Chunk(std::move(columns), unfinished_hosts.size()); -} - -static NameSet getOfflineHosts(const String & node_path, const NameSet & hosts_to_wait, const ZooKeeperPtr & zookeeper, LoggerPtr log) -{ - fs::path replicas_path; - if (node_path.ends_with('/')) - replicas_path = fs::path(node_path).parent_path().parent_path().parent_path() / "replicas"; - else - replicas_path = fs::path(node_path).parent_path().parent_path() / "replicas"; - - Strings paths; - Strings hosts_array; - for (const auto & host : hosts_to_wait) - { - hosts_array.push_back(host); - paths.push_back(replicas_path / host / "active"); - } - - NameSet offline; - auto res = zookeeper->tryGet(paths); - for (size_t i = 0; i < res.size(); ++i) - if (res[i].error == Coordination::Error::ZNONODE) - offline.insert(hosts_array[i]); - - if (offline.size() == hosts_to_wait.size()) - { - /// Avoid reporting that all hosts are offline - LOG_WARNING(log, "Did not find active hosts, will wait for all {} hosts. This should not happen often", offline.size()); - return {}; - } - - return offline; -} - -Chunk DDLQueryStatusSource::generate() -{ - bool all_hosts_finished = num_hosts_finished >= waiting_hosts.size(); - - /// Seems like num_hosts_finished cannot be strictly greater than waiting_hosts.size() - assert(num_hosts_finished <= waiting_hosts.size()); - - if (all_hosts_finished || timeout_exceeded) - return {}; - - String node_to_wait = "finished"; - if (is_replicated_database && context->getSettingsRef()[Setting::database_replicated_enforce_synchronous_settings]) - node_to_wait = "synced"; - - size_t try_number = 0; - - while (true) - { - if (isCancelled()) - return {}; - - if (stop_waiting_offline_hosts || (timeout_seconds >= 0 && watch.elapsedSeconds() > timeout_seconds)) - { - timeout_exceeded = true; - - size_t num_unfinished_hosts = waiting_hosts.size() - num_hosts_finished; - size_t num_active_hosts = current_active_hosts.size(); - - constexpr auto msg_format = "Distributed DDL task {} is not finished on {} of {} hosts " - "({} of them are currently executing the task, {} are inactive). " - "They are going to execute the query in background. Was waiting for {} seconds{}"; - - if (throw_on_timeout || (throw_on_timeout_only_active && !stop_waiting_offline_hosts)) - { - if (!first_exception) - first_exception = std::make_unique(Exception(ErrorCodes::TIMEOUT_EXCEEDED, - msg_format, node_path, num_unfinished_hosts, waiting_hosts.size(), num_active_hosts, offline_hosts.size(), - watch.elapsedSeconds(), stop_waiting_offline_hosts ? "" : ", which is longer than distributed_ddl_task_timeout")); - - /// For Replicated database print a list of unfinished hosts as well. Will return empty block on next iteration. - if (is_replicated_database) - return generateChunkWithUnfinishedHosts(); - return {}; - } - - LOG_INFO(log, msg_format, node_path, num_unfinished_hosts, waiting_hosts.size(), num_active_hosts, offline_hosts.size(), - watch.elapsedSeconds(), stop_waiting_offline_hosts ? "" : "which is longer than distributed_ddl_task_timeout"); - - return generateChunkWithUnfinishedHosts(); - } - - sleepForMilliseconds(std::min(1000, 50 * try_number)); - - bool node_exists = false; - Strings tmp_hosts; - Strings tmp_active_hosts; - - { - auto retries_ctl = ZooKeeperRetriesControl( - "executeDDLQueryOnCluster", getLogger("DDLQueryStatusSource"), getRetriesInfo(), context->getProcessListElement()); - retries_ctl.retryLoop([&]() - { - auto zookeeper = context->getZooKeeper(); - Strings paths = {String(fs::path(node_path) / node_to_wait), String(fs::path(node_path) / "active")}; - auto res = zookeeper->tryGetChildren(paths); - for (size_t i = 0; i < res.size(); ++i) - if (res[i].error != Coordination::Error::ZOK && res[i].error != Coordination::Error::ZNONODE) - throw Coordination::Exception::fromPath(res[i].error, paths[i]); - - if (res[0].error == Coordination::Error::ZNONODE) - node_exists = zookeeper->exists(node_path); - else - node_exists = true; - tmp_hosts = res[0].names; - tmp_active_hosts = res[1].names; - - if (only_running_hosts) - offline_hosts = getOfflineHosts(node_path, waiting_hosts, zookeeper, log); - }); - } - - if (!node_exists) - { - /// Paradoxically, this exception will be throw even in case of "never_throw" mode. - - if (!first_exception) - first_exception = std::make_unique(Exception(ErrorCodes::UNFINISHED, - "Cannot provide query execution status. The query's node {} has been deleted by the cleaner" - " since it was finished (or its lifetime is expired)", - node_path)); - return {}; - } - - Strings new_hosts = getNewAndUpdate(tmp_hosts); - ++try_number; - - if (only_running_hosts) - { - size_t num_finished_or_offline = 0; - for (const auto & host : waiting_hosts) - num_finished_or_offline += finished_hosts.contains(host) || offline_hosts.contains(host); - - if (num_finished_or_offline == waiting_hosts.size()) - stop_waiting_offline_hosts = true; - } - - if (new_hosts.empty()) - continue; - - current_active_hosts = std::move(tmp_active_hosts); - - MutableColumns columns = output.getHeader().cloneEmptyColumns(); - for (const String & host_id : new_hosts) - { - ExecutionStatus status(-1, "Cannot obtain error message"); - - /// Replicated database retries in case of error, it should not write error status. -#ifdef DEBUG_OR_SANITIZER_BUILD - bool need_check_status = true; -#else - bool need_check_status = !is_replicated_database; -#endif - if (need_check_status) - { - String status_data; - bool finished_exists = false; - - auto retries_ctl = ZooKeeperRetriesControl( - "executeDDLQueryOnCluster", - getLogger("DDLQueryStatusSource"), - getRetriesInfo(), - context->getProcessListElement()); - retries_ctl.retryLoop([&]() - { - finished_exists = context->getZooKeeper()->tryGet(fs::path(node_path) / "finished" / host_id, status_data); - }); - if (finished_exists) - status.tryDeserializeText(status_data); - } - else - { - status = ExecutionStatus{0}; - } - - - if (status.code != 0 && !first_exception - && context->getSettingsRef()[Setting::distributed_ddl_output_mode] != DistributedDDLOutputMode::NEVER_THROW) - { - if (is_replicated_database) - throw Exception(ErrorCodes::LOGICAL_ERROR, "There was an error on {}: {} (probably it's a bug)", host_id, status.message); - - auto [host, port] = parseHostAndPort(host_id); - first_exception = std::make_unique(Exception(status.code, - "There was an error on [{}:{}]: {}", host, port, status.message)); - } - - ++num_hosts_finished; - - size_t num = 0; - if (is_replicated_database) - { - if (status.code != 0) - throw Exception(ErrorCodes::LOGICAL_ERROR, "There was an error on {}: {} (probably it's a bug)", host_id, status.message); - auto [shard, replica] = DatabaseReplicated::parseFullReplicaName(host_id); - columns[num++]->insert(shard); - columns[num++]->insert(replica); - columns[num++]->insert(OK); - } - else - { - auto [host, port] = parseHostAndPort(host_id); - columns[num++]->insert(host); - columns[num++]->insert(port); - columns[num++]->insert(status.code); - columns[num++]->insert(status.message); - } - columns[num++]->insert(waiting_hosts.size() - num_hosts_finished); - columns[num++]->insert(current_active_hosts.size()); - } - - return Chunk(std::move(columns), new_hosts.size()); - } -} - -IProcessor::Status DDLQueryStatusSource::prepare() -{ - /// This method is overloaded to throw exception after all data is read. - /// Exception is pushed into pipe (instead of simply being thrown) to ensure the order of data processing and exception. - - if (finished) - { - if (first_exception) - { - if (!output.canPush()) - return Status::PortFull; - - output.pushException(std::make_exception_ptr(*first_exception)); - } - - output.finish(); - return Status::Finished; - } - return ISource::prepare(); -} - -Strings DDLQueryStatusSource::getNewAndUpdate(const Strings & current_list_of_finished_hosts) -{ - Strings diff; - for (const String & host : current_list_of_finished_hosts) - { - if (!waiting_hosts.contains(host)) - { - if (!ignoring_hosts.contains(host)) - { - ignoring_hosts.emplace(host); - LOG_INFO(log, "Unexpected host {} appeared in task {}", host, node_path); - } - continue; - } - - if (!finished_hosts.contains(host)) - { - diff.emplace_back(host); - finished_hosts.emplace(host); - } - } - - return diff; -} - - bool maybeRemoveOnCluster(const ASTPtr & query_ptr, ContextPtr context) { const auto * query = dynamic_cast(query_ptr.get()); diff --git a/src/Interpreters/executeDDLQueryOnCluster.h b/src/Interpreters/executeDDLQueryOnCluster.h index d3365553875..69e0c38834e 100644 --- a/src/Interpreters/executeDDLQueryOnCluster.h +++ b/src/Interpreters/executeDDLQueryOnCluster.h @@ -37,13 +37,16 @@ struct DDLQueryOnClusterParams /// Privileges which the current user should have to execute a query. AccessRightsElements access_to_check; + + /// Use retries when creating nodes "query-0000000000", "query-0000000001", "query-0000000002" in ZooKeeper. + ZooKeeperRetriesInfo retries_info; }; /// Pushes distributed DDL query to the queue. /// Returns DDLQueryStatusSource, which reads results of query execution on each host in the cluster. BlockIO executeDDLQueryOnCluster(const ASTPtr & query_ptr, ContextPtr context, const DDLQueryOnClusterParams & params = {}); -BlockIO getDistributedDDLStatus(const String & node_path, const DDLLogEntry & entry, ContextPtr context, const Strings * hosts_to_wait); +BlockIO getDDLOnClusterStatus(const String & node_path, const String & replicas_path, const DDLLogEntry & entry, ContextPtr context); bool maybeRemoveOnCluster(const ASTPtr & query_ptr, ContextPtr context); diff --git a/src/Interpreters/executeQuery.cpp b/src/Interpreters/executeQuery.cpp index a8fcfff65ad..9250c069283 100644 --- a/src/Interpreters/executeQuery.cpp +++ b/src/Interpreters/executeQuery.cpp @@ -81,6 +81,7 @@ #include #include +#include #include #include @@ -460,7 +461,7 @@ QueryLogElement logQueryStart( return elem; } -void logQueryMetricLogFinish(ContextPtr context, bool internal, String query_id, QueryStatusInfoPtr info) +void logQueryMetricLogFinish(ContextPtr context, bool internal, String query_id, std::chrono::system_clock::time_point finish_time, QueryStatusInfoPtr info) { if (auto query_metric_log = context->getQueryMetricLog(); query_metric_log && !internal) { @@ -475,11 +476,11 @@ void logQueryMetricLogFinish(ContextPtr context, bool internal, String query_id, /// to query the final state in query_log. auto collect_on_finish = info->elapsed_microseconds > interval_milliseconds * 1000; auto query_info = collect_on_finish ? info : nullptr; - query_metric_log->finishQuery(query_id, query_info); + query_metric_log->finishQuery(query_id, finish_time, query_info); } else { - query_metric_log->finishQuery(query_id, nullptr); + query_metric_log->finishQuery(query_id, finish_time, nullptr); } } } @@ -503,6 +504,7 @@ void logQueryFinish( /// Update performance counters before logging to query_log CurrentThread::finalizePerformanceCounters(); + auto time_now = std::chrono::system_clock::now(); QueryStatusInfo info = process_list_elem->getInfo(true, settings[Setting::log_profile_events]); elem.type = QueryLogElementType::QUERY_FINISH; @@ -597,7 +599,7 @@ void logQueryFinish( } } - logQueryMetricLogFinish(context, internal, elem.client_info.current_query_id, std::make_shared(info)); + logQueryMetricLogFinish(context, internal, elem.client_info.current_query_id, time_now, std::make_shared(info)); } if (query_span) @@ -697,7 +699,7 @@ void logQueryException( query_span->finish(); } - logQueryMetricLogFinish(context, internal, elem.client_info.current_query_id, info); + logQueryMetricLogFinish(context, internal, elem.client_info.current_query_id, time_now, info); } void logExceptionBeforeStart( @@ -796,7 +798,7 @@ void logExceptionBeforeStart( } } - logQueryMetricLogFinish(context, false, elem.client_info.current_query_id, nullptr); + logQueryMetricLogFinish(context, false, elem.client_info.current_query_id, std::chrono::system_clock::now(), nullptr); } void validateAnalyzerSettings(ASTPtr ast, bool context_value) diff --git a/src/Interpreters/loadMetadata.cpp b/src/Interpreters/loadMetadata.cpp index 4f72c29a625..12de96ef03a 100644 --- a/src/Interpreters/loadMetadata.cpp +++ b/src/Interpreters/loadMetadata.cpp @@ -380,7 +380,7 @@ static void convertOrdinaryDatabaseToAtomic(LoggerPtr log, ContextMutablePtr con /// Converts database with Ordinary engine to Atomic. Does nothing if database is not Ordinary. /// Can be called only during server startup when there are no queries from users. -static void maybeConvertOrdinaryDatabaseToAtomic(ContextMutablePtr context, const String & database_name, LoadTaskPtrs * startup_tasks = nullptr) +static void maybeConvertOrdinaryDatabaseToAtomic(ContextMutablePtr context, const String & database_name, const LoadTaskPtrs & load_system_metadata_tasks = {}) { LoggerPtr log = getLogger("loadMetadata"); @@ -407,12 +407,8 @@ static void maybeConvertOrdinaryDatabaseToAtomic(ContextMutablePtr context, cons try { - if (startup_tasks) // NOTE: only for system database - { - /// It's not quite correct to run DDL queries while database is not started up. - waitLoad(TablesLoaderForegroundPoolId, *startup_tasks); - startup_tasks->clear(); - } + /// It's not quite correct to run DDL queries while database is not started up. + waitLoad(TablesLoaderForegroundPoolId, load_system_metadata_tasks); auto local_context = Context::createCopy(context); @@ -462,13 +458,7 @@ static void maybeConvertOrdinaryDatabaseToAtomic(ContextMutablePtr context, cons }; TablesLoader loader{context, databases, LoadingStrictnessLevel::FORCE_RESTORE}; waitLoad(TablesLoaderForegroundPoolId, loader.loadTablesAsync()); - - /// Startup tables if they were started before conversion and detach/attach - if (startup_tasks) // NOTE: only for system database - *startup_tasks = loader.startupTablesAsync(); // We have loaded old database(s), replace tasks to startup new database - else - // An old database was already loaded, so we should load new one as well - waitLoad(TablesLoaderForegroundPoolId, loader.startupTablesAsync()); + waitLoad(TablesLoaderForegroundPoolId, loader.startupTablesAsync()); } catch (Exception & e) { @@ -480,13 +470,13 @@ static void maybeConvertOrdinaryDatabaseToAtomic(ContextMutablePtr context, cons } } -void maybeConvertSystemDatabase(ContextMutablePtr context, LoadTaskPtrs & system_startup_tasks) +void maybeConvertSystemDatabase(ContextMutablePtr context, LoadTaskPtrs & load_system_metadata_tasks) { /// TODO remove this check, convert system database unconditionally if (context->getSettingsRef()[Setting::allow_deprecated_database_ordinary]) return; - maybeConvertOrdinaryDatabaseToAtomic(context, DatabaseCatalog::SYSTEM_DATABASE, &system_startup_tasks); + maybeConvertOrdinaryDatabaseToAtomic(context, DatabaseCatalog::SYSTEM_DATABASE, load_system_metadata_tasks); } void convertDatabasesEnginesIfNeed(const LoadTaskPtrs & load_metadata, ContextMutablePtr context) @@ -509,7 +499,7 @@ void convertDatabasesEnginesIfNeed(const LoadTaskPtrs & load_metadata, ContextMu fs::remove(convert_flag_path); } -LoadTaskPtrs loadMetadataSystem(ContextMutablePtr context) +LoadTaskPtrs loadMetadataSystem(ContextMutablePtr context, bool async_load_system_database) { loadSystemDatabaseImpl(context, DatabaseCatalog::SYSTEM_DATABASE, "Atomic"); loadSystemDatabaseImpl(context, DatabaseCatalog::INFORMATION_SCHEMA, "Memory"); @@ -522,11 +512,28 @@ LoadTaskPtrs loadMetadataSystem(ContextMutablePtr context) {DatabaseCatalog::INFORMATION_SCHEMA_UPPERCASE, DatabaseCatalog::instance().getDatabase(DatabaseCatalog::INFORMATION_SCHEMA_UPPERCASE)}, }; TablesLoader loader{context, databases, LoadingStrictnessLevel::FORCE_RESTORE}; - auto tasks = loader.loadTablesAsync(); - waitLoad(TablesLoaderForegroundPoolId, tasks); - /// Will startup tables in system database after all databases are loaded. - return loader.startupTablesAsync(); + auto load_tasks = loader.loadTablesAsync(); + auto startup_tasks = loader.startupTablesAsync(); + + if (async_load_system_database) + { + scheduleLoad(load_tasks); + scheduleLoad(startup_tasks); + + // Do NOT wait, just return tasks for continuation or later wait. + return joinTasks(load_tasks, startup_tasks); + } + else + { + waitLoad(TablesLoaderForegroundPoolId, load_tasks); + + /// This has to be done before the initialization of system logs `initializeSystemLogs()`, + /// otherwise there is a race condition between the system database initialization + /// and creation of new tables in the database. + waitLoad(TablesLoaderForegroundPoolId, startup_tasks); + return {}; + } } } diff --git a/src/Interpreters/loadMetadata.h b/src/Interpreters/loadMetadata.h index b0d97d53de3..dcf3dad9f96 100644 --- a/src/Interpreters/loadMetadata.h +++ b/src/Interpreters/loadMetadata.h @@ -8,10 +8,10 @@ namespace DB /// Load tables from system database. Only real tables like query_log, part_log. /// You should first load system database, then attach system tables that you need into it, then load other databases. -/// It returns tasks to startup system tables. +/// It returns tasks that are still in progress if `async_load_system_database = true` otherwise it wait for all jobs to be done. /// Background operations in system tables may slowdown loading of the rest tables, /// so we startup system tables after all databases are loaded. -[[nodiscard]] LoadTaskPtrs loadMetadataSystem(ContextMutablePtr context); +[[nodiscard]] LoadTaskPtrs loadMetadataSystem(ContextMutablePtr context, bool async_load_system_database = false); /// Load tables from databases and add them to context. Databases 'system' and 'information_schema' are ignored. /// Use separate function to load system tables. @@ -20,7 +20,7 @@ namespace DB [[nodiscard]] LoadTaskPtrs loadMetadata(ContextMutablePtr context, const String & default_database_name = {}, bool async_load_databases = false); /// Converts `system` database from Ordinary to Atomic (if needed) -void maybeConvertSystemDatabase(ContextMutablePtr context, LoadTaskPtrs & system_startup_tasks); +void maybeConvertSystemDatabase(ContextMutablePtr context, LoadTaskPtrs & load_system_metadata_tasks); /// Converts all databases (except system) from Ordinary to Atomic if convert_ordinary_to_atomic flag exists /// Waits for `load_metadata` task before conversions diff --git a/src/Interpreters/registerInterpreters.cpp b/src/Interpreters/registerInterpreters.cpp index 481d0597a85..838b3a669da 100644 --- a/src/Interpreters/registerInterpreters.cpp +++ b/src/Interpreters/registerInterpreters.cpp @@ -52,6 +52,10 @@ void registerInterpreterExternalDDLQuery(InterpreterFactory & factory); void registerInterpreterTransactionControlQuery(InterpreterFactory & factory); void registerInterpreterCreateFunctionQuery(InterpreterFactory & factory); void registerInterpreterDropFunctionQuery(InterpreterFactory & factory); +void registerInterpreterCreateWorkloadQuery(InterpreterFactory & factory); +void registerInterpreterDropWorkloadQuery(InterpreterFactory & factory); +void registerInterpreterCreateResourceQuery(InterpreterFactory & factory); +void registerInterpreterDropResourceQuery(InterpreterFactory & factory); void registerInterpreterCreateIndexQuery(InterpreterFactory & factory); void registerInterpreterCreateNamedCollectionQuery(InterpreterFactory & factory); void registerInterpreterDropIndexQuery(InterpreterFactory & factory); @@ -111,6 +115,10 @@ void registerInterpreters() registerInterpreterTransactionControlQuery(factory); registerInterpreterCreateFunctionQuery(factory); registerInterpreterDropFunctionQuery(factory); + registerInterpreterCreateWorkloadQuery(factory); + registerInterpreterDropWorkloadQuery(factory); + registerInterpreterCreateResourceQuery(factory); + registerInterpreterDropResourceQuery(factory); registerInterpreterCreateIndexQuery(factory); registerInterpreterCreateNamedCollectionQuery(factory); registerInterpreterDropIndexQuery(factory); diff --git a/src/Parsers/ASTCreateResourceQuery.cpp b/src/Parsers/ASTCreateResourceQuery.cpp new file mode 100644 index 00000000000..3e40d76ba1b --- /dev/null +++ b/src/Parsers/ASTCreateResourceQuery.cpp @@ -0,0 +1,83 @@ +#include +#include +#include +#include +#include + +namespace DB +{ + +ASTPtr ASTCreateResourceQuery::clone() const +{ + auto res = std::make_shared(*this); + res->children.clear(); + + res->resource_name = resource_name->clone(); + res->children.push_back(res->resource_name); + + res->operations = operations; + + return res; +} + +void ASTCreateResourceQuery::formatImpl(const IAST::FormatSettings & format, IAST::FormatState &, IAST::FormatStateStacked) const +{ + format.ostr << (format.hilite ? hilite_keyword : "") << "CREATE "; + + if (or_replace) + format.ostr << "OR REPLACE "; + + format.ostr << "RESOURCE "; + + if (if_not_exists) + format.ostr << "IF NOT EXISTS "; + + format.ostr << (format.hilite ? hilite_none : ""); + + format.ostr << (format.hilite ? hilite_identifier : "") << backQuoteIfNeed(getResourceName()) << (format.hilite ? hilite_none : ""); + + formatOnCluster(format); + + format.ostr << " ("; + + bool first = true; + for (const auto & operation : operations) + { + if (!first) + format.ostr << ", "; + else + first = false; + + switch (operation.mode) + { + case AccessMode::Read: + { + format.ostr << (format.hilite ? hilite_keyword : "") << "READ "; + break; + } + case AccessMode::Write: + { + format.ostr << (format.hilite ? hilite_keyword : "") << "WRITE "; + break; + } + } + if (operation.disk) + { + format.ostr << "DISK " << (format.hilite ? hilite_none : ""); + format.ostr << (format.hilite ? hilite_identifier : "") << backQuoteIfNeed(*operation.disk) << (format.hilite ? hilite_none : ""); + } + else + format.ostr << "ANY DISK" << (format.hilite ? hilite_none : ""); + } + + format.ostr << ")"; +} + +String ASTCreateResourceQuery::getResourceName() const +{ + String name; + tryGetIdentifierNameInto(resource_name, name); + return name; +} + +} diff --git a/src/Parsers/ASTCreateResourceQuery.h b/src/Parsers/ASTCreateResourceQuery.h new file mode 100644 index 00000000000..51933a375f8 --- /dev/null +++ b/src/Parsers/ASTCreateResourceQuery.h @@ -0,0 +1,48 @@ +#pragma once + +#include +#include + + +namespace DB +{ + +class ASTCreateResourceQuery : public IAST, public ASTQueryWithOnCluster +{ +public: + enum class AccessMode + { + Read, + Write + }; + struct Operation + { + AccessMode mode; + std::optional disk; // Applies to all disks if not set + + friend bool operator ==(const Operation & lhs, const Operation & rhs) { return lhs.mode == rhs.mode && lhs.disk == rhs.disk; } + friend bool operator !=(const Operation & lhs, const Operation & rhs) { return !(lhs == rhs); } + }; + + using Operations = std::vector; + + ASTPtr resource_name; + Operations operations; /// List of operations that require this resource + + bool or_replace = false; + bool if_not_exists = false; + + String getID(char delim) const override { return "CreateResourceQuery" + (delim + getResourceName()); } + + ASTPtr clone() const override; + + void formatImpl(const FormatSettings & format, FormatState & state, FormatStateStacked frame) const override; + + ASTPtr getRewrittenASTWithoutOnCluster(const WithoutOnClusterASTRewriteParams &) const override { return removeOnCluster(clone()); } + + String getResourceName() const; + + QueryKind getQueryKind() const override { return QueryKind::Create; } +}; + +} diff --git a/src/Parsers/ASTCreateWorkloadQuery.cpp b/src/Parsers/ASTCreateWorkloadQuery.cpp new file mode 100644 index 00000000000..972ce733651 --- /dev/null +++ b/src/Parsers/ASTCreateWorkloadQuery.cpp @@ -0,0 +1,95 @@ +#include +#include +#include +#include +#include +#include + +namespace DB +{ + +ASTPtr ASTCreateWorkloadQuery::clone() const +{ + auto res = std::make_shared(*this); + res->children.clear(); + + res->workload_name = workload_name->clone(); + res->children.push_back(res->workload_name); + + if (workload_parent) + { + res->workload_parent = workload_parent->clone(); + res->children.push_back(res->workload_parent); + } + + res->changes = changes; + + return res; +} + +void ASTCreateWorkloadQuery::formatImpl(const IAST::FormatSettings & format, IAST::FormatState &, IAST::FormatStateStacked) const +{ + format.ostr << (format.hilite ? hilite_keyword : "") << "CREATE "; + + if (or_replace) + format.ostr << "OR REPLACE "; + + format.ostr << "WORKLOAD "; + + if (if_not_exists) + format.ostr << "IF NOT EXISTS "; + + format.ostr << (format.hilite ? hilite_none : ""); + + format.ostr << (format.hilite ? hilite_identifier : "") << backQuoteIfNeed(getWorkloadName()) << (format.hilite ? hilite_none : ""); + + formatOnCluster(format); + + if (hasParent()) + { + format.ostr << (format.hilite ? hilite_keyword : "") << " IN " << (format.hilite ? hilite_none : ""); + format.ostr << (format.hilite ? hilite_identifier : "") << backQuoteIfNeed(getWorkloadParent()) << (format.hilite ? hilite_none : ""); + } + + if (!changes.empty()) + { + format.ostr << ' ' << (format.hilite ? hilite_keyword : "") << "SETTINGS" << (format.hilite ? hilite_none : "") << ' '; + + bool first = true; + + for (const auto & change : changes) + { + if (!first) + format.ostr << ", "; + else + first = false; + format.ostr << change.name << " = " << applyVisitor(FieldVisitorToString(), change.value); + if (!change.resource.empty()) + { + format.ostr << ' ' << (format.hilite ? hilite_keyword : "") << "FOR" << (format.hilite ? hilite_none : "") << ' '; + format.ostr << (format.hilite ? hilite_identifier : "") << backQuoteIfNeed(change.resource) << (format.hilite ? hilite_none : ""); + } + } + } +} + +String ASTCreateWorkloadQuery::getWorkloadName() const +{ + String name; + tryGetIdentifierNameInto(workload_name, name); + return name; +} + +bool ASTCreateWorkloadQuery::hasParent() const +{ + return workload_parent != nullptr; +} + +String ASTCreateWorkloadQuery::getWorkloadParent() const +{ + String name; + tryGetIdentifierNameInto(workload_parent, name); + return name; +} + +} diff --git a/src/Parsers/ASTCreateWorkloadQuery.h b/src/Parsers/ASTCreateWorkloadQuery.h new file mode 100644 index 00000000000..8a4cecc001e --- /dev/null +++ b/src/Parsers/ASTCreateWorkloadQuery.h @@ -0,0 +1,53 @@ +#pragma once + +#include +#include +#include +#include + +namespace DB +{ + +class ASTCreateWorkloadQuery : public IAST, public ASTQueryWithOnCluster +{ +public: + ASTPtr workload_name; + ASTPtr workload_parent; + + /// Special version of settings that support optional `FOR resource` clause + struct SettingChange + { + String name; + Field value; + String resource; + + SettingChange() = default; + SettingChange(std::string_view name_, const Field & value_, std::string_view resource_) : name(name_), value(value_), resource(resource_) {} + SettingChange(std::string_view name_, Field && value_, std::string_view resource_) : name(name_), value(std::move(value_)), resource(resource_) {} + + friend bool operator ==(const SettingChange & lhs, const SettingChange & rhs) { return (lhs.name == rhs.name) && (lhs.value == rhs.value) && (lhs.resource == rhs.resource); } + friend bool operator !=(const SettingChange & lhs, const SettingChange & rhs) { return !(lhs == rhs); } + }; + + using SettingsChanges = std::vector; + SettingsChanges changes; + + bool or_replace = false; + bool if_not_exists = false; + + String getID(char delim) const override { return "CreateWorkloadQuery" + (delim + getWorkloadName()); } + + ASTPtr clone() const override; + + void formatImpl(const FormatSettings & format, FormatState & state, FormatStateStacked frame) const override; + + ASTPtr getRewrittenASTWithoutOnCluster(const WithoutOnClusterASTRewriteParams &) const override { return removeOnCluster(clone()); } + + String getWorkloadName() const; + bool hasParent() const; + String getWorkloadParent() const; + + QueryKind getQueryKind() const override { return QueryKind::Create; } +}; + +} diff --git a/src/Parsers/ASTDropResourceQuery.cpp b/src/Parsers/ASTDropResourceQuery.cpp new file mode 100644 index 00000000000..753ac4e30e7 --- /dev/null +++ b/src/Parsers/ASTDropResourceQuery.cpp @@ -0,0 +1,25 @@ +#include +#include +#include + +namespace DB +{ + +ASTPtr ASTDropResourceQuery::clone() const +{ + return std::make_shared(*this); +} + +void ASTDropResourceQuery::formatImpl(const IAST::FormatSettings & settings, IAST::FormatState &, IAST::FormatStateStacked) const +{ + settings.ostr << (settings.hilite ? hilite_keyword : "") << "DROP RESOURCE "; + + if (if_exists) + settings.ostr << "IF EXISTS "; + + settings.ostr << (settings.hilite ? hilite_none : ""); + settings.ostr << (settings.hilite ? hilite_identifier : "") << backQuoteIfNeed(resource_name) << (settings.hilite ? hilite_none : ""); + formatOnCluster(settings); +} + +} diff --git a/src/Parsers/ASTDropResourceQuery.h b/src/Parsers/ASTDropResourceQuery.h new file mode 100644 index 00000000000..e1534ea454a --- /dev/null +++ b/src/Parsers/ASTDropResourceQuery.h @@ -0,0 +1,28 @@ +#pragma once + +#include +#include + + +namespace DB +{ + +class ASTDropResourceQuery : public IAST, public ASTQueryWithOnCluster +{ +public: + String resource_name; + + bool if_exists = false; + + String getID(char) const override { return "DropResourceQuery"; } + + ASTPtr clone() const override; + + void formatImpl(const FormatSettings & s, FormatState & state, FormatStateStacked frame) const override; + + ASTPtr getRewrittenASTWithoutOnCluster(const WithoutOnClusterASTRewriteParams &) const override { return removeOnCluster(clone()); } + + QueryKind getQueryKind() const override { return QueryKind::Drop; } +}; + +} diff --git a/src/Parsers/ASTDropWorkloadQuery.cpp b/src/Parsers/ASTDropWorkloadQuery.cpp new file mode 100644 index 00000000000..3192223c4b3 --- /dev/null +++ b/src/Parsers/ASTDropWorkloadQuery.cpp @@ -0,0 +1,25 @@ +#include +#include +#include + +namespace DB +{ + +ASTPtr ASTDropWorkloadQuery::clone() const +{ + return std::make_shared(*this); +} + +void ASTDropWorkloadQuery::formatImpl(const IAST::FormatSettings & settings, IAST::FormatState &, IAST::FormatStateStacked) const +{ + settings.ostr << (settings.hilite ? hilite_keyword : "") << "DROP WORKLOAD "; + + if (if_exists) + settings.ostr << "IF EXISTS "; + + settings.ostr << (settings.hilite ? hilite_none : ""); + settings.ostr << (settings.hilite ? hilite_identifier : "") << backQuoteIfNeed(workload_name) << (settings.hilite ? hilite_none : ""); + formatOnCluster(settings); +} + +} diff --git a/src/Parsers/ASTDropWorkloadQuery.h b/src/Parsers/ASTDropWorkloadQuery.h new file mode 100644 index 00000000000..99c3a011447 --- /dev/null +++ b/src/Parsers/ASTDropWorkloadQuery.h @@ -0,0 +1,28 @@ +#pragma once + +#include +#include + + +namespace DB +{ + +class ASTDropWorkloadQuery : public IAST, public ASTQueryWithOnCluster +{ +public: + String workload_name; + + bool if_exists = false; + + String getID(char) const override { return "DropWorkloadQuery"; } + + ASTPtr clone() const override; + + void formatImpl(const FormatSettings & s, FormatState & state, FormatStateStacked frame) const override; + + ASTPtr getRewrittenASTWithoutOnCluster(const WithoutOnClusterASTRewriteParams &) const override { return removeOnCluster(clone()); } + + QueryKind getQueryKind() const override { return QueryKind::Drop; } +}; + +} diff --git a/src/Parsers/ASTSystemQuery.cpp b/src/Parsers/ASTSystemQuery.cpp index b5e5e0f208d..d76d33ce708 100644 --- a/src/Parsers/ASTSystemQuery.cpp +++ b/src/Parsers/ASTSystemQuery.cpp @@ -191,6 +191,7 @@ void ASTSystemQuery::formatImpl(const FormatSettings & settings, FormatState & s case Type::SYNC_REPLICA: case Type::WAIT_LOADING_PARTS: case Type::FLUSH_DISTRIBUTED: + case Type::PREWARM_MARK_CACHE: { if (table) { diff --git a/src/Parsers/ASTSystemQuery.h b/src/Parsers/ASTSystemQuery.h index d9f5b425182..d9ee4d8aa22 100644 --- a/src/Parsers/ASTSystemQuery.h +++ b/src/Parsers/ASTSystemQuery.h @@ -23,6 +23,7 @@ public: SUSPEND, DROP_DNS_CACHE, DROP_CONNECTIONS_CACHE, + PREWARM_MARK_CACHE, DROP_MARK_CACHE, DROP_UNCOMPRESSED_CACHE, DROP_INDEX_MARK_CACHE, diff --git a/src/Parsers/Access/ASTAuthenticationData.cpp b/src/Parsers/Access/ASTAuthenticationData.cpp index 7a1091d8a1a..c7a6429f6aa 100644 --- a/src/Parsers/Access/ASTAuthenticationData.cpp +++ b/src/Parsers/Access/ASTAuthenticationData.cpp @@ -14,6 +14,15 @@ namespace ErrorCodes extern const int LOGICAL_ERROR; } +namespace +{ + void formatValidUntil(const IAST & valid_until, const IAST::FormatSettings & settings) + { + settings.ostr << (settings.hilite ? IAST::hilite_keyword : "") << " VALID UNTIL " << (settings.hilite ? IAST::hilite_none : ""); + valid_until.format(settings); + } +} + std::optional ASTAuthenticationData::getPassword() const { if (contains_password) @@ -46,6 +55,12 @@ void ASTAuthenticationData::formatImpl(const FormatSettings & settings, FormatSt { settings.ostr << (settings.hilite ? IAST::hilite_keyword : "") << " no_password" << (settings.hilite ? IAST::hilite_none : ""); + + if (valid_until) + { + formatValidUntil(*valid_until, settings); + } + return; } @@ -205,6 +220,11 @@ void ASTAuthenticationData::formatImpl(const FormatSettings & settings, FormatSt children[1]->format(settings); } + if (valid_until) + { + formatValidUntil(*valid_until, settings); + } + } bool ASTAuthenticationData::hasSecretParts() const diff --git a/src/Parsers/Access/ASTAuthenticationData.h b/src/Parsers/Access/ASTAuthenticationData.h index 7f0644b3437..24c4c015efd 100644 --- a/src/Parsers/Access/ASTAuthenticationData.h +++ b/src/Parsers/Access/ASTAuthenticationData.h @@ -41,6 +41,7 @@ public: bool contains_password = false; bool contains_hash = false; + ASTPtr valid_until; protected: void formatImpl(const FormatSettings & settings, FormatState &, FormatStateStacked) const override; diff --git a/src/Parsers/Access/ASTCreateUserQuery.cpp b/src/Parsers/Access/ASTCreateUserQuery.cpp index ec48c32b684..eb4503acf82 100644 --- a/src/Parsers/Access/ASTCreateUserQuery.cpp +++ b/src/Parsers/Access/ASTCreateUserQuery.cpp @@ -260,8 +260,10 @@ void ASTCreateUserQuery::formatImpl(const FormatSettings & format, FormatState & formatAuthenticationData(authentication_methods, format); } - if (valid_until) - formatValidUntil(*valid_until, format); + if (global_valid_until) + { + formatValidUntil(*global_valid_until, format); + } if (hosts) formatHosts(nullptr, *hosts, format); diff --git a/src/Parsers/Access/ASTCreateUserQuery.h b/src/Parsers/Access/ASTCreateUserQuery.h index e1bae98f2f3..8926c7cad44 100644 --- a/src/Parsers/Access/ASTCreateUserQuery.h +++ b/src/Parsers/Access/ASTCreateUserQuery.h @@ -62,7 +62,7 @@ public: std::shared_ptr default_database; - ASTPtr valid_until; + ASTPtr global_valid_until; String getID(char) const override; ASTPtr clone() const override; diff --git a/src/Parsers/Access/ParserCreateUserQuery.cpp b/src/Parsers/Access/ParserCreateUserQuery.cpp index 8bfc84a28a6..657302574c2 100644 --- a/src/Parsers/Access/ParserCreateUserQuery.cpp +++ b/src/Parsers/Access/ParserCreateUserQuery.cpp @@ -43,6 +43,19 @@ namespace }); } + bool parseValidUntil(IParserBase::Pos & pos, Expected & expected, ASTPtr & valid_until) + { + return IParserBase::wrapParseImpl(pos, [&] + { + if (!ParserKeyword{Keyword::VALID_UNTIL}.ignore(pos, expected)) + return false; + + ParserStringAndSubstitution until_p; + + return until_p.parse(pos, valid_until, expected); + }); + } + bool parseAuthenticationData( IParserBase::Pos & pos, Expected & expected, @@ -223,6 +236,8 @@ namespace if (http_auth_scheme) auth_data->children.push_back(std::move(http_auth_scheme)); + parseValidUntil(pos, expected, auth_data->valid_until); + return true; }); } @@ -283,6 +298,8 @@ namespace authentication_methods.emplace_back(std::make_shared()); authentication_methods.back()->type = AuthenticationType::NO_PASSWORD; + parseValidUntil(pos, expected, authentication_methods.back()->valid_until); + return true; } @@ -471,19 +488,6 @@ namespace }); } - bool parseValidUntil(IParserBase::Pos & pos, Expected & expected, ASTPtr & valid_until) - { - return IParserBase::wrapParseImpl(pos, [&] - { - if (!ParserKeyword{Keyword::VALID_UNTIL}.ignore(pos, expected)) - return false; - - ParserStringAndSubstitution until_p; - - return until_p.parse(pos, valid_until, expected); - }); - } - bool parseAddIdentifiedWith(IParserBase::Pos & pos, Expected & expected, std::vector> & auth_data) { return IParserBase::wrapParseImpl(pos, [&] @@ -554,7 +558,7 @@ bool ParserCreateUserQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expec std::shared_ptr settings; std::shared_ptr grantees; std::shared_ptr default_database; - ASTPtr valid_until; + ASTPtr global_valid_until; String cluster; String storage_name; bool reset_authentication_methods_to_new = false; @@ -568,20 +572,27 @@ bool ParserCreateUserQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expec { parsed_identified_with = parseIdentifiedOrNotIdentified(pos, expected, auth_data); - if (!parsed_identified_with && alter) + if (parsed_identified_with) + { + continue; + } + else if (alter) { parsed_add_identified_with = parseAddIdentifiedWith(pos, expected, auth_data); + if (parsed_add_identified_with) + { + continue; + } } } if (!reset_authentication_methods_to_new && alter && auth_data.empty()) { reset_authentication_methods_to_new = parseResetAuthenticationMethods(pos, expected); - } - - if (!valid_until) - { - parseValidUntil(pos, expected, valid_until); + if (reset_authentication_methods_to_new) + { + continue; + } } AllowedClientHosts new_hosts; @@ -640,6 +651,14 @@ bool ParserCreateUserQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expec if (storage_name.empty() && ParserKeyword{Keyword::IN}.ignore(pos, expected) && parseAccessStorageName(pos, expected, storage_name)) continue; + if (auth_data.empty() && !global_valid_until) + { + if (parseValidUntil(pos, expected, global_valid_until)) + { + continue; + } + } + break; } @@ -674,7 +693,7 @@ bool ParserCreateUserQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expec query->settings = std::move(settings); query->grantees = std::move(grantees); query->default_database = std::move(default_database); - query->valid_until = std::move(valid_until); + query->global_valid_until = std::move(global_valid_until); query->storage_name = std::move(storage_name); query->reset_authentication_methods_to_new = reset_authentication_methods_to_new; query->add_identified_with = parsed_add_identified_with; @@ -685,8 +704,8 @@ bool ParserCreateUserQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expec query->children.push_back(authentication_method); } - if (query->valid_until) - query->children.push_back(query->valid_until); + if (query->global_valid_until) + query->children.push_back(query->global_valid_until); return true; } diff --git a/src/Parsers/Access/ParserGrantQuery.cpp b/src/Parsers/Access/ParserGrantQuery.cpp index e29cf11273b..4a0d24559a3 100644 --- a/src/Parsers/Access/ParserGrantQuery.cpp +++ b/src/Parsers/Access/ParserGrantQuery.cpp @@ -155,6 +155,9 @@ namespace for (auto & [access_flags, columns] : access_and_columns) { + if (wildcard && !columns.empty()) + return false; + AccessRightsElement element; element.access_flags = access_flags; element.columns = std::move(columns); diff --git a/src/Parsers/CommonParsers.h b/src/Parsers/CommonParsers.h index 83b7eb71d64..dd0ba91d428 100644 --- a/src/Parsers/CommonParsers.h +++ b/src/Parsers/CommonParsers.h @@ -392,6 +392,7 @@ namespace DB MR_MACROS(RANDOMIZE_FOR, "RANDOMIZE FOR") \ MR_MACROS(RANDOMIZED, "RANDOMIZED") \ MR_MACROS(RANGE, "RANGE") \ + MR_MACROS(READ, "READ") \ MR_MACROS(READONLY, "READONLY") \ MR_MACROS(REALM, "REALM") \ MR_MACROS(RECOMPRESS, "RECOMPRESS") \ @@ -411,6 +412,7 @@ namespace DB MR_MACROS(REPLACE, "REPLACE") \ MR_MACROS(RESET_SETTING, "RESET SETTING") \ MR_MACROS(RESET_AUTHENTICATION_METHODS_TO_NEW, "RESET AUTHENTICATION METHODS TO NEW") \ + MR_MACROS(RESOURCE, "RESOURCE") \ MR_MACROS(RESPECT_NULLS, "RESPECT NULLS") \ MR_MACROS(RESTORE, "RESTORE") \ MR_MACROS(RESTRICT, "RESTRICT") \ @@ -523,6 +525,7 @@ namespace DB MR_MACROS(WHEN, "WHEN") \ MR_MACROS(WHERE, "WHERE") \ MR_MACROS(WINDOW, "WINDOW") \ + MR_MACROS(WORKLOAD, "WORKLOAD") \ MR_MACROS(QUALIFY, "QUALIFY") \ MR_MACROS(WITH_ADMIN_OPTION, "WITH ADMIN OPTION") \ MR_MACROS(WITH_CHECK, "WITH CHECK") \ @@ -535,6 +538,7 @@ namespace DB MR_MACROS(WITH, "WITH") \ MR_MACROS(RECURSIVE, "RECURSIVE") \ MR_MACROS(WK, "WK") \ + MR_MACROS(WRITE, "WRITE") \ MR_MACROS(WRITABLE, "WRITABLE") \ MR_MACROS(WW, "WW") \ MR_MACROS(YEAR, "YEAR") \ diff --git a/src/Parsers/ParserCreateResourceQuery.cpp b/src/Parsers/ParserCreateResourceQuery.cpp new file mode 100644 index 00000000000..68c157df175 --- /dev/null +++ b/src/Parsers/ParserCreateResourceQuery.cpp @@ -0,0 +1,144 @@ +#include + +#include +#include +#include +#include +#include + + +namespace DB +{ + +namespace +{ + +bool parseOneOperation(ASTCreateResourceQuery::Operation & operation, IParser::Pos & pos, Expected & expected) +{ + ParserIdentifier disk_name_p; + + ASTCreateResourceQuery::AccessMode mode; + ASTPtr node; + std::optional disk; + + if (ParserKeyword(Keyword::WRITE).ignore(pos, expected)) + mode = ASTCreateResourceQuery::AccessMode::Write; + else if (ParserKeyword(Keyword::READ).ignore(pos, expected)) + mode = ASTCreateResourceQuery::AccessMode::Read; + else + return false; + + if (ParserKeyword(Keyword::ANY).ignore(pos, expected)) + { + if (!ParserKeyword(Keyword::DISK).ignore(pos, expected)) + return false; + } + else + { + if (!ParserKeyword(Keyword::DISK).ignore(pos, expected)) + return false; + + if (!disk_name_p.parse(pos, node, expected)) + return false; + + disk.emplace(); + if (!tryGetIdentifierNameInto(node, *disk)) + return false; + } + + operation.mode = mode; + operation.disk = std::move(disk); + + return true; +} + +bool parseOperations(IParser::Pos & pos, Expected & expected, ASTCreateResourceQuery::Operations & operations) +{ + return IParserBase::wrapParseImpl(pos, [&] + { + ParserToken s_open(TokenType::OpeningRoundBracket); + ParserToken s_close(TokenType::ClosingRoundBracket); + + if (!s_open.ignore(pos, expected)) + return false; + + ASTCreateResourceQuery::Operations res_operations; + + auto parse_operation = [&] + { + ASTCreateResourceQuery::Operation operation; + if (!parseOneOperation(operation, pos, expected)) + return false; + res_operations.push_back(std::move(operation)); + return true; + }; + + if (!ParserList::parseUtil(pos, expected, parse_operation, false)) + return false; + + if (!s_close.ignore(pos, expected)) + return false; + + operations = std::move(res_operations); + return true; + }); +} + +} + +bool ParserCreateResourceQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected & expected) +{ + ParserKeyword s_create(Keyword::CREATE); + ParserKeyword s_resource(Keyword::RESOURCE); + ParserKeyword s_or_replace(Keyword::OR_REPLACE); + ParserKeyword s_if_not_exists(Keyword::IF_NOT_EXISTS); + ParserKeyword s_on(Keyword::ON); + ParserIdentifier resource_name_p; + + ASTPtr resource_name; + + String cluster_str; + bool or_replace = false; + bool if_not_exists = false; + + if (!s_create.ignore(pos, expected)) + return false; + + if (s_or_replace.ignore(pos, expected)) + or_replace = true; + + if (!s_resource.ignore(pos, expected)) + return false; + + if (!or_replace && s_if_not_exists.ignore(pos, expected)) + if_not_exists = true; + + if (!resource_name_p.parse(pos, resource_name, expected)) + return false; + + if (s_on.ignore(pos, expected)) + { + if (!ASTQueryWithOnCluster::parse(pos, cluster_str, expected)) + return false; + } + + ASTCreateResourceQuery::Operations operations; + if (!parseOperations(pos, expected, operations)) + return false; + + auto create_resource_query = std::make_shared(); + node = create_resource_query; + + create_resource_query->resource_name = resource_name; + create_resource_query->children.push_back(resource_name); + + create_resource_query->or_replace = or_replace; + create_resource_query->if_not_exists = if_not_exists; + create_resource_query->cluster = std::move(cluster_str); + + create_resource_query->operations = std::move(operations); + + return true; +} + +} diff --git a/src/Parsers/ParserCreateResourceQuery.h b/src/Parsers/ParserCreateResourceQuery.h new file mode 100644 index 00000000000..1b7c9fc4a7f --- /dev/null +++ b/src/Parsers/ParserCreateResourceQuery.h @@ -0,0 +1,16 @@ +#pragma once + +#include "IParserBase.h" + +namespace DB +{ + +/// CREATE RESOURCE cache_io (WRITE DISK s3diskWithCache, READ DISK s3diskWithCache) +class ParserCreateResourceQuery : public IParserBase +{ +protected: + const char * getName() const override { return "CREATE RESOURCE query"; } + bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override; +}; + +} diff --git a/src/Parsers/ParserCreateWorkloadEntity.cpp b/src/Parsers/ParserCreateWorkloadEntity.cpp new file mode 100644 index 00000000000..013210a6d87 --- /dev/null +++ b/src/Parsers/ParserCreateWorkloadEntity.cpp @@ -0,0 +1,16 @@ +#include +#include +#include + +namespace DB +{ + +bool ParserCreateWorkloadEntity::parseImpl(Pos & pos, ASTPtr & node, Expected & expected) +{ + ParserCreateWorkloadQuery create_workload_p; + ParserCreateResourceQuery create_resource_p; + + return create_workload_p.parse(pos, node, expected) || create_resource_p.parse(pos, node, expected); +} + +} diff --git a/src/Parsers/ParserCreateWorkloadEntity.h b/src/Parsers/ParserCreateWorkloadEntity.h new file mode 100644 index 00000000000..1e7b78b3ccc --- /dev/null +++ b/src/Parsers/ParserCreateWorkloadEntity.h @@ -0,0 +1,17 @@ +#pragma once + +#include + +namespace DB +{ + +/// Special parser for the CREATE WORKLOAD and CREATE RESOURCE queries. +class ParserCreateWorkloadEntity : public IParserBase +{ +protected: + const char * getName() const override { return "CREATE workload entity query"; } + + bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override; +}; + +} diff --git a/src/Parsers/ParserCreateWorkloadQuery.cpp b/src/Parsers/ParserCreateWorkloadQuery.cpp new file mode 100644 index 00000000000..9caf474741c --- /dev/null +++ b/src/Parsers/ParserCreateWorkloadQuery.cpp @@ -0,0 +1,155 @@ +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +namespace DB +{ + +namespace +{ + +bool parseWorkloadSetting( + ASTCreateWorkloadQuery::SettingChange & change, IParser::Pos & pos, Expected & expected) +{ + ParserIdentifier name_p; + ParserLiteral value_p; + ParserToken s_eq(TokenType::Equals); + ParserIdentifier resource_name_p; + + ASTPtr name_node; + ASTPtr value_node; + ASTPtr resource_name_node; + + String name; + String resource_name; + + if (!name_p.parse(pos, name_node, expected)) + return false; + tryGetIdentifierNameInto(name_node, name); + + if (!s_eq.ignore(pos, expected)) + return false; + + if (!value_p.parse(pos, value_node, expected)) + return false; + + if (ParserKeyword(Keyword::FOR).ignore(pos, expected)) + { + if (!resource_name_p.parse(pos, resource_name_node, expected)) + return false; + tryGetIdentifierNameInto(resource_name_node, resource_name); + } + + change.name = std::move(name); + change.value = value_node->as().value; + change.resource = std::move(resource_name); + + return true; +} + +bool parseSettings(IParser::Pos & pos, Expected & expected, ASTCreateWorkloadQuery::SettingsChanges & changes) +{ + return IParserBase::wrapParseImpl(pos, [&] + { + if (!ParserKeyword(Keyword::SETTINGS).ignore(pos, expected)) + return false; + + ASTCreateWorkloadQuery::SettingsChanges res_changes; + + auto parse_setting = [&] + { + ASTCreateWorkloadQuery::SettingChange change; + if (!parseWorkloadSetting(change, pos, expected)) + return false; + res_changes.push_back(std::move(change)); + return true; + }; + + if (!ParserList::parseUtil(pos, expected, parse_setting, false)) + return false; + + changes = std::move(res_changes); + return true; + }); +} + +} + +bool ParserCreateWorkloadQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected & expected) +{ + ParserKeyword s_create(Keyword::CREATE); + ParserKeyword s_workload(Keyword::WORKLOAD); + ParserKeyword s_or_replace(Keyword::OR_REPLACE); + ParserKeyword s_if_not_exists(Keyword::IF_NOT_EXISTS); + ParserIdentifier workload_name_p; + ParserKeyword s_on(Keyword::ON); + ParserKeyword s_in(Keyword::IN); + + ASTPtr workload_name; + ASTPtr workload_parent; + + String cluster_str; + bool or_replace = false; + bool if_not_exists = false; + + if (!s_create.ignore(pos, expected)) + return false; + + if (s_or_replace.ignore(pos, expected)) + or_replace = true; + + if (!s_workload.ignore(pos, expected)) + return false; + + if (!or_replace && s_if_not_exists.ignore(pos, expected)) + if_not_exists = true; + + if (!workload_name_p.parse(pos, workload_name, expected)) + return false; + + if (s_on.ignore(pos, expected)) + { + if (!ASTQueryWithOnCluster::parse(pos, cluster_str, expected)) + return false; + } + + if (s_in.ignore(pos, expected)) + { + if (!workload_name_p.parse(pos, workload_parent, expected)) + return false; + } + + ASTCreateWorkloadQuery::SettingsChanges changes; + parseSettings(pos, expected, changes); + + auto create_workload_query = std::make_shared(); + node = create_workload_query; + + create_workload_query->workload_name = workload_name; + create_workload_query->children.push_back(workload_name); + + if (workload_parent) + { + create_workload_query->workload_parent = workload_parent; + create_workload_query->children.push_back(workload_parent); + } + + create_workload_query->or_replace = or_replace; + create_workload_query->if_not_exists = if_not_exists; + create_workload_query->cluster = std::move(cluster_str); + create_workload_query->changes = std::move(changes); + + + return true; +} + +} diff --git a/src/Parsers/ParserCreateWorkloadQuery.h b/src/Parsers/ParserCreateWorkloadQuery.h new file mode 100644 index 00000000000..62c89affeda --- /dev/null +++ b/src/Parsers/ParserCreateWorkloadQuery.h @@ -0,0 +1,16 @@ +#pragma once + +#include "IParserBase.h" + +namespace DB +{ + +/// CREATE WORKLOAD production IN all SETTINGS weight = 3, max_speed = '1G' FOR network_read, max_speed = '2G' FOR network_write +class ParserCreateWorkloadQuery : public IParserBase +{ +protected: + const char * getName() const override { return "CREATE WORKLOAD query"; } + bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override; +}; + +} diff --git a/src/Parsers/ParserDropResourceQuery.cpp b/src/Parsers/ParserDropResourceQuery.cpp new file mode 100644 index 00000000000..6c078281828 --- /dev/null +++ b/src/Parsers/ParserDropResourceQuery.cpp @@ -0,0 +1,52 @@ +#include +#include +#include +#include +#include + +namespace DB +{ + +bool ParserDropResourceQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected & expected) +{ + ParserKeyword s_drop(Keyword::DROP); + ParserKeyword s_resource(Keyword::RESOURCE); + ParserKeyword s_if_exists(Keyword::IF_EXISTS); + ParserKeyword s_on(Keyword::ON); + ParserIdentifier resource_name_p; + + String cluster_str; + bool if_exists = false; + + ASTPtr resource_name; + + if (!s_drop.ignore(pos, expected)) + return false; + + if (!s_resource.ignore(pos, expected)) + return false; + + if (s_if_exists.ignore(pos, expected)) + if_exists = true; + + if (!resource_name_p.parse(pos, resource_name, expected)) + return false; + + if (s_on.ignore(pos, expected)) + { + if (!ASTQueryWithOnCluster::parse(pos, cluster_str, expected)) + return false; + } + + auto drop_resource_query = std::make_shared(); + drop_resource_query->if_exists = if_exists; + drop_resource_query->cluster = std::move(cluster_str); + + node = drop_resource_query; + + drop_resource_query->resource_name = resource_name->as().name(); + + return true; +} + +} diff --git a/src/Parsers/ParserDropResourceQuery.h b/src/Parsers/ParserDropResourceQuery.h new file mode 100644 index 00000000000..651603d1e90 --- /dev/null +++ b/src/Parsers/ParserDropResourceQuery.h @@ -0,0 +1,14 @@ +#pragma once + +#include "IParserBase.h" + +namespace DB +{ +/// DROP RESOURCE resource1 +class ParserDropResourceQuery : public IParserBase +{ +protected: + const char * getName() const override { return "DROP RESOURCE query"; } + bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override; +}; +} diff --git a/src/Parsers/ParserDropWorkloadQuery.cpp b/src/Parsers/ParserDropWorkloadQuery.cpp new file mode 100644 index 00000000000..edc82c8f30a --- /dev/null +++ b/src/Parsers/ParserDropWorkloadQuery.cpp @@ -0,0 +1,52 @@ +#include +#include +#include +#include +#include + +namespace DB +{ + +bool ParserDropWorkloadQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected & expected) +{ + ParserKeyword s_drop(Keyword::DROP); + ParserKeyword s_workload(Keyword::WORKLOAD); + ParserKeyword s_if_exists(Keyword::IF_EXISTS); + ParserKeyword s_on(Keyword::ON); + ParserIdentifier workload_name_p; + + String cluster_str; + bool if_exists = false; + + ASTPtr workload_name; + + if (!s_drop.ignore(pos, expected)) + return false; + + if (!s_workload.ignore(pos, expected)) + return false; + + if (s_if_exists.ignore(pos, expected)) + if_exists = true; + + if (!workload_name_p.parse(pos, workload_name, expected)) + return false; + + if (s_on.ignore(pos, expected)) + { + if (!ASTQueryWithOnCluster::parse(pos, cluster_str, expected)) + return false; + } + + auto drop_workload_query = std::make_shared(); + drop_workload_query->if_exists = if_exists; + drop_workload_query->cluster = std::move(cluster_str); + + node = drop_workload_query; + + drop_workload_query->workload_name = workload_name->as().name(); + + return true; +} + +} diff --git a/src/Parsers/ParserDropWorkloadQuery.h b/src/Parsers/ParserDropWorkloadQuery.h new file mode 100644 index 00000000000..af060caf303 --- /dev/null +++ b/src/Parsers/ParserDropWorkloadQuery.h @@ -0,0 +1,14 @@ +#pragma once + +#include "IParserBase.h" + +namespace DB +{ +/// DROP WORKLOAD workload1 +class ParserDropWorkloadQuery : public IParserBase +{ +protected: + const char * getName() const override { return "DROP WORKLOAD query"; } + bool parseImpl(Pos & pos, ASTPtr & node, Expected & expected) override; +}; +} diff --git a/src/Parsers/ParserQuery.cpp b/src/Parsers/ParserQuery.cpp index d5645298ecf..4ed6e4267f4 100644 --- a/src/Parsers/ParserQuery.cpp +++ b/src/Parsers/ParserQuery.cpp @@ -1,8 +1,12 @@ #include #include +#include +#include #include #include #include +#include +#include #include #include #include @@ -51,6 +55,10 @@ bool ParserQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected) ParserCreateSettingsProfileQuery create_settings_profile_p; ParserCreateFunctionQuery create_function_p; ParserDropFunctionQuery drop_function_p; + ParserCreateWorkloadQuery create_workload_p; + ParserDropWorkloadQuery drop_workload_p; + ParserCreateResourceQuery create_resource_p; + ParserDropResourceQuery drop_resource_p; ParserCreateNamedCollectionQuery create_named_collection_p; ParserDropNamedCollectionQuery drop_named_collection_p; ParserAlterNamedCollectionQuery alter_named_collection_p; @@ -82,6 +90,10 @@ bool ParserQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected) || create_settings_profile_p.parse(pos, node, expected) || create_function_p.parse(pos, node, expected) || drop_function_p.parse(pos, node, expected) + || create_workload_p.parse(pos, node, expected) + || drop_workload_p.parse(pos, node, expected) + || create_resource_p.parse(pos, node, expected) + || drop_resource_p.parse(pos, node, expected) || create_named_collection_p.parse(pos, node, expected) || drop_named_collection_p.parse(pos, node, expected) || alter_named_collection_p.parse(pos, node, expected) diff --git a/src/Parsers/ParserSystemQuery.cpp b/src/Parsers/ParserSystemQuery.cpp index af84dd10bfa..453ae0b5032 100644 --- a/src/Parsers/ParserSystemQuery.cpp +++ b/src/Parsers/ParserSystemQuery.cpp @@ -276,6 +276,7 @@ bool ParserSystemQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected & case Type::RESTART_REPLICA: case Type::SYNC_REPLICA: case Type::WAIT_LOADING_PARTS: + case Type::PREWARM_MARK_CACHE: { if (!parseQueryWithOnCluster(res, pos, expected)) return false; diff --git a/src/Parsers/parseIdentifierOrStringLiteral.cpp b/src/Parsers/parseIdentifierOrStringLiteral.cpp index bb93145772a..71fe071ec03 100644 --- a/src/Parsers/parseIdentifierOrStringLiteral.cpp +++ b/src/Parsers/parseIdentifierOrStringLiteral.cpp @@ -6,11 +6,24 @@ #include #include #include +#include namespace DB { +namespace ErrorCodes +{ + extern const int CANNOT_PARSE_TEXT; +} + +namespace Setting +{ + extern const SettingsUInt64 max_query_size; + extern const SettingsUInt64 max_parser_depth; + extern const SettingsUInt64 max_parser_backtracks; +} + bool parseIdentifierOrStringLiteral(IParser::Pos & pos, Expected & expected, String & result) { return IParserBase::wrapParseImpl(pos, [&] @@ -54,4 +67,18 @@ bool parseIdentifiersOrStringLiterals(IParser::Pos & pos, Expected & expected, S return true; } +std::vector parseIdentifiersOrStringLiterals(const String & str, const Settings & settings) +{ + Tokens tokens(str.data(), str.data() + str.size(), settings[Setting::max_query_size]); + IParser::Pos pos(tokens, static_cast(settings[Setting::max_parser_depth]), static_cast(settings[Setting::max_parser_backtracks])); + + Expected expected; + std::vector res; + + if (!parseIdentifiersOrStringLiterals(pos, expected, res)) + throw Exception(ErrorCodes::CANNOT_PARSE_TEXT, "Cannot parse string ('{}') into vector of identifiers", str); + + return res; +} + } diff --git a/src/Parsers/parseIdentifierOrStringLiteral.h b/src/Parsers/parseIdentifierOrStringLiteral.h index b450ce8f2f0..867962d1a57 100644 --- a/src/Parsers/parseIdentifierOrStringLiteral.h +++ b/src/Parsers/parseIdentifierOrStringLiteral.h @@ -7,6 +7,8 @@ namespace DB { +struct Settings; + /** Parses a name of an object which could be written in the following forms: * name / `name` / "name" (identifier) or 'name'. * Note that empty strings are not allowed. @@ -16,4 +18,7 @@ bool parseIdentifierOrStringLiteral(IParser::Pos & pos, Expected & expected, Str /** Parse a list of identifiers or string literals. */ bool parseIdentifiersOrStringLiterals(IParser::Pos & pos, Expected & expected, Strings & result); +/** Parse a list of identifiers or string literals into vector of strings. */ +std::vector parseIdentifiersOrStringLiterals(const String & str, const Settings & settings); + } diff --git a/src/Processors/Executors/StreamingFormatExecutor.cpp b/src/Processors/Executors/StreamingFormatExecutor.cpp index 10a7b7fd7f5..2d4b87e9f4d 100644 --- a/src/Processors/Executors/StreamingFormatExecutor.cpp +++ b/src/Processors/Executors/StreamingFormatExecutor.cpp @@ -22,8 +22,12 @@ StreamingFormatExecutor::StreamingFormatExecutor( , adding_defaults_transform(std::move(adding_defaults_transform_)) , port(format->getPort().getHeader(), format.get()) , result_columns(header.cloneEmptyColumns()) + , checkpoints(result_columns.size()) { connect(format->getPort(), port); + + for (size_t i = 0; i < result_columns.size(); ++i) + checkpoints[i] = result_columns[i]->getCheckpoint(); } MutableColumns StreamingFormatExecutor::getResultColumns() @@ -53,6 +57,9 @@ size_t StreamingFormatExecutor::execute(ReadBuffer & buffer) size_t StreamingFormatExecutor::execute() { + for (size_t i = 0; i < result_columns.size(); ++i) + result_columns[i]->updateCheckpoint(*checkpoints[i]); + try { size_t new_rows = 0; @@ -85,19 +92,19 @@ size_t StreamingFormatExecutor::execute() catch (Exception & e) { format->resetParser(); - return on_error(result_columns, e); + return on_error(result_columns, checkpoints, e); } catch (std::exception & e) { format->resetParser(); auto exception = Exception(Exception::CreateFromSTDTag{}, e); - return on_error(result_columns, exception); + return on_error(result_columns, checkpoints, exception); } catch (...) { format->resetParser(); - auto exception = Exception(ErrorCodes::UNKNOWN_EXCEPTION, "Unknowk exception while executing StreamingFormatExecutor with format {}", format->getName()); - return on_error(result_columns, exception); + auto exception = Exception(ErrorCodes::UNKNOWN_EXCEPTION, "Unknown exception while executing StreamingFormatExecutor with format {}", format->getName()); + return on_error(result_columns, checkpoints, exception); } } diff --git a/src/Processors/Executors/StreamingFormatExecutor.h b/src/Processors/Executors/StreamingFormatExecutor.h index f159178df8c..3db5a92ae98 100644 --- a/src/Processors/Executors/StreamingFormatExecutor.h +++ b/src/Processors/Executors/StreamingFormatExecutor.h @@ -19,12 +19,12 @@ public: /// and exception to rethrow it or add context to it. /// Should return number of new rows, which are added in callback /// to result columns in comparison to previous call of `execute`. - using ErrorCallback = std::function; + using ErrorCallback = std::function; StreamingFormatExecutor( const Block & header_, InputFormatPtr format_, - ErrorCallback on_error_ = [](const MutableColumns &, Exception & e) -> size_t { throw std::move(e); }, + ErrorCallback on_error_ = [](const MutableColumns &, const ColumnCheckpoints, Exception & e) -> size_t { throw std::move(e); }, SimpleTransformPtr adding_defaults_transform_ = nullptr); /// Returns numbers of new read rows. @@ -50,6 +50,7 @@ private: InputPort port; MutableColumns result_columns; + ColumnCheckpoints checkpoints; }; } diff --git a/src/Processors/Formats/IRowInputFormat.cpp b/src/Processors/Formats/IRowInputFormat.cpp index 2da5118afe1..ea316f97521 100644 --- a/src/Processors/Formats/IRowInputFormat.cpp +++ b/src/Processors/Formats/IRowInputFormat.cpp @@ -111,6 +111,10 @@ Chunk IRowInputFormat::read() for (size_t i = 0; i < num_columns; ++i) columns[i] = header.getByPosition(i).type->createColumn(*serializations[i]); + ColumnCheckpoints checkpoints(columns.size()); + for (size_t column_idx = 0; column_idx < columns.size(); ++column_idx) + checkpoints[column_idx] = columns[column_idx]->getCheckpoint(); + block_missing_values.clear(); size_t num_rows = 0; @@ -136,6 +140,9 @@ Chunk IRowInputFormat::read() { try { + for (size_t column_idx = 0; column_idx < columns.size(); ++column_idx) + columns[column_idx]->updateCheckpoint(*checkpoints[column_idx]); + info.read_columns.clear(); continue_reading = readRow(columns, info); @@ -199,14 +206,9 @@ Chunk IRowInputFormat::read() syncAfterError(); - /// Truncate all columns in block to initial size (remove values, that was appended to only part of columns). - + /// Rollback all columns in block to initial size (remove values, that was appended to only part of columns). for (size_t column_idx = 0; column_idx < num_columns; ++column_idx) - { - auto & column = columns[column_idx]; - if (column->size() > num_rows) - column->popBack(column->size() - num_rows); - } + columns[column_idx]->rollback(*checkpoints[column_idx]); } } } diff --git a/src/Processors/Formats/Impl/JSONAsStringRowInputFormat.cpp b/src/Processors/Formats/Impl/JSONAsStringRowInputFormat.cpp index 1985c7433c8..06557db9aa2 100644 --- a/src/Processors/Formats/Impl/JSONAsStringRowInputFormat.cpp +++ b/src/Processors/Formats/Impl/JSONAsStringRowInputFormat.cpp @@ -172,7 +172,7 @@ JSONAsObjectRowInputFormat::JSONAsObjectRowInputFormat( const auto & type = header_.getByPosition(0).type; if (!isObject(type) && !isObjectDeprecated(type)) throw Exception(ErrorCodes::BAD_ARGUMENTS, - "Input format JSONAsObject is only suitable for tables with a single column of type Object/JSON but the column type is {}", + "Input format JSONAsObject is only suitable for tables with a single column of type JSON but the column type is {}", type->getName()); } @@ -193,8 +193,8 @@ JSONAsObjectExternalSchemaReader::JSONAsObjectExternalSchemaReader(const FormatS if (!settings.json.allow_deprecated_object_type && !settings.json.allow_json_type) throw Exception( ErrorCodes::ILLEGAL_COLUMN, - "Cannot infer the data structure in JSONAsObject format because experimental Object/JSON type is not allowed. Set setting " - "allow_experimental_object_type = 1 or allow_experimental_json_type=1 in order to allow it"); + "Cannot infer the data structure in JSONAsObject format because experimental JSON type is not allowed. Set setting " + "allow_experimental_json_type = 1 in order to allow it"); } void registerInputFormatJSONAsString(FormatFactory & factory) diff --git a/src/Processors/Formats/Impl/NativeFormat.cpp b/src/Processors/Formats/Impl/NativeFormat.cpp index 5411e2e7811..022cb38596b 100644 --- a/src/Processors/Formats/Impl/NativeFormat.cpp +++ b/src/Processors/Formats/Impl/NativeFormat.cpp @@ -15,16 +15,17 @@ namespace DB class NativeInputFormat final : public IInputFormat { public: - NativeInputFormat(ReadBuffer & buf, const Block & header_, const FormatSettings & settings) + NativeInputFormat(ReadBuffer & buf, const Block & header_, const FormatSettings & settings_) : IInputFormat(header_, &buf) , reader(std::make_unique( buf, header_, 0, - settings, - settings.defaults_for_omitted_fields ? &block_missing_values : nullptr)) + settings_, + settings_.defaults_for_omitted_fields ? &block_missing_values : nullptr)) , header(header_) , block_missing_values(header.columns()) + , settings(settings_) { } @@ -55,7 +56,7 @@ public: void setReadBuffer(ReadBuffer & in_) override { - reader = std::make_unique(in_, header, 0); + reader = std::make_unique(in_, header, 0, settings, settings.defaults_for_omitted_fields ? &block_missing_values : nullptr); IInputFormat::setReadBuffer(in_); } @@ -67,6 +68,7 @@ private: std::unique_ptr reader; Block header; BlockMissingValues block_missing_values; + const FormatSettings settings; size_t approx_bytes_read_for_chunk = 0; }; diff --git a/src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.cpp b/src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.cpp index b8e4db8700c..b471989076b 100644 --- a/src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.cpp +++ b/src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.cpp @@ -296,6 +296,31 @@ void ParquetPlainValuesReader::readBatch( ); } +template +void ParquetBitPlainReader::readBatch( + MutableColumnPtr & col_ptr, LazyNullMap & null_map, UInt32 num_values) +{ + auto cursor = col_ptr->size(); + auto * column_data = getResizedPrimitiveData(*assert_cast(col_ptr.get()), cursor + num_values); + + def_level_reader->visitNullableValues( + cursor, + num_values, + max_def_level, + null_map, + /* individual_visitor */ [&](size_t nest_cursor) + { + uint8_t byte; + bit_reader->GetValue(1, &byte); + column_data[nest_cursor] = byte; + }, + /* repeated_visitor */ [&](size_t nest_cursor, UInt32 count) + { + bit_reader->GetBatch(1, &column_data[nest_cursor], count); + } + ); +} + template <> void ParquetPlainValuesReader, ParquetReaderTypes::TimestampInt96>::readBatch( @@ -561,6 +586,9 @@ template class ParquetPlainValuesReader>; template class ParquetPlainValuesReader>; template class ParquetPlainValuesReader>; template class ParquetPlainValuesReader; +template class ParquetPlainValuesReader; + +template class ParquetBitPlainReader; template class ParquetFixedLenPlainReader>; template class ParquetFixedLenPlainReader>; @@ -569,6 +597,7 @@ template class ParquetRleLCReader; template class ParquetRleLCReader; template class ParquetRleLCReader; +template class ParquetRleDictReader; template class ParquetRleDictReader; template class ParquetRleDictReader; template class ParquetRleDictReader; diff --git a/src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.h b/src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.h index fbccb612b3c..db55f7e2d6a 100644 --- a/src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.h +++ b/src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.h @@ -172,6 +172,27 @@ private: ParquetDataBuffer plain_data_buffer; }; +template +class ParquetBitPlainReader : public ParquetDataValuesReader +{ +public: + ParquetBitPlainReader( + Int32 max_def_level_, + std::unique_ptr def_level_reader_, + std::unique_ptr bit_reader_) + : max_def_level(max_def_level_) + , def_level_reader(std::move(def_level_reader_)) + , bit_reader(std::move(bit_reader_)) + {} + + void readBatch(MutableColumnPtr & col_ptr, LazyNullMap & null_map, UInt32 num_values) override; + +private: + Int32 max_def_level; + std::unique_ptr def_level_reader; + std::unique_ptr bit_reader; +}; + /** * The data and definition level encoding are same as ParquetPlainValuesReader. * But the element size is const and bigger than primitive data type. diff --git a/src/Processors/Formats/Impl/Parquet/ParquetLeafColReader.cpp b/src/Processors/Formats/Impl/Parquet/ParquetLeafColReader.cpp index 4b5880eba37..c3c7db510ed 100644 --- a/src/Processors/Formats/Impl/Parquet/ParquetLeafColReader.cpp +++ b/src/Processors/Formats/Impl/Parquet/ParquetLeafColReader.cpp @@ -425,16 +425,29 @@ void ParquetLeafColReader::initDataReader( degradeDictionary(); } - ParquetDataBuffer parquet_buffer = [&]() + if (col_descriptor.physical_type() == parquet::Type::BOOLEAN) { - if constexpr (!std::is_same_v, TColumn>) - return ParquetDataBuffer(buffer, max_size); + if constexpr (std::is_same_v) + { + auto bit_reader = std::make_unique(buffer, max_size); + data_values_reader = std::make_unique>(col_descriptor.max_definition_level(), + std::move(def_level_reader), + std::move(bit_reader)); + } + } + else + { + ParquetDataBuffer parquet_buffer = [&]() + { + if constexpr (!std::is_same_v, TColumn>) + return ParquetDataBuffer(buffer, max_size); - auto scale = assert_cast(*base_data_type).getScale(); - return ParquetDataBuffer(buffer, max_size, scale); - }(); - data_values_reader = createPlainReader( - col_descriptor, std::move(def_level_reader), std::move(parquet_buffer)); + auto scale = assert_cast(*base_data_type).getScale(); + return ParquetDataBuffer(buffer, max_size, scale); + }(); + data_values_reader = createPlainReader( + col_descriptor, std::move(def_level_reader), std::move(parquet_buffer)); + } break; } case parquet::Encoding::RLE_DICTIONARY: @@ -612,6 +625,12 @@ std::unique_ptr ParquetLeafColReader::createDi }); return res; } + + if (col_descriptor.physical_type() == parquet::Type::type::BOOLEAN) + { + throw Exception(ErrorCodes::NOT_IMPLEMENTED, "Dictionary encoding for booleans is not supported"); + } + return std::make_unique>( col_descriptor.max_definition_level(), std::move(def_level_reader), @@ -620,6 +639,7 @@ std::unique_ptr ParquetLeafColReader::createDi } +template class ParquetLeafColReader; template class ParquetLeafColReader; template class ParquetLeafColReader; template class ParquetLeafColReader; diff --git a/src/Processors/Formats/Impl/Parquet/ParquetRecordReader.cpp b/src/Processors/Formats/Impl/Parquet/ParquetRecordReader.cpp index acf11a30162..971bb9e1be5 100644 --- a/src/Processors/Formats/Impl/Parquet/ParquetRecordReader.cpp +++ b/src/Processors/Formats/Impl/Parquet/ParquetRecordReader.cpp @@ -263,7 +263,7 @@ std::unique_ptr ColReaderFactory::makeReader() switch (col_descriptor.physical_type()) { case parquet::Type::BOOLEAN: - break; + return makeLeafReader(); case parquet::Type::INT32: return fromInt32(); case parquet::Type::INT64: diff --git a/src/Processors/Merges/Algorithms/ReplacingSortedAlgorithm.cpp b/src/Processors/Merges/Algorithms/ReplacingSortedAlgorithm.cpp index cd347d371d9..dbce348d1aa 100644 --- a/src/Processors/Merges/Algorithms/ReplacingSortedAlgorithm.cpp +++ b/src/Processors/Merges/Algorithms/ReplacingSortedAlgorithm.cpp @@ -46,11 +46,28 @@ ReplacingSortedAlgorithm::ReplacingSortedAlgorithm( { if (!is_deleted_column.empty()) is_deleted_column_number = header_.getPositionByName(is_deleted_column); + if (!version_column.empty()) version_column_number = header_.getPositionByName(version_column); } void ReplacingSortedAlgorithm::insertRow() +{ + if (is_deleted_column_number != -1) + { + if (!(cleanup && assert_cast(*(*selected_row.all_columns)[is_deleted_column_number]).getData()[selected_row.row_num])) + insertRowImpl(); + } + else + { + insertRowImpl(); + } + + /// insertRowImpl() may has not been called + saveChunkForSkippingFinalFromSelectedRow(); +} + +void ReplacingSortedAlgorithm::insertRowImpl() { if (out_row_sources_buf) { @@ -67,6 +84,7 @@ void ReplacingSortedAlgorithm::insertRow() /// We just record the position to be selected in the chunk if (!selected_row.owned_chunk->replace_final_selection) selected_row.owned_chunk->replace_final_selection = ColumnUInt64::create(); + selected_row.owned_chunk->replace_final_selection->insert(selected_row.row_num); /// This is the last row we can select from `selected_row.owned_chunk`, keep it to emit later @@ -74,7 +92,9 @@ void ReplacingSortedAlgorithm::insertRow() to_be_emitted.push(std::move(selected_row.owned_chunk)); } else + { merged_data->insertRow(*selected_row.all_columns, selected_row.row_num, selected_row.owned_chunk->getNumRows()); + } selected_row.clear(); } @@ -113,30 +133,68 @@ IMergingAlgorithm::Status ReplacingSortedAlgorithm::merge() /// Write the data for the previous primary key. if (!selected_row.empty()) - { - if (is_deleted_column_number!=-1) - { - if (!(cleanup && assert_cast(*(*selected_row.all_columns)[is_deleted_column_number]).getData()[selected_row.row_num])) - insertRow(); - } - else - insertRow(); - /// insertRow() may has not been called - saveChunkForSkippingFinalFromSelectedRow(); - } + insertRow(); selected_row.clear(); } + if (current->isFirst() + && key_differs + && is_deleted_column_number == -1 /// Ignore optimization if we need to filter deleted rows. + && sources_origin_merge_tree_part_level[current->order] > 0 + && !skipLastRowFor(current->order) /// Ignore optimization if last row should be skipped. + && (queue.size() == 1 || (queue.size() >= 2 && current.totallyLess(queue.nextChild())))) + { + /// This is special optimization if current cursor is totally less than next cursor + /// and current chunk has no duplicates (we assume that parts with non-zero level have no duplicates) + /// We want to insert current cursor chunk directly in merged data. + + /// First if merged_data is not empty we need to flush it. + /// We will get into the same condition on next merge call. + if (merged_data->mergedRows() != 0) + return Status(merged_data->pull()); + + size_t source_num = current->order; + auto current_chunk = std::move(*sources[source_num].chunk); + size_t chunk_num_rows = current_chunk.getNumRows(); + + /// We will get the next block from the corresponding source, if there is one. + queue.removeTop(); + + if (enable_vertical_final) + { + current_chunk.getChunkInfos().add(std::make_shared()); + Status status(std::move(current_chunk)); + status.required_source = source_num; + return status; + } + + merged_data->insertChunk(std::move(current_chunk), chunk_num_rows); + sources[source_num].chunk = {}; + + /// Write order of rows for other columns this data will be used in gather stream + if (out_row_sources_buf) + { + /// All rows are not skipped. + RowSourcePart row_source(source_num); + for (size_t i = 0; i < chunk_num_rows; ++i) + out_row_sources_buf->write(row_source.data); + } + + Status status(merged_data->pull()); + status.required_source = source_num; + return status; + } + /// Initially, skip all rows. Unskip last on insert. size_t current_pos = current_row_sources.size(); if (out_row_sources_buf) current_row_sources.emplace_back(current.impl->order, true); - if ((is_deleted_column_number!=-1)) + if (is_deleted_column_number != -1) { const UInt8 is_deleted = assert_cast(*current->all_columns[is_deleted_column_number]).getData()[current->getRow()]; - if ((is_deleted != 1) && (is_deleted != 0)) + if (is_deleted > 1) throw Exception(ErrorCodes::INCORRECT_DATA, "Incorrect data: is_deleted = {} (must be 1 or 0).", toString(is_deleted)); } @@ -172,17 +230,7 @@ IMergingAlgorithm::Status ReplacingSortedAlgorithm::merge() /// We will write the data for the last primary key. if (!selected_row.empty()) - { - if (is_deleted_column_number!=-1) - { - if (!(cleanup && assert_cast(*(*selected_row.all_columns)[is_deleted_column_number]).getData()[selected_row.row_num])) - insertRow(); - } - else - insertRow(); - /// insertRow() may has not been called - saveChunkForSkippingFinalFromSelectedRow(); - } + insertRow(); /// Skipping final: emit the remaining chunks if (!to_be_emitted.empty()) diff --git a/src/Processors/Merges/Algorithms/ReplacingSortedAlgorithm.h b/src/Processors/Merges/Algorithms/ReplacingSortedAlgorithm.h index 2f23f2a5c4d..ec366b900f5 100644 --- a/src/Processors/Merges/Algorithms/ReplacingSortedAlgorithm.h +++ b/src/Processors/Merges/Algorithms/ReplacingSortedAlgorithm.h @@ -13,8 +13,7 @@ class Logger; namespace DB { -/** Use in skipping final to keep list of indices of selected row after merging final - */ +//// Used in skipping final to keep the list of indices of selected rows after merging. struct ChunkSelectFinalIndices : public ChunkInfoCloneable { explicit ChunkSelectFinalIndices(MutableColumnPtr select_final_indices_); @@ -24,6 +23,11 @@ struct ChunkSelectFinalIndices : public ChunkInfoCloneable +{ +}; + /** Merges several sorted inputs into one. * For each group of consecutive identical values of the primary key (the columns by which the data is sorted), * keeps row with max `version` value. @@ -63,6 +67,7 @@ private: PODArray current_row_sources; void insertRow(); + void insertRowImpl(); /// Method for using in skipping FINAL logic /// Skipping FINAL doesn't merge rows to new chunks but marks selected rows in input chunks and emit them diff --git a/src/Processors/QueryPlan/Optimizations/optimizeUseAggregateProjection.cpp b/src/Processors/QueryPlan/Optimizations/optimizeUseAggregateProjection.cpp index 511ae274101..dee16bfcb1a 100644 --- a/src/Processors/QueryPlan/Optimizations/optimizeUseAggregateProjection.cpp +++ b/src/Processors/QueryPlan/Optimizations/optimizeUseAggregateProjection.cpp @@ -752,7 +752,8 @@ std::optional optimizeUseAggregateProjections(QueryPlan::Node & node, Qu Pipe pipe(std::make_shared(std::move(block_with_count))); projection_reading = std::make_unique(std::move(pipe)); - selected_projection_name = "Optimized trivial count"; + /// Use @minmax_count_projection name as it goes through the same optimization. + selected_projection_name = metadata->minmax_count_projection->name; has_ordinary_parts = reading->getAnalyzedResult() != nullptr; } else diff --git a/src/Processors/Transforms/SelectByIndicesTransform.h b/src/Processors/Transforms/SelectByIndicesTransform.h index b44f5a3203e..e67d3bfde51 100644 --- a/src/Processors/Transforms/SelectByIndicesTransform.h +++ b/src/Processors/Transforms/SelectByIndicesTransform.h @@ -26,8 +26,12 @@ public: void transform(Chunk & chunk) override { size_t num_rows = chunk.getNumRows(); - auto select_final_indices_info = chunk.getChunkInfos().extract(); + auto select_all_rows_info = chunk.getChunkInfos().extract(); + if (select_all_rows_info) + return; + + auto select_final_indices_info = chunk.getChunkInfos().extract(); if (!select_final_indices_info || !select_final_indices_info->select_final_indices) throw Exception(ErrorCodes::LOGICAL_ERROR, "Chunk passed to SelectByIndicesTransform without indices column"); diff --git a/src/Server/CertificateReloader.cpp b/src/Server/CertificateReloader.cpp index 5b981fc7a87..aa84b26af69 100644 --- a/src/Server/CertificateReloader.cpp +++ b/src/Server/CertificateReloader.cpp @@ -91,6 +91,12 @@ void CertificateReloader::tryLoad(const Poco::Util::AbstractConfiguration & conf } +void CertificateReloader::tryLoadClient(const Poco::Util::AbstractConfiguration & config) +{ + tryLoad(config, nullptr, Poco::Net::SSLManager::CFG_CLIENT_PREFIX); +} + + void CertificateReloader::tryLoad(const Poco::Util::AbstractConfiguration & config, SSL_CTX * ctx, const std::string & prefix) { std::lock_guard lock{data_mutex}; @@ -107,7 +113,12 @@ std::list::iterator CertificateReloader::findOrI else { if (!ctx) - ctx = Poco::Net::SSLManager::instance().defaultServerContext()->sslContext(); + { + if (prefix == Poco::Net::SSLManager::CFG_CLIENT_PREFIX) + ctx = Poco::Net::SSLManager::instance().defaultClientContext()->sslContext(); + else + ctx = Poco::Net::SSLManager::instance().defaultServerContext()->sslContext(); + } data.push_back(MultiData(ctx)); --it; data_index[prefix] = it; diff --git a/src/Server/CertificateReloader.h b/src/Server/CertificateReloader.h index 28737988fdd..0e4ea8b989e 100644 --- a/src/Server/CertificateReloader.h +++ b/src/Server/CertificateReloader.h @@ -77,6 +77,9 @@ public: /// Handle configuration reload for default path void tryLoad(const Poco::Util::AbstractConfiguration & config); + /// Handle configuration reload client for default path + void tryLoadClient(const Poco::Util::AbstractConfiguration & config); + /// Handle configuration reload void tryLoad(const Poco::Util::AbstractConfiguration & config, SSL_CTX * ctx, const std::string & prefix); diff --git a/src/Server/HTTP/authenticateUserByHTTP.cpp b/src/Server/HTTP/authenticateUserByHTTP.cpp index cbad91cc292..61029ed9560 100644 --- a/src/Server/HTTP/authenticateUserByHTTP.cpp +++ b/src/Server/HTTP/authenticateUserByHTTP.cpp @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -54,11 +55,13 @@ bool authenticateUserByHTTP( HTTPServerResponse & response, Session & session, std::unique_ptr & request_credentials, + const HTTPHandlerConnectionConfig & connection_config, ContextPtr global_context, LoggerPtr log) { /// Get the credentials created by the previous call of authenticateUserByHTTP() while handling the previous HTTP request. auto current_credentials = std::move(request_credentials); + const auto & config_credentials = connection_config.credentials; /// The user and password can be passed by headers (similar to X-Auth-*), /// which is used by load balancers to pass authentication information. @@ -70,6 +73,7 @@ bool authenticateUserByHTTP( /// The header 'X-ClickHouse-SSL-Certificate-Auth: on' enables checking the common name /// extracted from the SSL certificate used for this connection instead of checking password. bool has_ssl_certificate_auth = (request.get("X-ClickHouse-SSL-Certificate-Auth", "") == "on"); + bool has_config_credentials = config_credentials.has_value(); /// User name and password can be passed using HTTP Basic auth or query parameters /// (both methods are insecure). @@ -79,6 +83,10 @@ bool authenticateUserByHTTP( std::string spnego_challenge; SSLCertificateSubjects certificate_subjects; + if (config_credentials) + { + checkUserNameNotEmpty(config_credentials->getUserName(), "config authentication"); + } if (has_ssl_certificate_auth) { #if USE_SSL @@ -86,6 +94,8 @@ bool authenticateUserByHTTP( checkUserNameNotEmpty(user, "X-ClickHouse HTTP headers"); /// It is prohibited to mix different authorization schemes. + if (has_config_credentials) + throwMultipleAuthenticationMethods("SSL certificate authentication", "authentication set in config"); if (!password.empty()) throwMultipleAuthenticationMethods("SSL certificate authentication", "authentication via password"); if (has_http_credentials) @@ -109,6 +119,8 @@ bool authenticateUserByHTTP( checkUserNameNotEmpty(user, "X-ClickHouse HTTP headers"); /// It is prohibited to mix different authorization schemes. + if (has_config_credentials) + throwMultipleAuthenticationMethods("X-ClickHouse HTTP headers", "authentication set in config"); if (has_http_credentials) throwMultipleAuthenticationMethods("X-ClickHouse HTTP headers", "Authorization HTTP header"); if (has_credentials_in_query_params) @@ -117,6 +129,8 @@ bool authenticateUserByHTTP( else if (has_http_credentials) { /// It is prohibited to mix different authorization schemes. + if (has_config_credentials) + throwMultipleAuthenticationMethods("Authorization HTTP header", "authentication set in config"); if (has_credentials_in_query_params) throwMultipleAuthenticationMethods("Authorization HTTP header", "authentication via parameters"); @@ -190,6 +204,10 @@ bool authenticateUserByHTTP( return false; } } + else if (has_config_credentials) + { + current_credentials = std::make_unique(*config_credentials); + } else // I.e., now using user name and password strings ("Basic"). { if (!current_credentials) diff --git a/src/Server/HTTP/authenticateUserByHTTP.h b/src/Server/HTTP/authenticateUserByHTTP.h index 3b5a04cae68..02dcf828faa 100644 --- a/src/Server/HTTP/authenticateUserByHTTP.h +++ b/src/Server/HTTP/authenticateUserByHTTP.h @@ -11,13 +11,22 @@ class HTMLForm; class HTTPServerResponse; class Session; class Credentials; +class BasicCredentials; +struct HTTPHandlerConnectionConfig; /// Authenticates a user via HTTP protocol and initializes a session. +/// /// Usually retrieves the name and the password for that user from either the request's headers or from the query parameters. -/// Returns true when the user successfully authenticated, -/// the session instance will be configured accordingly, and the request_credentials instance will be dropped. -/// Returns false when the user is not authenticated yet, and the HTTP_UNAUTHORIZED response is sent with the "WWW-Authenticate" header, -/// in this case the `request_credentials` instance must be preserved until the next request or until any exception. +/// You can also pass user/password explicitly via `config_credentials`. +/// +/// Returns true when the user successfully authenticated: +/// - the session instance will be configured accordingly +/// - and the request_credentials instance will be dropped. +/// +/// Returns false when the user is not authenticated yet: +/// - the HTTP_UNAUTHORIZED response is sent with the "WWW-Authenticate" header +/// - the `request_credentials` instance must be preserved until the next request or until any exception. +/// /// Throws an exception if authentication failed. bool authenticateUserByHTTP( const HTTPServerRequest & request, @@ -25,6 +34,7 @@ bool authenticateUserByHTTP( HTTPServerResponse & response, Session & session, std::unique_ptr & request_credentials, + const HTTPHandlerConnectionConfig & connection_config, ContextPtr global_context, LoggerPtr log); diff --git a/src/Server/HTTPHandler.cpp b/src/Server/HTTPHandler.cpp index 8a9ae05b355..5fd92d99b3c 100644 --- a/src/Server/HTTPHandler.cpp +++ b/src/Server/HTTPHandler.cpp @@ -1,6 +1,5 @@ #include -#include #include #include #include @@ -145,6 +144,15 @@ static std::chrono::steady_clock::duration parseSessionTimeout( return std::chrono::seconds(session_timeout); } +HTTPHandlerConnectionConfig::HTTPHandlerConnectionConfig(const Poco::Util::AbstractConfiguration & config, const std::string & config_prefix) +{ + if (config.has(config_prefix + ".handler.user") || config.has(config_prefix + ".handler.password")) + { + credentials.emplace( + config.getString(config_prefix + ".handler.user", "default"), + config.getString(config_prefix + ".handler.password", "")); + } +} void HTTPHandler::pushDelayedResults(Output & used_output) { @@ -182,11 +190,12 @@ void HTTPHandler::pushDelayedResults(Output & used_output) } -HTTPHandler::HTTPHandler(IServer & server_, const std::string & name, const HTTPResponseHeaderSetup & http_response_headers_override_) +HTTPHandler::HTTPHandler(IServer & server_, const HTTPHandlerConnectionConfig & connection_config_, const std::string & name, const HTTPResponseHeaderSetup & http_response_headers_override_) : server(server_) , log(getLogger(name)) , default_settings(server.context()->getSettingsRef()) , http_response_headers_override(http_response_headers_override_) + , connection_config(connection_config_) { server_display_name = server.config().getString("display_name", getFQDNOrHostName()); } @@ -199,7 +208,7 @@ HTTPHandler::~HTTPHandler() = default; bool HTTPHandler::authenticateUser(HTTPServerRequest & request, HTMLForm & params, HTTPServerResponse & response) { - return authenticateUserByHTTP(request, params, response, *session, request_credentials, server.context(), log); + return authenticateUserByHTTP(request, params, response, *session, request_credentials, connection_config, server.context(), log); } @@ -768,8 +777,12 @@ void HTTPHandler::handleRequest(HTTPServerRequest & request, HTTPServerResponse } DynamicQueryHandler::DynamicQueryHandler( - IServer & server_, const std::string & param_name_, const HTTPResponseHeaderSetup & http_response_headers_override_) - : HTTPHandler(server_, "DynamicQueryHandler", http_response_headers_override_), param_name(param_name_) + IServer & server_, + const HTTPHandlerConnectionConfig & connection_config, + const std::string & param_name_, + const HTTPResponseHeaderSetup & http_response_headers_override_) + : HTTPHandler(server_, connection_config, "DynamicQueryHandler", http_response_headers_override_) + , param_name(param_name_) { } @@ -826,12 +839,13 @@ std::string DynamicQueryHandler::getQuery(HTTPServerRequest & request, HTMLForm PredefinedQueryHandler::PredefinedQueryHandler( IServer & server_, + const HTTPHandlerConnectionConfig & connection_config, const NameSet & receive_params_, const std::string & predefined_query_, const CompiledRegexPtr & url_regex_, const std::unordered_map & header_name_with_regex_, const HTTPResponseHeaderSetup & http_response_headers_override_) - : HTTPHandler(server_, "PredefinedQueryHandler", http_response_headers_override_) + : HTTPHandler(server_, connection_config, "PredefinedQueryHandler", http_response_headers_override_) , receive_params(receive_params_) , predefined_query(predefined_query_) , url_regex(url_regex_) @@ -923,10 +937,11 @@ HTTPRequestHandlerFactoryPtr createDynamicHandlerFactory(IServer & server, { auto query_param_name = config.getString(config_prefix + ".handler.query_param_name", "query"); + HTTPHandlerConnectionConfig connection_config(config, config_prefix); HTTPResponseHeaderSetup http_response_headers_override = parseHTTPResponseHeaders(config, config_prefix); - auto creator = [&server, query_param_name, http_response_headers_override]() -> std::unique_ptr - { return std::make_unique(server, query_param_name, http_response_headers_override); }; + auto creator = [&server, query_param_name, http_response_headers_override, connection_config]() -> std::unique_ptr + { return std::make_unique(server, connection_config, query_param_name, http_response_headers_override); }; auto factory = std::make_shared>(std::move(creator)); factory->addFiltersFromConfig(config, config_prefix); @@ -968,6 +983,8 @@ HTTPRequestHandlerFactoryPtr createPredefinedHandlerFactory(IServer & server, Poco::Util::AbstractConfiguration::Keys headers_name; config.keys(config_prefix + ".headers", headers_name); + HTTPHandlerConnectionConfig connection_config(config, config_prefix); + for (const auto & header_name : headers_name) { auto expression = config.getString(config_prefix + ".headers." + header_name); @@ -1001,12 +1018,18 @@ HTTPRequestHandlerFactoryPtr createPredefinedHandlerFactory(IServer & server, predefined_query, regex, headers_name_with_regex, - http_response_headers_override] + http_response_headers_override, + connection_config] -> std::unique_ptr { return std::make_unique( - server, analyze_receive_params, predefined_query, regex, - headers_name_with_regex, http_response_headers_override); + server, + connection_config, + analyze_receive_params, + predefined_query, + regex, + headers_name_with_regex, + http_response_headers_override); }; factory = std::make_shared>(std::move(creator)); factory->addFiltersFromConfig(config, config_prefix); @@ -1019,18 +1042,21 @@ HTTPRequestHandlerFactoryPtr createPredefinedHandlerFactory(IServer & server, analyze_receive_params, predefined_query, headers_name_with_regex, - http_response_headers_override] + http_response_headers_override, + connection_config] -> std::unique_ptr { return std::make_unique( - server, analyze_receive_params, predefined_query, CompiledRegexPtr{}, - headers_name_with_regex, http_response_headers_override); + server, + connection_config, + analyze_receive_params, + predefined_query, + CompiledRegexPtr{}, + headers_name_with_regex, + http_response_headers_override); }; - factory = std::make_shared>(std::move(creator)); - factory->addFiltersFromConfig(config, config_prefix); - return factory; } diff --git a/src/Server/HTTPHandler.h b/src/Server/HTTPHandler.h index 6580b317f6e..2296fa70aeb 100644 --- a/src/Server/HTTPHandler.h +++ b/src/Server/HTTPHandler.h @@ -12,6 +12,7 @@ #include #include #include +#include #include "HTTPResponseHeaderWriter.h" @@ -26,17 +27,28 @@ namespace DB { class Session; -class Credentials; class IServer; struct Settings; class WriteBufferFromHTTPServerResponse; using CompiledRegexPtr = std::shared_ptr; +struct HTTPHandlerConnectionConfig +{ + std::optional credentials; + + /// TODO: + /// String quota; + /// String default_database; + + HTTPHandlerConnectionConfig() = default; + HTTPHandlerConnectionConfig(const Poco::Util::AbstractConfiguration & config, const std::string & config_prefix); +}; + class HTTPHandler : public HTTPRequestHandler { public: - HTTPHandler(IServer & server_, const std::string & name, const HTTPResponseHeaderSetup & http_response_headers_override_); + HTTPHandler(IServer & server_, const HTTPHandlerConnectionConfig & connection_config_, const std::string & name, const HTTPResponseHeaderSetup & http_response_headers_override_); ~HTTPHandler() override; void handleRequest(HTTPServerRequest & request, HTTPServerResponse & response, const ProfileEvents::Event & write_event) override; @@ -146,16 +158,7 @@ private: // The request_credential instance may outlive a single request/response loop. // This happens only when the authentication mechanism requires more than a single request/response exchange (e.g., SPNEGO). std::unique_ptr request_credentials; - - // Returns true when the user successfully authenticated, - // the session instance will be configured accordingly, and the request_credentials instance will be dropped. - // Returns false when the user is not authenticated yet, and the 'Negotiate' response is sent, - // the session and request_credentials instances are preserved. - // Throws an exception if authentication failed. - bool authenticateUser( - HTTPServerRequest & request, - HTMLForm & params, - HTTPServerResponse & response); + HTTPHandlerConnectionConfig connection_config; /// Also initializes 'used_output'. void processQuery( @@ -174,6 +177,13 @@ private: Output & used_output); static void pushDelayedResults(Output & used_output); + +protected: + // @see authenticateUserByHTTP() + virtual bool authenticateUser( + HTTPServerRequest & request, + HTMLForm & params, + HTTPServerResponse & response); }; class DynamicQueryHandler : public HTTPHandler @@ -184,6 +194,7 @@ private: public: explicit DynamicQueryHandler( IServer & server_, + const HTTPHandlerConnectionConfig & connection_config, const std::string & param_name_ = "query", const HTTPResponseHeaderSetup & http_response_headers_override_ = std::nullopt); @@ -203,6 +214,7 @@ private: public: PredefinedQueryHandler( IServer & server_, + const HTTPHandlerConnectionConfig & connection_config, const NameSet & receive_params_, const std::string & predefined_query_, const CompiledRegexPtr & url_regex_, diff --git a/src/Server/HTTPHandlerFactory.cpp b/src/Server/HTTPHandlerFactory.cpp index 2d5ddd859fe..950cad4038a 100644 --- a/src/Server/HTTPHandlerFactory.cpp +++ b/src/Server/HTTPHandlerFactory.cpp @@ -275,7 +275,7 @@ void addDefaultHandlersFactory( auto dynamic_creator = [&server] () -> std::unique_ptr { - return std::make_unique(server, "query"); + return std::make_unique(server, HTTPHandlerConnectionConfig{}, "query"); }; auto query_handler = std::make_shared>(std::move(dynamic_creator)); query_handler->addFilter([](const auto & request) diff --git a/src/Server/PrometheusRequestHandler.cpp b/src/Server/PrometheusRequestHandler.cpp index cd18eac50a7..9c521e06667 100644 --- a/src/Server/PrometheusRequestHandler.cpp +++ b/src/Server/PrometheusRequestHandler.cpp @@ -7,6 +7,7 @@ #include #include #include +#include #include "config.h" #include @@ -137,7 +138,7 @@ protected: bool authenticateUser(HTTPServerRequest & request, HTTPServerResponse & response) { - return authenticateUserByHTTP(request, *params, response, *session, request_credentials, server().context(), log()); + return authenticateUserByHTTP(request, *params, response, *session, request_credentials, HTTPHandlerConnectionConfig{}, server().context(), log()); } void makeContext(HTTPServerRequest & request) diff --git a/src/Server/ProtocolServerAdapter.cpp b/src/Server/ProtocolServerAdapter.cpp index 6b723bc8d87..3abf5733c52 100644 --- a/src/Server/ProtocolServerAdapter.cpp +++ b/src/Server/ProtocolServerAdapter.cpp @@ -30,11 +30,13 @@ ProtocolServerAdapter::ProtocolServerAdapter( const std::string & listen_host_, const char * port_name_, const std::string & description_, - std::unique_ptr tcp_server_) + std::unique_ptr tcp_server_, + bool supports_runtime_reconfiguration_) : listen_host(listen_host_) , port_name(port_name_) , description(description_) , impl(std::make_unique(std::move(tcp_server_))) + , supports_runtime_reconfiguration(supports_runtime_reconfiguration_) { } @@ -66,11 +68,13 @@ ProtocolServerAdapter::ProtocolServerAdapter( const std::string & listen_host_, const char * port_name_, const std::string & description_, - std::unique_ptr grpc_server_) + std::unique_ptr grpc_server_, + bool supports_runtime_reconfiguration_) : listen_host(listen_host_) , port_name(port_name_) , description(description_) , impl(std::make_unique(std::move(grpc_server_))) + , supports_runtime_reconfiguration(supports_runtime_reconfiguration_) { } #endif diff --git a/src/Server/ProtocolServerAdapter.h b/src/Server/ProtocolServerAdapter.h index 4a0b0cae8e7..132a9b93c1b 100644 --- a/src/Server/ProtocolServerAdapter.h +++ b/src/Server/ProtocolServerAdapter.h @@ -21,10 +21,20 @@ class ProtocolServerAdapter public: ProtocolServerAdapter(ProtocolServerAdapter && src) = default; ProtocolServerAdapter & operator =(ProtocolServerAdapter && src) = default; - ProtocolServerAdapter(const std::string & listen_host_, const char * port_name_, const std::string & description_, std::unique_ptr tcp_server_); + ProtocolServerAdapter( + const std::string & listen_host_, + const char * port_name_, + const std::string & description_, + std::unique_ptr tcp_server_, + bool supports_runtime_reconfiguration_ = true); #if USE_GRPC - ProtocolServerAdapter(const std::string & listen_host_, const char * port_name_, const std::string & description_, std::unique_ptr grpc_server_); + ProtocolServerAdapter( + const std::string & listen_host_, + const char * port_name_, + const std::string & description_, + std::unique_ptr grpc_server_, + bool supports_runtime_reconfiguration_ = true); #endif /// Starts the server. A new thread will be created that waits for and accepts incoming connections. @@ -46,6 +56,8 @@ public: /// Returns the port this server is listening to. UInt16 portNumber() const { return impl->portNumber(); } + bool supportsRuntimeReconfiguration() const { return supports_runtime_reconfiguration; } + const std::string & getListenHost() const { return listen_host; } const std::string & getPortName() const { return port_name; } @@ -72,6 +84,7 @@ private: std::string port_name; std::string description; std::unique_ptr impl; + bool supports_runtime_reconfiguration = true; }; } diff --git a/src/Server/TCPHandler.cpp b/src/Server/TCPHandler.cpp index 921c53b6bcb..e7e4ae25a68 100644 --- a/src/Server/TCPHandler.cpp +++ b/src/Server/TCPHandler.cpp @@ -301,6 +301,9 @@ void TCPHandler::runImpl() { receiveHello(); + if (!default_database.empty()) + DatabaseCatalog::instance().assertDatabaseExists(default_database); + /// In interserver mode queries are executed without a session context. if (!is_interserver_mode) session->makeSessionContext(); diff --git a/src/Storages/FileLog/FileLogSource.cpp b/src/Storages/FileLog/FileLogSource.cpp index 7f7d21b62c8..3ae38434de2 100644 --- a/src/Storages/FileLog/FileLogSource.cpp +++ b/src/Storages/FileLog/FileLogSource.cpp @@ -86,21 +86,18 @@ Chunk FileLogSource::generate() std::optional exception_message; size_t total_rows = 0; - auto on_error = [&](const MutableColumns & result_columns, Exception & e) + auto on_error = [&](const MutableColumns & result_columns, const ColumnCheckpoints & checkpoints, Exception & e) { if (handle_error_mode == StreamingHandleErrorMode::STREAM) { exception_message = e.message(); - for (const auto & column : result_columns) + for (size_t i = 0; i < result_columns.size(); ++i) { - // We could already push some rows to result_columns - // before exception, we need to fix it. - auto cur_rows = column->size(); - if (cur_rows > total_rows) - column->popBack(cur_rows - total_rows); + // We could already push some rows to result_columns before exception, we need to fix it. + result_columns[i]->rollback(*checkpoints[i]); // All data columns will get default value in case of error. - column->insertDefault(); + result_columns[i]->insertDefault(); } return 1; diff --git a/src/Storages/Kafka/KafkaSource.cpp b/src/Storages/Kafka/KafkaSource.cpp index 9e654b9fd94..c03de4a37eb 100644 --- a/src/Storages/Kafka/KafkaSource.cpp +++ b/src/Storages/Kafka/KafkaSource.cpp @@ -114,23 +114,20 @@ Chunk KafkaSource::generateImpl() size_t total_rows = 0; size_t failed_poll_attempts = 0; - auto on_error = [&](const MutableColumns & result_columns, Exception & e) + auto on_error = [&](const MutableColumns & result_columns, const ColumnCheckpoints & checkpoints, Exception & e) { ProfileEvents::increment(ProfileEvents::KafkaMessagesFailed); if (put_error_to_stream) { exception_message = e.message(); - for (const auto & column : result_columns) + for (size_t i = 0; i < result_columns.size(); ++i) { - // read_kafka_message could already push some rows to result_columns - // before exception, we need to fix it. - auto cur_rows = column->size(); - if (cur_rows > total_rows) - column->popBack(cur_rows - total_rows); + // We could already push some rows to result_columns before exception, we need to fix it. + result_columns[i]->rollback(*checkpoints[i]); // all data columns will get default value in case of error - column->insertDefault(); + result_columns[i]->insertDefault(); } return 1; diff --git a/src/Storages/Kafka/StorageKafka2.cpp b/src/Storages/Kafka/StorageKafka2.cpp index f583e73f47d..8a1414604ac 100644 --- a/src/Storages/Kafka/StorageKafka2.cpp +++ b/src/Storages/Kafka/StorageKafka2.cpp @@ -854,23 +854,20 @@ StorageKafka2::PolledBatchInfo StorageKafka2::pollConsumer( size_t total_rows = 0; size_t failed_poll_attempts = 0; - auto on_error = [&](const MutableColumns & result_columns, Exception & e) + auto on_error = [&](const MutableColumns & result_columns, const ColumnCheckpoints & checkpoints, Exception & e) { ProfileEvents::increment(ProfileEvents::KafkaMessagesFailed); if (put_error_to_stream) { exception_message = e.message(); - for (const auto & column : result_columns) + for (size_t i = 0; i < result_columns.size(); ++i) { - // read_kafka_message could already push some rows to result_columns - // before exception, we need to fix it. - auto cur_rows = column->size(); - if (cur_rows > total_rows) - column->popBack(cur_rows - total_rows); + // We could already push some rows to result_columns before exception, we need to fix it. + result_columns[i]->rollback(*checkpoints[i]); // all data columns will get default value in case of error - column->insertDefault(); + result_columns[i]->insertDefault(); } return 1; diff --git a/src/Storages/Kafka/StorageKafkaUtils.cpp b/src/Storages/Kafka/StorageKafkaUtils.cpp index dd954d6a7c2..119aadd11d8 100644 --- a/src/Storages/Kafka/StorageKafkaUtils.cpp +++ b/src/Storages/Kafka/StorageKafkaUtils.cpp @@ -308,6 +308,7 @@ void registerStorageKafka(StorageFactory & factory) creator_fn, StorageFactory::StorageFeatures{ .supports_settings = true, + .source_access_type = AccessType::KAFKA, }); } diff --git a/src/Storages/MergeTree/IMergeTreeDataPart.h b/src/Storages/MergeTree/IMergeTreeDataPart.h index 378832d32a1..b41a1d840e1 100644 --- a/src/Storages/MergeTree/IMergeTreeDataPart.h +++ b/src/Storages/MergeTree/IMergeTreeDataPart.h @@ -180,6 +180,9 @@ public: void loadRowsCountFileForUnexpectedPart(); + /// Loads marks and saves them into mark cache for specified columns. + virtual void loadMarksToCache(const Names & column_names, MarkCache * mark_cache) const = 0; + String getMarksFileExtension() const { return index_granularity_info.mark_type.getFileExtension(); } /// Generate the new name for this part according to `new_part_info` and min/max dates from the old name. diff --git a/src/Storages/MergeTree/IMergeTreeDataPartWriter.cpp b/src/Storages/MergeTree/IMergeTreeDataPartWriter.cpp index 3d6366f9217..dbfdbbdea88 100644 --- a/src/Storages/MergeTree/IMergeTreeDataPartWriter.cpp +++ b/src/Storages/MergeTree/IMergeTreeDataPartWriter.cpp @@ -91,6 +91,13 @@ Columns IMergeTreeDataPartWriter::releaseIndexColumns() return result; } +PlainMarksByName IMergeTreeDataPartWriter::releaseCachedMarks() +{ + PlainMarksByName res; + std::swap(cached_marks, res); + return res; +} + SerializationPtr IMergeTreeDataPartWriter::getSerialization(const String & column_name) const { auto it = serializations.find(column_name); diff --git a/src/Storages/MergeTree/IMergeTreeDataPartWriter.h b/src/Storages/MergeTree/IMergeTreeDataPartWriter.h index eb51a1b2922..b8ac14b1750 100644 --- a/src/Storages/MergeTree/IMergeTreeDataPartWriter.h +++ b/src/Storages/MergeTree/IMergeTreeDataPartWriter.h @@ -8,6 +8,7 @@ #include #include #include +#include namespace DB @@ -46,6 +47,9 @@ public: virtual void finish(bool sync) = 0; Columns releaseIndexColumns(); + + PlainMarksByName releaseCachedMarks(); + const MergeTreeIndexGranularity & getIndexGranularity() const { return index_granularity; } protected: @@ -69,6 +73,8 @@ protected: MutableDataPartStoragePtr data_part_storage; MutableColumns index_columns; MergeTreeIndexGranularity index_granularity; + /// Marks that will be saved to cache on finish. + PlainMarksByName cached_marks; }; using MergeTreeDataPartWriterPtr = std::unique_ptr; diff --git a/src/Storages/MergeTree/IMergedBlockOutputStream.h b/src/Storages/MergeTree/IMergedBlockOutputStream.h index cfcfb177e05..a901b03c115 100644 --- a/src/Storages/MergeTree/IMergedBlockOutputStream.h +++ b/src/Storages/MergeTree/IMergedBlockOutputStream.h @@ -34,6 +34,11 @@ public: return writer->getIndexGranularity(); } + PlainMarksByName releaseCachedMarks() + { + return writer->releaseCachedMarks(); + } + protected: /// Remove all columns marked expired in data_part. Also, clears checksums diff --git a/src/Storages/MergeTree/MergeFromLogEntryTask.cpp b/src/Storages/MergeTree/MergeFromLogEntryTask.cpp index fa6640409e5..d7e807c689f 100644 --- a/src/Storages/MergeTree/MergeFromLogEntryTask.cpp +++ b/src/Storages/MergeTree/MergeFromLogEntryTask.cpp @@ -371,6 +371,7 @@ ReplicatedMergeMutateTaskBase::PrepareResult MergeFromLogEntryTask::prepare() bool MergeFromLogEntryTask::finalize(ReplicatedMergeMutateTaskBase::PartLogWriter write_part_log) { part = merge_task->getFuture().get(); + auto cached_marks = merge_task->releaseCachedMarks(); storage.merger_mutator.renameMergedTemporaryPart(part, parts, NO_TRANSACTION_PTR, *transaction_ptr); /// Why we reset task here? Because it holds shared pointer to part and tryRemovePartImmediately will @@ -444,6 +445,9 @@ bool MergeFromLogEntryTask::finalize(ReplicatedMergeMutateTaskBase::PartLogWrite finish_callback = [storage_ptr = &storage]() { storage_ptr->merge_selecting_task->schedule(); }; ProfileEvents::increment(ProfileEvents::ReplicatedPartMerges); + if (auto * mark_cache = storage.getContext()->getMarkCache().get()) + addMarksToCache(*part, cached_marks, mark_cache); + write_part_log({}); StorageReplicatedMergeTree::incrementMergedPartsProfileEvent(part->getType()); diff --git a/src/Storages/MergeTree/MergePlainMergeTreeTask.cpp b/src/Storages/MergeTree/MergePlainMergeTreeTask.cpp index f7b52d2216d..6aca58faf47 100644 --- a/src/Storages/MergeTree/MergePlainMergeTreeTask.cpp +++ b/src/Storages/MergeTree/MergePlainMergeTreeTask.cpp @@ -152,6 +152,12 @@ void MergePlainMergeTreeTask::finish() ThreadFuzzer::maybeInjectSleep(); ThreadFuzzer::maybeInjectMemoryLimitException(); + if (auto * mark_cache = storage.getContext()->getMarkCache().get()) + { + auto marks = merge_task->releaseCachedMarks(); + addMarksToCache(*new_part, marks, mark_cache); + } + write_part_log({}); StorageMergeTree::incrementMergedPartsProfileEvent(new_part->getType()); transfer_profile_counters_to_initial_query(); @@ -163,7 +169,6 @@ void MergePlainMergeTreeTask::finish() ThreadFuzzer::maybeInjectSleep(); ThreadFuzzer::maybeInjectMemoryLimitException(); } - } ContextMutablePtr MergePlainMergeTreeTask::createTaskContext() const diff --git a/src/Storages/MergeTree/MergeTask.cpp b/src/Storages/MergeTree/MergeTask.cpp index e3ace824115..193622d7b87 100644 --- a/src/Storages/MergeTree/MergeTask.cpp +++ b/src/Storages/MergeTree/MergeTask.cpp @@ -93,6 +93,7 @@ namespace MergeTreeSetting extern const MergeTreeSettingsUInt64 vertical_merge_algorithm_min_columns_to_activate; extern const MergeTreeSettingsUInt64 vertical_merge_algorithm_min_rows_to_activate; extern const MergeTreeSettingsBool vertical_merge_remote_filesystem_prefetch; + extern const MergeTreeSettingsBool prewarm_mark_cache; } namespace ErrorCodes @@ -546,6 +547,8 @@ bool MergeTask::ExecuteAndFinalizeHorizontalPart::prepare() const } } + bool save_marks_in_cache = (*global_ctx->data->getSettings())[MergeTreeSetting::prewarm_mark_cache] && global_ctx->context->getMarkCache(); + global_ctx->to = std::make_shared( global_ctx->new_data_part, global_ctx->metadata_snapshot, @@ -555,6 +558,7 @@ bool MergeTask::ExecuteAndFinalizeHorizontalPart::prepare() const ctx->compression_codec, global_ctx->txn ? global_ctx->txn->tid : Tx::PrehistoricTID, /*reset_columns=*/ true, + save_marks_in_cache, ctx->blocks_are_granules_size, global_ctx->context->getWriteSettings()); @@ -1085,6 +1089,8 @@ void MergeTask::VerticalMergeStage::prepareVerticalMergeForOneColumn() const ctx->executor = std::make_unique(ctx->column_parts_pipeline); NamesAndTypesList columns_list = {*ctx->it_name_and_type}; + bool save_marks_in_cache = (*global_ctx->data->getSettings())[MergeTreeSetting::prewarm_mark_cache] && global_ctx->context->getMarkCache(); + ctx->column_to = std::make_unique( global_ctx->new_data_part, global_ctx->metadata_snapshot, @@ -1093,6 +1099,7 @@ void MergeTask::VerticalMergeStage::prepareVerticalMergeForOneColumn() const column_pipepline.indexes_to_recalc, getStatisticsForColumns(columns_list, global_ctx->metadata_snapshot), &global_ctx->written_offset_columns, + save_marks_in_cache, global_ctx->to->getIndexGranularity()); ctx->column_elems_written = 0; @@ -1130,6 +1137,10 @@ void MergeTask::VerticalMergeStage::finalizeVerticalMergeForOneColumn() const auto changed_checksums = ctx->column_to->fillChecksums(global_ctx->new_data_part, global_ctx->checksums_gathered_columns); global_ctx->checksums_gathered_columns.add(std::move(changed_checksums)); + auto cached_marks = ctx->column_to->releaseCachedMarks(); + for (auto & [name, marks] : cached_marks) + global_ctx->cached_marks.emplace(name, std::move(marks)); + ctx->delayed_streams.emplace_back(std::move(ctx->column_to)); while (ctx->delayed_streams.size() > ctx->max_delayed_streams) @@ -1276,6 +1287,10 @@ bool MergeTask::MergeProjectionsStage::finalizeProjectionsAndWholeMerge() const else global_ctx->to->finalizePart(global_ctx->new_data_part, ctx->need_sync, &global_ctx->storage_columns, &global_ctx->checksums_gathered_columns); + auto cached_marks = global_ctx->to->releaseCachedMarks(); + for (auto & [name, marks] : cached_marks) + global_ctx->cached_marks.emplace(name, std::move(marks)); + global_ctx->new_data_part->getDataPartStorage().precommitTransaction(); global_ctx->promise.set_value(global_ctx->new_data_part); diff --git a/src/Storages/MergeTree/MergeTask.h b/src/Storages/MergeTree/MergeTask.h index 5a4fb1ec0b8..53792165987 100644 --- a/src/Storages/MergeTree/MergeTask.h +++ b/src/Storages/MergeTree/MergeTask.h @@ -5,6 +5,7 @@ #include #include +#include #include #include @@ -132,6 +133,13 @@ public: return nullptr; } + PlainMarksByName releaseCachedMarks() const + { + PlainMarksByName res; + std::swap(global_ctx->cached_marks, res); + return res; + } + bool execute(); private: @@ -209,6 +217,7 @@ private: std::promise promise{}; IMergedBlockOutputStream::WrittenOffsetColumns written_offset_columns{}; + PlainMarksByName cached_marks; MergeTreeTransactionPtr txn; bool need_prefix; diff --git a/src/Storages/MergeTree/MergeTreeData.cpp b/src/Storages/MergeTree/MergeTreeData.cpp index 8611681a976..4ed8c67469d 100644 --- a/src/Storages/MergeTree/MergeTreeData.cpp +++ b/src/Storages/MergeTree/MergeTreeData.cpp @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -154,6 +155,7 @@ namespace namespace DB { + namespace Setting { extern const SettingsBool allow_drop_detached; @@ -229,6 +231,12 @@ namespace MergeTreeSetting extern const MergeTreeSettingsString storage_policy; extern const MergeTreeSettingsFloat zero_copy_concurrent_part_removal_max_postpone_ratio; extern const MergeTreeSettingsUInt64 zero_copy_concurrent_part_removal_max_split_times; + extern const MergeTreeSettingsBool prewarm_mark_cache; +} + +namespace ServerSetting +{ + extern const ServerSettingsDouble mark_cache_prewarm_ratio; } namespace ErrorCodes @@ -2335,6 +2343,55 @@ void MergeTreeData::stopOutdatedAndUnexpectedDataPartsLoadingTask() } } +void MergeTreeData::prewarmMarkCache(ThreadPool & pool) +{ + if (!(*getSettings())[MergeTreeSetting::prewarm_mark_cache]) + return; + + auto * mark_cache = getContext()->getMarkCache().get(); + if (!mark_cache) + return; + + auto metadata_snaphost = getInMemoryMetadataPtr(); + auto column_names = getColumnsToPrewarmMarks(*getSettings(), metadata_snaphost->getColumns().getAllPhysical()); + + if (column_names.empty()) + return; + + Stopwatch watch; + LOG_TRACE(log, "Prewarming mark cache"); + + auto data_parts = getDataPartsVectorForInternalUsage(); + + /// Prewarm mark cache firstly for the most fresh parts according + /// to time columns in partition key (if exists) and by modification time. + + auto to_tuple = [](const auto & part) + { + return std::make_tuple(part->getMinMaxDate().second, part->getMinMaxTime().second, part->modification_time); + }; + + std::sort(data_parts.begin(), data_parts.end(), [&to_tuple](const auto & lhs, const auto & rhs) + { + return to_tuple(lhs) > to_tuple(rhs); + }); + + ThreadPoolCallbackRunnerLocal runner(pool, "PrewarmMarks"); + double ratio_to_prewarm = getContext()->getServerSettings()[ServerSetting::mark_cache_prewarm_ratio]; + + for (const auto & part : data_parts) + { + if (mark_cache->sizeInBytes() >= mark_cache->maxSizeInBytes() * ratio_to_prewarm) + break; + + runner([&] { part->loadMarksToCache(column_names, mark_cache); }); + } + + runner.waitForAllToFinishAndRethrowFirstError(); + watch.stop(); + LOG_TRACE(log, "Prewarmed mark cache in {} seconds", watch.elapsedSeconds()); +} + /// Is the part directory old. /// True if its modification time and the modification time of all files inside it is less then threshold. /// (Only files on the first level of nesting are considered). @@ -6371,6 +6428,12 @@ DetachedPartsInfo MergeTreeData::getDetachedParts() const for (const auto & disk : getDisks()) { + /// While it is possible to have detached parts on readonly/write-once disks + /// (if they were produced on another machine, where it wasn't readonly) + /// to avoid wasting resources for slow disks, avoid trying to enumerate them. + if (disk->isReadOnly() || disk->isWriteOnce()) + continue; + String detached_path = fs::path(relative_data_path) / DETACHED_DIR_NAME; /// Note: we don't care about TOCTOU issue here. diff --git a/src/Storages/MergeTree/MergeTreeData.h b/src/Storages/MergeTree/MergeTreeData.h index 7a9730e8627..a32106f76bb 100644 --- a/src/Storages/MergeTree/MergeTreeData.h +++ b/src/Storages/MergeTree/MergeTreeData.h @@ -506,6 +506,9 @@ public: /// Load the set of data parts from disk. Call once - immediately after the object is created. void loadDataParts(bool skip_sanity_checks, std::optional> expected_parts); + /// Prewarm mark cache for the most recent data parts. + void prewarmMarkCache(ThreadPool & pool); + String getLogName() const { return log.loadName(); } Int64 getMaxBlockNumber() const; diff --git a/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp b/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp index 8b3c7bdf3fb..62ad9d4a52a 100644 --- a/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp +++ b/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp @@ -70,6 +70,7 @@ namespace MergeTreeSetting extern const MergeTreeSettingsBool ttl_only_drop_parts; extern const MergeTreeSettingsUInt64 parts_to_throw_insert; extern const MergeTreeSettingsMergeSelectorAlgorithm merge_selector_algorithm; + extern const MergeTreeSettingsBool merge_selector_enable_heuristic_to_remove_small_parts_at_right; } namespace ErrorCodes @@ -540,6 +541,8 @@ SelectPartsDecision MergeTreeDataMergerMutator::selectPartsToMergeFromRanges( /// Override value from table settings simple_merge_settings.window_size = (*data_settings)[MergeTreeSetting::merge_selector_window_size]; simple_merge_settings.max_parts_to_merge_at_once = (*data_settings)[MergeTreeSetting::max_parts_to_merge_at_once]; + simple_merge_settings.enable_heuristic_to_remove_small_parts_at_right = (*data_settings)[MergeTreeSetting::merge_selector_enable_heuristic_to_remove_small_parts_at_right]; + if (!(*data_settings)[MergeTreeSetting::min_age_to_force_merge_on_partition_only]) simple_merge_settings.min_age_to_force_merge = (*data_settings)[MergeTreeSetting::min_age_to_force_merge_seconds]; diff --git a/src/Storages/MergeTree/MergeTreeDataPartCompact.cpp b/src/Storages/MergeTree/MergeTreeDataPartCompact.cpp index fd46b3b9540..14c2da82de1 100644 --- a/src/Storages/MergeTree/MergeTreeDataPartCompact.cpp +++ b/src/Storages/MergeTree/MergeTreeDataPartCompact.cpp @@ -136,6 +136,32 @@ void MergeTreeDataPartCompact::loadIndexGranularity() loadIndexGranularityImpl(index_granularity, index_granularity_info, columns.size(), getDataPartStorage()); } +void MergeTreeDataPartCompact::loadMarksToCache(const Names & column_names, MarkCache * mark_cache) const +{ + if (column_names.empty() || !mark_cache) + return; + + auto context = storage.getContext(); + auto read_settings = context->getReadSettings(); + auto * load_marks_threadpool = read_settings.load_marks_asynchronously ? &context->getLoadMarksThreadpool() : nullptr; + auto info_for_read = std::make_shared(shared_from_this(), std::make_shared()); + + LOG_TEST(getLogger("MergeTreeDataPartCompact"), "Loading marks into mark cache for columns {} of part {}", toString(column_names), name); + + MergeTreeMarksLoader loader( + info_for_read, + mark_cache, + index_granularity_info.getMarksFilePath(DATA_FILE_NAME), + index_granularity.getMarksCount(), + index_granularity_info, + /*save_marks_in_cache=*/ true, + read_settings, + load_marks_threadpool, + columns.size()); + + loader.loadMarks(); +} + bool MergeTreeDataPartCompact::hasColumnFiles(const NameAndTypePair & column) const { if (!getColumnPosition(column.getNameInStorage())) @@ -230,7 +256,14 @@ bool MergeTreeDataPartCompact::isStoredOnRemoteDiskWithZeroCopySupport() const MergeTreeDataPartCompact::~MergeTreeDataPartCompact() { - removeIfNeeded(); + try + { + removeIfNeeded(); + } + catch (...) + { + tryLogCurrentException(__PRETTY_FUNCTION__); + } } } diff --git a/src/Storages/MergeTree/MergeTreeDataPartCompact.h b/src/Storages/MergeTree/MergeTreeDataPartCompact.h index 9512485c54e..8e279571578 100644 --- a/src/Storages/MergeTree/MergeTreeDataPartCompact.h +++ b/src/Storages/MergeTree/MergeTreeDataPartCompact.h @@ -54,6 +54,8 @@ public: std::optional getFileNameForColumn(const NameAndTypePair & /* column */) const override { return DATA_FILE_NAME; } + void loadMarksToCache(const Names & column_names, MarkCache * mark_cache) const override; + ~MergeTreeDataPartCompact() override; protected: diff --git a/src/Storages/MergeTree/MergeTreeDataPartWide.cpp b/src/Storages/MergeTree/MergeTreeDataPartWide.cpp index 9bbf0ad9739..c515d645253 100644 --- a/src/Storages/MergeTree/MergeTreeDataPartWide.cpp +++ b/src/Storages/MergeTree/MergeTreeDataPartWide.cpp @@ -182,6 +182,47 @@ void MergeTreeDataPartWide::loadIndexGranularity() loadIndexGranularityImpl(index_granularity, index_granularity_info, getDataPartStorage(), *any_column_filename); } +void MergeTreeDataPartWide::loadMarksToCache(const Names & column_names, MarkCache * mark_cache) const +{ + if (column_names.empty() || !mark_cache) + return; + + std::vector> loaders; + + auto context = storage.getContext(); + auto read_settings = context->getReadSettings(); + auto * load_marks_threadpool = read_settings.load_marks_asynchronously ? &context->getLoadMarksThreadpool() : nullptr; + auto info_for_read = std::make_shared(shared_from_this(), std::make_shared()); + + LOG_TEST(getLogger("MergeTreeDataPartWide"), "Loading marks into mark cache for columns {} of part {}", toString(column_names), name); + + for (const auto & column_name : column_names) + { + auto serialization = getSerialization(column_name); + serialization->enumerateStreams([&](const auto & subpath) + { + auto stream_name = getStreamNameForColumn(column_name, subpath, checksums); + if (!stream_name) + return; + + loaders.emplace_back(std::make_unique( + info_for_read, + mark_cache, + index_granularity_info.getMarksFilePath(*stream_name), + index_granularity.getMarksCount(), + index_granularity_info, + /*save_marks_in_cache=*/ true, + read_settings, + load_marks_threadpool, + /*num_columns_in_mark=*/ 1)); + + loaders.back()->startAsyncLoad(); + }); + } + + for (auto & loader : loaders) + loader->loadMarks(); +} bool MergeTreeDataPartWide::isStoredOnRemoteDisk() const { @@ -200,7 +241,14 @@ bool MergeTreeDataPartWide::isStoredOnRemoteDiskWithZeroCopySupport() const MergeTreeDataPartWide::~MergeTreeDataPartWide() { - removeIfNeeded(); + try + { + removeIfNeeded(); + } + catch (...) + { + tryLogCurrentException(__PRETTY_FUNCTION__); + } } void MergeTreeDataPartWide::doCheckConsistency(bool require_part_metadata) const diff --git a/src/Storages/MergeTree/MergeTreeDataPartWide.h b/src/Storages/MergeTree/MergeTreeDataPartWide.h index 42893f47573..022a5fb746c 100644 --- a/src/Storages/MergeTree/MergeTreeDataPartWide.h +++ b/src/Storages/MergeTree/MergeTreeDataPartWide.h @@ -51,6 +51,8 @@ public: std::optional getColumnModificationTime(const String & column_name) const override; + void loadMarksToCache(const Names & column_names, MarkCache * mark_cache) const override; + protected: static void loadIndexGranularityImpl( MergeTreeIndexGranularity & index_granularity_, MergeTreeIndexGranularityInfo & index_granularity_info_, diff --git a/src/Storages/MergeTree/MergeTreeDataPartWriterCompact.cpp b/src/Storages/MergeTree/MergeTreeDataPartWriterCompact.cpp index a859172023f..67a2c1ee9f1 100644 --- a/src/Storages/MergeTree/MergeTreeDataPartWriterCompact.cpp +++ b/src/Storages/MergeTree/MergeTreeDataPartWriterCompact.cpp @@ -1,5 +1,6 @@ #include #include +#include "Formats/MarkInCompressedFile.h" namespace DB { @@ -54,6 +55,11 @@ MergeTreeDataPartWriterCompact::MergeTreeDataPartWriterCompact( marks_source_hashing = std::make_unique(*marks_compressor); } + if (settings.save_marks_in_cache) + { + cached_marks[MergeTreeDataPartCompact::DATA_FILE_NAME] = std::make_unique(); + } + for (const auto & column : columns_list) { auto compression = getCodecDescOrDefault(column.name, default_codec); @@ -255,9 +261,12 @@ void MergeTreeDataPartWriterCompact::writeDataBlock(const Block & block, const G return &result_stream->hashing_buf; }; + MarkInCompressedFile mark{plain_hashing.count(), static_cast(0)}; + writeBinaryLittleEndian(mark.offset_in_compressed_file, marks_out); + writeBinaryLittleEndian(mark.offset_in_decompressed_block, marks_out); - writeBinaryLittleEndian(plain_hashing.count(), marks_out); - writeBinaryLittleEndian(static_cast(0), marks_out); + if (!cached_marks.empty()) + cached_marks.begin()->second->push_back(mark); writeColumnSingleGranule( block.getByName(name_and_type->name), getSerialization(name_and_type->name), @@ -296,11 +305,17 @@ void MergeTreeDataPartWriterCompact::fillDataChecksums(MergeTreeDataPartChecksum if (with_final_mark && data_written) { + MarkInCompressedFile mark{plain_hashing.count(), 0}; + for (size_t i = 0; i < columns_list.size(); ++i) { - writeBinaryLittleEndian(plain_hashing.count(), marks_out); - writeBinaryLittleEndian(static_cast(0), marks_out); + writeBinaryLittleEndian(mark.offset_in_compressed_file, marks_out); + writeBinaryLittleEndian(mark.offset_in_decompressed_block, marks_out); + + if (!cached_marks.empty()) + cached_marks.begin()->second->push_back(mark); } + writeBinaryLittleEndian(static_cast(0), marks_out); } diff --git a/src/Storages/MergeTree/MergeTreeDataPartWriterOnDisk.h b/src/Storages/MergeTree/MergeTreeDataPartWriterOnDisk.h index 8d84442981e..4a760c20b58 100644 --- a/src/Storages/MergeTree/MergeTreeDataPartWriterOnDisk.h +++ b/src/Storages/MergeTree/MergeTreeDataPartWriterOnDisk.h @@ -8,6 +8,7 @@ #include #include #include +#include namespace DB { diff --git a/src/Storages/MergeTree/MergeTreeDataPartWriterWide.cpp b/src/Storages/MergeTree/MergeTreeDataPartWriterWide.cpp index 459ddc1ca79..433c7c21613 100644 --- a/src/Storages/MergeTree/MergeTreeDataPartWriterWide.cpp +++ b/src/Storages/MergeTree/MergeTreeDataPartWriterWide.cpp @@ -6,6 +6,8 @@ #include #include #include +#include +#include #include #include @@ -105,6 +107,12 @@ MergeTreeDataPartWriterWide::MergeTreeDataPartWriterWide( indices_to_recalc_, stats_to_recalc_, marks_file_extension_, default_codec_, settings_, index_granularity_) { + if (settings.save_marks_in_cache) + { + auto columns_vec = getColumnsToPrewarmMarks(*storage_settings, columns_list); + columns_to_load_marks = NameSet(columns_vec.begin(), columns_vec.end()); + } + for (const auto & column : columns_list) { auto compression = getCodecDescOrDefault(column.name, default_codec); @@ -198,6 +206,9 @@ void MergeTreeDataPartWriterWide::addStreams( settings.marks_compress_block_size, query_write_settings); + if (columns_to_load_marks.contains(name_and_type.name)) + cached_marks.emplace(stream_name, std::make_unique()); + full_name_to_stream_name.emplace(full_stream_name, stream_name); stream_name_to_full_name.emplace(stream_name, full_stream_name); }; @@ -366,8 +377,12 @@ void MergeTreeDataPartWriterWide::flushMarkToFile(const StreamNameAndMark & stre writeBinaryLittleEndian(stream_with_mark.mark.offset_in_compressed_file, marks_out); writeBinaryLittleEndian(stream_with_mark.mark.offset_in_decompressed_block, marks_out); + if (settings.can_use_adaptive_granularity) writeBinaryLittleEndian(rows_in_mark, marks_out); + + if (auto it = cached_marks.find(stream_with_mark.stream_name); it != cached_marks.end()) + it->second->push_back(stream_with_mark.mark); } StreamsWithMarks MergeTreeDataPartWriterWide::getCurrentMarksForColumn( @@ -742,7 +757,6 @@ void MergeTreeDataPartWriterWide::fillChecksums(MergeTreeDataPartChecksums & che fillPrimaryIndexChecksums(checksums); fillSkipIndicesChecksums(checksums); - fillStatisticsChecksums(checksums); } @@ -756,7 +770,6 @@ void MergeTreeDataPartWriterWide::finish(bool sync) finishPrimaryIndexSerialization(sync); finishSkipIndicesSerialization(sync); - finishStatisticsSerialization(sync); } diff --git a/src/Storages/MergeTree/MergeTreeDataPartWriterWide.h b/src/Storages/MergeTree/MergeTreeDataPartWriterWide.h index ab86ed27c7e..68f016a7421 100644 --- a/src/Storages/MergeTree/MergeTreeDataPartWriterWide.h +++ b/src/Storages/MergeTree/MergeTreeDataPartWriterWide.h @@ -136,6 +136,9 @@ private: using MarksForColumns = std::unordered_map; MarksForColumns last_non_written_marks; + /// Set of columns to put marks in cache during write. + NameSet columns_to_load_marks; + /// How many rows we have already written in the current mark. /// More than zero when incoming blocks are smaller then their granularity. size_t rows_written_in_last_mark = 0; diff --git a/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp b/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp index d7305045a56..1b3c58000e7 100644 --- a/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp +++ b/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp @@ -71,10 +71,7 @@ namespace Setting extern const SettingsString force_data_skipping_indices; extern const SettingsBool force_index_by_date; extern const SettingsSeconds lock_acquire_timeout; - extern const SettingsUInt64 max_parser_backtracks; - extern const SettingsUInt64 max_parser_depth; extern const SettingsInt64 max_partitions_to_read; - extern const SettingsUInt64 max_query_size; extern const SettingsUInt64 max_threads_for_indexes; extern const SettingsNonZeroUInt64 max_parallel_replicas; extern const SettingsUInt64 merge_tree_coarse_index_granularity; @@ -640,20 +637,11 @@ RangesInDataParts MergeTreeDataSelectExecutor::filterPartsByPrimaryKeyAndSkipInd if (use_skip_indexes && settings[Setting::force_data_skipping_indices].changed) { - const auto & indices = settings[Setting::force_data_skipping_indices].toString(); - - Strings forced_indices; - { - Tokens tokens(indices.data(), indices.data() + indices.size(), settings[Setting::max_query_size]); - IParser::Pos pos( - tokens, static_cast(settings[Setting::max_parser_depth]), static_cast(settings[Setting::max_parser_backtracks])); - Expected expected; - if (!parseIdentifiersOrStringLiterals(pos, expected, forced_indices)) - throw Exception(ErrorCodes::CANNOT_PARSE_TEXT, "Cannot parse force_data_skipping_indices ('{}')", indices); - } + const auto & indices_str = settings[Setting::force_data_skipping_indices].toString(); + auto forced_indices = parseIdentifiersOrStringLiterals(indices_str, settings); if (forced_indices.empty()) - throw Exception(ErrorCodes::CANNOT_PARSE_TEXT, "No indices parsed from force_data_skipping_indices ('{}')", indices); + throw Exception(ErrorCodes::CANNOT_PARSE_TEXT, "No indices parsed from force_data_skipping_indices ('{}')", indices_str); std::unordered_set useful_indices_names; for (const auto & useful_index : skip_indexes.useful_indices) diff --git a/src/Storages/MergeTree/MergeTreeDataWriter.cpp b/src/Storages/MergeTree/MergeTreeDataWriter.cpp index 67fef759ed4..ac29a9244b0 100644 --- a/src/Storages/MergeTree/MergeTreeDataWriter.cpp +++ b/src/Storages/MergeTree/MergeTreeDataWriter.cpp @@ -73,6 +73,7 @@ namespace MergeTreeSetting extern const MergeTreeSettingsFloat min_free_disk_ratio_to_perform_insert; extern const MergeTreeSettingsBool optimize_row_order; extern const MergeTreeSettingsFloat ratio_of_defaults_for_sparse_serialization; + extern const MergeTreeSettingsBool prewarm_mark_cache; } namespace ErrorCodes @@ -684,6 +685,7 @@ MergeTreeDataWriter::TemporaryPart MergeTreeDataWriter::writeTempPartImpl( /// This effectively chooses minimal compression method: /// either default lz4 or compression method with zero thresholds on absolute and relative part size. auto compression_codec = data.getContext()->chooseCompressionCodec(0, 0); + bool save_marks_in_cache = (*data_settings)[MergeTreeSetting::prewarm_mark_cache] && data.getContext()->getMarkCache(); auto out = std::make_unique( new_data_part, @@ -693,8 +695,9 @@ MergeTreeDataWriter::TemporaryPart MergeTreeDataWriter::writeTempPartImpl( statistics, compression_codec, context->getCurrentTransaction() ? context->getCurrentTransaction()->tid : Tx::PrehistoricTID, - false, - false, + /*reset_columns=*/ false, + save_marks_in_cache, + /*blocks_are_granules_size=*/ false, context->getWriteSettings()); out->writeWithPermutation(block, perm_ptr); @@ -829,6 +832,7 @@ MergeTreeDataWriter::TemporaryPart MergeTreeDataWriter::writeProjectionPartImpl( /// This effectively chooses minimal compression method: /// either default lz4 or compression method with zero thresholds on absolute and relative part size. auto compression_codec = data.getContext()->chooseCompressionCodec(0, 0); + bool save_marks_in_cache = (*data.getSettings())[MergeTreeSetting::prewarm_mark_cache] && data.getContext()->getMarkCache(); auto out = std::make_unique( new_data_part, @@ -839,7 +843,10 @@ MergeTreeDataWriter::TemporaryPart MergeTreeDataWriter::writeProjectionPartImpl( ColumnsStatistics{}, compression_codec, Tx::PrehistoricTID, - false, false, data.getContext()->getWriteSettings()); + /*reset_columns=*/ false, + save_marks_in_cache, + /*blocks_are_granules_size=*/ false, + data.getContext()->getWriteSettings()); out->writeWithPermutation(block, perm_ptr); auto finalizer = out->finalizePartAsync(new_data_part, false); diff --git a/src/Storages/MergeTree/MergeTreeIOSettings.cpp b/src/Storages/MergeTree/MergeTreeIOSettings.cpp index 8b87c35b4e6..bacfbbd5720 100644 --- a/src/Storages/MergeTree/MergeTreeIOSettings.cpp +++ b/src/Storages/MergeTree/MergeTreeIOSettings.cpp @@ -34,6 +34,7 @@ MergeTreeWriterSettings::MergeTreeWriterSettings( const MergeTreeSettingsPtr & storage_settings, bool can_use_adaptive_granularity_, bool rewrite_primary_key_, + bool save_marks_in_cache_, bool blocks_are_granules_size_) : min_compress_block_size( (*storage_settings)[MergeTreeSetting::min_compress_block_size] ? (*storage_settings)[MergeTreeSetting::min_compress_block_size] : global_settings[Setting::min_compress_block_size]) @@ -46,6 +47,7 @@ MergeTreeWriterSettings::MergeTreeWriterSettings( , primary_key_compress_block_size((*storage_settings)[MergeTreeSetting::primary_key_compress_block_size]) , can_use_adaptive_granularity(can_use_adaptive_granularity_) , rewrite_primary_key(rewrite_primary_key_) + , save_marks_in_cache(save_marks_in_cache_) , blocks_are_granules_size(blocks_are_granules_size_) , query_write_settings(query_write_settings_) , low_cardinality_max_dictionary_size(global_settings[Setting::low_cardinality_max_dictionary_size]) diff --git a/src/Storages/MergeTree/MergeTreeIOSettings.h b/src/Storages/MergeTree/MergeTreeIOSettings.h index 66a648be6e0..4d1d2533729 100644 --- a/src/Storages/MergeTree/MergeTreeIOSettings.h +++ b/src/Storages/MergeTree/MergeTreeIOSettings.h @@ -61,7 +61,8 @@ struct MergeTreeWriterSettings const MergeTreeSettingsPtr & storage_settings, bool can_use_adaptive_granularity_, bool rewrite_primary_key_, - bool blocks_are_granules_size_ = false); + bool save_marks_in_cache_, + bool blocks_are_granules_size_); size_t min_compress_block_size; size_t max_compress_block_size; @@ -75,6 +76,7 @@ struct MergeTreeWriterSettings bool can_use_adaptive_granularity; bool rewrite_primary_key; + bool save_marks_in_cache; bool blocks_are_granules_size; WriteSettings query_write_settings; diff --git a/src/Storages/MergeTree/MergeTreeMarksLoader.cpp b/src/Storages/MergeTree/MergeTreeMarksLoader.cpp index 168134a329f..a271af578cc 100644 --- a/src/Storages/MergeTree/MergeTreeMarksLoader.cpp +++ b/src/Storages/MergeTree/MergeTreeMarksLoader.cpp @@ -3,10 +3,12 @@ #include #include #include +#include #include #include #include #include +#include #include @@ -21,6 +23,11 @@ namespace ProfileEvents namespace DB { +namespace MergeTreeSetting +{ + extern const MergeTreeSettingsString columns_to_prewarm_mark_cache; +} + namespace ErrorCodes { extern const int CANNOT_READ_ALL_DATA; @@ -211,6 +218,7 @@ MarkCache::MappedPtr MergeTreeMarksLoader::loadMarksSync() if (mark_cache) { auto key = MarkCache::hash(fs::path(data_part_storage->getFullPath()) / mrk_path); + if (save_marks_in_cache) { auto callback = [this] { return loadMarksImpl(); }; @@ -249,4 +257,25 @@ std::future MergeTreeMarksLoader::loadMarksAsync() "LoadMarksThread"); } +void addMarksToCache(const IMergeTreeDataPart & part, const PlainMarksByName & cached_marks, MarkCache * mark_cache) +{ + MemoryTrackerBlockerInThread temporarily_disable_memory_tracker; + + for (const auto & [stream_name, marks] : cached_marks) + { + auto mark_path = part.index_granularity_info.getMarksFilePath(stream_name); + auto key = MarkCache::hash(fs::path(part.getDataPartStorage().getFullPath()) / mark_path); + mark_cache->set(key, std::make_shared(*marks)); + } +} + +Names getColumnsToPrewarmMarks(const MergeTreeSettings & settings, const NamesAndTypesList & columns_list) +{ + auto columns_str = settings[MergeTreeSetting::columns_to_prewarm_mark_cache].toString(); + if (columns_str.empty()) + return columns_list.getNames(); + + return parseIdentifiersOrStringLiterals(columns_str, Context::getGlobalContextInstance()->getSettingsRef()); +} + } diff --git a/src/Storages/MergeTree/MergeTreeMarksLoader.h b/src/Storages/MergeTree/MergeTreeMarksLoader.h index 9c28cc65fdf..e031700d6a7 100644 --- a/src/Storages/MergeTree/MergeTreeMarksLoader.h +++ b/src/Storages/MergeTree/MergeTreeMarksLoader.h @@ -77,4 +77,13 @@ private: using MergeTreeMarksLoaderPtr = std::shared_ptr; +class IMergeTreeDataPart; +struct MergeTreeSettings; + +/// Adds computed marks for part to the marks cache. +void addMarksToCache(const IMergeTreeDataPart & part, const PlainMarksByName & cached_marks, MarkCache * mark_cache); + +/// Returns the list of columns suitable for prewarming of mark cache according to settings. +Names getColumnsToPrewarmMarks(const MergeTreeSettings & settings, const NamesAndTypesList & columns_list); + } diff --git a/src/Storages/MergeTree/MergeTreeReaderWide.cpp b/src/Storages/MergeTree/MergeTreeReaderWide.cpp index 898bf5a2933..77231d8d392 100644 --- a/src/Storages/MergeTree/MergeTreeReaderWide.cpp +++ b/src/Storages/MergeTree/MergeTreeReaderWide.cpp @@ -262,7 +262,7 @@ MergeTreeReaderWide::FileStreams::iterator MergeTreeReaderWide::addStream(const /*num_columns_in_mark=*/ 1); auto stream_settings = settings; - stream_settings.is_low_cardinality_dictionary = substream_path.size() > 1 && substream_path[substream_path.size() - 2].type == ISerialization::Substream::Type::DictionaryKeys; + stream_settings.is_low_cardinality_dictionary = ISerialization::isLowCardinalityDictionarySubcolumn(substream_path); auto create_stream = [&]() { diff --git a/src/Storages/MergeTree/MergeTreeSettings.cpp b/src/Storages/MergeTree/MergeTreeSettings.cpp index 8c6aafe48f2..883191d59ab 100644 --- a/src/Storages/MergeTree/MergeTreeSettings.cpp +++ b/src/Storages/MergeTree/MergeTreeSettings.cpp @@ -30,10 +30,11 @@ namespace ErrorCodes extern const int BAD_ARGUMENTS; } +// clang-format off + /** These settings represent fine tunes for internal details of MergeTree storages * and should not be changed by the user without a reason. */ - #define MERGE_TREE_SETTINGS(DECLARE, ALIAS) \ DECLARE(UInt64, min_compress_block_size, 0, "When granule is written, compress the data in buffer if the size of pending uncompressed data is larger or equal than the specified threshold. If this setting is not set, the corresponding global setting is used.", 0) \ DECLARE(UInt64, max_compress_block_size, 0, "Compress the pending uncompressed data in buffer if its size is larger or equal than the specified threshold. Block of data will be compressed even if the current granule is not finished. If this setting is not set, the corresponding global setting is used.", 0) \ @@ -88,7 +89,7 @@ namespace ErrorCodes DECLARE(UInt64, min_age_to_force_merge_seconds, 0, "If all parts in a certain range are older than this value, range will be always eligible for merging. Set to 0 to disable.", 0) \ DECLARE(Bool, min_age_to_force_merge_on_partition_only, false, "Whether min_age_to_force_merge_seconds should be applied only on the entire partition and not on subset.", false) \ DECLARE(UInt64, number_of_free_entries_in_pool_to_execute_optimize_entire_partition, 25, "When there is less than specified number of free entries in pool, do not try to execute optimize entire partition with a merge (this merge is created when set min_age_to_force_merge_seconds > 0 and min_age_to_force_merge_on_partition_only = true). This is to leave free threads for regular merges and avoid \"Too many parts\"", 0) \ - DECLARE(Bool, remove_rolled_back_parts_immediately, 1, "Setting for an incomplete experimental feature.", 0) \ + DECLARE(Bool, remove_rolled_back_parts_immediately, 1, "Setting for an incomplete experimental feature.", EXPERIMENTAL) \ DECLARE(UInt64, replicated_max_mutations_in_one_entry, 10000, "Max number of mutation commands that can be merged together and executed in one MUTATE_PART entry (0 means unlimited)", 0) \ DECLARE(UInt64, number_of_mutations_to_delay, 500, "If table has at least that many unfinished mutations, artificially slow down mutations of table. Disabled if set to 0", 0) \ DECLARE(UInt64, number_of_mutations_to_throw, 1000, "If table has at least that many unfinished mutations, throw 'Too many mutations' exception. Disabled if set to 0", 0) \ @@ -98,7 +99,8 @@ namespace ErrorCodes DECLARE(String, merge_workload, "", "Name of workload to be used to access resources for merges", 0) \ DECLARE(String, mutation_workload, "", "Name of workload to be used to access resources for mutations", 0) \ DECLARE(Milliseconds, background_task_preferred_step_execution_time_ms, 50, "Target time to execution of one step of merge or mutation. Can be exceeded if one step takes longer time", 0) \ - DECLARE(MergeSelectorAlgorithm, merge_selector_algorithm, MergeSelectorAlgorithm::SIMPLE, "The algorithm to select parts for merges assignment", 0) \ + DECLARE(MergeSelectorAlgorithm, merge_selector_algorithm, MergeSelectorAlgorithm::SIMPLE, "The algorithm to select parts for merges assignment", EXPERIMENTAL) \ + DECLARE(Bool, merge_selector_enable_heuristic_to_remove_small_parts_at_right, true, "Enable heuristic for selecting parts for merge which removes parts from right side of range, if their size is less than specified ratio (0.01) of sum_size. Works for Simple and StochasticSimple merge selectors", 0) \ \ /** Inserts settings. */ \ DECLARE(UInt64, parts_to_delay_insert, 1000, "If table contains at least that many active parts in single partition, artificially slow down insert into table. Disabled if set to 0", 0) \ @@ -214,14 +216,14 @@ namespace ErrorCodes DECLARE(Bool, enable_block_offset_column, false, "Enable persisting column _block_offset for each row.", 0) \ \ /** Experimental/work in progress feature. Unsafe for production. */ \ - DECLARE(UInt64, part_moves_between_shards_enable, 0, "Experimental/Incomplete feature to move parts between shards. Does not take into account sharding expressions.", 0) \ - DECLARE(UInt64, part_moves_between_shards_delay_seconds, 30, "Time to wait before/after moving parts between shards.", 0) \ - DECLARE(Bool, allow_remote_fs_zero_copy_replication, false, "Don't use this setting in production, because it is not ready.", 0) \ - DECLARE(String, remote_fs_zero_copy_zookeeper_path, "/clickhouse/zero_copy", "ZooKeeper path for zero-copy table-independent info.", 0) \ - DECLARE(Bool, remote_fs_zero_copy_path_compatible_mode, false, "Run zero-copy in compatible mode during conversion process.", 0) \ - DECLARE(Bool, cache_populated_by_fetch, false, "Only available in ClickHouse Cloud", 0) \ - DECLARE(Bool, force_read_through_cache_for_merges, false, "Force read-through filesystem cache for merges", 0) \ - DECLARE(Bool, allow_experimental_replacing_merge_with_cleanup, false, "Allow experimental CLEANUP merges for ReplacingMergeTree with is_deleted column.", 0) \ + DECLARE(UInt64, part_moves_between_shards_enable, 0, "Experimental/Incomplete feature to move parts between shards. Does not take into account sharding expressions.", EXPERIMENTAL) \ + DECLARE(UInt64, part_moves_between_shards_delay_seconds, 30, "Time to wait before/after moving parts between shards.", EXPERIMENTAL) \ + DECLARE(Bool, allow_remote_fs_zero_copy_replication, false, "Don't use this setting in production, because it is not ready.", BETA) \ + DECLARE(String, remote_fs_zero_copy_zookeeper_path, "/clickhouse/zero_copy", "ZooKeeper path for zero-copy table-independent info.", EXPERIMENTAL) \ + DECLARE(Bool, remote_fs_zero_copy_path_compatible_mode, false, "Run zero-copy in compatible mode during conversion process.", EXPERIMENTAL) \ + DECLARE(Bool, cache_populated_by_fetch, false, "Only available in ClickHouse Cloud", EXPERIMENTAL) \ + DECLARE(Bool, force_read_through_cache_for_merges, false, "Force read-through filesystem cache for merges", EXPERIMENTAL) \ + DECLARE(Bool, allow_experimental_replacing_merge_with_cleanup, false, "Allow experimental CLEANUP merges for ReplacingMergeTree with is_deleted column.", EXPERIMENTAL) \ \ /** Compress marks and primary key. */ \ DECLARE(Bool, compress_marks, true, "Marks support compression, reduce mark file size and speed up network transmission.", 0) \ @@ -232,13 +234,15 @@ namespace ErrorCodes DECLARE(UInt64, primary_key_compress_block_size, 65536, "Primary compress block size, the actual size of the block to compress.", 0) \ DECLARE(Bool, primary_key_lazy_load, true, "Load primary key in memory on first use instead of on table initialization. This can save memory in the presence of a large number of tables.", 0) \ DECLARE(Float, primary_key_ratio_of_unique_prefix_values_to_skip_suffix_columns, 0.9f, "If the value of a column of the primary key in data part changes at least in this ratio of times, skip loading next columns in memory. This allows to save memory usage by not loading useless columns of the primary key.", 0) \ + DECLARE(Bool, prewarm_mark_cache, false, "If true mark cache will be prewarmed by saving marks to mark cache on inserts, merges, fetches and on startup of server", 0) \ + DECLARE(String, columns_to_prewarm_mark_cache, "", "List of columns to prewarm mark cache for (if enabled). Empty means all columns", 0) \ /** Projection settings. */ \ DECLARE(UInt64, max_projections, 25, "The maximum number of merge tree projections.", 0) \ DECLARE(LightweightMutationProjectionMode, lightweight_mutation_projection_mode, LightweightMutationProjectionMode::THROW, "When lightweight delete happens on a table with projection(s), the possible operations include throw the exception as projection exists, or drop projections of this table's relevant parts, or rebuild the projections.", 0) \ DECLARE(DeduplicateMergeProjectionMode, deduplicate_merge_projection_mode, DeduplicateMergeProjectionMode::THROW, "Whether to allow create projection for the table with non-classic MergeTree. Ignore option is purely for compatibility which might result in incorrect answer. Otherwise, if allowed, what is the action when merge, drop or rebuild.", 0) \ #define MAKE_OBSOLETE_MERGE_TREE_SETTING(M, TYPE, NAME, DEFAULT) \ - M(TYPE, NAME, DEFAULT, "Obsolete setting, does nothing.", BaseSettingsHelpers::Flags::OBSOLETE) + M(TYPE, NAME, DEFAULT, "Obsolete setting, does nothing.", SettingsTierType::OBSOLETE) #define OBSOLETE_MERGE_TREE_SETTINGS(M, ALIAS) \ /** Obsolete settings that do nothing but left for compatibility reasons. */ \ @@ -276,8 +280,9 @@ namespace ErrorCodes MERGE_TREE_SETTINGS(M, ALIAS) \ OBSOLETE_MERGE_TREE_SETTINGS(M, ALIAS) -DECLARE_SETTINGS_TRAITS(MergeTreeSettingsTraits, LIST_OF_MERGE_TREE_SETTINGS) +// clang-format on +DECLARE_SETTINGS_TRAITS(MergeTreeSettingsTraits, LIST_OF_MERGE_TREE_SETTINGS) /** Settings for the MergeTree family of engines. * Could be loaded from config or from a CREATE TABLE query (SETTINGS clause). @@ -648,7 +653,8 @@ void MergeTreeSettings::dumpToSystemMergeTreeSettingsColumns(MutableColumnsAndCo res_columns[5]->insert(max); res_columns[6]->insert(writability == SettingConstraintWritability::CONST); res_columns[7]->insert(setting.getTypeName()); - res_columns[8]->insert(setting.isObsolete()); + res_columns[8]->insert(setting.getTier() == SettingsTierType::OBSOLETE); + res_columns[9]->insert(setting.getTier()); } } diff --git a/src/Storages/MergeTree/MergeTreeSink.cpp b/src/Storages/MergeTree/MergeTreeSink.cpp index 1e42f16736d..604112c26ea 100644 --- a/src/Storages/MergeTree/MergeTreeSink.cpp +++ b/src/Storages/MergeTree/MergeTreeSink.cpp @@ -243,6 +243,15 @@ void MergeTreeSink::finishDelayedChunk() /// Part can be deduplicated, so increment counters and add to part log only if it's really added if (added) { + if (auto * mark_cache = storage.getContext()->getMarkCache().get()) + { + for (const auto & stream : partition.temp_part.streams) + { + auto marks = stream.stream->releaseCachedMarks(); + addMarksToCache(*part, marks, mark_cache); + } + } + auto counters_snapshot = std::make_shared(partition.part_counters.getPartiallyAtomicSnapshot()); PartLog::addNewPart(storage.getContext(), PartLog::PartLogEntry(part, partition.elapsed_ns, counters_snapshot)); StorageMergeTree::incrementInsertedPartsProfileEvent(part->getType()); diff --git a/src/Storages/MergeTree/MergedBlockOutputStream.cpp b/src/Storages/MergeTree/MergedBlockOutputStream.cpp index 4ee68580d3f..77c34aae30a 100644 --- a/src/Storages/MergeTree/MergedBlockOutputStream.cpp +++ b/src/Storages/MergeTree/MergedBlockOutputStream.cpp @@ -25,6 +25,7 @@ MergedBlockOutputStream::MergedBlockOutputStream( CompressionCodecPtr default_codec_, TransactionID tid, bool reset_columns_, + bool save_marks_in_cache, bool blocks_are_granules_size, const WriteSettings & write_settings_, const MergeTreeIndexGranularity & computed_index_granularity) @@ -39,6 +40,7 @@ MergedBlockOutputStream::MergedBlockOutputStream( storage_settings, data_part->index_granularity_info.mark_type.adaptive, /* rewrite_primary_key = */ true, + save_marks_in_cache, blocks_are_granules_size); /// TODO: looks like isStoredOnDisk() is always true for MergeTreeDataPart diff --git a/src/Storages/MergeTree/MergedBlockOutputStream.h b/src/Storages/MergeTree/MergedBlockOutputStream.h index e212fe5bb5a..060778866e0 100644 --- a/src/Storages/MergeTree/MergedBlockOutputStream.h +++ b/src/Storages/MergeTree/MergedBlockOutputStream.h @@ -24,6 +24,7 @@ public: CompressionCodecPtr default_codec_, TransactionID tid, bool reset_columns_ = false, + bool save_marks_in_cache = false, bool blocks_are_granules_size = false, const WriteSettings & write_settings = {}, const MergeTreeIndexGranularity & computed_index_granularity = {}); diff --git a/src/Storages/MergeTree/MergedColumnOnlyOutputStream.cpp b/src/Storages/MergeTree/MergedColumnOnlyOutputStream.cpp index 05cd77dcd40..bed539dfe02 100644 --- a/src/Storages/MergeTree/MergedColumnOnlyOutputStream.cpp +++ b/src/Storages/MergeTree/MergedColumnOnlyOutputStream.cpp @@ -19,6 +19,7 @@ MergedColumnOnlyOutputStream::MergedColumnOnlyOutputStream( const MergeTreeIndices & indices_to_recalc, const ColumnsStatistics & stats_to_recalc_, WrittenOffsetColumns * offset_columns_, + bool save_marks_in_cache, const MergeTreeIndexGranularity & index_granularity, const MergeTreeIndexGranularityInfo * index_granularity_info) : IMergedBlockOutputStream(data_part->storage.getSettings(), data_part->getDataPartStoragePtr(), metadata_snapshot_, columns_list_, /*reset_columns=*/ true) @@ -30,7 +31,9 @@ MergedColumnOnlyOutputStream::MergedColumnOnlyOutputStream( data_part->storage.getContext()->getWriteSettings(), storage_settings, index_granularity_info ? index_granularity_info->mark_type.adaptive : data_part->storage.canUseAdaptiveGranularity(), - /* rewrite_primary_key = */ false); + /* rewrite_primary_key = */ false, + save_marks_in_cache, + /* blocks_are_granules_size = */ false); writer = createMergeTreeDataPartWriter( data_part->getType(), diff --git a/src/Storages/MergeTree/MergedColumnOnlyOutputStream.h b/src/Storages/MergeTree/MergedColumnOnlyOutputStream.h index e837a62743e..f6bf9e37a58 100644 --- a/src/Storages/MergeTree/MergedColumnOnlyOutputStream.h +++ b/src/Storages/MergeTree/MergedColumnOnlyOutputStream.h @@ -22,6 +22,7 @@ public: const MergeTreeIndices & indices_to_recalc_, const ColumnsStatistics & stats_to_recalc_, WrittenOffsetColumns * offset_columns_ = nullptr, + bool save_marks_in_cache = false, const MergeTreeIndexGranularity & index_granularity = {}, const MergeTreeIndexGranularityInfo * index_granularity_info_ = nullptr); diff --git a/src/Storages/MergeTree/MutateTask.cpp b/src/Storages/MergeTree/MutateTask.cpp index ee87051371c..753b0c5d2fe 100644 --- a/src/Storages/MergeTree/MutateTask.cpp +++ b/src/Storages/MergeTree/MutateTask.cpp @@ -1623,6 +1623,7 @@ private: ctx->compression_codec, ctx->txn ? ctx->txn->tid : Tx::PrehistoricTID, /*reset_columns=*/ true, + /*save_marks_in_cache=*/ false, /*blocks_are_granules_size=*/ false, ctx->context->getWriteSettings(), computed_granularity); @@ -1851,6 +1852,7 @@ private: std::vector(ctx->indices_to_recalc.begin(), ctx->indices_to_recalc.end()), ColumnsStatistics(ctx->stats_to_recalc.begin(), ctx->stats_to_recalc.end()), nullptr, + /*save_marks_in_cache=*/ false, ctx->source_part->index_granularity, &ctx->source_part->index_granularity_info ); diff --git a/src/Storages/MergeTree/ReplicatedMergeTreeAttachThread.cpp b/src/Storages/MergeTree/ReplicatedMergeTreeAttachThread.cpp index 22b8ccca151..c258048354e 100644 --- a/src/Storages/MergeTree/ReplicatedMergeTreeAttachThread.cpp +++ b/src/Storages/MergeTree/ReplicatedMergeTreeAttachThread.cpp @@ -1,5 +1,6 @@ #include #include +#include #include #include @@ -20,7 +21,6 @@ namespace ErrorCodes { extern const int SUPPORT_IS_DISABLED; extern const int REPLICA_STATUS_CHANGED; - extern const int LOGICAL_ERROR; } ReplicatedMergeTreeAttachThread::ReplicatedMergeTreeAttachThread(StorageReplicatedMergeTree & storage_) @@ -123,67 +123,6 @@ void ReplicatedMergeTreeAttachThread::checkHasReplicaMetadataInZooKeeper(const z } } -Int32 ReplicatedMergeTreeAttachThread::fixReplicaMetadataVersionIfNeeded(zkutil::ZooKeeperPtr zookeeper) -{ - const String & zookeeper_path = storage.zookeeper_path; - const String & replica_path = storage.replica_path; - const bool replica_readonly = storage.is_readonly; - - for (size_t i = 0; i != 2; ++i) - { - String replica_metadata_version_str; - const bool replica_metadata_version_exists = zookeeper->tryGet(replica_path + "/metadata_version", replica_metadata_version_str); - if (!replica_metadata_version_exists) - return -1; - - const Int32 metadata_version = parse(replica_metadata_version_str); - - if (metadata_version != 0 || replica_readonly) - { - /// No need to fix anything - return metadata_version; - } - - Coordination::Stat stat; - zookeeper->get(fs::path(zookeeper_path) / "metadata", &stat); - if (stat.version == 0) - { - /// No need to fix anything - return metadata_version; - } - - ReplicatedMergeTreeQueue & queue = storage.queue; - queue.pullLogsToQueue(zookeeper); - if (queue.getStatus().metadata_alters_in_queue != 0) - { - LOG_DEBUG(log, "No need to update metadata_version as there are ALTER_METADATA entries in the queue"); - return metadata_version; - } - - const Coordination::Requests ops = { - zkutil::makeSetRequest(fs::path(replica_path) / "metadata_version", std::to_string(stat.version), 0), - zkutil::makeCheckRequest(fs::path(zookeeper_path) / "metadata", stat.version), - }; - Coordination::Responses ops_responses; - const auto code = zookeeper->tryMulti(ops, ops_responses); - if (code == Coordination::Error::ZOK) - { - LOG_DEBUG(log, "Successfully set metadata_version to {}", stat.version); - return stat.version; - } - if (code != Coordination::Error::ZBADVERSION) - { - throw zkutil::KeeperException(code); - } - } - - /// Second attempt is only possible if metadata_version != 0 or metadata.version changed during the first attempt. - /// If metadata_version != 0, on second attempt we will return the new metadata_version. - /// If metadata.version changed, on second attempt we will either get metadata_version != 0 and return the new metadata_version or we will get metadata_alters_in_queue != 0 and return 0. - /// Either way, on second attempt this method should return. - throw Exception(ErrorCodes::LOGICAL_ERROR, "Failed to fix replica metadata_version in ZooKeeper after two attempts"); -} - void ReplicatedMergeTreeAttachThread::runImpl() { storage.setZooKeeper(); @@ -227,33 +166,6 @@ void ReplicatedMergeTreeAttachThread::runImpl() /// Just in case it was not removed earlier due to connection loss zookeeper->tryRemove(replica_path + "/flags/force_restore_data"); - const Int32 replica_metadata_version = fixReplicaMetadataVersionIfNeeded(zookeeper); - const bool replica_metadata_version_exists = replica_metadata_version != -1; - if (replica_metadata_version_exists) - { - storage.setInMemoryMetadata(metadata_snapshot->withMetadataVersion(replica_metadata_version)); - } - else - { - /// Table was created before 20.4 and was never altered, - /// let's initialize replica metadata version from global metadata version. - Coordination::Stat table_metadata_version_stat; - zookeeper->get(zookeeper_path + "/metadata", &table_metadata_version_stat); - - Coordination::Requests ops; - ops.emplace_back(zkutil::makeCheckRequest(zookeeper_path + "/metadata", table_metadata_version_stat.version)); - ops.emplace_back(zkutil::makeCreateRequest(replica_path + "/metadata_version", toString(table_metadata_version_stat.version), zkutil::CreateMode::Persistent)); - - Coordination::Responses res; - auto code = zookeeper->tryMulti(ops, res); - - if (code == Coordination::Error::ZBADVERSION) - throw Exception(ErrorCodes::REPLICA_STATUS_CHANGED, "Failed to initialize metadata_version " - "because table was concurrently altered, will retry"); - - zkutil::KeeperMultiException::check(code, ops, res); - } - storage.checkTableStructure(replica_path, metadata_snapshot); storage.checkParts(skip_sanity_checks); diff --git a/src/Storages/MergeTree/ReplicatedMergeTreeAttachThread.h b/src/Storages/MergeTree/ReplicatedMergeTreeAttachThread.h index bfc97442598..250a5ed34d1 100644 --- a/src/Storages/MergeTree/ReplicatedMergeTreeAttachThread.h +++ b/src/Storages/MergeTree/ReplicatedMergeTreeAttachThread.h @@ -48,8 +48,6 @@ private: void runImpl(); void finalizeInitialization(); - - Int32 fixReplicaMetadataVersionIfNeeded(zkutil::ZooKeeperPtr zookeeper); }; } diff --git a/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp b/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp index 6b1581645f8..b1564b58a6c 100644 --- a/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp +++ b/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp @@ -615,7 +615,7 @@ std::pair ReplicatedMergeTreeQueue::pullLogsToQueue(zkutil::Zo { std::lock_guard lock(pull_logs_to_queue_mutex); - if (reason != LOAD) + if (reason != LOAD && reason != FIX_METADATA_VERSION) { /// It's totally ok to load queue on readonly replica (that's what RestartingThread does on initialization). /// It's ok if replica became readonly due to connection loss after we got current zookeeper (in this case zookeeper must be expired). diff --git a/src/Storages/MergeTree/ReplicatedMergeTreeQueue.h b/src/Storages/MergeTree/ReplicatedMergeTreeQueue.h index 9d3349663e2..6ec8818b0c6 100644 --- a/src/Storages/MergeTree/ReplicatedMergeTreeQueue.h +++ b/src/Storages/MergeTree/ReplicatedMergeTreeQueue.h @@ -334,6 +334,7 @@ public: UPDATE, MERGE_PREDICATE, SYNC, + FIX_METADATA_VERSION, OTHER, }; diff --git a/src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp b/src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp index 9d3e26cdc8d..93124e634bd 100644 --- a/src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp +++ b/src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.cpp @@ -29,6 +29,8 @@ namespace MergeTreeSetting namespace ErrorCodes { extern const int REPLICA_IS_ALREADY_ACTIVE; + extern const int REPLICA_STATUS_CHANGED; + extern const int LOGICAL_ERROR; } namespace FailPoints @@ -207,6 +209,36 @@ bool ReplicatedMergeTreeRestartingThread::tryStartup() throw; } + const Int32 replica_metadata_version = fixReplicaMetadataVersionIfNeeded(zookeeper); + const bool replica_metadata_version_exists = replica_metadata_version != -1; + if (replica_metadata_version_exists) + { + storage.setInMemoryMetadata(storage.getInMemoryMetadataPtr()->withMetadataVersion(replica_metadata_version)); + } + else + { + /// Table was created before 20.4 and was never altered, + /// let's initialize replica metadata version from global metadata version. + + const String & zookeeper_path = storage.zookeeper_path, & replica_path = storage.replica_path; + + Coordination::Stat table_metadata_version_stat; + zookeeper->get(zookeeper_path + "/metadata", &table_metadata_version_stat); + + Coordination::Requests ops; + ops.emplace_back(zkutil::makeCheckRequest(zookeeper_path + "/metadata", table_metadata_version_stat.version)); + ops.emplace_back(zkutil::makeCreateRequest(replica_path + "/metadata_version", toString(table_metadata_version_stat.version), zkutil::CreateMode::Persistent)); + + Coordination::Responses res; + auto code = zookeeper->tryMulti(ops, res); + + if (code == Coordination::Error::ZBADVERSION) + throw Exception(ErrorCodes::REPLICA_STATUS_CHANGED, "Failed to initialize metadata_version " + "because table was concurrently altered, will retry"); + + zkutil::KeeperMultiException::check(code, ops, res); + } + storage.queue.removeCurrentPartsFromMutations(); storage.last_queue_update_finish_time.store(time(nullptr)); @@ -424,4 +456,64 @@ void ReplicatedMergeTreeRestartingThread::setNotReadonly() storage.readonly_start_time.store(0, std::memory_order_relaxed); } + +Int32 ReplicatedMergeTreeRestartingThread::fixReplicaMetadataVersionIfNeeded(zkutil::ZooKeeperPtr zookeeper) +{ + const String & zookeeper_path = storage.zookeeper_path; + const String & replica_path = storage.replica_path; + + const size_t num_attempts = 2; + for (size_t attempt = 0; attempt != num_attempts; ++attempt) + { + String replica_metadata_version_str; + Coordination::Stat replica_stat; + const bool replica_metadata_version_exists = zookeeper->tryGet(replica_path + "/metadata_version", replica_metadata_version_str, &replica_stat); + if (!replica_metadata_version_exists) + return -1; + + const Int32 metadata_version = parse(replica_metadata_version_str); + if (metadata_version != 0) + return metadata_version; + + Coordination::Stat table_stat; + zookeeper->get(fs::path(zookeeper_path) / "metadata", &table_stat); + if (table_stat.version == 0) + return metadata_version; + + ReplicatedMergeTreeQueue & queue = storage.queue; + queue.pullLogsToQueue(zookeeper, {}, ReplicatedMergeTreeQueue::FIX_METADATA_VERSION); + if (queue.getStatus().metadata_alters_in_queue != 0) + { + LOG_INFO(log, "Skipping updating metadata_version as there are ALTER_METADATA entries in the queue"); + return metadata_version; + } + + const Coordination::Requests ops = { + zkutil::makeSetRequest(fs::path(replica_path) / "metadata_version", std::to_string(table_stat.version), replica_stat.version), + zkutil::makeCheckRequest(fs::path(zookeeper_path) / "metadata", table_stat.version), + }; + Coordination::Responses ops_responses; + const Coordination::Error code = zookeeper->tryMulti(ops, ops_responses); + if (code == Coordination::Error::ZOK) + { + LOG_DEBUG(log, "Successfully set metadata_version to {}", table_stat.version); + return table_stat.version; + } + + if (code == Coordination::Error::ZBADVERSION) + { + LOG_WARNING(log, "Cannot fix metadata_version because either metadata.version or metadata_version.version changed, attempts left = {}", num_attempts - attempt - 1); + continue; + } + + throw zkutil::KeeperException(code); + } + + /// Second attempt is only possible if either metadata_version.version or metadata.version changed during the first attempt. + /// If metadata_version changed to non-zero value during the first attempt, on second attempt we will return the new metadata_version. + /// If metadata.version changed during first attempt, on second attempt we will either get metadata_version != 0 and return the new metadata_version or we will get metadata_alters_in_queue != 0 and return 0. + /// So either first or second attempt should return unless metadata_version was rewritten from 0 to 0 during the first attempt which is highly unlikely. + throw Exception(ErrorCodes::LOGICAL_ERROR, "Failed to fix replica metadata_version in ZooKeeper after two attempts"); +} + } diff --git a/src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.h b/src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.h index d719505ae5e..6f450dc1d40 100644 --- a/src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.h +++ b/src/Storages/MergeTree/ReplicatedMergeTreeRestartingThread.h @@ -6,6 +6,7 @@ #include #include #include +#include namespace DB @@ -68,6 +69,9 @@ private: /// Disable readonly mode for table void setNotReadonly(); + + /// Fix replica metadata_version if needed + Int32 fixReplicaMetadataVersionIfNeeded(zkutil::ZooKeeperPtr zookeeper); }; diff --git a/src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp b/src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp index 1ba04fc460d..f1b0e5ec385 100644 --- a/src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp +++ b/src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp @@ -481,6 +481,17 @@ void ReplicatedMergeTreeSinkImpl::finishDelayedChunk(const ZooKeeperWithF /// Set a special error code if the block is duplicate int error = (deduplicate && deduplicated) ? ErrorCodes::INSERT_WAS_DEDUPLICATED : 0; + auto * mark_cache = storage.getContext()->getMarkCache().get(); + + if (!error && mark_cache) + { + for (const auto & stream : partition.temp_part.streams) + { + auto marks = stream.stream->releaseCachedMarks(); + addMarksToCache(*part, marks, mark_cache); + } + } + auto counters_snapshot = std::make_shared(partition.part_counters.getPartiallyAtomicSnapshot()); PartLog::addNewPart(storage.getContext(), PartLog::PartLogEntry(part, partition.elapsed_ns, counters_snapshot), ExecutionStatus(error)); StorageReplicatedMergeTree::incrementInsertedPartsProfileEvent(part->getType()); @@ -521,8 +532,18 @@ void ReplicatedMergeTreeSinkImpl::finishDelayedChunk(const ZooKeeperWithFa { partition.temp_part.finalize(); auto conflict_block_ids = commitPart(zookeeper, partition.temp_part.part, partition.block_id, delayed_chunk->replicas_num).first; + if (conflict_block_ids.empty()) { + if (auto * mark_cache = storage.getContext()->getMarkCache().get()) + { + for (const auto & stream : partition.temp_part.streams) + { + auto marks = stream.stream->releaseCachedMarks(); + addMarksToCache(*partition.temp_part.part, marks, mark_cache); + } + } + auto counters_snapshot = std::make_shared(partition.part_counters.getPartiallyAtomicSnapshot()); PartLog::addNewPart( storage.getContext(), diff --git a/src/Storages/NATS/NATSSource.cpp b/src/Storages/NATS/NATSSource.cpp index 23ee0da0642..cfdb507c582 100644 --- a/src/Storages/NATS/NATSSource.cpp +++ b/src/Storages/NATS/NATSSource.cpp @@ -107,21 +107,18 @@ Chunk NATSSource::generate() storage.getFormatName(), empty_buf, non_virtual_header, context, max_block_size, std::nullopt, 1); std::optional exception_message; size_t total_rows = 0; - auto on_error = [&](const MutableColumns & result_columns, Exception & e) + auto on_error = [&](const MutableColumns & result_columns, const ColumnCheckpoints & checkpoints, Exception & e) { if (handle_error_mode == StreamingHandleErrorMode::STREAM) { exception_message = e.message(); - for (const auto & column : result_columns) + for (size_t i = 0; i < result_columns.size(); ++i) { - // We could already push some rows to result_columns - // before exception, we need to fix it. - auto cur_rows = column->size(); - if (cur_rows > total_rows) - column->popBack(cur_rows - total_rows); + // We could already push some rows to result_columns before exception, we need to fix it. + result_columns[i]->rollback(*checkpoints[i]); // All data columns will get default value in case of error. - column->insertDefault(); + result_columns[i]->insertDefault(); } return 1; diff --git a/src/Storages/NATS/StorageNATS.cpp b/src/Storages/NATS/StorageNATS.cpp index 123f5adc22d..5a51f078e7b 100644 --- a/src/Storages/NATS/StorageNATS.cpp +++ b/src/Storages/NATS/StorageNATS.cpp @@ -786,7 +786,13 @@ void registerStorageNATS(StorageFactory & factory) return std::make_shared(args.table_id, args.getContext(), args.columns, args.comment, std::move(nats_settings), args.mode); }; - factory.registerStorage("NATS", creator_fn, StorageFactory::StorageFeatures{ .supports_settings = true, }); + factory.registerStorage( + "NATS", + creator_fn, + StorageFactory::StorageFeatures{ + .supports_settings = true, + .source_access_type = AccessType::NATS, + }); } } diff --git a/src/Storages/ObjectStorage/HDFS/WriteBufferFromHDFS.cpp b/src/Storages/ObjectStorage/HDFS/WriteBufferFromHDFS.cpp index 4f6f8c782f2..4879dc41d53 100644 --- a/src/Storages/ObjectStorage/HDFS/WriteBufferFromHDFS.cpp +++ b/src/Storages/ObjectStorage/HDFS/WriteBufferFromHDFS.cpp @@ -29,6 +29,7 @@ extern const int CANNOT_FSYNC; struct WriteBufferFromHDFS::WriteBufferFromHDFSImpl { std::string hdfs_uri; + std::string hdfs_file_path; hdfsFile fout; HDFSBuilderWrapper builder; HDFSFSPtr fs; @@ -36,25 +37,24 @@ struct WriteBufferFromHDFS::WriteBufferFromHDFSImpl WriteBufferFromHDFSImpl( const std::string & hdfs_uri_, + const std::string & hdfs_file_path_, const Poco::Util::AbstractConfiguration & config_, int replication_, const WriteSettings & write_settings_, int flags) : hdfs_uri(hdfs_uri_) + , hdfs_file_path(hdfs_file_path_) , builder(createHDFSBuilder(hdfs_uri, config_)) , fs(createHDFSFS(builder.get())) , write_settings(write_settings_) { - const size_t begin_of_path = hdfs_uri.find('/', hdfs_uri.find("//") + 2); - const String path = hdfs_uri.substr(begin_of_path); - /// O_WRONLY meaning create or overwrite i.e., implies O_TRUNCAT here - fout = hdfsOpenFile(fs.get(), path.c_str(), flags, 0, replication_, 0); + fout = hdfsOpenFile(fs.get(), hdfs_file_path.c_str(), flags, 0, replication_, 0); if (fout == nullptr) { throw Exception(ErrorCodes::CANNOT_OPEN_FILE, "Unable to open HDFS file: {} ({}) error: {}", - path, hdfs_uri, std::string(hdfsGetLastError())); + hdfs_file_path, hdfs_uri, std::string(hdfsGetLastError())); } } @@ -71,7 +71,7 @@ struct WriteBufferFromHDFS::WriteBufferFromHDFSImpl rlock.unlock(std::max(0, bytes_written)); if (bytes_written < 0) - throw Exception(ErrorCodes::NETWORK_ERROR, "Fail to write HDFS file: {} {}", hdfs_uri, std::string(hdfsGetLastError())); + throw Exception(ErrorCodes::NETWORK_ERROR, "Fail to write HDFS file: {}, hdfs_uri: {}, {}", hdfs_file_path, hdfs_uri, std::string(hdfsGetLastError())); if (write_settings.remote_throttler) write_settings.remote_throttler->add(bytes_written, ProfileEvents::RemoteWriteThrottlerBytes, ProfileEvents::RemoteWriteThrottlerSleepMicroseconds); @@ -83,20 +83,21 @@ struct WriteBufferFromHDFS::WriteBufferFromHDFSImpl { int result = hdfsSync(fs.get(), fout); if (result < 0) - throw ErrnoException(ErrorCodes::CANNOT_FSYNC, "Cannot HDFS sync {} {}", hdfs_uri, std::string(hdfsGetLastError())); + throw ErrnoException(ErrorCodes::CANNOT_FSYNC, "Cannot HDFS sync {}, hdfs_url: {}, {}", hdfs_file_path, hdfs_uri, std::string(hdfsGetLastError())); } }; WriteBufferFromHDFS::WriteBufferFromHDFS( - const std::string & hdfs_name_, + const std::string & hdfs_uri_, + const std::string & hdfs_file_path_, const Poco::Util::AbstractConfiguration & config_, int replication_, const WriteSettings & write_settings_, size_t buf_size_, int flags_) : WriteBufferFromFileBase(buf_size_, nullptr, 0) - , impl(std::make_unique(hdfs_name_, config_, replication_, write_settings_, flags_)) - , filename(hdfs_name_) + , impl(std::make_unique(hdfs_uri_, hdfs_file_path_, config_, replication_, write_settings_, flags_)) + , filename(hdfs_file_path_) { } diff --git a/src/Storages/ObjectStorage/HDFS/WriteBufferFromHDFS.h b/src/Storages/ObjectStorage/HDFS/WriteBufferFromHDFS.h index e3f0ae96a8f..8166da92e16 100644 --- a/src/Storages/ObjectStorage/HDFS/WriteBufferFromHDFS.h +++ b/src/Storages/ObjectStorage/HDFS/WriteBufferFromHDFS.h @@ -22,7 +22,8 @@ class WriteBufferFromHDFS final : public WriteBufferFromFileBase public: WriteBufferFromHDFS( - const String & hdfs_name_, + const String & hdfs_uri_, + const String & hdfs_file_path_, const Poco::Util::AbstractConfiguration & config_, int replication_, const WriteSettings & write_settings_ = {}, diff --git a/src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.cpp b/src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.cpp index 8bab744459d..6aac853b011 100644 --- a/src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.cpp +++ b/src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.cpp @@ -219,6 +219,108 @@ ObjectStorageQueueMetadata::tryAcquireBucket(const Bucket & bucket, const Proces return ObjectStorageQueueOrderedFileMetadata::tryAcquireBucket(zookeeper_path, bucket, processor, log); } +void ObjectStorageQueueMetadata::alterSettings(const SettingsChanges & changes) +{ + const fs::path alter_settings_lock_path = zookeeper_path / "alter_settings_lock"; + zkutil::EphemeralNodeHolder::Ptr alter_settings_lock; + auto zookeeper = getZooKeeper(); + + /// We will retry taking alter_settings_lock for the duration of 5 seconds. + /// Do we need to add a setting for this? + const size_t num_tries = 100; + for (size_t i = 0; i < num_tries; ++i) + { + alter_settings_lock = zkutil::EphemeralNodeHolder::tryCreate(alter_settings_lock_path, *zookeeper, toString(getCurrentTime())); + + if (alter_settings_lock) + break; + + if (i == num_tries - 1) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Failed to take alter setting lock"); + + sleepForMilliseconds(50); + } + + Coordination::Stat stat; + auto metadata_str = zookeeper->get(fs::path(zookeeper_path) / "metadata", &stat); + auto metadata_from_zk = ObjectStorageQueueTableMetadata::parse(metadata_str); + auto new_table_metadata{table_metadata}; + + for (const auto & change : changes) + { + if (!ObjectStorageQueueTableMetadata::isStoredInKeeper(change.name)) + continue; + + if (change.name == "processing_threads_num") + { + const auto value = change.value.safeGet(); + if (table_metadata.processing_threads_num == value) + { + LOG_TRACE(log, "Setting `processing_threads_num` already equals {}. " + "Will do nothing", value); + continue; + } + new_table_metadata.processing_threads_num = value; + } + else if (change.name == "loading_retries") + { + const auto value = change.value.safeGet(); + if (table_metadata.loading_retries == value) + { + LOG_TRACE(log, "Setting `loading_retries` already equals {}. " + "Will do nothing", value); + continue; + } + new_table_metadata.loading_retries = value; + } + else if (change.name == "after_processing") + { + const auto value = ObjectStorageQueueTableMetadata::actionFromString(change.value.safeGet()); + if (table_metadata.after_processing == value) + { + LOG_TRACE(log, "Setting `after_processing` already equals {}. " + "Will do nothing", value); + continue; + } + new_table_metadata.after_processing = value; + } + else if (change.name == "tracked_files_limit") + { + const auto value = change.value.safeGet(); + if (table_metadata.tracked_files_limit == value) + { + LOG_TRACE(log, "Setting `tracked_files_limit` already equals {}. " + "Will do nothing", value); + continue; + } + new_table_metadata.tracked_files_limit = value; + } + else if (change.name == "tracked_file_ttl_sec") + { + const auto value = change.value.safeGet(); + if (table_metadata.tracked_files_ttl_sec == value) + { + LOG_TRACE(log, "Setting `tracked_file_ttl_sec` already equals {}. " + "Will do nothing", value); + continue; + } + new_table_metadata.tracked_files_ttl_sec = value; + } + else + { + throw Exception(ErrorCodes::LOGICAL_ERROR, "Setting `{}` is not changeable", change.name); + } + } + + const auto new_metadata_str = new_table_metadata.toString(); + LOG_TRACE(log, "New metadata: {}", new_metadata_str); + + const fs::path table_metadata_path = zookeeper_path / "metadata"; + zookeeper->set(table_metadata_path, new_metadata_str, stat.version); + + table_metadata.syncChangeableSettings(new_table_metadata); +} + ObjectStorageQueueTableMetadata ObjectStorageQueueMetadata::syncWithKeeper( const fs::path & zookeeper_path, const ObjectStorageQueueSettings & settings, diff --git a/src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.h b/src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.h index 93e726b17ab..8da8a4c367d 100644 --- a/src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.h +++ b/src/Storages/ObjectStorageQueue/ObjectStorageQueueMetadata.h @@ -9,6 +9,7 @@ #include #include #include +#include namespace fs = std::filesystem; namespace Poco { class Logger; } @@ -89,6 +90,8 @@ public: const ObjectStorageQueueTableMetadata & getTableMetadata() const { return table_metadata; } ObjectStorageQueueTableMetadata & getTableMetadata() { return table_metadata; } + void alterSettings(const SettingsChanges & changes); + private: void cleanupThreadFunc(); void cleanupThreadFuncImpl(); diff --git a/src/Storages/ObjectStorageQueue/ObjectStorageQueueOrderedFileMetadata.cpp b/src/Storages/ObjectStorageQueue/ObjectStorageQueueOrderedFileMetadata.cpp index 75a96328051..72e9e073f27 100644 --- a/src/Storages/ObjectStorageQueue/ObjectStorageQueueOrderedFileMetadata.cpp +++ b/src/Storages/ObjectStorageQueue/ObjectStorageQueueOrderedFileMetadata.cpp @@ -381,10 +381,10 @@ void ObjectStorageQueueOrderedFileMetadata::setProcessedImpl() /// In one zookeeper transaction do the following: enum RequestType { - SET_MAX_PROCESSED_PATH = 0, - CHECK_PROCESSING_ID_PATH = 1, /// Optional. - REMOVE_PROCESSING_ID_PATH = 2, /// Optional. - REMOVE_PROCESSING_PATH = 3, /// Optional. + CHECK_PROCESSING_ID_PATH = 0, + REMOVE_PROCESSING_ID_PATH = 1, + REMOVE_PROCESSING_PATH = 2, + SET_MAX_PROCESSED_PATH = 3, }; const auto zk_client = getZooKeeper(); @@ -409,8 +409,18 @@ void ObjectStorageQueueOrderedFileMetadata::setProcessedImpl() return; } + bool unexpected_error = false; if (Coordination::isHardwareError(code)) failure_reason = "Lost connection to keeper"; + else if (is_request_failed(CHECK_PROCESSING_ID_PATH)) + failure_reason = "Version of processing id node changed"; + else if (is_request_failed(REMOVE_PROCESSING_PATH)) + { + /// Remove processing_id node should not actually fail + /// because we just checked in a previous keeper request that it exists and has a certain version. + unexpected_error = true; + failure_reason = "Failed to remove processing id path"; + } else if (is_request_failed(SET_MAX_PROCESSED_PATH)) { LOG_TRACE(log, "Cannot set file {} as processed. " @@ -418,13 +428,12 @@ void ObjectStorageQueueOrderedFileMetadata::setProcessedImpl() "Will retry.", path, code); continue; } - else if (is_request_failed(CHECK_PROCESSING_ID_PATH)) - failure_reason = "Version of processing id node changed"; - else if (is_request_failed(REMOVE_PROCESSING_PATH)) - failure_reason = "Failed to remove processing path"; else throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected state of zookeeper transaction: {}", code); + if (unexpected_error) + throw Exception(ErrorCodes::LOGICAL_ERROR, "{}", failure_reason); + LOG_WARNING(log, "Cannot set file {} as processed: {}. Reason: {}", path, code, failure_reason); return; } diff --git a/src/Storages/ObjectStorageQueue/ObjectStorageQueueSettings.cpp b/src/Storages/ObjectStorageQueue/ObjectStorageQueueSettings.cpp index d47e7b97404..060f1cd2dd5 100644 --- a/src/Storages/ObjectStorageQueue/ObjectStorageQueueSettings.cpp +++ b/src/Storages/ObjectStorageQueue/ObjectStorageQueueSettings.cpp @@ -23,15 +23,15 @@ namespace ErrorCodes 0) \ DECLARE(ObjectStorageQueueAction, after_processing, ObjectStorageQueueAction::KEEP, "Delete or keep file in after successful processing", 0) \ DECLARE(String, keeper_path, "", "Zookeeper node path", 0) \ - DECLARE(UInt32, loading_retries, 10, "Retry loading up to specified number of times", 0) \ - DECLARE(UInt32, processing_threads_num, 1, "Number of processing threads", 0) \ + DECLARE(UInt64, loading_retries, 10, "Retry loading up to specified number of times", 0) \ + DECLARE(UInt64, processing_threads_num, 1, "Number of processing threads", 0) \ DECLARE(UInt32, enable_logging_to_queue_log, 1, "Enable logging to system table system.(s3/azure_)queue_log", 0) \ DECLARE(String, last_processed_path, "", "For Ordered mode. Files that have lexicographically smaller file name are considered already processed", 0) \ - DECLARE(UInt32, tracked_file_ttl_sec, 0, "Maximum number of seconds to store processed files in ZooKeeper node (store forever by default)", 0) \ - DECLARE(UInt32, polling_min_timeout_ms, 1000, "Minimal timeout before next polling", 0) \ - DECLARE(UInt32, polling_max_timeout_ms, 10000, "Maximum timeout before next polling", 0) \ - DECLARE(UInt32, polling_backoff_ms, 1000, "Polling backoff", 0) \ - DECLARE(UInt32, tracked_files_limit, 1000, "For unordered mode. Max set size for tracking processed files in ZooKeeper", 0) \ + DECLARE(UInt64, tracked_files_limit, 1000, "For unordered mode. Max set size for tracking processed files in ZooKeeper", 0) \ + DECLARE(UInt64, tracked_file_ttl_sec, 0, "Maximum number of seconds to store processed files in ZooKeeper node (store forever by default)", 0) \ + DECLARE(UInt64, polling_min_timeout_ms, 1000, "Minimal timeout before next polling", 0) \ + DECLARE(UInt64, polling_max_timeout_ms, 10000, "Maximum timeout before next polling", 0) \ + DECLARE(UInt64, polling_backoff_ms, 1000, "Polling backoff", 0) \ DECLARE(UInt32, cleanup_interval_min_ms, 60000, "For unordered mode. Polling backoff min for cleanup", 0) \ DECLARE(UInt32, cleanup_interval_max_ms, 60000, "For unordered mode. Polling backoff max for cleanup", 0) \ DECLARE(UInt32, buckets, 0, "Number of buckets for Ordered mode parallel processing", 0) \ @@ -112,6 +112,11 @@ ObjectStorageQueueSettings::~ObjectStorageQueueSettings() = default; OBJECT_STORAGE_QUEUE_SETTINGS_SUPPORTED_TYPES(ObjectStorageQueueSettings, IMPLEMENT_SETTING_SUBSCRIPT_OPERATOR) +void ObjectStorageQueueSettings::applyChanges(const SettingsChanges & changes) +{ + impl->applyChanges(changes); +} + void ObjectStorageQueueSettings::loadFromQuery(ASTStorage & storage_def) { if (storage_def.settings) @@ -156,4 +161,9 @@ void ObjectStorageQueueSettings::loadFromQuery(ASTStorage & storage_def) } } +Field ObjectStorageQueueSettings::get(const std::string & name) +{ + return impl->get(name); +} + } diff --git a/src/Storages/ObjectStorageQueue/ObjectStorageQueueSettings.h b/src/Storages/ObjectStorageQueue/ObjectStorageQueueSettings.h index c2929ac27fb..06bb78a95a2 100644 --- a/src/Storages/ObjectStorageQueue/ObjectStorageQueueSettings.h +++ b/src/Storages/ObjectStorageQueue/ObjectStorageQueueSettings.h @@ -12,6 +12,7 @@ class ASTStorage; struct ObjectStorageQueueSettingsImpl; struct MutableColumnsAndConstraints; class StorageObjectStorageQueue; +class SettingsChanges; /// List of available types supported in ObjectStorageQueueSettings object #define OBJECT_STORAGE_QUEUE_SETTINGS_SUPPORTED_TYPES(CLASS_NAME, M) \ @@ -61,6 +62,10 @@ struct ObjectStorageQueueSettings void loadFromQuery(ASTStorage & storage_def); + void applyChanges(const SettingsChanges & changes); + + Field get(const std::string & name); + private: std::unique_ptr impl; }; diff --git a/src/Storages/ObjectStorageQueue/ObjectStorageQueueSource.cpp b/src/Storages/ObjectStorageQueue/ObjectStorageQueueSource.cpp index c55287d2177..ba1a97bc2fb 100644 --- a/src/Storages/ObjectStorageQueue/ObjectStorageQueueSource.cpp +++ b/src/Storages/ObjectStorageQueue/ObjectStorageQueueSource.cpp @@ -657,7 +657,7 @@ void ObjectStorageQueueSource::commit(bool success, const std::string & exceptio void ObjectStorageQueueSource::applyActionAfterProcessing(const String & path) { - if (files_metadata->getTableMetadata().after_processing == "delete") + if (files_metadata->getTableMetadata().after_processing == ObjectStorageQueueAction::DELETE) { object_storage->removeObject(StoredObject(path)); } diff --git a/src/Storages/ObjectStorageQueue/ObjectStorageQueueTableMetadata.cpp b/src/Storages/ObjectStorageQueue/ObjectStorageQueueTableMetadata.cpp index e67379c393e..1c024fa09b8 100644 --- a/src/Storages/ObjectStorageQueue/ObjectStorageQueueTableMetadata.cpp +++ b/src/Storages/ObjectStorageQueue/ObjectStorageQueueTableMetadata.cpp @@ -17,11 +17,11 @@ namespace ObjectStorageQueueSetting extern const ObjectStorageQueueSettingsObjectStorageQueueAction after_processing; extern const ObjectStorageQueueSettingsUInt32 buckets; extern const ObjectStorageQueueSettingsString last_processed_path; - extern const ObjectStorageQueueSettingsUInt32 loading_retries; extern const ObjectStorageQueueSettingsObjectStorageQueueMode mode; - extern const ObjectStorageQueueSettingsUInt32 processing_threads_num; - extern const ObjectStorageQueueSettingsUInt32 tracked_files_limit; - extern const ObjectStorageQueueSettingsUInt32 tracked_file_ttl_sec; + extern const ObjectStorageQueueSettingsUInt64 loading_retries; + extern const ObjectStorageQueueSettingsUInt64 processing_threads_num; + extern const ObjectStorageQueueSettingsUInt64 tracked_files_limit; + extern const ObjectStorageQueueSettingsUInt64 tracked_file_ttl_sec; } @@ -56,13 +56,13 @@ ObjectStorageQueueTableMetadata::ObjectStorageQueueTableMetadata( const std::string & format_) : format_name(format_) , columns(columns_.toString()) - , after_processing(engine_settings[ObjectStorageQueueSetting::after_processing].toString()) , mode(engine_settings[ObjectStorageQueueSetting::mode].toString()) - , tracked_files_limit(engine_settings[ObjectStorageQueueSetting::tracked_files_limit]) - , tracked_files_ttl_sec(engine_settings[ObjectStorageQueueSetting::tracked_file_ttl_sec]) , buckets(engine_settings[ObjectStorageQueueSetting::buckets]) , last_processed_path(engine_settings[ObjectStorageQueueSetting::last_processed_path]) + , after_processing(engine_settings[ObjectStorageQueueSetting::after_processing]) , loading_retries(engine_settings[ObjectStorageQueueSetting::loading_retries]) + , tracked_files_limit(engine_settings[ObjectStorageQueueSetting::tracked_files_limit]) + , tracked_files_ttl_sec(engine_settings[ObjectStorageQueueSetting::tracked_file_ttl_sec]) { processing_threads_num_changed = engine_settings[ObjectStorageQueueSetting::processing_threads_num].changed; if (!processing_threads_num_changed && engine_settings[ObjectStorageQueueSetting::processing_threads_num] <= 1) @@ -74,16 +74,16 @@ ObjectStorageQueueTableMetadata::ObjectStorageQueueTableMetadata( String ObjectStorageQueueTableMetadata::toString() const { Poco::JSON::Object json; - json.set("after_processing", after_processing); + json.set("after_processing", actionToString(after_processing.load())); json.set("mode", mode); - json.set("tracked_files_limit", tracked_files_limit); - json.set("tracked_files_ttl_sec", tracked_files_ttl_sec); - json.set("processing_threads_num", processing_threads_num); + json.set("tracked_files_limit", tracked_files_limit.load()); + json.set("tracked_files_ttl_sec", tracked_files_ttl_sec.load()); + json.set("processing_threads_num", processing_threads_num.load()); json.set("buckets", buckets); json.set("format_name", format_name); json.set("columns", columns); json.set("last_processed_file", last_processed_path); - json.set("loading_retries", loading_retries); + json.set("loading_retries", loading_retries.load()); std::ostringstream oss; // STYLE_CHECK_ALLOW_STD_STRING_STREAM oss.exceptions(std::ios::failbit); @@ -91,6 +91,26 @@ String ObjectStorageQueueTableMetadata::toString() const return oss.str(); } +ObjectStorageQueueAction ObjectStorageQueueTableMetadata::actionFromString(const std::string & action) +{ + if (action == "keep") + return ObjectStorageQueueAction::KEEP; + if (action == "delete") + return ObjectStorageQueueAction::DELETE; + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Unexpected ObjectStorageQueue action: {}", action); +} + +std::string ObjectStorageQueueTableMetadata::actionToString(ObjectStorageQueueAction action) +{ + switch (action) + { + case ObjectStorageQueueAction::DELETE: + return "delete"; + case ObjectStorageQueueAction::KEEP: + return "keep"; + } +} + ObjectStorageQueueMode ObjectStorageQueueTableMetadata::getMode() const { return modeFromString(mode); @@ -115,14 +135,14 @@ static auto getOrDefault( ObjectStorageQueueTableMetadata::ObjectStorageQueueTableMetadata(const Poco::JSON::Object::Ptr & json) : format_name(json->getValue("format_name")) , columns(json->getValue("columns")) - , after_processing(json->getValue("after_processing")) , mode(json->getValue("mode")) - , tracked_files_limit(getOrDefault(json, "tracked_files_limit", "s3queue_", 0)) - , tracked_files_ttl_sec(getOrDefault(json, "tracked_files_ttl_sec", "", getOrDefault(json, "tracked_file_ttl_sec", "s3queue_", 0))) , buckets(getOrDefault(json, "buckets", "", 0)) , last_processed_path(getOrDefault(json, "last_processed_file", "s3queue_", "")) + , after_processing(actionFromString(json->getValue("after_processing"))) , loading_retries(getOrDefault(json, "loading_retries", "", 10)) , processing_threads_num(getOrDefault(json, "processing_threads_num", "s3queue_", 1)) + , tracked_files_limit(getOrDefault(json, "tracked_files_limit", "s3queue_", 0)) + , tracked_files_ttl_sec(getOrDefault(json, "tracked_files_ttl_sec", "", getOrDefault(json, "tracked_file_ttl_sec", "s3queue_", 0))) { validateMode(mode); } @@ -148,7 +168,7 @@ void ObjectStorageQueueTableMetadata::adjustFromKeeper(const ObjectStorageQueueT else LOG_TRACE(log, "{}", message); - processing_threads_num = from_zk.processing_threads_num; + processing_threads_num = from_zk.processing_threads_num.load(); } } @@ -164,8 +184,8 @@ void ObjectStorageQueueTableMetadata::checkImmutableFieldsEquals(const ObjectSto ErrorCodes::METADATA_MISMATCH, "Existing table metadata in ZooKeeper differs " "in action after processing. Stored in ZooKeeper: {}, local: {}", - DB::toString(from_zk.after_processing), - DB::toString(after_processing)); + DB::toString(from_zk.after_processing.load()), + DB::toString(after_processing.load())); if (mode != from_zk.mode) throw Exception( diff --git a/src/Storages/ObjectStorageQueue/ObjectStorageQueueTableMetadata.h b/src/Storages/ObjectStorageQueue/ObjectStorageQueueTableMetadata.h index a4edd8831c1..6dfc705a7b6 100644 --- a/src/Storages/ObjectStorageQueue/ObjectStorageQueueTableMetadata.h +++ b/src/Storages/ObjectStorageQueue/ObjectStorageQueueTableMetadata.h @@ -19,17 +19,19 @@ class ReadBuffer; */ struct ObjectStorageQueueTableMetadata { + /// Non-changeable settings. const String format_name; const String columns; - const String after_processing; const String mode; - const UInt32 tracked_files_limit; - const UInt32 tracked_files_ttl_sec; const UInt32 buckets; const String last_processed_path; - const UInt32 loading_retries; + /// Changeable settings. + std::atomic after_processing; + std::atomic loading_retries; + std::atomic processing_threads_num; + std::atomic tracked_files_limit; + std::atomic tracked_files_ttl_sec; - UInt32 processing_threads_num; /// Can be changed from keeper. bool processing_threads_num_changed = false; ObjectStorageQueueTableMetadata( @@ -37,10 +39,36 @@ struct ObjectStorageQueueTableMetadata const ColumnsDescription & columns_, const std::string & format_); + ObjectStorageQueueTableMetadata(const ObjectStorageQueueTableMetadata & other) + : format_name(other.format_name) + , columns(other.columns) + , mode(other.mode) + , buckets(other.buckets) + , last_processed_path(other.last_processed_path) + , after_processing(other.after_processing.load()) + , loading_retries(other.loading_retries.load()) + , processing_threads_num(other.processing_threads_num.load()) + , tracked_files_limit(other.tracked_files_limit.load()) + , tracked_files_ttl_sec(other.tracked_files_ttl_sec.load()) + { + } + + void syncChangeableSettings(const ObjectStorageQueueTableMetadata & other) + { + after_processing = other.after_processing.load(); + loading_retries = other.loading_retries.load(); + processing_threads_num = other.processing_threads_num.load(); + tracked_files_limit = other.tracked_files_limit.load(); + tracked_files_ttl_sec = other.tracked_files_ttl_sec.load(); + } + explicit ObjectStorageQueueTableMetadata(const Poco::JSON::Object::Ptr & json); static ObjectStorageQueueTableMetadata parse(const String & metadata_str); + static ObjectStorageQueueAction actionFromString(const std::string & action); + static std::string actionToString(ObjectStorageQueueAction action); + String toString() const; ObjectStorageQueueMode getMode() const; @@ -49,6 +77,25 @@ struct ObjectStorageQueueTableMetadata void checkEquals(const ObjectStorageQueueTableMetadata & from_zk) const; + static bool isStoredInKeeper(const std::string & name) + { + static const std::unordered_set settings_names + { + "format_name", + "columns", + "mode", + "buckets", + "last_processed_path", + "after_processing", + "loading_retries", + "processing_threads_num", + "tracked_files_limit", + "tracked_file_ttl_sec", + "tracked_files_ttl_sec", + }; + return settings_names.contains(name); + } + private: void checkImmutableFieldsEquals(const ObjectStorageQueueTableMetadata & from_zk) const; }; diff --git a/src/Storages/ObjectStorageQueue/ObjectStorageQueueUnorderedFileMetadata.cpp b/src/Storages/ObjectStorageQueue/ObjectStorageQueueUnorderedFileMetadata.cpp index 40751d9c332..2050797a2ea 100644 --- a/src/Storages/ObjectStorageQueue/ObjectStorageQueueUnorderedFileMetadata.cpp +++ b/src/Storages/ObjectStorageQueue/ObjectStorageQueueUnorderedFileMetadata.cpp @@ -103,29 +103,46 @@ void ObjectStorageQueueUnorderedFileMetadata::setProcessedImpl() /// In one zookeeper transaction do the following: enum RequestType { - SET_MAX_PROCESSED_PATH = 0, - CHECK_PROCESSING_ID_PATH = 1, /// Optional. - REMOVE_PROCESSING_ID_PATH = 2, /// Optional. - REMOVE_PROCESSING_PATH = 3, /// Optional. + CHECK_PROCESSING_ID_PATH, + REMOVE_PROCESSING_ID_PATH, + REMOVE_PROCESSING_PATH, + SET_PROCESSED_PATH, }; const auto zk_client = getZooKeeper(); - std::string failure_reason; - Coordination::Requests requests; - requests.push_back( - zkutil::makeCreateRequest( - processed_node_path, node_metadata.toString(), zkutil::CreateMode::Persistent)); + std::map request_index; if (processing_id_version.has_value()) { requests.push_back(zkutil::makeCheckRequest(processing_node_id_path, processing_id_version.value())); requests.push_back(zkutil::makeRemoveRequest(processing_node_id_path, processing_id_version.value())); requests.push_back(zkutil::makeRemoveRequest(processing_node_path, -1)); + + /// The order is important: + /// we must first check processing nodes and set processed_path the last. + request_index[CHECK_PROCESSING_ID_PATH] = 0; + request_index[REMOVE_PROCESSING_ID_PATH] = 1; + request_index[REMOVE_PROCESSING_PATH] = 2; + request_index[SET_PROCESSED_PATH] = 3; + } + else + { + request_index[SET_PROCESSED_PATH] = 0; } + requests.push_back( + zkutil::makeCreateRequest( + processed_node_path, node_metadata.toString(), zkutil::CreateMode::Persistent)); + Coordination::Responses responses; - auto is_request_failed = [&](RequestType type) { return responses[type]->error != Coordination::Error::ZOK; }; + auto is_request_failed = [&](RequestType type) + { + if (!request_index.contains(type)) + return false; + chassert(request_index[type] < responses.size()); + return responses[request_index[type]]->error != Coordination::Error::ZOK; + }; const auto code = zk_client->tryMulti(requests, responses); if (code == Coordination::Error::ZOK) @@ -140,18 +157,41 @@ void ObjectStorageQueueUnorderedFileMetadata::setProcessedImpl() return; } + bool unexpected_error = false; + std::string failure_reason; + if (Coordination::isHardwareError(code)) + { failure_reason = "Lost connection to keeper"; - else if (is_request_failed(SET_MAX_PROCESSED_PATH)) - throw Exception(ErrorCodes::LOGICAL_ERROR, - "Cannot create a persistent node in /processed since it already exists"); + } else if (is_request_failed(CHECK_PROCESSING_ID_PATH)) + { + /// This is normal in case of expired session with keeper. failure_reason = "Version of processing id node changed"; + } + else if (is_request_failed(REMOVE_PROCESSING_ID_PATH)) + { + /// Remove processing_id node should not actually fail + /// because we just checked in a previous keeper request that it exists and has a certain version. + unexpected_error = true; + failure_reason = "Failed to remove processing id path"; + } else if (is_request_failed(REMOVE_PROCESSING_PATH)) + { + /// This is normal in case of expired session with keeper as this node is ephemeral. failure_reason = "Failed to remove processing path"; + } + else if (is_request_failed(SET_PROCESSED_PATH)) + { + unexpected_error = true; + failure_reason = "Cannot create a persistent node in /processed since it already exists"; + } else throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected state of zookeeper transaction: {}", code); + if (unexpected_error) + throw Exception(ErrorCodes::LOGICAL_ERROR, "{}", failure_reason); + LOG_WARNING(log, "Cannot set file {} as processed: {}. Reason: {}", path, code, failure_reason); } diff --git a/src/Storages/ObjectStorageQueue/StorageObjectStorageQueue.cpp b/src/Storages/ObjectStorageQueue/StorageObjectStorageQueue.cpp index 245b441513d..200872a2f4c 100644 --- a/src/Storages/ObjectStorageQueue/StorageObjectStorageQueue.cpp +++ b/src/Storages/ObjectStorageQueue/StorageObjectStorageQueue.cpp @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -51,15 +52,15 @@ namespace ObjectStorageQueueSetting extern const ObjectStorageQueueSettingsUInt64 max_processed_files_before_commit; extern const ObjectStorageQueueSettingsUInt64 max_processed_rows_before_commit; extern const ObjectStorageQueueSettingsUInt64 max_processing_time_sec_before_commit; - extern const ObjectStorageQueueSettingsUInt32 polling_min_timeout_ms; - extern const ObjectStorageQueueSettingsUInt32 polling_max_timeout_ms; - extern const ObjectStorageQueueSettingsUInt32 polling_backoff_ms; - extern const ObjectStorageQueueSettingsUInt32 processing_threads_num; + extern const ObjectStorageQueueSettingsUInt64 polling_min_timeout_ms; + extern const ObjectStorageQueueSettingsUInt64 polling_max_timeout_ms; + extern const ObjectStorageQueueSettingsUInt64 polling_backoff_ms; + extern const ObjectStorageQueueSettingsUInt64 processing_threads_num; extern const ObjectStorageQueueSettingsUInt32 buckets; - extern const ObjectStorageQueueSettingsUInt32 tracked_file_ttl_sec; - extern const ObjectStorageQueueSettingsUInt32 tracked_files_limit; + extern const ObjectStorageQueueSettingsUInt64 tracked_file_ttl_sec; + extern const ObjectStorageQueueSettingsUInt64 tracked_files_limit; extern const ObjectStorageQueueSettingsString last_processed_path; - extern const ObjectStorageQueueSettingsUInt32 loading_retries; + extern const ObjectStorageQueueSettingsUInt64 loading_retries; extern const ObjectStorageQueueSettingsObjectStorageQueueAction after_processing; } @@ -69,6 +70,7 @@ namespace ErrorCodes extern const int BAD_ARGUMENTS; extern const int BAD_QUERY_PARAMETER; extern const int QUERY_NOT_ALLOWED; + extern const int SUPPORT_IS_DISABLED; } namespace @@ -353,10 +355,11 @@ void StorageObjectStorageQueue::read( void ReadFromObjectStorageQueue::initializePipeline(QueryPipelineBuilder & pipeline, const BuildQueryPipelineSettings &) { Pipes pipes; - const size_t adjusted_num_streams = storage->getTableMetadata().processing_threads_num; + + size_t processing_threads_num = storage->getTableMetadata().processing_threads_num; createIterator(nullptr); - for (size_t i = 0; i < adjusted_num_streams; ++i) + for (size_t i = 0; i < processing_threads_num; ++i) pipes.emplace_back(storage->createSource( i/* processor_id */, info, @@ -555,6 +558,174 @@ bool StorageObjectStorageQueue::streamToViews() return total_rows > 0; } +static const std::unordered_set changeable_settings_unordered_mode +{ + "processing_threads_num", + "loading_retries", + "after_processing", + "tracked_files_limit", + "tracked_file_ttl_sec", + "polling_min_timeout_ms", + "polling_max_timeout_ms", + "polling_backoff_ms", + /// For compatibility. + "s3queue_processing_threads_num", + "s3queue_loading_retries", + "s3queue_after_processing", + "s3queue_tracked_files_limit", + "s3queue_tracked_file_ttl_sec", + "s3queue_polling_min_timeout_ms", + "s3queue_polling_max_timeout_ms", + "s3queue_polling_backoff_ms", +}; + +static const std::unordered_set changeable_settings_ordered_mode +{ + "loading_retries", + "after_processing", + "polling_min_timeout_ms", + "polling_max_timeout_ms", + "polling_backoff_ms", + /// For compatibility. + "s3queue_loading_retries", + "s3queue_after_processing", + "s3queue_polling_min_timeout_ms", + "s3queue_polling_max_timeout_ms", + "s3queue_polling_backoff_ms", +}; + +static bool isSettingChangeable(const std::string & name, ObjectStorageQueueMode mode) +{ + if (mode == ObjectStorageQueueMode::UNORDERED) + return changeable_settings_unordered_mode.contains(name); + else + return changeable_settings_ordered_mode.contains(name); +} + +void StorageObjectStorageQueue::checkAlterIsPossible(const AlterCommands & commands, ContextPtr local_context) const +{ + for (const auto & command : commands) + { + if (command.type != AlterCommand::MODIFY_SETTING && command.type != AlterCommand::RESET_SETTING) + throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only MODIFY SETTING alter is allowed for {}", getName()); + } + + StorageInMemoryMetadata new_metadata = getInMemoryMetadata(); + commands.apply(new_metadata, local_context); + + StorageInMemoryMetadata old_metadata = getInMemoryMetadata(); + if (!new_metadata.hasSettingsChanges()) + throw Exception(ErrorCodes::LOGICAL_ERROR, "No settings changes"); + + const auto & new_changes = new_metadata.settings_changes->as().changes; + const auto & old_changes = old_metadata.settings_changes->as().changes; + const auto mode = getTableMetadata().getMode(); + for (const auto & changed_setting : new_changes) + { + auto it = std::find_if( + old_changes.begin(), old_changes.end(), + [&](const SettingChange & change) { return change.name == changed_setting.name; }); + + const bool setting_changed = it != old_changes.end() && it->value != changed_setting.value; + + if (setting_changed) + { + if (!isSettingChangeable(changed_setting.name, mode)) + { + throw Exception( + ErrorCodes::SUPPORT_IS_DISABLED, + "Changing setting {} is not allowed for {} mode of {}", + changed_setting.name, magic_enum::enum_name(mode), getName()); + } + } + } +} + +void StorageObjectStorageQueue::alter( + const AlterCommands & commands, + ContextPtr local_context, + AlterLockHolder &) +{ + if (commands.isSettingsAlter()) + { + auto table_id = getStorageID(); + + StorageInMemoryMetadata old_metadata = getInMemoryMetadata(); + const auto & old_settings = old_metadata.settings_changes->as().changes; + + StorageInMemoryMetadata new_metadata = getInMemoryMetadata(); + commands.apply(new_metadata, local_context); + auto new_settings = new_metadata.settings_changes->as().changes; + + ObjectStorageQueueSettings default_settings; + for (const auto & setting : old_settings) + { + auto it = std::find_if( + new_settings.begin(), new_settings.end(), + [&](const SettingChange & change) { return change.name == setting.name; }); + + if (it == new_settings.end()) + { + /// Setting was reset. + new_settings.push_back(SettingChange(setting.name, default_settings.get(setting.name))); + } + } + + SettingsChanges changed_settings; + std::set changed_settings_set; + + const auto mode = getTableMetadata().getMode(); + for (const auto & setting : new_settings) + { + auto it = std::find_if( + old_settings.begin(), old_settings.end(), + [&](const SettingChange & change) { return change.name == setting.name; }); + + const bool setting_changed = it == old_settings.end() || it->value != setting.value; + if (!setting_changed) + continue; + + if (!isSettingChangeable(setting.name, mode)) + { + throw Exception( + ErrorCodes::SUPPORT_IS_DISABLED, + "Changing setting {} is not allowed for {} mode of {}", + setting.name, magic_enum::enum_name(mode), getName()); + } + + SettingChange result_setting(setting); + if (result_setting.name.starts_with("s3queue_")) + result_setting.name = result_setting.name.substr(std::strlen("s3queue_")); + + changed_settings.push_back(result_setting); + + auto inserted = changed_settings_set.emplace(result_setting.name).second; + if (!inserted) + throw Exception(ErrorCodes::BAD_ARGUMENTS, "Setting {} is duplicated", setting.name); + } + + /// Alter settings which are stored in keeper. + files_metadata->alterSettings(changed_settings); + + /// Alter settings which are not stored in keeper. + for (const auto & change : changed_settings) + { + if (change.name == "polling_min_timeout_ms") + polling_min_timeout_ms = change.value.safeGet(); + if (change.name == "polling_max_timeout_ms") + polling_max_timeout_ms = change.value.safeGet(); + if (change.name == "polling_backoff_ms") + polling_backoff_ms = change.value.safeGet(); + } + + StorageInMemoryMetadata metadata = getInMemoryMetadata(); + metadata.setSettingsChanges(new_metadata.settings_changes); + setInMemoryMetadata(metadata); + + DatabaseCatalog::instance().getDatabase(table_id.database_name)->alterTable(local_context, table_id, new_metadata); + } +} + zkutil::ZooKeeperPtr StorageObjectStorageQueue::getZooKeeper() const { return getContext()->getZooKeeper(); @@ -582,8 +753,8 @@ ObjectStorageQueueSettings StorageObjectStorageQueue::getSettings() const settings[ObjectStorageQueueSetting::processing_threads_num] = table_metadata.processing_threads_num; settings[ObjectStorageQueueSetting::enable_logging_to_queue_log] = enable_logging_to_queue_log; settings[ObjectStorageQueueSetting::last_processed_path] = table_metadata.last_processed_path; - settings[ObjectStorageQueueSetting::tracked_file_ttl_sec] = 0; - settings[ObjectStorageQueueSetting::tracked_files_limit] = 0; + settings[ObjectStorageQueueSetting::tracked_file_ttl_sec] = table_metadata.tracked_files_ttl_sec; + settings[ObjectStorageQueueSetting::tracked_files_limit] = table_metadata.tracked_files_limit; settings[ObjectStorageQueueSetting::polling_min_timeout_ms] = polling_min_timeout_ms; settings[ObjectStorageQueueSetting::polling_max_timeout_ms] = polling_max_timeout_ms; settings[ObjectStorageQueueSetting::polling_backoff_ms] = polling_backoff_ms; diff --git a/src/Storages/ObjectStorageQueue/StorageObjectStorageQueue.h b/src/Storages/ObjectStorageQueue/StorageObjectStorageQueue.h index 04b0a16834d..371e409825f 100644 --- a/src/Storages/ObjectStorageQueue/StorageObjectStorageQueue.h +++ b/src/Storages/ObjectStorageQueue/StorageObjectStorageQueue.h @@ -48,6 +48,13 @@ public: size_t max_block_size, size_t num_streams) override; + void checkAlterIsPossible(const AlterCommands & commands, ContextPtr local_context) const override; + + void alter( + const AlterCommands & commands, + ContextPtr local_context, + AlterLockHolder & table_lock_holder) override; + const auto & getFormatName() const { return configuration->format; } const fs::path & getZooKeeperPath() const { return zk_path; } @@ -65,9 +72,9 @@ private: const std::string engine_name; const fs::path zk_path; const bool enable_logging_to_queue_log; - const UInt32 polling_min_timeout_ms; - const UInt32 polling_max_timeout_ms; - const UInt32 polling_backoff_ms; + UInt64 polling_min_timeout_ms; + UInt64 polling_max_timeout_ms; + UInt64 polling_backoff_ms; const CommitSettings commit_settings; std::shared_ptr files_metadata; diff --git a/src/Storages/ProjectionsDescription.cpp b/src/Storages/ProjectionsDescription.cpp index 9654b4ef37a..26c3238c940 100644 --- a/src/Storages/ProjectionsDescription.cpp +++ b/src/Storages/ProjectionsDescription.cpp @@ -294,8 +294,22 @@ Block ProjectionDescription::calculate(const Block & block, ContextPtr context) mut_context->setSetting("aggregate_functions_null_for_empty", Field(0)); mut_context->setSetting("transform_null_in", Field(0)); + ASTPtr query_ast_copy = nullptr; + /// Respect the _row_exists column. + if (block.findByName("_row_exists")) + { + query_ast_copy = query_ast->clone(); + auto * select_row_exists = query_ast_copy->as(); + if (!select_row_exists) + throw Exception(ErrorCodes::LOGICAL_ERROR, "Cannot get ASTSelectQuery when adding _row_exists = 1. It's a bug"); + + select_row_exists->setExpression( + ASTSelectQuery::Expression::WHERE, + makeASTFunction("equals", std::make_shared("_row_exists"), std::make_shared(1))); + } + auto builder = InterpreterSelectQuery( - query_ast, + query_ast_copy ? query_ast_copy : query_ast, mut_context, Pipe(std::make_shared(block)), SelectQueryOptions{ diff --git a/src/Storages/RabbitMQ/RabbitMQSource.cpp b/src/Storages/RabbitMQ/RabbitMQSource.cpp index a7b652d76dd..14999e7b7c2 100644 --- a/src/Storages/RabbitMQ/RabbitMQSource.cpp +++ b/src/Storages/RabbitMQ/RabbitMQSource.cpp @@ -165,21 +165,18 @@ Chunk RabbitMQSource::generateImpl() std::optional exception_message; size_t total_rows = 0; - auto on_error = [&](const MutableColumns & result_columns, Exception & e) + auto on_error = [&](const MutableColumns & result_columns, const ColumnCheckpoints & checkpoints, Exception & e) { if (handle_error_mode == StreamingHandleErrorMode::STREAM) { exception_message = e.message(); - for (const auto & column : result_columns) + for (size_t i = 0; i < result_columns.size(); ++i) { - // We could already push some rows to result_columns - // before exception, we need to fix it. - auto cur_rows = column->size(); - if (cur_rows > total_rows) - column->popBack(cur_rows - total_rows); + // We could already push some rows to result_columns before exception, we need to fix it. + result_columns[i]->rollback(*checkpoints[i]); // All data columns will get default value in case of error. - column->insertDefault(); + result_columns[i]->insertDefault(); } return 1; diff --git a/src/Storages/RabbitMQ/StorageRabbitMQ.cpp b/src/Storages/RabbitMQ/StorageRabbitMQ.cpp index 0f3ac2d5289..3e922b541f7 100644 --- a/src/Storages/RabbitMQ/StorageRabbitMQ.cpp +++ b/src/Storages/RabbitMQ/StorageRabbitMQ.cpp @@ -1322,7 +1322,13 @@ void registerStorageRabbitMQ(StorageFactory & factory) return std::make_shared(args.table_id, args.getContext(), args.columns, args.comment, std::move(rabbitmq_settings), args.mode); }; - factory.registerStorage("RabbitMQ", creator_fn, StorageFactory::StorageFeatures{ .supports_settings = true, }); + factory.registerStorage( + "RabbitMQ", + creator_fn, + StorageFactory::StorageFeatures{ + .supports_settings = true, + .source_access_type = AccessType::RABBITMQ, + }); } } diff --git a/src/Storages/StorageKeeperMap.cpp b/src/Storages/StorageKeeperMap.cpp index 316eced1ed6..2a4a5f3370f 100644 --- a/src/Storages/StorageKeeperMap.cpp +++ b/src/Storages/StorageKeeperMap.cpp @@ -889,7 +889,7 @@ private: } }; - auto max_multiread_size = with_retries->getKeeperSettings().batch_size_for_keeper_multiread; + auto max_multiread_size = with_retries->getKeeperSettings().batch_size_for_multiread; auto keys_it = data_children.begin(); while (keys_it != data_children.end()) @@ -941,9 +941,8 @@ void StorageKeeperMap::backupData(BackupEntriesCollector & backup_entries_collec ( getLogger(fmt::format("StorageKeeperMapBackup ({})", getStorageID().getNameForLogs())), [&] { return getClient(); }, - WithRetries::KeeperSettings::fromContext(backup_entries_collector.getContext()), - backup_entries_collector.getContext()->getProcessListElement(), - [](WithRetries::FaultyKeeper &) {} + BackupKeeperSettings::fromContext(backup_entries_collector.getContext()), + backup_entries_collector.getContext()->getProcessListElement() ); backup_entries_collector.addBackupEntries( @@ -972,9 +971,8 @@ void StorageKeeperMap::restoreDataFromBackup(RestorerFromBackup & restorer, cons ( getLogger(fmt::format("StorageKeeperMapRestore ({})", getStorageID().getNameForLogs())), [&] { return getClient(); }, - WithRetries::KeeperSettings::fromContext(restorer.getContext()), - restorer.getContext()->getProcessListElement(), - [](WithRetries::FaultyKeeper &) {} + BackupKeeperSettings::fromContext(restorer.getContext()), + restorer.getContext()->getProcessListElement() ); bool allow_non_empty_tables = restorer.isNonEmptyTableAllowed(); @@ -1037,7 +1035,7 @@ void StorageKeeperMap::restoreDataImpl( CompressedReadBufferFromFile compressed_in{std::move(in_from_file)}; fs::path data_path_fs(zk_data_path); - auto max_multi_size = with_retries->getKeeperSettings().batch_size_for_keeper_multi; + auto max_multi_size = with_retries->getKeeperSettings().batch_size_for_multi; Coordination::Requests create_requests; const auto flush_create_requests = [&] diff --git a/src/Storages/StorageMergeTree.cpp b/src/Storages/StorageMergeTree.cpp index abc66df0d8b..40cd6e01dba 100644 --- a/src/Storages/StorageMergeTree.cpp +++ b/src/Storages/StorageMergeTree.cpp @@ -38,6 +38,7 @@ #include #include #include +#include namespace DB @@ -154,6 +155,7 @@ StorageMergeTree::StorageMergeTree( loadMutations(); loadDeduplicationLog(); + prewarmMarkCache(getActivePartsLoadingThreadPool().get()); } diff --git a/src/Storages/StorageReplicatedMergeTree.cpp b/src/Storages/StorageReplicatedMergeTree.cpp index 850623157a1..bbfedb2f355 100644 --- a/src/Storages/StorageReplicatedMergeTree.cpp +++ b/src/Storages/StorageReplicatedMergeTree.cpp @@ -103,6 +103,7 @@ #include #include +#include #include #include @@ -207,6 +208,7 @@ namespace MergeTreeSetting extern const MergeTreeSettingsBool use_minimalistic_checksums_in_zookeeper; extern const MergeTreeSettingsBool use_minimalistic_part_header_in_zookeeper; extern const MergeTreeSettingsMilliseconds wait_for_unique_parts_send_before_shutdown_ms; + extern const MergeTreeSettingsBool prewarm_mark_cache; } namespace FailPoints @@ -507,6 +509,7 @@ StorageReplicatedMergeTree::StorageReplicatedMergeTree( } loadDataParts(skip_sanity_checks, expected_parts_on_this_replica); + prewarmMarkCache(getActivePartsLoadingThreadPool().get()); if (LoadingStrictnessLevel::ATTACH <= mode) { @@ -5079,6 +5082,12 @@ bool StorageReplicatedMergeTree::fetchPart( ProfileEvents::increment(ProfileEvents::ObsoleteReplicatedParts); } + if ((*getSettings())[MergeTreeSetting::prewarm_mark_cache] && getContext()->getMarkCache()) + { + auto column_names = getColumnsToPrewarmMarks(*getSettings(), part->getColumns()); + part->loadMarksToCache(column_names, getContext()->getMarkCache().get()); + } + write_part_log({}); } else @@ -9891,7 +9900,14 @@ std::pair StorageReplicatedMergeTree::unlockSharedDataByID( } else if (error_code == Coordination::Error::ZNOTEMPTY) { - LOG_TRACE(logger, "Cannot remove last parent zookeeper lock {} for part {} with id {}, another replica locked part concurrently", zookeeper_part_uniq_node, part_name, part_id); + LOG_TRACE( + logger, + "Cannot remove last parent zookeeper lock {} for part {} with id {}, another replica locked part concurrently", + zookeeper_part_uniq_node, + part_name, + part_id); + part_has_no_more_locks = false; + continue; } else if (error_code == Coordination::Error::ZNONODE) { diff --git a/src/Storages/System/StorageSystemLicenses.sh b/src/Storages/System/StorageSystemLicenses.sh index 79f05d50d1d..becab852899 100755 --- a/src/Storages/System/StorageSystemLicenses.sh +++ b/src/Storages/System/StorageSystemLicenses.sh @@ -1,5 +1,7 @@ #!/usr/bin/env bash +set -e -o pipefail + ROOT_PATH="$(git rev-parse --show-toplevel)" IFS=$'\t' diff --git a/src/Storages/System/StorageSystemMergeTreeSettings.cpp b/src/Storages/System/StorageSystemMergeTreeSettings.cpp index 35d975216f6..1da4835dba5 100644 --- a/src/Storages/System/StorageSystemMergeTreeSettings.cpp +++ b/src/Storages/System/StorageSystemMergeTreeSettings.cpp @@ -1,4 +1,5 @@ -#include +#include +#include #include #include #include @@ -30,6 +31,14 @@ ColumnsDescription SystemMergeTreeSettings::getColumnsDescription() }, {"type", std::make_shared(), "Setting type (implementation specific string value)."}, {"is_obsolete", std::make_shared(), "Shows whether a setting is obsolete."}, + {"tier", getSettingsTierEnum(), R"( +Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their +development and the expectations one might have when using them: +* PRODUCTION: The feature is stable, safe to use and does not have issues interacting with other PRODUCTION features. +* BETA: The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome. +* EXPERIMENTAL: The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time. +* OBSOLETE: No longer supported. Either it is already removed or it will be removed in future releases. +)"}, }; } diff --git a/src/Storages/System/StorageSystemResources.cpp b/src/Storages/System/StorageSystemResources.cpp new file mode 100644 index 00000000000..2f948b8e057 --- /dev/null +++ b/src/Storages/System/StorageSystemResources.cpp @@ -0,0 +1,71 @@ +#include +#include +#include +#include +#include +#include +#include + + +namespace DB +{ + +ColumnsDescription StorageSystemResources::getColumnsDescription() +{ + return ColumnsDescription + { + {"name", std::make_shared(), "The name of the resource."}, + {"read_disks", std::make_shared(std::make_shared()), "The list of disk names that uses this resource for read operations."}, + {"write_disks", std::make_shared(std::make_shared()), "The list of disk names that uses this resource for write operations."}, + {"create_query", std::make_shared(), "CREATE query of the resource."}, + }; +} + +void StorageSystemResources::fillData(MutableColumns & res_columns, ContextPtr context, const ActionsDAG::Node *, std::vector) const +{ + const auto & storage = context->getWorkloadEntityStorage(); + const auto & resource_names = storage.getAllEntityNames(WorkloadEntityType::Resource); + for (const auto & resource_name : resource_names) + { + auto ast = storage.get(resource_name); + auto & resource = typeid_cast(*ast); + res_columns[0]->insert(resource_name); + { + Array read_disks; + Array write_disks; + for (const auto & [mode, disk] : resource.operations) + { + switch (mode) + { + case DB::ASTCreateResourceQuery::AccessMode::Read: + { + read_disks.emplace_back(disk ? *disk : "ANY"); + break; + } + case DB::ASTCreateResourceQuery::AccessMode::Write: + { + write_disks.emplace_back(disk ? *disk : "ANY"); + break; + } + } + } + res_columns[1]->insert(read_disks); + res_columns[2]->insert(write_disks); + } + res_columns[3]->insert(queryToString(ast)); + } +} + +void StorageSystemResources::backupData(BackupEntriesCollector & /*backup_entries_collector*/, const String & /*data_path_in_backup*/, const std::optional & /* partitions */) +{ + // TODO(serxa): add backup for resources + // storage.backup(backup_entries_collector, data_path_in_backup); +} + +void StorageSystemResources::restoreDataFromBackup(RestorerFromBackup & /*restorer*/, const String & /*data_path_in_backup*/, const std::optional & /* partitions */) +{ + // TODO(serxa): add restore for resources + // storage.restore(restorer, data_path_in_backup); +} + +} diff --git a/src/Storages/System/StorageSystemResources.h b/src/Storages/System/StorageSystemResources.h new file mode 100644 index 00000000000..42bbcd09aa4 --- /dev/null +++ b/src/Storages/System/StorageSystemResources.h @@ -0,0 +1,29 @@ +#pragma once + +#include + + +namespace DB +{ + +class Context; + + +/// Implements `resources` system table, which allows you to get a list of all RESOURCEs +class StorageSystemResources final : public IStorageSystemOneBlock +{ +public: + std::string getName() const override { return "SystemResources"; } + + static ColumnsDescription getColumnsDescription(); + + void backupData(BackupEntriesCollector & backup_entries_collector, const String & data_path_in_backup, const std::optional & partitions) override; + void restoreDataFromBackup(RestorerFromBackup & restorer, const String & data_path_in_backup, const std::optional & partitions) override; + +protected: + using IStorageSystemOneBlock::IStorageSystemOneBlock; + + void fillData(MutableColumns & res_columns, ContextPtr context, const ActionsDAG::Node *, std::vector) const override; +}; + +} diff --git a/src/Storages/System/StorageSystemScheduler.cpp b/src/Storages/System/StorageSystemScheduler.cpp index b42c807d6fc..8784ba084ce 100644 --- a/src/Storages/System/StorageSystemScheduler.cpp +++ b/src/Storages/System/StorageSystemScheduler.cpp @@ -84,12 +84,12 @@ ColumnsDescription StorageSystemScheduler::getColumnsDescription() void StorageSystemScheduler::fillData(MutableColumns & res_columns, ContextPtr context, const ActionsDAG::Node *, std::vector) const { - context->getResourceManager()->forEachNode([&] (const String & resource, const String & path, const String & type, const SchedulerNodePtr & node) + context->getResourceManager()->forEachNode([&] (const String & resource, const String & path, ISchedulerNode * node) { size_t i = 0; res_columns[i++]->insert(resource); res_columns[i++]->insert(path); - res_columns[i++]->insert(type); + res_columns[i++]->insert(node->getTypeName()); res_columns[i++]->insert(node->info.weight); res_columns[i++]->insert(node->info.priority.value); res_columns[i++]->insert(node->isActive()); @@ -118,23 +118,23 @@ void StorageSystemScheduler::fillData(MutableColumns & res_columns, ContextPtr c if (auto * parent = dynamic_cast(node->parent)) { - if (auto value = parent->getChildVRuntime(node.get())) + if (auto value = parent->getChildVRuntime(node)) vruntime = *value; } - if (auto * ptr = dynamic_cast(node.get())) + if (auto * ptr = dynamic_cast(node)) system_vruntime = ptr->getSystemVRuntime(); - if (auto * ptr = dynamic_cast(node.get())) + if (auto * ptr = dynamic_cast(node)) std::tie(queue_length, queue_cost) = ptr->getQueueLengthAndCost(); - if (auto * ptr = dynamic_cast(node.get())) + if (auto * ptr = dynamic_cast(node)) budget = ptr->getBudget(); - if (auto * ptr = dynamic_cast(node.get())) + if (auto * ptr = dynamic_cast(node)) is_satisfied = ptr->isSatisfied(); - if (auto * ptr = dynamic_cast(node.get())) + if (auto * ptr = dynamic_cast(node)) { std::tie(inflight_requests, inflight_cost) = ptr->getInflights(); std::tie(max_requests, max_cost) = ptr->getLimits(); } - if (auto * ptr = dynamic_cast(node.get())) + if (auto * ptr = dynamic_cast(node)) { std::tie(max_speed, max_burst) = ptr->getParams(); throttling_us = ptr->getThrottlingDuration().count() / 1000; diff --git a/src/Storages/System/StorageSystemSettings.cpp b/src/Storages/System/StorageSystemSettings.cpp index 9309f10378e..debd40386a6 100644 --- a/src/Storages/System/StorageSystemSettings.cpp +++ b/src/Storages/System/StorageSystemSettings.cpp @@ -2,6 +2,8 @@ #include #include +#include +#include #include #include #include @@ -34,6 +36,14 @@ ColumnsDescription StorageSystemSettings::getColumnsDescription() {"default", std::make_shared(), "Setting default value."}, {"alias_for", std::make_shared(), "Flag that shows whether this name is an alias to another setting."}, {"is_obsolete", std::make_shared(), "Shows whether a setting is obsolete."}, + {"tier", getSettingsTierEnum(), R"( +Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their +development and the expectations one might have when using them: +* PRODUCTION: The feature is stable, safe to use and does not have issues interacting with other PRODUCTION features. +* BETA: The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome. +* EXPERIMENTAL: The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time. +* OBSOLETE: No longer supported. Either it is already removed or it will be removed in future releases. +)"}, }; } diff --git a/src/Storages/System/StorageSystemWorkloads.cpp b/src/Storages/System/StorageSystemWorkloads.cpp new file mode 100644 index 00000000000..ebb7e693e26 --- /dev/null +++ b/src/Storages/System/StorageSystemWorkloads.cpp @@ -0,0 +1,48 @@ +#include +#include +#include +#include +#include +#include + + +namespace DB +{ + +ColumnsDescription StorageSystemWorkloads::getColumnsDescription() +{ + return ColumnsDescription + { + {"name", std::make_shared(), "The name of the workload."}, + {"parent", std::make_shared(), "The name of the parent workload."}, + {"create_query", std::make_shared(), "CREATE query of the workload."}, + }; +} + +void StorageSystemWorkloads::fillData(MutableColumns & res_columns, ContextPtr context, const ActionsDAG::Node *, std::vector) const +{ + const auto & storage = context->getWorkloadEntityStorage(); + const auto & workload_names = storage.getAllEntityNames(WorkloadEntityType::Workload); + for (const auto & workload_name : workload_names) + { + auto ast = storage.get(workload_name); + auto & workload = typeid_cast(*ast); + res_columns[0]->insert(workload_name); + res_columns[1]->insert(workload.getWorkloadParent()); + res_columns[2]->insert(queryToString(ast)); + } +} + +void StorageSystemWorkloads::backupData(BackupEntriesCollector & /*backup_entries_collector*/, const String & /*data_path_in_backup*/, const std::optional & /* partitions */) +{ + // TODO(serxa): add backup for workloads + // storage.backup(backup_entries_collector, data_path_in_backup); +} + +void StorageSystemWorkloads::restoreDataFromBackup(RestorerFromBackup & /*restorer*/, const String & /*data_path_in_backup*/, const std::optional & /* partitions */) +{ + // TODO(serxa): add restore for workloads + // storage.restore(restorer, data_path_in_backup); +} + +} diff --git a/src/Storages/System/StorageSystemWorkloads.h b/src/Storages/System/StorageSystemWorkloads.h new file mode 100644 index 00000000000..9d4770a02b8 --- /dev/null +++ b/src/Storages/System/StorageSystemWorkloads.h @@ -0,0 +1,29 @@ +#pragma once + +#include + + +namespace DB +{ + +class Context; + + +/// Implements `workloads` system table, which allows you to get a list of all workloads +class StorageSystemWorkloads final : public IStorageSystemOneBlock +{ +public: + std::string getName() const override { return "SystemWorkloads"; } + + static ColumnsDescription getColumnsDescription(); + + void backupData(BackupEntriesCollector & backup_entries_collector, const String & data_path_in_backup, const std::optional & partitions) override; + void restoreDataFromBackup(RestorerFromBackup & restorer, const String & data_path_in_backup, const std::optional & partitions) override; + +protected: + using IStorageSystemOneBlock::IStorageSystemOneBlock; + + void fillData(MutableColumns & res_columns, ContextPtr context, const ActionsDAG::Node *, std::vector) const override; +}; + +} diff --git a/src/Storages/System/attachSystemTables.cpp b/src/Storages/System/attachSystemTables.cpp index 7c6dac7a608..0bd3369ff32 100644 --- a/src/Storages/System/attachSystemTables.cpp +++ b/src/Storages/System/attachSystemTables.cpp @@ -23,6 +23,8 @@ #include #include #include +#include +#include #include #include #include @@ -126,8 +128,6 @@ void attachSystemTablesServer(ContextPtr context, IDatabase & system_database, b attachNoDescription(context, system_database, "one", "This table contains a single row with a single dummy UInt8 column containing the value 0. Used when the table is not specified explicitly, for example in queries like `SELECT 1`."); attachNoDescription(context, system_database, "numbers", "Generates all natural numbers, starting from 0 (to 2^64 - 1, and then again) in sorted order.", false, "number"); attachNoDescription(context, system_database, "numbers_mt", "Multithreaded version of `system.numbers`. Numbers order is not guaranteed.", true, "number"); - attachNoDescription(context, system_database, "generate_series", "Generates arithmetic progression of natural numbers in sorted order in a given segment with a given step", false, "generate_series"); - attachNoDescription(context, system_database, "generateSeries", "Generates arithmetic progression of natural numbers in sorted order in a given segment with a given step", false, "generate_series"); attachNoDescription(context, system_database, "zeros", "Produces unlimited number of non-materialized zeros.", false); attachNoDescription(context, system_database, "zeros_mt", "Multithreaded version of system.zeros.", true); attach(context, system_database, "databases", "Lists all databases of the current server."); @@ -232,6 +232,8 @@ void attachSystemTablesServer(ContextPtr context, IDatabase & system_database, b attach>(context, system_database, "azure_queue_settings", "Contains a list of settings of AzureQueue tables."); attach(context, system_database, "dashboards", "Contains queries used by /dashboard page accessible though HTTP interface. This table can be useful for monitoring and troubleshooting. The table contains a row for every chart in a dashboard."); attach(context, system_database, "view_refreshes", "Lists all Refreshable Materialized Views of current server."); + attach(context, system_database, "workloads", "Contains a list of all currently existing workloads."); + attach(context, system_database, "resources", "Contains a list of all currently existing resources."); if (has_zookeeper) { diff --git a/tests/ci/.mypy.ini b/tests/ci/.mypy.ini index ecb4aef87dd..bb46a6d24b6 100644 --- a/tests/ci/.mypy.ini +++ b/tests/ci/.mypy.ini @@ -16,4 +16,3 @@ no_implicit_reexport = True strict_equality = True extra_checks = True ignore_missing_imports = True -logging-fstring-interpolation = False \ No newline at end of file diff --git a/tests/ci/artifactory.py b/tests/ci/artifactory.py index c66659d4e93..00a7eeebb35 100644 --- a/tests/ci/artifactory.py +++ b/tests/ci/artifactory.py @@ -200,6 +200,7 @@ class RpmArtifactory: ) _PROD_REPO_URL = "https://packages.clickhouse.com/rpm/clickhouse.repo" _SIGN_KEY = "885E2BDCF96B0B45ABF058453E4AD4719DDE9A38" + FEDORA_VERSION = 40 def __init__(self, release_info: ReleaseInfo, dry_run: bool): self.release_info = release_info @@ -249,16 +250,16 @@ class RpmArtifactory: Shell.check("sync") def test_packages(self): - Shell.check("docker pull fedora:latest", strict=True) + Shell.check(f"docker pull fedora:{self.FEDORA_VERSION}", strict=True) print(f"Test package installation, version [{self.version}]") rpm_command = f"dnf config-manager --add-repo={self.repo_url} && dnf makecache && dnf -y install clickhouse-client-{self.version}-1" - cmd = f'docker run --rm fedora:latest /bin/bash -c "dnf -y install dnf-plugins-core && dnf config-manager --add-repo={self.repo_url} && {rpm_command}"' + cmd = f'docker run --rm fedora:{self.FEDORA_VERSION} /bin/bash -c "dnf -y install dnf-plugins-core && dnf config-manager --add-repo={self.repo_url} && {rpm_command}"' print("Running test command:") print(f" {cmd}") assert Shell.check(cmd) print("Test package installation, version [latest]") rpm_command_2 = f"dnf config-manager --add-repo={self.repo_url} && dnf makecache && dnf -y install clickhouse-client" - cmd = f'docker run --rm fedora:latest /bin/bash -c "dnf -y install dnf-plugins-core && dnf config-manager --add-repo={self.repo_url} && {rpm_command_2}"' + cmd = f'docker run --rm fedora:{self.FEDORA_VERSION} /bin/bash -c "dnf -y install dnf-plugins-core && dnf config-manager --add-repo={self.repo_url} && {rpm_command_2}"' print("Running test command:") print(f" {cmd}") assert Shell.check(cmd) diff --git a/tests/ci/ci_config.py b/tests/ci/ci_config.py index b9885a89444..9f5d5f1983d 100644 --- a/tests/ci/ci_config.py +++ b/tests/ci/ci_config.py @@ -97,9 +97,9 @@ class CI: ), runner_type=Runners.BUILDER_ARM, ), - BuildNames.PACKAGE_AARCH64_ASAN: CommonJobConfigs.BUILD.with_properties( + BuildNames.PACKAGE_ARM_ASAN: CommonJobConfigs.BUILD.with_properties( build_config=BuildConfig( - name=BuildNames.PACKAGE_AARCH64_ASAN, + name=BuildNames.PACKAGE_ARM_ASAN, compiler="clang-18-aarch64", sanitizer="address", package_type="deb", @@ -283,6 +283,10 @@ class CI: JobNames.STATEFUL_TEST_ASAN: CommonJobConfigs.STATEFUL_TEST.with_properties( required_builds=[BuildNames.PACKAGE_ASAN] ), + JobNames.STATEFUL_TEST_ARM_ASAN: CommonJobConfigs.STATEFUL_TEST.with_properties( + required_builds=[BuildNames.PACKAGE_ARM_ASAN], + runner_type=Runners.FUNC_TESTER_ARM, + ), JobNames.STATEFUL_TEST_TSAN: CommonJobConfigs.STATEFUL_TEST.with_properties( required_builds=[BuildNames.PACKAGE_TSAN] ), @@ -331,6 +335,11 @@ class CI: JobNames.STATELESS_TEST_ASAN: CommonJobConfigs.STATELESS_TEST.with_properties( required_builds=[BuildNames.PACKAGE_ASAN], num_batches=2 ), + JobNames.STATELESS_TEST_ARM_ASAN: CommonJobConfigs.STATELESS_TEST.with_properties( + required_builds=[BuildNames.PACKAGE_ARM_ASAN], + num_batches=2, + runner_type=Runners.FUNC_TESTER_ARM, + ), JobNames.STATELESS_TEST_TSAN: CommonJobConfigs.STATELESS_TEST.with_properties( required_builds=[BuildNames.PACKAGE_TSAN], num_batches=4 ), diff --git a/tests/ci/ci_definitions.py b/tests/ci/ci_definitions.py index fc67959013b..dd86dc320c2 100644 --- a/tests/ci/ci_definitions.py +++ b/tests/ci/ci_definitions.py @@ -106,7 +106,7 @@ class BuildNames(metaclass=WithIter): PACKAGE_MSAN = "package_msan" PACKAGE_DEBUG = "package_debug" PACKAGE_AARCH64 = "package_aarch64" - PACKAGE_AARCH64_ASAN = "package_aarch64_asan" + PACKAGE_ARM_ASAN = "package_aarch64_asan" PACKAGE_RELEASE_COVERAGE = "package_release_coverage" BINARY_RELEASE = "binary_release" BINARY_TIDY = "binary_tidy" @@ -141,6 +141,7 @@ class JobNames(metaclass=WithIter): STATELESS_TEST_RELEASE_COVERAGE = "Stateless tests (coverage)" STATELESS_TEST_AARCH64 = "Stateless tests (aarch64)" STATELESS_TEST_ASAN = "Stateless tests (asan)" + STATELESS_TEST_ARM_ASAN = "Stateless tests (aarch64, asan)" STATELESS_TEST_TSAN = "Stateless tests (tsan)" STATELESS_TEST_MSAN = "Stateless tests (msan)" STATELESS_TEST_UBSAN = "Stateless tests (ubsan)" @@ -157,6 +158,7 @@ class JobNames(metaclass=WithIter): STATEFUL_TEST_RELEASE_COVERAGE = "Stateful tests (coverage)" STATEFUL_TEST_AARCH64 = "Stateful tests (aarch64)" STATEFUL_TEST_ASAN = "Stateful tests (asan)" + STATEFUL_TEST_ARM_ASAN = "Stateful tests (aarch64, asan)" STATEFUL_TEST_TSAN = "Stateful tests (tsan)" STATEFUL_TEST_MSAN = "Stateful tests (msan)" STATEFUL_TEST_UBSAN = "Stateful tests (ubsan)" @@ -241,7 +243,7 @@ class StatusNames(metaclass=WithIter): # mergeable status MERGEABLE = "Mergeable Check" # status of a sync pr - SYNC = "Cloud fork sync (only for ClickHouse Inc. employees)" + SYNC = "CH Inc sync" # PR formatting check status PR_CHECK = "PR Check" @@ -632,6 +634,8 @@ REQUIRED_CHECKS = [ JobNames.STATEFUL_TEST_RELEASE, JobNames.STATELESS_TEST_RELEASE, JobNames.STATELESS_TEST_ASAN, + JobNames.STATELESS_TEST_ARM_ASAN, + JobNames.STATEFUL_TEST_ARM_ASAN, JobNames.STATELESS_TEST_FLAKY_ASAN, JobNames.STATEFUL_TEST_ASAN, JobNames.STYLE_CHECK, diff --git a/tests/ci/test_ci_config.py b/tests/ci/test_ci_config.py index 29b184a4e61..0e396b827ea 100644 --- a/tests/ci/test_ci_config.py +++ b/tests/ci/test_ci_config.py @@ -36,7 +36,7 @@ class TestCIConfig(unittest.TestCase): elif "binary_" in job.lower() or "package_" in job.lower(): if job.lower() in ( CI.BuildNames.PACKAGE_AARCH64, - CI.BuildNames.PACKAGE_AARCH64_ASAN, + CI.BuildNames.PACKAGE_ARM_ASAN, ): self.assertTrue( CI.JOB_CONFIGS[job].runner_type in (CI.Runners.BUILDER_ARM,), @@ -95,69 +95,39 @@ class TestCIConfig(unittest.TestCase): self.assertTrue(CI.JOB_CONFIGS[job].required_builds is None) else: self.assertTrue(CI.JOB_CONFIGS[job].build_config is None) - if "asan" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - == CI.BuildNames.PACKAGE_ASAN, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + if "asan" in job and "aarch" in job: + expected_builds = [CI.BuildNames.PACKAGE_ARM_ASAN] + elif "asan" in job: + expected_builds = [CI.BuildNames.PACKAGE_ASAN] elif "msan" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - == CI.BuildNames.PACKAGE_MSAN, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.PACKAGE_MSAN] elif "tsan" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - == CI.BuildNames.PACKAGE_TSAN, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.PACKAGE_TSAN] elif "ubsan" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - == CI.BuildNames.PACKAGE_UBSAN, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.PACKAGE_UBSAN] elif "debug" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - == CI.BuildNames.PACKAGE_DEBUG, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.PACKAGE_DEBUG] + elif job in ( + "Unit tests (release)", + "ClickHouse Keeper Jepsen", + "ClickHouse Server Jepsen", + ): + expected_builds = [CI.BuildNames.BINARY_RELEASE] elif "release" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - in ( - CI.BuildNames.PACKAGE_RELEASE, - CI.BuildNames.BINARY_RELEASE, - ), - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.PACKAGE_RELEASE] elif "coverage" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - == CI.BuildNames.PACKAGE_RELEASE_COVERAGE, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.PACKAGE_RELEASE_COVERAGE] elif "aarch" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - == CI.BuildNames.PACKAGE_AARCH64, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.PACKAGE_AARCH64] elif "amd64" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - == CI.BuildNames.PACKAGE_RELEASE, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.PACKAGE_RELEASE] elif "uzzer" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] == CI.BuildNames.FUZZERS, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.FUZZERS] elif "Docker" in job: + expected_builds = [ + CI.BuildNames.PACKAGE_RELEASE, + CI.BuildNames.PACKAGE_AARCH64, + ] self.assertTrue( CI.JOB_CONFIGS[job].required_builds[0] in ( @@ -167,20 +137,12 @@ class TestCIConfig(unittest.TestCase): f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", ) elif "SQLTest" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - == CI.BuildNames.PACKAGE_RELEASE, - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [CI.BuildNames.PACKAGE_RELEASE] elif "Jepsen" in job: - self.assertTrue( - CI.JOB_CONFIGS[job].required_builds[0] - in ( - CI.BuildNames.PACKAGE_RELEASE, - CI.BuildNames.BINARY_RELEASE, - ), - f"Job [{job}] probably has wrong required build [{CI.JOB_CONFIGS[job].required_builds[0]}] in JobConfig", - ) + expected_builds = [ + CI.BuildNames.PACKAGE_RELEASE, + CI.BuildNames.BINARY_RELEASE, + ] elif job in ( CI.JobNames.STYLE_CHECK, CI.JobNames.FAST_TEST, @@ -188,9 +150,16 @@ class TestCIConfig(unittest.TestCase): CI.JobNames.DOCS_CHECK, CI.JobNames.BUGFIX_VALIDATE, ): - self.assertTrue(CI.JOB_CONFIGS[job].required_builds is None) + expected_builds = [] else: print(f"Job [{job}] required build not checked") + assert False + + self.assertCountEqual( + expected_builds, + CI.JOB_CONFIGS[job].required_builds or [], + f"Required builds are not valid for job [{job}]", + ) def test_job_stage_config(self): """ diff --git a/tests/clickhouse-test b/tests/clickhouse-test index 100a6358dcf..9c035b7cc35 100755 --- a/tests/clickhouse-test +++ b/tests/clickhouse-test @@ -920,6 +920,7 @@ class SettingsRandomizer: "optimize_functions_to_subcolumns": lambda: random.randint(0, 1), "parallel_replicas_local_plan": lambda: random.randint(0, 1), "output_format_native_write_json_as_string": lambda: random.randint(0, 1), + "enable_vertical_final": lambda: random.randint(0, 1), } @staticmethod diff --git a/tests/docker_scripts/stress_tests.lib b/tests/docker_scripts/stress_tests.lib index 3ab52c19dbd..5c346a2d17f 100644 --- a/tests/docker_scripts/stress_tests.lib +++ b/tests/docker_scripts/stress_tests.lib @@ -263,8 +263,12 @@ function check_logs_for_critical_errors() # Remove file logical_errors.txt if it's empty [ -s /test_output/logical_errors.txt ] || rm /test_output/logical_errors.txt - # No such key errors (ignore a.myext which is used in 02724_database_s3.sh and does not exist) - rg --text "Code: 499.*The specified key does not exist" /var/log/clickhouse-server/clickhouse-server*.log | grep -v "a.myext" > /test_output/no_such_key_errors.txt \ + # ignore: + # - a.myext which is used in 02724_database_s3.sh and does not exist + # - "DistributedCacheTCPHandler" and "caller id: None:DistribCache" because they happen inside distributed cache server + # - "ReadBufferFromDistributedCache", "AsynchronousBoundedReadBuffer", "ReadBufferFromS3", "ReadBufferFromAzureBlobStorage" + # exceptions printed internally by a buffer, exception will be rethrown and handled correctly + rg --text "Code: 499.*The specified key does not exist" /var/log/clickhouse-server/clickhouse-server*.log | grep -v -e "a.myext" -e "DistributedCacheTCPHandler" -e "ReadBufferFromDistributedCache" -e "ReadBufferFromS3" -e "ReadBufferFromAzureBlobStorage" -e "AsynchronousBoundedReadBuffer" -e "caller id: None:DistribCache" > /test_output/no_such_key_errors.txt \ && echo -e "S3_ERROR No such key thrown (see clickhouse-server.log or no_such_key_errors.txt)$FAIL$(trim_server_logs no_such_key_errors.txt)" >> /test_output/test_results.tsv \ || echo -e "No lost s3 keys$OK" >> /test_output/test_results.tsv diff --git a/tests/integration/helpers/cluster.py b/tests/integration/helpers/cluster.py index 3c92df51ac4..7c531cdd493 100644 --- a/tests/integration/helpers/cluster.py +++ b/tests/integration/helpers/cluster.py @@ -83,6 +83,8 @@ CLICKHOUSE_ERROR_LOG_FILE = "/var/log/clickhouse-server/clickhouse-server.err.lo # This means that this minimum need to be, at least, 1 year older than the current release CLICKHOUSE_CI_MIN_TESTED_VERSION = "23.3" +ZOOKEEPER_CONTAINERS = ("zoo1", "zoo2", "zoo3") + # to create docker-compose env file def _create_env_file(path, variables): @@ -2061,6 +2063,11 @@ class ClickHouseCluster: container_id = self.get_container_id(instance_name) return self.docker_client.api.logs(container_id).decode() + def query_zookeeper(self, query, node=ZOOKEEPER_CONTAINERS[0], nothrow=False): + cmd = f'clickhouse keeper-client -p {self.zookeeper_port} -q "{query}"' + container_id = self.get_container_id(node) + return self.exec_in_container(container_id, cmd, nothrow=nothrow, use_cli=False) + def exec_in_container( self, container_id: str, @@ -2125,6 +2132,16 @@ class ClickHouseCluster: ], ) + def remove_file_from_container(self, container_id, path): + self.exec_in_container( + container_id, + [ + "bash", + "-c", + "rm {}".format(path), + ], + ) + def wait_for_url( self, url="http://localhost:8123/ping", conn_timeout=2, interval=2, timeout=60 ): @@ -2352,7 +2369,7 @@ class ClickHouseCluster: time.sleep(0.5) raise Exception("Cannot wait PostgreSQL Java Client container") - def wait_rabbitmq_to_start(self, timeout=60): + def wait_rabbitmq_to_start(self, timeout=120): self.print_all_docker_pieces() self.rabbitmq_ip = self.get_instance_ip(self.rabbitmq_host) @@ -2391,16 +2408,16 @@ class ClickHouseCluster: def wait_zookeeper_secure_to_start(self, timeout=20): logging.debug("Wait ZooKeeper Secure to start") - nodes = ["zoo1", "zoo2", "zoo3"] - self.wait_zookeeper_nodes_to_start(nodes, timeout) + self.wait_zookeeper_nodes_to_start(ZOOKEEPER_CONTAINERS, timeout) def wait_zookeeper_to_start(self, timeout: float = 180) -> None: logging.debug("Wait ZooKeeper to start") - nodes = ["zoo1", "zoo2", "zoo3"] - self.wait_zookeeper_nodes_to_start(nodes, timeout) + self.wait_zookeeper_nodes_to_start(ZOOKEEPER_CONTAINERS, timeout) def wait_zookeeper_nodes_to_start( - self, nodes: List[str], timeout: float = 60 + self, + nodes: List[str], + timeout: float = 60, ) -> None: start = time.time() err = Exception("") @@ -3226,7 +3243,11 @@ class ClickHouseCluster: return zk def run_kazoo_commands_with_retries( - self, kazoo_callback, zoo_instance_name="zoo1", repeats=1, sleep_for=1 + self, + kazoo_callback, + zoo_instance_name=ZOOKEEPER_CONTAINERS[0], + repeats=1, + sleep_for=1, ): zk = self.get_kazoo_client(zoo_instance_name) logging.debug( @@ -3944,11 +3965,11 @@ class ClickHouseInstance: ) logging.info(f"PS RESULT:\n{ps_clickhouse}") pid = self.get_process_pid("clickhouse") - if pid is not None: - self.exec_in_container( - ["bash", "-c", f"gdb -batch -ex 'thread apply all bt full' -p {pid}"], - user="root", - ) + # if pid is not None: + # self.exec_in_container( + # ["bash", "-c", f"gdb -batch -ex 'thread apply all bt full' -p {pid}"], + # user="root", + # ) if last_err is not None: raise last_err @@ -4128,6 +4149,9 @@ class ClickHouseInstance: self.docker_id, local_path, dest_path ) + def remove_file_from_container(self, path): + return self.cluster.remove_file_from_container(self.docker_id, path) + def get_process_pid(self, process_name): output = self.exec_in_container( [ @@ -4648,9 +4672,7 @@ class ClickHouseInstance: depends_on.append("nats1") if self.with_zookeeper: - depends_on.append("zoo1") - depends_on.append("zoo2") - depends_on.append("zoo3") + depends_on += list(ZOOKEEPER_CONTAINERS) if self.with_minio: depends_on.append("minio1") diff --git a/tests/integration/helpers/config_manager.py b/tests/integration/helpers/config_manager.py new file mode 100644 index 00000000000..0a080a33477 --- /dev/null +++ b/tests/integration/helpers/config_manager.py @@ -0,0 +1,65 @@ +import os + + +class ConfigManager: + """Allows to temporarily add configuration files to the "config.d" or "users.d" directories. + + Can act as a context manager: + + with ConfigManager() as cm: + cm.add_main_config("configs/test_specific_config.xml") # copy "configs/test_specific_config.xml" to "/etc/clickhouse-server/config.d" + ... + # "/etc/clickhouse-server/config.d/test_specific_config.xml" is removed automatically + + """ + + def __init__(self): + self.__added_configs = [] + + def add_main_config(self, node_or_nodes, local_path, reload_config=True): + """Temporarily adds a configuration file to the "config.d" directory.""" + self.__add_config( + node_or_nodes, local_path, dest_dir="config.d", reload_config=reload_config + ) + + def add_user_config(self, node_or_nodes, local_path, reload_config=True): + """Temporarily adds a configuration file to the "users.d" directory.""" + self.__add_config( + node_or_nodes, local_path, dest_dir="users.d", reload_config=reload_config + ) + + def reset(self, reload_config=True): + """Removes all configuration files added by this ConfigManager.""" + if not self.__added_configs: + return + for node, dest_path in self.__added_configs: + node.remove_file_from_container(dest_path) + if reload_config: + for node, _ in self.__added_configs: + node.query("SYSTEM RELOAD CONFIG") + self.__added_configs = [] + + def __add_config(self, node_or_nodes, local_path, dest_dir, reload_config): + nodes_to_add_config = ( + node_or_nodes if (type(node_or_nodes) is list) else [node_or_nodes] + ) + for node in nodes_to_add_config: + src_path = os.path.join(node.cluster.base_dir, local_path) + dest_path = os.path.join( + "/etc/clickhouse-server", dest_dir, os.path.basename(local_path) + ) + node.copy_file_to_container(src_path, dest_path) + if reload_config: + for node in nodes_to_add_config: + node.query("SYSTEM RELOAD CONFIG") + for node in nodes_to_add_config: + dest_path = os.path.join( + "/etc/clickhouse-server", dest_dir, os.path.basename(local_path) + ) + self.__added_configs.append((node, dest_path)) + + def __enter__(self): + return self + + def __exit__(self, type, value, traceback): + self.reset() diff --git a/tests/integration/parallel_skip.json b/tests/integration/parallel_skip.json index 507894534d4..d293cae4dfd 100644 --- a/tests/integration/parallel_skip.json +++ b/tests/integration/parallel_skip.json @@ -170,6 +170,18 @@ "test_storage_kerberized_kafka/test.py::test_kafka_json_as_string", "test_storage_kerberized_kafka/test.py::test_kafka_json_as_string_request_new_ticket_after_expiration", "test_storage_kerberized_kafka/test.py::test_kafka_json_as_string_no_kdc", - "test_storage_kerberized_kafka/test.py::test_kafka_config_from_sql_named_collection" + "test_storage_kerberized_kafka/test.py::test_kafka_config_from_sql_named_collection", + "test_dns_cache/test.py::test_ip_change_drop_dns_cache", + "test_dns_cache/test.py::test_ip_change_update_dns_cache", + "test_dns_cache/test.py::test_dns_cache_update", + "test_dns_cache/test.py::test_user_access_ip_change", + "test_dns_cache/test.py::test_host_is_drop_from_cache_after_consecutive_failures", + "test_dns_cache/test.py::test_dns_resolver_filter", + + "test_https_replication/test_change_ip.py::test_replication_when_node_ip_changed", + + "test_host_regexp_multiple_ptr_records/test.py::test_host_regexp_multiple_ptr_v4_fails_with_wrong_resolution", + "test_host_regexp_multiple_ptr_records/test.py::test_host_regexp_multiple_ptr_v4", + "test_host_regexp_multiple_ptr_records/test.py::test_host_regexp_multiple_ptr_v6" ] diff --git a/tests/integration/test_async_load_databases/configs/async_load_system_database.xml b/tests/integration/test_async_load_databases/configs/async_load_system_database.xml new file mode 100644 index 00000000000..79823f5fbee --- /dev/null +++ b/tests/integration/test_async_load_databases/configs/async_load_system_database.xml @@ -0,0 +1,3 @@ + + true + diff --git a/tests/integration/test_async_load_databases/test.py b/tests/integration/test_async_load_databases/test.py index 7fc6fd222d1..acd3ef7455b 100644 --- a/tests/integration/test_async_load_databases/test.py +++ b/tests/integration/test_async_load_databases/test.py @@ -1,4 +1,5 @@ import random +import time import pytest @@ -13,25 +14,35 @@ DICTIONARY_FILES = [ ] cluster = ClickHouseCluster(__file__) -instance = cluster.add_instance( - "instance", +node1 = cluster.add_instance( + "node1", main_configs=["configs/config.xml"], dictionaries=DICTIONARY_FILES, stay_alive=True, ) +node2 = cluster.add_instance( + "node2", + main_configs=[ + "configs/async_load_system_database.xml", + ], + dictionaries=DICTIONARY_FILES, + stay_alive=True, +) + @pytest.fixture(scope="module") def started_cluster(): try: cluster.start() - instance.query( - """ - CREATE DATABASE IF NOT EXISTS dict ENGINE=Dictionary; - CREATE DATABASE IF NOT EXISTS test; - """ - ) + for node in [node1, node2]: + node.query( + """ + CREATE DATABASE IF NOT EXISTS dict ENGINE=Dictionary; + CREATE DATABASE IF NOT EXISTS test; + """ + ) yield cluster @@ -40,13 +51,13 @@ def started_cluster(): def get_status(dictionary_name): - return instance.query( + return node1.query( "SELECT status FROM system.dictionaries WHERE name='" + dictionary_name + "'" ).rstrip("\n") def test_dict_get_data(started_cluster): - query = instance.query + query = node1.query query( "CREATE TABLE test.elements (id UInt64, a String, b Int32, c Float64) ENGINE=Log;" @@ -80,7 +91,7 @@ def test_dict_get_data(started_cluster): # Wait for dictionaries to be reloaded. assert_eq_with_retry( - instance, + node1, "SELECT dictHas('dep_x', toUInt64(3))", "1", sleep_time=2, @@ -94,7 +105,7 @@ def test_dict_get_data(started_cluster): # so dep_x and dep_z are not going to be updated after the following INSERT. query("INSERT INTO test.elements VALUES (4, 'ether', 404, 0.001)") assert_eq_with_retry( - instance, + node1, "SELECT dictHas('dep_y', toUInt64(4))", "1", sleep_time=2, @@ -104,11 +115,11 @@ def test_dict_get_data(started_cluster): assert query("SELECT dictGetString('dep_y', 'a', toUInt64(4))") == "ether\n" assert query("SELECT dictGetString('dep_z', 'a', toUInt64(4))") == "ZZ\n" query("DROP TABLE IF EXISTS test.elements;") - instance.restart_clickhouse() + node1.restart_clickhouse() def dependent_tables_assert(): - res = instance.query("select database || '.' || name from system.tables") + res = node1.query("select database || '.' || name from system.tables") assert "system.join" in res assert "default.src" in res assert "dict.dep_y" in res @@ -119,7 +130,7 @@ def dependent_tables_assert(): def test_dependent_tables(started_cluster): - query = instance.query + query = node1.query query("create database lazy engine=Lazy(10)") query("create database a") query("create table lazy.src (n int, m int) engine=Log") @@ -157,7 +168,7 @@ def test_dependent_tables(started_cluster): ) dependent_tables_assert() - instance.restart_clickhouse() + node1.restart_clickhouse() dependent_tables_assert() query("drop table a.t") query("drop table lazy.log") @@ -170,14 +181,14 @@ def test_dependent_tables(started_cluster): def test_multiple_tables(started_cluster): - query = instance.query + query = node1.query tables_count = 20 for i in range(tables_count): query( f"create table test.table_{i} (n UInt64, s String) engine=MergeTree order by n as select number, randomString(100) from numbers(100)" ) - instance.restart_clickhouse() + node1.restart_clickhouse() order = [i for i in range(tables_count)] random.shuffle(order) @@ -185,3 +196,49 @@ def test_multiple_tables(started_cluster): assert query(f"select count() from test.table_{i}") == "100\n" for i in range(tables_count): query(f"drop table test.table_{i} sync") + + +def test_async_load_system_database(started_cluster): + id = 1 + for i in range(4): + # Access some system tables that might be still loading + if id > 1: + for j in range(3): + node2.query( + f"select count() from system.text_log_{random.randint(1, id - 1)}_test" + ) + node2.query( + f"select count() from system.query_log_{random.randint(1, id - 1)}_test" + ) + + assert ( + int( + node2.query( + f"select count() from system.asynchronous_loader where job ilike '%_log_%_test' and execution_pool = 'BackgroundLoad'" + ) + ) + > 0 + ) + + # Generate more system tables + for j in range(10): + while True: + node2.query("system flush logs") + count = int( + node2.query( + "select count() from system.tables where database = 'system' and name in ['query_log', 'text_log']" + ) + ) + if count == 2: + break + time.sleep(0.1) + node2.query(f"rename table system.text_log to system.text_log_{id}_test") + node2.query(f"rename table system.query_log to system.query_log_{id}_test") + id += 1 + + # Trigger async load of system database + node2.restart_clickhouse() + + for i in range(id - 1): + node2.query(f"drop table if exists system.text_log_{i + 1}_test") + node2.query(f"drop table if exists system.query_log_{i + 1}_test") diff --git a/tests/integration/test_backup_restore_on_cluster/configs/cluster_different_versions.xml b/tests/integration/test_backup_restore_on_cluster/configs/cluster_different_versions.xml new file mode 100644 index 00000000000..f70b255da18 --- /dev/null +++ b/tests/integration/test_backup_restore_on_cluster/configs/cluster_different_versions.xml @@ -0,0 +1,16 @@ + + + + + + new_node + 9000 + + + old_node + 9000 + + + + + diff --git a/tests/integration/test_backup_restore_on_cluster/configs/faster_zk_disconnect_detect.xml b/tests/integration/test_backup_restore_on_cluster/configs/faster_zk_disconnect_detect.xml new file mode 100644 index 00000000000..cfc6672ede4 --- /dev/null +++ b/tests/integration/test_backup_restore_on_cluster/configs/faster_zk_disconnect_detect.xml @@ -0,0 +1,12 @@ + + + + zoo1 + 2181 + + 500 + 0 + 1000 + 5000 + + diff --git a/tests/integration/test_backup_restore_on_cluster/configs/lesser_timeouts.xml b/tests/integration/test_backup_restore_on_cluster/configs/lesser_timeouts.xml index 0886f4bc722..38947be6a5d 100644 --- a/tests/integration/test_backup_restore_on_cluster/configs/lesser_timeouts.xml +++ b/tests/integration/test_backup_restore_on_cluster/configs/lesser_timeouts.xml @@ -1,6 +1,6 @@ - 1000 + 1000 10000 3000 diff --git a/tests/integration/test_backup_restore_on_cluster/configs/shutdown_cancel_backups.xml b/tests/integration/test_backup_restore_on_cluster/configs/shutdown_cancel_backups.xml new file mode 100644 index 00000000000..e0c0e0b32cd --- /dev/null +++ b/tests/integration/test_backup_restore_on_cluster/configs/shutdown_cancel_backups.xml @@ -0,0 +1,3 @@ + + false + diff --git a/tests/integration/test_backup_restore_on_cluster/configs/slow_backups.xml b/tests/integration/test_backup_restore_on_cluster/configs/slow_backups.xml new file mode 100644 index 00000000000..933c3250054 --- /dev/null +++ b/tests/integration/test_backup_restore_on_cluster/configs/slow_backups.xml @@ -0,0 +1,7 @@ + + + true + + 12 + 2 + diff --git a/tests/integration/test_backup_restore_on_cluster/configs/zookeeper_retries.xml b/tests/integration/test_backup_restore_on_cluster/configs/zookeeper_retries.xml index 1283f28a8cb..7af54d2dd95 100644 --- a/tests/integration/test_backup_restore_on_cluster/configs/zookeeper_retries.xml +++ b/tests/integration/test_backup_restore_on_cluster/configs/zookeeper_retries.xml @@ -1,9 +1,12 @@ - 1000 - 1 - 1 + 50 + 100 + 1000 + 10 + 2 + 3 42 0.002 diff --git a/tests/integration/test_backup_restore_on_cluster/test.py b/tests/integration/test_backup_restore_on_cluster/test.py index a1082c563d1..4d4fe0e665a 100644 --- a/tests/integration/test_backup_restore_on_cluster/test.py +++ b/tests/integration/test_backup_restore_on_cluster/test.py @@ -1153,7 +1153,7 @@ def test_get_error_from_other_host(): node1.query("INSERT INTO tbl VALUES (3)") backup_name = new_backup_name() - expected_error = "Got error from node2.*Table default.tbl was not found" + expected_error = "Got error from host node2.*Table default.tbl was not found" assert re.search( expected_error, node1.query_and_get_error( @@ -1162,8 +1162,7 @@ def test_get_error_from_other_host(): ) -@pytest.mark.parametrize("kill", [False, True]) -def test_stop_other_host_during_backup(kill): +def test_shutdown_waits_for_backup(): node1.query( "CREATE TABLE tbl ON CLUSTER 'cluster' (" "x UInt8" @@ -1182,7 +1181,7 @@ def test_stop_other_host_during_backup(kill): # If kill=False the pending backup must be completed # If kill=True the pending backup might be completed or failed - node2.stop_clickhouse(kill=kill) + node2.stop_clickhouse(kill=False) assert_eq_with_retry( node1, @@ -1192,22 +1191,11 @@ def test_stop_other_host_during_backup(kill): ) status = node1.query(f"SELECT status FROM system.backups WHERE id='{id}'").strip() - - if kill: - expected_statuses = ["BACKUP_CREATED", "BACKUP_FAILED"] - else: - expected_statuses = ["BACKUP_CREATED", "BACKUP_CANCELLED"] - - assert status in expected_statuses + assert status == "BACKUP_CREATED" node2.start_clickhouse() - if status == "BACKUP_CREATED": - node1.query("DROP TABLE tbl ON CLUSTER 'cluster' SYNC") - node1.query(f"RESTORE TABLE tbl ON CLUSTER 'cluster' FROM {backup_name}") - node1.query("SYSTEM SYNC REPLICA tbl") - assert node1.query("SELECT * FROM tbl ORDER BY x") == TSV([3, 5]) - elif status == "BACKUP_FAILED": - assert not os.path.exists( - os.path.join(get_path_to_backup(backup_name), ".backup") - ) + node1.query("DROP TABLE tbl ON CLUSTER 'cluster' SYNC") + node1.query(f"RESTORE TABLE tbl ON CLUSTER 'cluster' FROM {backup_name}") + node1.query("SYSTEM SYNC REPLICA tbl") + assert node1.query("SELECT * FROM tbl ORDER BY x") == TSV([3, 5]) diff --git a/tests/integration/test_backup_restore_on_cluster/test_cancel_backup.py b/tests/integration/test_backup_restore_on_cluster/test_cancel_backup.py new file mode 100644 index 00000000000..f63dc2aef3d --- /dev/null +++ b/tests/integration/test_backup_restore_on_cluster/test_cancel_backup.py @@ -0,0 +1,780 @@ +import os +import random +import time +import uuid + +import pytest + +from helpers.cluster import ClickHouseCluster +from helpers.config_manager import ConfigManager +from helpers.network import PartitionManager +from helpers.test_tools import TSV + +cluster = ClickHouseCluster(__file__) + +main_configs = [ + "configs/backups_disk.xml", + "configs/cluster.xml", + "configs/lesser_timeouts.xml", # Default timeouts are quite big (a few minutes), the tests don't need them to be that big. + "configs/slow_backups.xml", + "configs/shutdown_cancel_backups.xml", +] + +user_configs = [ + "configs/zookeeper_retries.xml", +] + +node1 = cluster.add_instance( + "node1", + main_configs=main_configs, + user_configs=user_configs, + external_dirs=["/backups/"], + macros={"replica": "node1", "shard": "shard1"}, + with_zookeeper=True, + stay_alive=True, # Necessary for "test_shutdown_cancel_backup" +) + +node2 = cluster.add_instance( + "node2", + main_configs=main_configs, + user_configs=user_configs, + external_dirs=["/backups/"], + macros={"replica": "node2", "shard": "shard1"}, + with_zookeeper=True, + stay_alive=True, # Necessary for "test_shutdown_cancel_backup" +) + +nodes = [node1, node2] + + +@pytest.fixture(scope="module", autouse=True) +def start_cluster(): + try: + cluster.start() + yield cluster + finally: + cluster.shutdown() + + +@pytest.fixture(autouse=True) +def cleanup_after_test(): + try: + yield + finally: + node1.query("DROP TABLE IF EXISTS tbl ON CLUSTER 'cluster' SYNC") + + +# Utilities + + +# Gets a printable version the name of a node. +def get_node_name(node): + return "node1" if (node == node1) else "node2" + + +# Choose a random instance. +def random_node(): + return random.choice(nodes) + + +# Makes table "tbl" and fill it with data. +def create_and_fill_table(node, num_parts=10, on_cluster=True): + # We use partitioning to make sure there will be more files in a backup. + partition_by_clause = " PARTITION BY x%" + str(num_parts) if num_parts > 1 else "" + node.query( + "CREATE TABLE tbl " + + ("ON CLUSTER 'cluster' " if on_cluster else "") + + "(x UInt64) ENGINE=ReplicatedMergeTree('/clickhouse/tables/tbl/', '{replica}') " + + "ORDER BY tuple()" + + partition_by_clause + ) + if num_parts > 0: + node.query(f"INSERT INTO tbl SELECT number FROM numbers({num_parts})") + + +# Generates an ID suitable both as backup id or restore id. +def random_id(): + return uuid.uuid4().hex + + +# Generates a backup name prepared for using in BACKUP and RESTORE queries. +def get_backup_name(backup_id): + return f"Disk('backups', '{backup_id}')" + + +# Reads the status of a backup or a restore from system.backups. +def get_status(initiator, backup_id=None, restore_id=None): + id = backup_id if backup_id is not None else restore_id + return initiator.query(f"SELECT status FROM system.backups WHERE id='{id}'").rstrip( + "\n" + ) + + +# Reads the error message of a failed backup or a failed restore from system.backups. +def get_error(initiator, backup_id=None, restore_id=None): + id = backup_id if backup_id is not None else restore_id + return initiator.query(f"SELECT error FROM system.backups WHERE id='{id}'").rstrip( + "\n" + ) + + +# Waits until the status of a backup or a restore becomes a desired one. +# Returns how many seconds the function was waiting. +def wait_status( + initiator, + status="BACKUP_CREATED", + backup_id=None, + restore_id=None, + timeout=None, +): + print(f"Waiting for status {status}") + id = backup_id if backup_id is not None else restore_id + operation_name = "backup" if backup_id is not None else "restore" + current_status = get_status(initiator, backup_id=backup_id, restore_id=restore_id) + waited = 0 + while ( + (current_status != status) + and (current_status in ["CREATING_BACKUP", "RESTORING"]) + and ((timeout is None) or (waited < timeout)) + ): + sleep_time = 1 if (timeout is None) else min(1, timeout - waited) + time.sleep(sleep_time) + waited += sleep_time + current_status = get_status( + initiator, backup_id=backup_id, restore_id=restore_id + ) + start_time, end_time = ( + initiator.query( + f"SELECT start_time, end_time FROM system.backups WHERE id='{id}'" + ) + .splitlines()[0] + .split("\t") + ) + print( + f"{get_node_name(initiator)} : Got status {current_status} for {operation_name} {id} after waiting {waited} seconds " + f"(start_time = {start_time}, end_time = {end_time})" + ) + assert current_status == status + + +# Returns how many entries are in system.processes corresponding to a specified backup or restore. +def get_num_system_processes( + node_or_nodes, backup_id=None, restore_id=None, is_initial_query=None +): + id = backup_id if backup_id is not None else restore_id + query_kind = "Backup" if backup_id is not None else "Restore" + total = 0 + filter_for_is_initial_query = ( + f" AND (is_initial_query = {is_initial_query})" + if is_initial_query is not None + else "" + ) + nodes_to_consider = ( + node_or_nodes if (type(node_or_nodes) is list) else [node_or_nodes] + ) + for node in nodes_to_consider: + count = int( + node.query( + f"SELECT count() FROM system.processes WHERE (query_kind='{query_kind}') AND (query LIKE '%{id}%'){filter_for_is_initial_query}" + ) + ) + total += count + return total + + +# Waits until the number of entries in system.processes corresponding to a specified backup or restore becomes a desired one. +# Returns how many seconds the function was waiting. +def wait_num_system_processes( + node_or_nodes, + num_system_processes=0, + backup_id=None, + restore_id=None, + is_initial_query=None, + timeout=None, +): + print(f"Waiting for number of system processes = {num_system_processes}") + id = backup_id if backup_id is not None else restore_id + operation_name = "backup" if backup_id is not None else "restore" + current_count = get_num_system_processes( + node_or_nodes, + backup_id=backup_id, + restore_id=restore_id, + is_initial_query=is_initial_query, + ) + + def is_current_count_ok(): + return (current_count == num_system_processes) or ( + num_system_processes == "1+" and current_count >= 1 + ) + + waited = 0 + while not is_current_count_ok() and ((timeout is None) or (waited < timeout)): + sleep_time = 1 if (timeout is None) else min(1, timeout - waited) + time.sleep(sleep_time) + waited += sleep_time + current_count = get_num_system_processes( + node_or_nodes, + backup_id=backup_id, + restore_id=restore_id, + is_initial_query=is_initial_query, + ) + if is_current_count_ok(): + print( + f"Got {current_count} system processes for {operation_name} {id} after waiting {waited} seconds" + ) + else: + nodes_to_consider = ( + node_or_nodes if (type(node_or_nodes) is list) else [node_or_nodes] + ) + for node in nodes_to_consider: + count = get_num_system_processes( + node, backup_id=backup_id, restore_id=restore_id + ) + print( + f"{get_node_name(node)}: Got {count} system processes for {operation_name} {id} after waiting {waited} seconds" + ) + assert False + return waited + + +# Kills a BACKUP or RESTORE query. +# Returns how many seconds the KILL QUERY was executing. +def kill_query( + node, backup_id=None, restore_id=None, is_initial_query=None, timeout=None +): + id = backup_id if backup_id is not None else restore_id + query_kind = "Backup" if backup_id is not None else "Restore" + operation_name = "backup" if backup_id is not None else "restore" + print(f"{get_node_name(node)}: Cancelling {operation_name} {id}") + filter_for_is_initial_query = ( + f" AND (is_initial_query = {is_initial_query})" + if is_initial_query is not None + else "" + ) + node.query( + f"KILL QUERY WHERE (query_kind='{query_kind}') AND (query LIKE '%{id}%'){filter_for_is_initial_query} SYNC" + ) + node.query("SYSTEM FLUSH LOGS") + duration = ( + int( + node.query( + f"SELECT query_duration_ms FROM system.query_log WHERE query_kind='KillQuery' AND query LIKE '%{id}%' AND type='QueryFinish'" + ) + ) + / 1000 + ) + print( + f"{get_node_name(node)}: Cancelled {operation_name} {id} after {duration} seconds" + ) + if timeout is not None: + assert duration < timeout + + +# Stops all ZooKeeper servers. +def stop_zookeeper_servers(zoo_nodes): + print(f"Stopping ZooKeeper servers {zoo_nodes}") + old_time = time.monotonic() + cluster.stop_zookeeper_nodes(zoo_nodes) + print( + f"Stopped ZooKeeper servers {zoo_nodes} in {time.monotonic() - old_time} seconds" + ) + + +# Starts all ZooKeeper servers back. +def start_zookeeper_servers(zoo_nodes): + print(f"Starting ZooKeeper servers {zoo_nodes}") + old_time = time.monotonic() + cluster.start_zookeeper_nodes(zoo_nodes) + print( + f"Started ZooKeeper servers {zoo_nodes} in {time.monotonic() - old_time} seconds" + ) + + +# Sleeps for random amount of time. +def random_sleep(max_seconds): + if random.randint(0, 5) > 0: + sleep(random.uniform(0, max_seconds)) + + +def sleep(seconds): + print(f"Sleeping {seconds} seconds") + time.sleep(seconds) + + +# Checks that BACKUP and RESTORE cleaned up properly with no trash left in ZooKeeper, backups folder, and logs. +class NoTrashChecker: + def __init__(self): + self.expect_backups = [] + self.expect_unfinished_backups = [] + self.expect_errors = [] + self.allow_errors = [] + self.check_zookeeper = True + + # Sleep 1 second to ensure this NoTrashChecker won't collect errors from a possible previous NoTrashChecker. + time.sleep(1) + + self.__start_time_for_collecting_errors = time.gmtime() + self.__previous_list_of_backups = set( + os.listdir(os.path.join(node1.cluster.instances_dir, "backups")) + ) + + self.__previous_list_of_znodes = set( + node1.query( + "SELECT name FROM system.zookeeper WHERE path = '/clickhouse/backups' " + + "AND NOT (name == 'alive_tracker')" + ).splitlines() + ) + + def __enter__(self): + return self + + def __exit__(self, type, value, traceback): + list_of_znodes = set( + node1.query( + "SELECT name FROM system.zookeeper WHERE path = '/clickhouse/backups' " + + "AND NOT (name == 'alive_tracker')" + ).splitlines() + ) + new_znodes = list_of_znodes.difference(self.__previous_list_of_znodes) + if new_znodes: + print(f"Found nodes in ZooKeeper: {new_znodes}") + for node in new_znodes: + print( + f"Nodes in '/clickhouse/backups/{node}':\n" + + node1.query( + f"SELECT name FROM system.zookeeper WHERE path = '/clickhouse/backups/{node}'" + ) + ) + print( + f"Nodes in '/clickhouse/backups/{node}/stage':\n" + + node1.query( + f"SELECT name FROM system.zookeeper WHERE path = '/clickhouse/backups/{node}/stage'" + ) + ) + if self.check_zookeeper: + assert new_znodes == set() + + list_of_backups = set( + os.listdir(os.path.join(node1.cluster.instances_dir, "backups")) + ) + new_backups = list_of_backups.difference(self.__previous_list_of_backups) + unfinished_backups = set( + backup + for backup in new_backups + if not os.path.exists( + os.path.join(node1.cluster.instances_dir, "backups", backup, ".backup") + ) + ) + new_backups = set( + backup for backup in new_backups if backup not in unfinished_backups + ) + if new_backups: + print(f"Found new backups: {new_backups}") + if unfinished_backups: + print(f"Found unfinished backups: {unfinished_backups}") + assert new_backups == set(self.expect_backups) + assert unfinished_backups == set(self.expect_unfinished_backups) + + all_errors = set() + start_time = time.strftime( + "%Y-%m-%d %H:%M:%S", self.__start_time_for_collecting_errors + ) + for node in nodes: + errors_query_result = node.query( + "SELECT name FROM system.errors WHERE last_error_time >= toDateTime('" + + start_time + + "') " + + "AND NOT ((name == 'KEEPER_EXCEPTION') AND (last_error_message LIKE '%Fault injection%')) " + + "AND NOT (name == 'NO_ELEMENTS_IN_CONFIG')" + ) + errors = errors_query_result.splitlines() + if errors: + print(f"{get_node_name(node)}: Found errors: {errors}") + print( + node.query( + "SELECT name, last_error_message FROM system.errors WHERE last_error_time >= toDateTime('" + + start_time + + "')" + ) + ) + for error in errors: + assert (error in self.expect_errors) or (error in self.allow_errors) + all_errors.update(errors) + + not_found_expected_errors = set(self.expect_errors).difference(all_errors) + if not_found_expected_errors: + print(f"Not found expected errors: {not_found_expected_errors}") + assert False + + +__backup_id_of_successful_backup = None + + +# Generates a backup which will be used to test RESTORE. +def get_backup_id_of_successful_backup(): + global __backup_id_of_successful_backup + if __backup_id_of_successful_backup is None: + __backup_id_of_successful_backup = random_id() + with NoTrashChecker() as no_trash_checker: + print("Will make backup successfully") + backup_id = __backup_id_of_successful_backup + create_and_fill_table(random_node()) + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator") + initiator.query( + f"BACKUP TABLE tbl ON CLUSTER 'cluster' TO {get_backup_name(backup_id)} SETTINGS id='{backup_id}' ASYNC" + ) + wait_status(initiator, "BACKUP_CREATED", backup_id=backup_id) + assert get_num_system_processes(nodes, backup_id=backup_id) == 0 + no_trash_checker.expect_backups = [backup_id] + + # Dropping the table before restoring. + node1.query("DROP TABLE tbl ON CLUSTER 'cluster' SYNC") + + return __backup_id_of_successful_backup + + +# Actual tests + + +# Test that a BACKUP operation can be cancelled with KILL QUERY. +def test_cancel_backup(): + with NoTrashChecker() as no_trash_checker: + create_and_fill_table(random_node()) + + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator") + + backup_id = random_id() + initiator.query( + f"BACKUP TABLE tbl ON CLUSTER 'cluster' TO {get_backup_name(backup_id)} SETTINGS id='{backup_id}' ASYNC" + ) + + assert get_status(initiator, backup_id=backup_id) == "CREATING_BACKUP" + assert get_num_system_processes(initiator, backup_id=backup_id) >= 1 + + # We shouldn't wait too long here, because otherwise the backup might be completed before we cancel it. + random_sleep(3) + + node_to_cancel, cancel_as_initiator = random.choice( + [(node1, False), (node2, False), (initiator, True)] + ) + + wait_num_system_processes( + node_to_cancel, + "1+", + backup_id=backup_id, + is_initial_query=cancel_as_initiator, + ) + + print( + f"Cancelling on {'initiator' if cancel_as_initiator else 'node'} {get_node_name(node_to_cancel)}" + ) + + # The timeout is 2 seconds here because a backup must be cancelled quickly. + kill_query( + node_to_cancel, + backup_id=backup_id, + is_initial_query=cancel_as_initiator, + timeout=3, + ) + + if cancel_as_initiator: + assert get_status(initiator, backup_id=backup_id) == "BACKUP_CANCELLED" + wait_status(initiator, "BACKUP_CANCELLED", backup_id=backup_id, timeout=3) + + assert "QUERY_WAS_CANCELLED" in get_error(initiator, backup_id=backup_id) + assert get_num_system_processes(nodes, backup_id=backup_id) == 0 + no_trash_checker.expect_errors = ["QUERY_WAS_CANCELLED"] + + +# Test that a RESTORE operation can be cancelled with KILL QUERY. +def test_cancel_restore(): + # Make backup. + backup_id = get_backup_id_of_successful_backup() + + # Cancel restoring. + with NoTrashChecker() as no_trash_checker: + print("Will cancel restoring") + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator") + + restore_id = random_id() + initiator.query( + f"RESTORE TABLE tbl ON CLUSTER 'cluster' FROM {get_backup_name(backup_id)} SETTINGS id='{restore_id}' ASYNC" + ) + + assert get_status(initiator, restore_id=restore_id) == "RESTORING" + assert get_num_system_processes(initiator, restore_id=restore_id) >= 1 + + # We shouldn't wait too long here, because otherwise the restore might be completed before we cancel it. + random_sleep(3) + + node_to_cancel, cancel_as_initiator = random.choice( + [(node1, False), (node2, False), (initiator, True)] + ) + + wait_num_system_processes( + node_to_cancel, + "1+", + restore_id=restore_id, + is_initial_query=cancel_as_initiator, + ) + + print( + f"Cancelling on {'initiator' if cancel_as_initiator else 'node'} {get_node_name(node_to_cancel)}" + ) + + # The timeout is 2 seconds here because a restore must be cancelled quickly. + kill_query( + node_to_cancel, + restore_id=restore_id, + is_initial_query=cancel_as_initiator, + timeout=3, + ) + + if cancel_as_initiator: + assert get_status(initiator, restore_id=restore_id) == "RESTORE_CANCELLED" + wait_status(initiator, "RESTORE_CANCELLED", restore_id=restore_id, timeout=3) + + assert "QUERY_WAS_CANCELLED" in get_error(initiator, restore_id=restore_id) + assert get_num_system_processes(nodes, restore_id=restore_id) == 0 + no_trash_checker.expect_errors = ["QUERY_WAS_CANCELLED"] + + # Restore successfully. + with NoTrashChecker() as no_trash_checker: + print("Will restore from backup successfully") + restore_id = random_id() + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator") + + initiator.query( + f"RESTORE TABLE tbl ON CLUSTER 'cluster' FROM {get_backup_name(backup_id)} SETTINGS id='{restore_id}' ASYNC" + ) + + wait_status(initiator, "RESTORED", restore_id=restore_id) + assert get_num_system_processes(nodes, restore_id=restore_id) == 0 + + +# Test that shutdown cancels a running backup and doesn't wait until it finishes. +def test_shutdown_cancels_backup(): + with NoTrashChecker() as no_trash_checker: + create_and_fill_table(random_node()) + + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator") + + backup_id = random_id() + initiator.query( + f"BACKUP TABLE tbl ON CLUSTER 'cluster' TO {get_backup_name(backup_id)} SETTINGS id='{backup_id}' ASYNC" + ) + + assert get_status(initiator, backup_id=backup_id) == "CREATING_BACKUP" + assert get_num_system_processes(initiator, backup_id=backup_id) >= 1 + + # We shouldn't wait too long here, because otherwise the backup might be completed before we cancel it. + random_sleep(3) + + node_to_restart = random.choice([node1, node2]) + wait_num_system_processes(node_to_restart, "1+", backup_id=backup_id) + + print(f"{get_node_name(node_to_restart)}: Restarting...") + node_to_restart.restart_clickhouse() # Must cancel the backup. + print(f"{get_node_name(node_to_restart)}: Restarted") + + wait_num_system_processes(nodes, 0, backup_id=backup_id) + + if initiator != node_to_restart: + assert get_status(initiator, backup_id=backup_id) == "BACKUP_CANCELLED" + assert "QUERY_WAS_CANCELLED" in get_error(initiator, backup_id=backup_id) + + # The information about this cancelled backup must be stored in system.backup_log + initiator.query("SYSTEM FLUSH LOGS") + assert initiator.query( + f"SELECT status FROM system.backup_log WHERE id='{backup_id}' ORDER BY status" + ) == TSV(["CREATING_BACKUP", "BACKUP_CANCELLED"]) + + no_trash_checker.expect_errors = ["QUERY_WAS_CANCELLED"] + + +# After an error backup should clean the destination folder and used nodes in ZooKeeper. +# No unexpected errors must be generated. +def test_error_leaves_no_trash(): + with NoTrashChecker() as no_trash_checker: + # We create table "tbl" on one node only in order to make "BACKUP TABLE tbl ON CLUSTER" fail + # (because of the non-existing table on another node). + create_and_fill_table(random_node(), on_cluster=False) + + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator") + + backup_id = random_id() + initiator.query( + f"BACKUP TABLE tbl ON CLUSTER 'cluster' TO {get_backup_name(backup_id)} SETTINGS id='{backup_id}' ASYNC" + ) + + wait_status(initiator, "BACKUP_FAILED", backup_id=backup_id) + assert "UNKNOWN_TABLE" in get_error(initiator, backup_id=backup_id) + + assert get_num_system_processes(nodes, backup_id=backup_id) == 0 + no_trash_checker.expect_errors = ["UNKNOWN_TABLE"] + + +# A backup must be stopped if Zookeeper is disconnected longer than `failure_after_host_disconnected_for_seconds`. +def test_long_disconnection_stops_backup(): + with NoTrashChecker() as no_trash_checker, ConfigManager() as config_manager: + # Config "faster_zk_disconnect_detect.xml" is used in this test to decrease number of retries when reconnecting to ZooKeeper. + # Without this config this test can take several minutes (instead of seconds) to run. + config_manager.add_main_config(nodes, "configs/faster_zk_disconnect_detect.xml") + + create_and_fill_table(random_node(), num_parts=100) + + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator") + + backup_id = random_id() + initiator.query( + f"BACKUP TABLE tbl ON CLUSTER 'cluster' TO {get_backup_name(backup_id)} SETTINGS id='{backup_id}' ASYNC", + settings={"backup_restore_failure_after_host_disconnected_for_seconds": 3}, + ) + + assert get_status(initiator, backup_id=backup_id) == "CREATING_BACKUP" + assert get_num_system_processes(initiator, backup_id=backup_id) >= 1 + + no_trash_checker.expect_unfinished_backups = [backup_id] + no_trash_checker.allow_errors = [ + "FAILED_TO_SYNC_BACKUP_OR_RESTORE", + "KEEPER_EXCEPTION", + "SOCKET_TIMEOUT", + "CANNOT_READ_ALL_DATA", + "NETWORK_ERROR", + "TABLE_IS_READ_ONLY", + ] + no_trash_checker.check_zookeeper = False + + with PartitionManager() as pm: + random_sleep(3) + + time_before_disconnection = time.monotonic() + + node_to_drop_zk_connection = random_node() + print( + f"Dropping connection between {get_node_name(node_to_drop_zk_connection)} and ZooKeeper" + ) + pm.drop_instance_zk_connections(node_to_drop_zk_connection) + + # Being disconnected from ZooKeeper a backup is expected to fail. + wait_status(initiator, "BACKUP_FAILED", backup_id=backup_id) + + time_to_fail = time.monotonic() - time_before_disconnection + error = get_error(initiator, backup_id=backup_id) + print(f"error={error}") + assert "Lost connection" in error + + # A backup is expected to fail, but it isn't expected to fail too soon. + print(f"Backup failed after {time_to_fail} seconds disconnection") + assert time_to_fail > 3 + assert time_to_fail < 30 + + +# A backup must NOT be stopped if Zookeeper is disconnected shorter than `failure_after_host_disconnected_for_seconds`. +def test_short_disconnection_doesnt_stop_backup(): + with NoTrashChecker() as no_trash_checker, ConfigManager() as config_manager: + use_faster_zk_disconnect_detect = random.choice([True, False]) + if use_faster_zk_disconnect_detect: + print("Using faster_zk_disconnect_detect.xml") + config_manager.add_main_config( + nodes, "configs/faster_zk_disconnect_detect.xml" + ) + + create_and_fill_table(random_node()) + + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator") + + backup_id = random_id() + initiator.query( + f"BACKUP TABLE tbl ON CLUSTER 'cluster' TO {get_backup_name(backup_id)} SETTINGS id='{backup_id}' ASYNC", + settings={"backup_restore_failure_after_host_disconnected_for_seconds": 6}, + ) + + assert get_status(initiator, backup_id=backup_id) == "CREATING_BACKUP" + assert get_num_system_processes(initiator, backup_id=backup_id) >= 1 + + # Dropping connection for less than `failure_after_host_disconnected_for_seconds` + with PartitionManager() as pm: + random_sleep(3) + node_to_drop_zk_connection = random_node() + print( + f"Dropping connection between {get_node_name(node_to_drop_zk_connection)} and ZooKeeper" + ) + pm.drop_instance_zk_connections(node_to_drop_zk_connection) + random_sleep(3) + print( + f"Restoring connection between {get_node_name(node_to_drop_zk_connection)} and ZooKeeper" + ) + + # Backup must be successful. + wait_status(initiator, "BACKUP_CREATED", backup_id=backup_id) + assert get_num_system_processes(nodes, backup_id=backup_id) == 0 + + no_trash_checker.expect_backups = [backup_id] + no_trash_checker.allow_errors = [ + "KEEPER_EXCEPTION", + "SOCKET_TIMEOUT", + "CANNOT_READ_ALL_DATA", + "NETWORK_ERROR", + "TABLE_IS_READ_ONLY", + ] + + +# A restore must NOT be stopped if Zookeeper is disconnected shorter than `failure_after_host_disconnected_for_seconds`. +def test_short_disconnection_doesnt_stop_restore(): + # Make a backup. + backup_id = get_backup_id_of_successful_backup() + + # Restore from the backup. + with NoTrashChecker() as no_trash_checker, ConfigManager() as config_manager: + use_faster_zk_disconnect_detect = random.choice([True, False]) + if use_faster_zk_disconnect_detect: + print("Using faster_zk_disconnect_detect.xml") + config_manager.add_main_config( + nodes, "configs/faster_zk_disconnect_detect.xml" + ) + + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator") + + restore_id = random_id() + initiator.query( + f"RESTORE TABLE tbl ON CLUSTER 'cluster' FROM {get_backup_name(backup_id)} SETTINGS id='{restore_id}' ASYNC", + settings={"backup_restore_failure_after_host_disconnected_for_seconds": 6}, + ) + + assert get_status(initiator, restore_id=restore_id) == "RESTORING" + assert get_num_system_processes(initiator, restore_id=restore_id) >= 1 + + # Dropping connection for less than `failure_after_host_disconnected_for_seconds` + with PartitionManager() as pm: + random_sleep(3) + node_to_drop_zk_connection = random_node() + print( + f"Dropping connection between {get_node_name(node_to_drop_zk_connection)} and ZooKeeper" + ) + pm.drop_instance_zk_connections(node_to_drop_zk_connection) + random_sleep(3) + print( + f"Restoring connection between {get_node_name(node_to_drop_zk_connection)} and ZooKeeper" + ) + + # Restore must be successful. + wait_status(initiator, "RESTORED", restore_id=restore_id) + assert get_num_system_processes(nodes, restore_id=restore_id) == 0 + + no_trash_checker.allow_errors = [ + "KEEPER_EXCEPTION", + "SOCKET_TIMEOUT", + "CANNOT_READ_ALL_DATA", + "NETWORK_ERROR", + "TABLE_IS_READ_ONLY", + ] diff --git a/tests/integration/test_backup_restore_on_cluster/test_different_versions.py b/tests/integration/test_backup_restore_on_cluster/test_different_versions.py new file mode 100644 index 00000000000..b5eea7a1902 --- /dev/null +++ b/tests/integration/test_backup_restore_on_cluster/test_different_versions.py @@ -0,0 +1,125 @@ +import random + +import pytest + +from helpers.cluster import ClickHouseCluster +from helpers.test_tools import TSV + +cluster = ClickHouseCluster(__file__) + +main_configs = [ + "configs/backups_disk.xml", + "configs/cluster_different_versions.xml", +] + +user_configs = [] + +new_node = cluster.add_instance( + "new_node", + main_configs=main_configs, + user_configs=user_configs, + external_dirs=["/backups/"], + macros={"replica": "new_node", "shard": "shard1"}, + with_zookeeper=True, +) + +old_node = cluster.add_instance( + "old_node", + image="clickhouse/clickhouse-server", + tag="24.9.2.42", + with_installed_binary=True, + main_configs=main_configs, + user_configs=user_configs, + external_dirs=["/backups/"], + macros={"replica": "old_node", "shard": "shard1"}, + with_zookeeper=True, +) + +nodes = [new_node, old_node] + + +@pytest.fixture(scope="module", autouse=True) +def start_cluster(): + try: + cluster.start() + yield cluster + finally: + cluster.shutdown() + + +@pytest.fixture(autouse=True) +def cleanup_after_test(): + try: + yield + finally: + new_node.query("DROP TABLE IF EXISTS tbl ON CLUSTER 'cluster_ver' SYNC") + + +backup_id_counter = 0 + + +def new_backup_name(): + global backup_id_counter + backup_id_counter += 1 + return f"Disk('backups', '{backup_id_counter}')" + + +# Gets a printable version the name of a node. +def get_node_name(node): + return "new_node" if (node == new_node) else "old_node" + + +# Choose a random instance. +def random_node(): + return random.choice(nodes) + + +def test_different_versions(): + new_node.query( + "CREATE TABLE tbl" + " ON CLUSTER 'cluster_ver'" + " (x UInt64) ENGINE=ReplicatedMergeTree('/clickhouse/tables/tbl/', '{replica}')" + " ORDER BY tuple()" + ) + + new_node.query(f"INSERT INTO tbl VALUES (1)") + old_node.query(f"INSERT INTO tbl VALUES (2)") + + backup_name = new_backup_name() + + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator for BACKUP") + initiator.query(f"BACKUP TABLE tbl ON CLUSTER 'cluster_ver' TO {backup_name}") + + new_node.query("DROP TABLE tbl ON CLUSTER 'cluster_ver' SYNC") + + initiator = random_node() + print(f"Using {get_node_name(initiator)} as initiator for RESTORE") + initiator.query(f"RESTORE TABLE tbl ON CLUSTER 'cluster_ver' FROM {backup_name}") + + new_node.query("SYSTEM SYNC REPLICA ON CLUSTER 'cluster_ver' tbl") + assert new_node.query("SELECT * FROM tbl ORDER BY x") == TSV([1, 2]) + assert old_node.query("SELECT * FROM tbl ORDER BY x") == TSV([1, 2]) + + # Error NO_ELEMENTS_IN_CONFIG is unrelated. + assert ( + new_node.query( + "SELECT name, last_error_message FROM system.errors WHERE NOT (" + "(name == 'NO_ELEMENTS_IN_CONFIG')" + ")" + ) + == "" + ) + + # Error FAILED_TO_SYNC_BACKUP_OR_RESTORE: "No connection to host new_node:9000 yet, will retry" is generated by the old version + # when it fails to connect to other host because that other host hasn't started yet. + # This is not an error actually, just an exception thrown and caught. The new version doesn't throw this exception. + assert ( + old_node.query( + "SELECT name, last_error_message FROM system.errors WHERE NOT (" + "(name == 'NO_ELEMENTS_IN_CONFIG') OR" + "((name == 'FAILED_TO_SYNC_BACKUP_OR_RESTORE') AND (last_error_message == 'No connection to host new_node:9000 yet, will retry'))" + ")" + ) + == "" + ) diff --git a/tests/integration/test_backup_restore_on_cluster/test_disallow_concurrency.py b/tests/integration/test_backup_restore_on_cluster/test_disallow_concurrency.py index 846c41592f7..3dea986e3d9 100644 --- a/tests/integration/test_backup_restore_on_cluster/test_disallow_concurrency.py +++ b/tests/integration/test_backup_restore_on_cluster/test_disallow_concurrency.py @@ -145,7 +145,7 @@ def wait_for_restore(node, restore_id): def check_backup_error(error): expected_errors = [ - "Concurrent backups not supported", + "Concurrent backups are not allowed", "BACKUP_ALREADY_EXISTS", ] assert any([expected_error in error for expected_error in expected_errors]) @@ -153,7 +153,7 @@ def check_backup_error(error): def check_restore_error(error): expected_errors = [ - "Concurrent restores not supported", + "Concurrent restores are not allowed", "Cannot restore the table default.tbl because it already contains some data", ] assert any([expected_error in error for expected_error in expected_errors]) diff --git a/tests/integration/test_config_corresponding_root/configs/config.xml b/tests/integration/test_config_corresponding_root/configs/config.xml index 9a38d02a036..001a98837c4 100644 --- a/tests/integration/test_config_corresponding_root/configs/config.xml +++ b/tests/integration/test_config_corresponding_root/configs/config.xml @@ -291,6 +291,8 @@ /clickhouse/task_queue/ddl + + /clickhouse/task_queue/replicas diff --git a/tests/integration/test_config_reload/__init__.py b/tests/integration/test_config_reload/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/integration/test_config_reload/configs/display_name.xml b/tests/integration/test_config_reload/configs/display_name.xml new file mode 100644 index 00000000000..ddb7f0be8be --- /dev/null +++ b/tests/integration/test_config_reload/configs/display_name.xml @@ -0,0 +1,3 @@ + + 424242 + diff --git a/tests/integration/test_config_reload/test.py b/tests/integration/test_config_reload/test.py new file mode 100644 index 00000000000..c2882b7f776 --- /dev/null +++ b/tests/integration/test_config_reload/test.py @@ -0,0 +1,41 @@ +import pytest + +from helpers.cluster import ClickHouseCluster + +cluster = ClickHouseCluster(__file__) +instance = cluster.add_instance( + "instance", + main_configs=["configs/display_name.xml"], + stay_alive=True, +) + + +@pytest.fixture(scope="module") +def start_cluster(): + try: + cluster.start() + yield cluster + finally: + cluster.shutdown() + + +DEFAULT_VALUE = "424242" +CHANGED_VALUE = "434343" + + +def test_system_reload_config_with_global_context(start_cluster): + # When running the this test multiple times, make sure failure of one test won't cause the failure of every subsequent tests + instance.replace_in_config( + "/etc/clickhouse-server/config.d/display_name.xml", CHANGED_VALUE, DEFAULT_VALUE + ) + instance.restart_clickhouse() + + assert DEFAULT_VALUE == instance.query("SELECT displayName()").strip() + + instance.replace_in_config( + "/etc/clickhouse-server/config.d/display_name.xml", DEFAULT_VALUE, CHANGED_VALUE + ) + + instance.query("SYSTEM RELOAD CONFIG") + + assert CHANGED_VALUE == instance.query("SELECT displayName()").strip() diff --git a/tests/integration/test_config_xml_full/configs/config.xml b/tests/integration/test_config_xml_full/configs/config.xml index 80b6a702032..81be2e48122 100644 --- a/tests/integration/test_config_xml_full/configs/config.xml +++ b/tests/integration/test_config_xml_full/configs/config.xml @@ -849,6 +849,8 @@ /clickhouse/task_queue/ddl + + /clickhouse/task_queue/replicas diff --git a/tests/integration/test_ddl_on_cluster_stop_waiting_for_offline_hosts/__init__.py b/tests/integration/test_ddl_on_cluster_stop_waiting_for_offline_hosts/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/integration/test_ddl_on_cluster_stop_waiting_for_offline_hosts/configs/remote_servers.xml b/tests/integration/test_ddl_on_cluster_stop_waiting_for_offline_hosts/configs/remote_servers.xml new file mode 100644 index 00000000000..c505345cf7f --- /dev/null +++ b/tests/integration/test_ddl_on_cluster_stop_waiting_for_offline_hosts/configs/remote_servers.xml @@ -0,0 +1,30 @@ + + + + + true + + node1 + 9000 + + + node2 + 9000 + + + + true + + node3 + 9000 + + + node4 + 9000 + + + + + + 1 + diff --git a/tests/integration/test_ddl_on_cluster_stop_waiting_for_offline_hosts/test.py b/tests/integration/test_ddl_on_cluster_stop_waiting_for_offline_hosts/test.py new file mode 100644 index 00000000000..06bdd6f2198 --- /dev/null +++ b/tests/integration/test_ddl_on_cluster_stop_waiting_for_offline_hosts/test.py @@ -0,0 +1,88 @@ +import time + +import pytest + +from helpers.cluster import ClickHouseCluster + +cluster = ClickHouseCluster(__file__) + +node1 = cluster.add_instance( + "node1", + main_configs=["configs/remote_servers.xml"], + with_zookeeper=True, + stay_alive=True, +) +node2 = cluster.add_instance( + "node2", main_configs=["configs/remote_servers.xml"], with_zookeeper=True +) +node3 = cluster.add_instance( + "node3", main_configs=["configs/remote_servers.xml"], with_zookeeper=True +) +node4 = cluster.add_instance( + "node4", + main_configs=["configs/remote_servers.xml"], + with_zookeeper=True, + stay_alive=True, +) + + +@pytest.fixture(scope="module") +def started_cluster(): + try: + cluster.start() + yield cluster + + finally: + cluster.shutdown() + + +def test_stop_waiting_for_offline_hosts(started_cluster): + timeout = 10 + settings = {"distributed_ddl_task_timeout": timeout} + + node1.query( + "DROP TABLE IF EXISTS test_table ON CLUSTER test_cluster SYNC", + settings=settings, + ) + + node1.query( + "CREATE TABLE test_table ON CLUSTER test_cluster (x Int) Engine=Memory", + settings=settings, + ) + + try: + node4.stop_clickhouse() + + start = time.time() + assert "Code: 159. DB::Exception" in node1.query_and_get_error( + "DROP TABLE IF EXISTS test_table ON CLUSTER test_cluster SYNC", + settings=settings, + ) + assert time.time() - start >= timeout + + start = time.time() + assert "Code: 159. DB::Exception" in node1.query_and_get_error( + "CREATE TABLE test_table ON CLUSTER test_cluster (x Int) Engine=Memory", + settings=settings, + ) + assert time.time() - start >= timeout + + # set `distributed_ddl_output_mode` = `throw_only_active`` + settings = { + "distributed_ddl_task_timeout": timeout, + "distributed_ddl_output_mode": "throw_only_active", + } + + start = time.time() + node1.query( + "DROP TABLE IF EXISTS test_table ON CLUSTER test_cluster SYNC", + settings=settings, + ) + + start = time.time() + node1.query( + "CREATE TABLE test_table ON CLUSTER test_cluster (x Int) Engine=Memory", + settings=settings, + ) + finally: + node4.start_clickhouse() diff --git a/tests/integration/test_ddl_worker_replicas/__init__.py b/tests/integration/test_ddl_worker_replicas/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/integration/test_ddl_worker_replicas/configs/remote_servers.xml b/tests/integration/test_ddl_worker_replicas/configs/remote_servers.xml new file mode 100644 index 00000000000..c505345cf7f --- /dev/null +++ b/tests/integration/test_ddl_worker_replicas/configs/remote_servers.xml @@ -0,0 +1,30 @@ + + + + + true + + node1 + 9000 + + + node2 + 9000 + + + + true + + node3 + 9000 + + + node4 + 9000 + + + + + + 1 + diff --git a/tests/integration/test_ddl_worker_replicas/test.py b/tests/integration/test_ddl_worker_replicas/test.py new file mode 100644 index 00000000000..5ba5f406e4f --- /dev/null +++ b/tests/integration/test_ddl_worker_replicas/test.py @@ -0,0 +1,77 @@ +import pytest + +from helpers.cluster import ClickHouseCluster + +cluster = ClickHouseCluster(__file__) + +node1 = cluster.add_instance( + "node1", + main_configs=["configs/remote_servers.xml"], + with_zookeeper=True, + stay_alive=True, +) +node2 = cluster.add_instance( + "node2", main_configs=["configs/remote_servers.xml"], with_zookeeper=True +) +node3 = cluster.add_instance( + "node3", main_configs=["configs/remote_servers.xml"], with_zookeeper=True +) +node4 = cluster.add_instance( + "node4", + main_configs=["configs/remote_servers.xml"], + with_zookeeper=True, + stay_alive=True, +) + + +@pytest.fixture(scope="module") +def started_cluster(): + try: + cluster.start() + yield cluster + + finally: + cluster.shutdown() + + +def test_ddl_worker_replicas(started_cluster): + for replica in ["node1:9000", "node2:9000", "node3:9000", "node4:9000"]: + # wait until the replicas path is created + node1.query_with_retry( + sql=f"SELECT count() FROM system.zookeeper WHERE path='/clickhouse/task_queue/replicas/{replica}'", + check_callback=lambda result: result == 1, + ) + + result = node1.query( + f"SELECT name, value, ephemeralOwner FROM system.zookeeper WHERE path='/clickhouse/task_queue/replicas/{replica}'" + ).strip() + print(f"result: {replica} {result}") + + lines = list(result.split("\n")) + assert len(lines) == 1 + parts = list(lines[0].split("\t")) + assert len(parts) == 3 + assert parts[0] == "active" + assert len(parts[1]) != 0 + assert len(parts[2]) != 0 + + try: + node4.stop_clickhouse() + + # wait for node4 active path is removed + node1.query_with_retry( + sql=f"SELECT count() FROM system.zookeeper WHERE path='/clickhouse/task_queue/replicas/node4:9000'", + check_callback=lambda result: result == 0, + ) + + result = node1.query_with_retry( + f"SELECT name, value, ephemeralOwner FROM system.zookeeper WHERE path='/clickhouse/task_queue/replicas/node4:9000'" + ).strip() + + print(f"result: {replica} {result}") + + lines = list(result.split("\n")) + assert len(lines) == 1 + assert len(lines[0]) == 0 + finally: + node4.start_clickhouse() diff --git a/tests/integration/test_detached_parts_metrics/configs/asynchronous_metrics_update_period_s.xml b/tests/integration/test_detached_parts_metrics/configs/asynchronous_metrics_update_period_s.xml index 0a56d734805..fe19b730059 100644 --- a/tests/integration/test_detached_parts_metrics/configs/asynchronous_metrics_update_period_s.xml +++ b/tests/integration/test_detached_parts_metrics/configs/asynchronous_metrics_update_period_s.xml @@ -1,4 +1,5 @@ 1 + 1 1 diff --git a/tests/integration/test_distributed_directory_monitor_split_batch_on_failure/test.py b/tests/integration/test_distributed_directory_monitor_split_batch_on_failure/test.py index 7a843a87ec2..74c35e7f4ea 100644 --- a/tests/integration/test_distributed_directory_monitor_split_batch_on_failure/test.py +++ b/tests/integration/test_distributed_directory_monitor_split_batch_on_failure/test.py @@ -78,7 +78,7 @@ def test_distributed_background_insert_split_batch_on_failure_OFF(started_cluste with pytest.raises( QueryRuntimeException, # no DOTALL in pytest.raises, use '(.|\n)' - match=r"DB::Exception: Received from.*Memory limit \(for query\) exceeded: (.|\n)*While sending a batch", + match=r"DB::Exception: Received from.*Query memory limit exceeded: (.|\n)*While sending a batch", ): node2.query("system flush distributed dist") assert int(node2.query("select count() from dist_data")) == 0 diff --git a/tests/integration/test_drop_replica/test.py b/tests/integration/test_drop_replica/test.py index fc6c225a0d0..eae997340e5 100644 --- a/tests/integration/test_drop_replica/test.py +++ b/tests/integration/test_drop_replica/test.py @@ -9,6 +9,7 @@ def fill_nodes(nodes, shard): for node in nodes: node.query( """ + DROP DATABASE IF EXISTS test SYNC; CREATE DATABASE test; CREATE TABLE test.test_table(date Date, id UInt32) @@ -21,6 +22,7 @@ def fill_nodes(nodes, shard): node.query( """ + DROP DATABASE IF EXISTS test1 SYNC; CREATE DATABASE test1; CREATE TABLE test1.test_table(date Date, id UInt32) @@ -33,6 +35,7 @@ def fill_nodes(nodes, shard): node.query( """ + DROP DATABASE IF EXISTS test2 SYNC; CREATE DATABASE test2; CREATE TABLE test2.test_table(date Date, id UInt32) @@ -45,7 +48,8 @@ def fill_nodes(nodes, shard): node.query( """ - CREATE DATABASE test3; + DROP DATABASE IF EXISTS test3 SYNC; + CREATE DATABASE test3; CREATE TABLE test3.test_table(date Date, id UInt32) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test3/{shard}/replicated/test_table', '{replica}') ORDER BY id PARTITION BY toYYYYMM(date) @@ -57,6 +61,7 @@ def fill_nodes(nodes, shard): node.query( """ + DROP DATABASE IF EXISTS test4 SYNC; CREATE DATABASE test4; CREATE TABLE test4.test_table(date Date, id UInt32) @@ -84,9 +89,6 @@ node_1_3 = cluster.add_instance( def start_cluster(): try: cluster.start() - - fill_nodes([node_1_1, node_1_2], 1) - yield cluster except Exception as ex: @@ -102,6 +104,8 @@ def check_exists(zk, path): def test_drop_replica(start_cluster): + fill_nodes([node_1_1, node_1_2], 1) + node_1_1.query( "INSERT INTO test.test_table SELECT number, toString(number) FROM numbers(100)" ) @@ -142,11 +146,7 @@ def test_drop_replica(start_cluster): shard=1 ) ) - assert "There is a local table" in node_1_2.query_and_get_error( - "SYSTEM DROP REPLICA 'node_1_1' FROM ZKPATH '/clickhouse/tables/test/{shard}/replicated/test_table'".format( - shard=1 - ) - ) + assert "There is a local table" in node_1_1.query_and_get_error( "SYSTEM DROP REPLICA 'node_1_1' FROM ZKPATH '/clickhouse/tables/test/{shard}/replicated/test_table'".format( shard=1 @@ -222,11 +222,22 @@ def test_drop_replica(start_cluster): ) assert exists_replica_1_1 == None - node_1_2.query("SYSTEM DROP REPLICA 'node_1_1'") - exists_replica_1_1 = check_exists( + node_1_1.query("ATTACH DATABASE test4") + + node_1_2.query("DETACH TABLE test4.test_table") + node_1_1.query( + "SYSTEM DROP REPLICA 'node_1_2' FROM ZKPATH '/clickhouse/tables/test4/{shard}/replicated/test_table'".format( + shard=1 + ) + ) + exists_replica_1_2 = check_exists( zk, "/clickhouse/tables/test4/{shard}/replicated/test_table/replicas/{replica}".format( - shard=1, replica="node_1_1" + shard=1, replica="node_1_2" ), ) - assert exists_replica_1_1 == None + assert exists_replica_1_2 == None + + node_1_1.query("ATTACH DATABASE test") + for i in range(1, 4): + node_1_1.query("ATTACH DATABASE test{}".format(i)) diff --git a/tests/integration/test_fix_metadata_version/__init__.py b/tests/integration/test_fix_metadata_version/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/integration/test_fix_metadata_version/configs/config.xml b/tests/integration/test_fix_metadata_version/configs/config.xml new file mode 100644 index 00000000000..4662e6794e3 --- /dev/null +++ b/tests/integration/test_fix_metadata_version/configs/config.xml @@ -0,0 +1,16 @@ + + 9000 + + + + + + + + + default + + + + + diff --git a/tests/integration/test_fix_metadata_version/test.py b/tests/integration/test_fix_metadata_version/test.py new file mode 100644 index 00000000000..085872bba05 --- /dev/null +++ b/tests/integration/test_fix_metadata_version/test.py @@ -0,0 +1,73 @@ +import pytest + +from helpers.cluster import ClickHouseCluster + +cluster = ClickHouseCluster(__file__) +node = cluster.add_instance( + "node", + main_configs=["configs/config.xml"], + stay_alive=True, + with_zookeeper=True, +) + + +@pytest.fixture(scope="module") +def start_cluster(): + try: + cluster.start() + yield cluster + finally: + cluster.shutdown() + + +def test_fix_metadata_version(start_cluster): + zookeeper_path = "/clickhouse/test_fix_metadata_version" + replica = "replica1" + replica_path = f"{zookeeper_path}/replicas/{replica}" + + def get_metadata_versions(): + table_metadata_version = int( + node.query( + f""" + SELECT version + FROM system.zookeeper + WHERE path = '{zookeeper_path}' AND name = 'metadata' + """ + ).strip() + ) + + replica_metadata_version = int( + node.query( + f""" + SELECT value + FROM system.zookeeper + WHERE path = '{replica_path}' AND name = 'metadata_version' + """ + ).strip() + ) + + return table_metadata_version, replica_metadata_version + + node.query( + f""" + DROP TABLE IF EXISTS t SYNC; + CREATE TABLE t + ( + `x` UInt32 + ) + ENGINE = ReplicatedMergeTree('{zookeeper_path}', '{replica}') + ORDER BY x + """ + ) + + node.query("ALTER TABLE t (ADD COLUMN `y` UInt32)") + + assert get_metadata_versions() == (1, 1) + + cluster.query_zookeeper(f"set '{replica_path}/metadata_version' '0'") + + assert get_metadata_versions() == (1, 0) + + node.restart_clickhouse() + + assert get_metadata_versions() == (1, 1) diff --git a/tests/integration/test_grpc_protocol/test.py b/tests/integration/test_grpc_protocol/test.py index 732907eed7a..561f5144aac 100644 --- a/tests/integration/test_grpc_protocol/test.py +++ b/tests/integration/test_grpc_protocol/test.py @@ -364,7 +364,7 @@ def test_logs(): ) assert query in logs assert "Read 1000000 rows" in logs - assert "Peak memory usage" in logs + assert "Query peak memory usage" in logs def test_progress(): diff --git a/tests/integration/test_http_handlers_config/test.py b/tests/integration/test_http_handlers_config/test.py index efba4f05748..cf291c6dedd 100644 --- a/tests/integration/test_http_handlers_config/test.py +++ b/tests/integration/test_http_handlers_config/test.py @@ -17,9 +17,10 @@ class SimpleCluster: cluster.start() def add_instance(self, name, config_dir): - script_path = os.path.dirname(os.path.realpath(__file__)) return self.cluster.add_instance( - name, main_configs=[os.path.join(script_path, config_dir, "config.xml")] + name, + main_configs=[os.path.join(config_dir, "config.xml")], + user_configs=["users.d/users.yaml"], ) @@ -96,6 +97,16 @@ def test_dynamic_query_handler(): == res_custom_ct.headers["X-Test-Http-Response-Headers-Even-Multiple"] ) + assert cluster.instance.http_request( + "test_dynamic_handler_auth_with_password?query=select+currentUser()" + ).content, "with_password" + assert cluster.instance.http_request( + "test_dynamic_handler_auth_with_password_fail?query=select+currentUser()" + ).status_code, 403 + assert cluster.instance.http_request( + "test_dynamic_handler_auth_without_password?query=select+currentUser()" + ).content, "without_password" + def test_predefined_query_handler(): with contextlib.closing( @@ -177,6 +188,16 @@ def test_predefined_query_handler(): ) assert b"max_threads\t1\n" == res1.content + assert cluster.instance.http_request( + "test_predefined_handler_auth_with_password" + ).content, "with_password" + assert cluster.instance.http_request( + "test_predefined_handler_auth_with_password_fail" + ).status_code, 403 + assert cluster.instance.http_request( + "test_predefined_handler_auth_without_password" + ).content, "without_password" + def test_fixed_static_handler(): with contextlib.closing( diff --git a/tests/integration/test_http_handlers_config/test_dynamic_handler/config.xml b/tests/integration/test_http_handlers_config/test_dynamic_handler/config.xml index 58fedbd9078..4900219f595 100644 --- a/tests/integration/test_http_handlers_config/test_dynamic_handler/config.xml +++ b/tests/integration/test_http_handlers_config/test_dynamic_handler/config.xml @@ -24,5 +24,32 @@ + + + GET + /test_dynamic_handler_auth_with_password + + dynamic_query_handler + with_password + password + + + + GET + /test_dynamic_handler_auth_with_password_fail + + dynamic_query_handler + with_password + + + + + GET + /test_dynamic_handler_auth_without_password + + dynamic_query_handler + without_password + + diff --git a/tests/integration/test_http_handlers_config/test_predefined_handler/config.xml b/tests/integration/test_http_handlers_config/test_predefined_handler/config.xml index a7804721f12..3c0ee3cd09a 100644 --- a/tests/integration/test_http_handlers_config/test_predefined_handler/config.xml +++ b/tests/integration/test_http_handlers_config/test_predefined_handler/config.xml @@ -33,5 +33,35 @@ INSERT INTO test_table(id, data) SELECT {id:UInt32}, {_request_body:String} + + + GET + /test_predefined_handler_auth_with_password + + predefined_query_handler + with_password + password + SELECT currentUser() + + + + GET + /test_predefined_handler_auth_with_password_fail + + predefined_query_handler + with_password + + SELECT currentUser() + + + + GET + /test_predefined_handler_auth_without_password + + predefined_query_handler + without_password + SELECT currentUser() + + diff --git a/tests/integration/test_http_handlers_config/users.d/users.yaml b/tests/integration/test_http_handlers_config/users.d/users.yaml new file mode 100644 index 00000000000..9ab8a84ae5a --- /dev/null +++ b/tests/integration/test_http_handlers_config/users.d/users.yaml @@ -0,0 +1,7 @@ +users: + with_password: + profile: default + password: password + without_password: + profile: default + no_password: 1 diff --git a/tests/integration/test_https_replication/configs/config.xml b/tests/integration/test_https_replication/configs/config.xml index 9a7a542b16e..8c1cd9beeb2 100644 --- a/tests/integration/test_https_replication/configs/config.xml +++ b/tests/integration/test_https_replication/configs/config.xml @@ -256,6 +256,8 @@ /clickhouse/task_queue/ddl + + /clickhouse/task_queue/replicas diff --git a/tests/integration/test_peak_memory_usage/test.py b/tests/integration/test_peak_memory_usage/test.py index 877cf97bb18..51268dcf386 100644 --- a/tests/integration/test_peak_memory_usage/test.py +++ b/tests/integration/test_peak_memory_usage/test.py @@ -68,7 +68,8 @@ def get_memory_usage_from_client_output_and_close(client_output): for line in client_output: print(f"'{line}'\n") if not peek_memory_usage_str_found: - peek_memory_usage_str_found = "Peak memory usage" in line + # Can be both Peak/peak + peek_memory_usage_str_found = "eak memory usage" in line if peek_memory_usage_str_found: search_obj = re.search(r"[+-]?[0-9]+\.[0-9]+", line) @@ -97,9 +98,7 @@ def test_clickhouse_client_max_peak_memory_usage_distributed(started_cluster): peak_memory_usage = get_memory_usage_from_client_output_and_close(client_output) assert peak_memory_usage - assert shard_2.contains_in_log( - f"Peak memory usage (for query): {peak_memory_usage}" - ) + assert shard_2.contains_in_log(f"Query peak memory usage: {peak_memory_usage}") def test_clickhouse_client_max_peak_memory_single_node(started_cluster): @@ -118,6 +117,4 @@ def test_clickhouse_client_max_peak_memory_single_node(started_cluster): peak_memory_usage = get_memory_usage_from_client_output_and_close(client_output) assert peak_memory_usage - assert shard_1.contains_in_log( - f"Peak memory usage (for query): {peak_memory_usage}" - ) + assert shard_1.contains_in_log(f"Query peak memory usage: {peak_memory_usage}") diff --git a/tests/integration/test_reload_client_certificate/__init__.py b/tests/integration/test_reload_client_certificate/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/integration/test_reload_client_certificate/configs/remote_servers.xml b/tests/integration/test_reload_client_certificate/configs/remote_servers.xml new file mode 100644 index 00000000000..63fdcea5dab --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs/remote_servers.xml @@ -0,0 +1,23 @@ + + + + + + node1 + 9000 + + + + node2 + 9000 + + + + node3 + 9000 + + + + + + diff --git a/tests/integration/test_reload_client_certificate/configs/zookeeper_config_with_ssl.xml b/tests/integration/test_reload_client_certificate/configs/zookeeper_config_with_ssl.xml new file mode 100644 index 00000000000..dc0fe771426 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs/zookeeper_config_with_ssl.xml @@ -0,0 +1,20 @@ + + + + zoo1 + 2281 + 1 + + + zoo2 + 2281 + 1 + + + zoo3 + 2281 + 1 + + 3000 + + diff --git a/tests/integration/test_reload_client_certificate/configs_secure/conf.d/remote_servers.xml b/tests/integration/test_reload_client_certificate/configs_secure/conf.d/remote_servers.xml new file mode 100644 index 00000000000..548819a8c97 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs_secure/conf.d/remote_servers.xml @@ -0,0 +1,17 @@ + + + + + + node1 + 9000 + + + + node2 + 9000 + + + + + diff --git a/tests/integration/test_reload_client_certificate/configs_secure/conf.d/ssl_conf.xml b/tests/integration/test_reload_client_certificate/configs_secure/conf.d/ssl_conf.xml new file mode 100644 index 00000000000..d620bcee919 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs_secure/conf.d/ssl_conf.xml @@ -0,0 +1,16 @@ + + + + /etc/clickhouse-server/config.d/first_client.crt + /etc/clickhouse-server/config.d/first_client.key + true + true + sslv2,sslv3 + true + none + + RejectCertificateHandler + + + + diff --git a/tests/integration/test_reload_client_certificate/configs_secure/first_client.crt b/tests/integration/test_reload_client_certificate/configs_secure/first_client.crt new file mode 100644 index 00000000000..7ade2d96273 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs_secure/first_client.crt @@ -0,0 +1,19 @@ +-----BEGIN CERTIFICATE----- +MIIC/TCCAeWgAwIBAgIJANjx1QSR77HBMA0GCSqGSIb3DQEBCwUAMBQxEjAQBgNV +BAMMCWxvY2FsaG9zdDAgFw0xODA3MzAxODE2MDhaGA8yMjkyMDUxNDE4MTYwOFow +FDESMBAGA1UEAwwJbG9jYWxob3N0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIB +CgKCAQEAs9uSo6lJG8o8pw0fbVGVu0tPOljSWcVSXH9uiJBwlZLQnhN4SFSFohfI +4K8U1tBDTnxPLUo/V1K9yzoLiRDGMkwVj6+4+hE2udS2ePTQv5oaMeJ9wrs+5c9T +4pOtlq3pLAdm04ZMB1nbrEysceVudHRkQbGHzHp6VG29Fw7Ga6YpqyHQihRmEkTU +7UCYNA+Vk7aDPdMS/khweyTpXYZimaK9f0ECU3/VOeG3fH6Sp2X6FN4tUj/aFXEj +sRmU5G2TlYiSIUMF2JPdhSihfk1hJVALrHPTU38SOL+GyyBRWdNcrIwVwbpvsvPg +pryMSNxnpr0AK0dFhjwnupIv5hJIOQIDAQABo1AwTjAdBgNVHQ4EFgQUjPLb3uYC +kcamyZHK4/EV8jAP0wQwHwYDVR0jBBgwFoAUjPLb3uYCkcamyZHK4/EV8jAP0wQw +DAYDVR0TBAUwAwEB/zANBgkqhkiG9w0BAQsFAAOCAQEAM/ocuDvfPus/KpMVD51j +4IdlU8R0vmnYLQ+ygzOAo7+hUWP5j0yvq4ILWNmQX6HNvUggCgFv9bjwDFhb/5Vr +85ieWfTd9+LTjrOzTw4avdGwpX9G+6jJJSSq15tw5ElOIFb/qNA9O4dBiu8vn03C +L/zRSXrARhSqTW5w/tZkUcSTT+M5h28+Lgn9ysx4Ff5vi44LJ1NnrbJbEAIYsAAD ++UA+4MBFKx1r6hHINULev8+lCfkpwIaeS8RL+op4fr6kQPxnULw8wT8gkuc8I4+L +P9gg/xDHB44T3ADGZ5Ib6O0DJaNiToO6rnoaaxs0KkotbvDWvRoxEytSbXKoYjYp +0g== +-----END CERTIFICATE----- diff --git a/tests/integration/test_reload_client_certificate/configs_secure/first_client.key b/tests/integration/test_reload_client_certificate/configs_secure/first_client.key new file mode 100644 index 00000000000..f0fb61ac443 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs_secure/first_client.key @@ -0,0 +1,28 @@ +-----BEGIN PRIVATE KEY----- +MIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQCz25KjqUkbyjyn +DR9tUZW7S086WNJZxVJcf26IkHCVktCeE3hIVIWiF8jgrxTW0ENOfE8tSj9XUr3L +OguJEMYyTBWPr7j6ETa51LZ49NC/mhox4n3Cuz7lz1Pik62WreksB2bThkwHWdus +TKxx5W50dGRBsYfMenpUbb0XDsZrpimrIdCKFGYSRNTtQJg0D5WTtoM90xL+SHB7 +JOldhmKZor1/QQJTf9U54bd8fpKnZfoU3i1SP9oVcSOxGZTkbZOViJIhQwXYk92F +KKF+TWElUAusc9NTfxI4v4bLIFFZ01ysjBXBum+y8+CmvIxI3GemvQArR0WGPCe6 +ki/mEkg5AgMBAAECggEATrbIBIxwDJOD2/BoUqWkDCY3dGevF8697vFuZKIiQ7PP +TX9j4vPq0DfsmDjHvAPFkTHiTQXzlroFik3LAp+uvhCCVzImmHq0IrwvZ9xtB43f +7Pkc5P6h1l3Ybo8HJ6zRIY3TuLtLxuPSuiOMTQSGRL0zq3SQ5DKuGwkz+kVjHXUN +MR2TECFwMHKQ5VLrC+7PMpsJYyOMlDAWhRfUalxC55xOXTpaN8TxNnwQ8K2ISVY5 +212Jz/a4hn4LdwxSz3Tiu95PN072K87HLWx3EdT6vW4Ge5P/A3y+smIuNAlanMnu +plHBRtpATLiTxZt/n6npyrfQVbYjSH7KWhB8hBHtaQKBgQDh9Cq1c/KtqDtE0Ccr +/r9tZNTUwBE6VP+3OJeKdEdtsfuxjOCkS1oAjgBJiSDOiWPh1DdoDeVZjPKq6pIu +Mq12OE3Doa8znfCXGbkSzEKOb2unKZMJxzrz99kXt40W5DtrqKPNb24CNqTiY8Aa +CjtcX+3weat82VRXvph6U8ltMwKBgQDLxjiQQzNoY7qvg7CwJCjf9qq8jmLK766g +1FHXopqS+dTxDLM8eJSRrpmxGWJvNeNc1uPhsKsKgotqAMdBUQTf7rSTbt4MyoH5 +bUcRLtr+0QTK9hDWMOOvleqNXha68vATkohWYfCueNsC60qD44o8RZAS6UNy3ENq +cM1cxqe84wKBgQDKkHutWnooJtajlTxY27O/nZKT/HA1bDgniMuKaz4R4Gr1PIez +on3YW3V0d0P7BP6PWRIm7bY79vkiMtLEKdiKUGWeyZdo3eHvhDb/3DCawtau8L2K +GZsHVp2//mS1Lfz7Qh8/L/NedqCQ+L4iWiPnZ3THjjwn3CoZ05ucpvrAMwKBgB54 +nay039MUVq44Owub3KDg+dcIU62U+cAC/9oG7qZbxYPmKkc4oL7IJSNecGHA5SbU +2268RFdl/gLz6tfRjbEOuOHzCjFPdvAdbysanpTMHLNc6FefJ+zxtgk9sJh0C4Jh +vxFrw9nTKKzfEl12gQ1SOaEaUIO0fEBGbe8ZpauRAoGAMAlGV+2/K4ebvAJKOVTa +dKAzQ+TD2SJmeR1HZmKDYddNqwtZlzg3v4ZhCk4eaUmGeC1Bdh8MDuB3QQvXz4Dr +vOIP4UVaOr+uM+7TgAgVnP4/K6IeJGzUDhX93pmpWhODfdu/oojEKVcpCojmEmS1 +KCBtmIrQLqzMpnBpLNuSY+Q= +-----END PRIVATE KEY----- diff --git a/tests/integration/test_reload_client_certificate/configs_secure/second_client.crt b/tests/integration/test_reload_client_certificate/configs_secure/second_client.crt new file mode 100644 index 00000000000..ff62438af62 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs_secure/second_client.crt @@ -0,0 +1,19 @@ +-----BEGIN CERTIFICATE----- +MIIDEDCCAfigAwIBAgIUEAdT/eB4tswNzGZg1V0rVP8WzJwwDQYJKoZIhvcNAQEL +BQAwGDEWMBQGA1UEAwwNbG9jYWxob3N0X25ldzAgFw0yNDEwMjQyMzE5MjJaGA8y +Mjk4MDgwOTIzMTkyMlowGDEWMBQGA1UEAwwNbG9jYWxob3N0X25ldzCCASIwDQYJ +KoZIhvcNAQEBBQADggEPADCCAQoCggEBALPbkqOpSRvKPKcNH21RlbtLTzpY0lnF +Ulx/boiQcJWS0J4TeEhUhaIXyOCvFNbQQ058Ty1KP1dSvcs6C4kQxjJMFY+vuPoR +NrnUtnj00L+aGjHifcK7PuXPU+KTrZat6SwHZtOGTAdZ26xMrHHlbnR0ZEGxh8x6 +elRtvRcOxmumKash0IoUZhJE1O1AmDQPlZO2gz3TEv5IcHsk6V2GYpmivX9BAlN/ +1Tnht3x+kqdl+hTeLVI/2hVxI7EZlORtk5WIkiFDBdiT3YUooX5NYSVQC6xz01N/ +Eji/hssgUVnTXKyMFcG6b7Lz4Ka8jEjcZ6a9ACtHRYY8J7qSL+YSSDkCAwEAAaNQ +ME4wHQYDVR0OBBYEFIzy297mApHGpsmRyuPxFfIwD9MEMB8GA1UdIwQYMBaAFIzy +297mApHGpsmRyuPxFfIwD9MEMAwGA1UdEwQFMAMBAf8wDQYJKoZIhvcNAQELBQAD +ggEBAD0z8mRBdk93+HxqJdW1qZBN2g+AUc/GUaTUa8oW9baHOOvdwUacfdVXpyDo +ffdeTKfdQNs7JYMP5tWupHCrvAGK3sIzPMt7Yr06tBD720IIyPTR3J7A5RmpQNKm +2RCqfO49Pg6U8kx+bDBKNjdCGWowt31cZTlJNXk7NPewtWaGYhuskbvH8gJDtbMd +d9fOepIbzl3u+us8JHFVglBRgjy9sYjUYUT9mnTzfbpebmkdtiicJZNP1j08VZFR +lXoHiESasyzlP8DLI/PQcpL6Lh8KnIifKGEkvXVaryPT2wlEo6Kti2cY8AIJKQgl +0U1jwiNcCwjYoKIXjunOO8T8mKg= +-----END CERTIFICATE----- \ No newline at end of file diff --git a/tests/integration/test_reload_client_certificate/configs_secure/second_client.key b/tests/integration/test_reload_client_certificate/configs_secure/second_client.key new file mode 100644 index 00000000000..f0fb61ac443 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs_secure/second_client.key @@ -0,0 +1,28 @@ +-----BEGIN PRIVATE KEY----- +MIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQCz25KjqUkbyjyn +DR9tUZW7S086WNJZxVJcf26IkHCVktCeE3hIVIWiF8jgrxTW0ENOfE8tSj9XUr3L +OguJEMYyTBWPr7j6ETa51LZ49NC/mhox4n3Cuz7lz1Pik62WreksB2bThkwHWdus +TKxx5W50dGRBsYfMenpUbb0XDsZrpimrIdCKFGYSRNTtQJg0D5WTtoM90xL+SHB7 +JOldhmKZor1/QQJTf9U54bd8fpKnZfoU3i1SP9oVcSOxGZTkbZOViJIhQwXYk92F +KKF+TWElUAusc9NTfxI4v4bLIFFZ01ysjBXBum+y8+CmvIxI3GemvQArR0WGPCe6 +ki/mEkg5AgMBAAECggEATrbIBIxwDJOD2/BoUqWkDCY3dGevF8697vFuZKIiQ7PP +TX9j4vPq0DfsmDjHvAPFkTHiTQXzlroFik3LAp+uvhCCVzImmHq0IrwvZ9xtB43f +7Pkc5P6h1l3Ybo8HJ6zRIY3TuLtLxuPSuiOMTQSGRL0zq3SQ5DKuGwkz+kVjHXUN +MR2TECFwMHKQ5VLrC+7PMpsJYyOMlDAWhRfUalxC55xOXTpaN8TxNnwQ8K2ISVY5 +212Jz/a4hn4LdwxSz3Tiu95PN072K87HLWx3EdT6vW4Ge5P/A3y+smIuNAlanMnu +plHBRtpATLiTxZt/n6npyrfQVbYjSH7KWhB8hBHtaQKBgQDh9Cq1c/KtqDtE0Ccr +/r9tZNTUwBE6VP+3OJeKdEdtsfuxjOCkS1oAjgBJiSDOiWPh1DdoDeVZjPKq6pIu +Mq12OE3Doa8znfCXGbkSzEKOb2unKZMJxzrz99kXt40W5DtrqKPNb24CNqTiY8Aa +CjtcX+3weat82VRXvph6U8ltMwKBgQDLxjiQQzNoY7qvg7CwJCjf9qq8jmLK766g +1FHXopqS+dTxDLM8eJSRrpmxGWJvNeNc1uPhsKsKgotqAMdBUQTf7rSTbt4MyoH5 +bUcRLtr+0QTK9hDWMOOvleqNXha68vATkohWYfCueNsC60qD44o8RZAS6UNy3ENq +cM1cxqe84wKBgQDKkHutWnooJtajlTxY27O/nZKT/HA1bDgniMuKaz4R4Gr1PIez +on3YW3V0d0P7BP6PWRIm7bY79vkiMtLEKdiKUGWeyZdo3eHvhDb/3DCawtau8L2K +GZsHVp2//mS1Lfz7Qh8/L/NedqCQ+L4iWiPnZ3THjjwn3CoZ05ucpvrAMwKBgB54 +nay039MUVq44Owub3KDg+dcIU62U+cAC/9oG7qZbxYPmKkc4oL7IJSNecGHA5SbU +2268RFdl/gLz6tfRjbEOuOHzCjFPdvAdbysanpTMHLNc6FefJ+zxtgk9sJh0C4Jh +vxFrw9nTKKzfEl12gQ1SOaEaUIO0fEBGbe8ZpauRAoGAMAlGV+2/K4ebvAJKOVTa +dKAzQ+TD2SJmeR1HZmKDYddNqwtZlzg3v4ZhCk4eaUmGeC1Bdh8MDuB3QQvXz4Dr +vOIP4UVaOr+uM+7TgAgVnP4/K6IeJGzUDhX93pmpWhODfdu/oojEKVcpCojmEmS1 +KCBtmIrQLqzMpnBpLNuSY+Q= +-----END PRIVATE KEY----- diff --git a/tests/integration/test_reload_client_certificate/configs_secure/third_client.crt b/tests/integration/test_reload_client_certificate/configs_secure/third_client.crt new file mode 100644 index 00000000000..4efb8f1b7b9 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs_secure/third_client.crt @@ -0,0 +1,19 @@ +-----BEGIN CERTIFICATE----- +MIIDCDCCAfCgAwIBAgIUC749qXQA+HcnMauXvrmGf+Yz7KswDQYJKoZIhvcNAQEL +BQAwFDESMBAGA1UEAwwJbG9jYWxob3N0MCAXDTI0MTAyNTA4NDg1N1oYDzIyOTgw +ODEwMDg0ODU3WjAUMRIwEAYDVQQDDAlsb2NhbGhvc3QwggEiMA0GCSqGSIb3DQEB +AQUAA4IBDwAwggEKAoIBAQCz25KjqUkbyjynDR9tUZW7S086WNJZxVJcf26IkHCV +ktCeE3hIVIWiF8jgrxTW0ENOfE8tSj9XUr3LOguJEMYyTBWPr7j6ETa51LZ49NC/ +mhox4n3Cuz7lz1Pik62WreksB2bThkwHWdusTKxx5W50dGRBsYfMenpUbb0XDsZr +pimrIdCKFGYSRNTtQJg0D5WTtoM90xL+SHB7JOldhmKZor1/QQJTf9U54bd8fpKn +ZfoU3i1SP9oVcSOxGZTkbZOViJIhQwXYk92FKKF+TWElUAusc9NTfxI4v4bLIFFZ +01ysjBXBum+y8+CmvIxI3GemvQArR0WGPCe6ki/mEkg5AgMBAAGjUDBOMB0GA1Ud +DgQWBBSM8tve5gKRxqbJkcrj8RXyMA/TBDAfBgNVHSMEGDAWgBSM8tve5gKRxqbJ +kcrj8RXyMA/TBDAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBCwUAA4IBAQB/QYNd +q8ub45u2tsCEr8xgON4CB2UGZD5RazY//W6kPWmLBf8fZjepF7yLjEWP6iQHWVWk +vIVmVsAnIyfOruUYQmxR4N770Tlit9PH7OqNtRzXHGV2el3Rp62mg8NneOx4SHX+ +HITyPF3Wcg7YyWCuwwGXXS2hZ20csQXZima1jVyTNRN0GDvp0xjX+o7gyANGxbxa +EnjXTc4IWbLJ/+k4I38suavXg8RToHt+1Ndp0sHoT7Fxj+mbxOcc3QVtYU/Ct1W7 +cirraodxjWkYX63zDeqteXU8JtNdJE43qFK4BVh3QTj7PhD3PFEAKcPbnJLbdTYC +ZU36rm75uOSdLXNB +-----END CERTIFICATE----- diff --git a/tests/integration/test_reload_client_certificate/configs_secure/third_client.key b/tests/integration/test_reload_client_certificate/configs_secure/third_client.key new file mode 100644 index 00000000000..f0fb61ac443 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/configs_secure/third_client.key @@ -0,0 +1,28 @@ +-----BEGIN PRIVATE KEY----- +MIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQCz25KjqUkbyjyn +DR9tUZW7S086WNJZxVJcf26IkHCVktCeE3hIVIWiF8jgrxTW0ENOfE8tSj9XUr3L +OguJEMYyTBWPr7j6ETa51LZ49NC/mhox4n3Cuz7lz1Pik62WreksB2bThkwHWdus +TKxx5W50dGRBsYfMenpUbb0XDsZrpimrIdCKFGYSRNTtQJg0D5WTtoM90xL+SHB7 +JOldhmKZor1/QQJTf9U54bd8fpKnZfoU3i1SP9oVcSOxGZTkbZOViJIhQwXYk92F +KKF+TWElUAusc9NTfxI4v4bLIFFZ01ysjBXBum+y8+CmvIxI3GemvQArR0WGPCe6 +ki/mEkg5AgMBAAECggEATrbIBIxwDJOD2/BoUqWkDCY3dGevF8697vFuZKIiQ7PP +TX9j4vPq0DfsmDjHvAPFkTHiTQXzlroFik3LAp+uvhCCVzImmHq0IrwvZ9xtB43f +7Pkc5P6h1l3Ybo8HJ6zRIY3TuLtLxuPSuiOMTQSGRL0zq3SQ5DKuGwkz+kVjHXUN +MR2TECFwMHKQ5VLrC+7PMpsJYyOMlDAWhRfUalxC55xOXTpaN8TxNnwQ8K2ISVY5 +212Jz/a4hn4LdwxSz3Tiu95PN072K87HLWx3EdT6vW4Ge5P/A3y+smIuNAlanMnu +plHBRtpATLiTxZt/n6npyrfQVbYjSH7KWhB8hBHtaQKBgQDh9Cq1c/KtqDtE0Ccr +/r9tZNTUwBE6VP+3OJeKdEdtsfuxjOCkS1oAjgBJiSDOiWPh1DdoDeVZjPKq6pIu +Mq12OE3Doa8znfCXGbkSzEKOb2unKZMJxzrz99kXt40W5DtrqKPNb24CNqTiY8Aa +CjtcX+3weat82VRXvph6U8ltMwKBgQDLxjiQQzNoY7qvg7CwJCjf9qq8jmLK766g +1FHXopqS+dTxDLM8eJSRrpmxGWJvNeNc1uPhsKsKgotqAMdBUQTf7rSTbt4MyoH5 +bUcRLtr+0QTK9hDWMOOvleqNXha68vATkohWYfCueNsC60qD44o8RZAS6UNy3ENq +cM1cxqe84wKBgQDKkHutWnooJtajlTxY27O/nZKT/HA1bDgniMuKaz4R4Gr1PIez +on3YW3V0d0P7BP6PWRIm7bY79vkiMtLEKdiKUGWeyZdo3eHvhDb/3DCawtau8L2K +GZsHVp2//mS1Lfz7Qh8/L/NedqCQ+L4iWiPnZ3THjjwn3CoZ05ucpvrAMwKBgB54 +nay039MUVq44Owub3KDg+dcIU62U+cAC/9oG7qZbxYPmKkc4oL7IJSNecGHA5SbU +2268RFdl/gLz6tfRjbEOuOHzCjFPdvAdbysanpTMHLNc6FefJ+zxtgk9sJh0C4Jh +vxFrw9nTKKzfEl12gQ1SOaEaUIO0fEBGbe8ZpauRAoGAMAlGV+2/K4ebvAJKOVTa +dKAzQ+TD2SJmeR1HZmKDYddNqwtZlzg3v4ZhCk4eaUmGeC1Bdh8MDuB3QQvXz4Dr +vOIP4UVaOr+uM+7TgAgVnP4/K6IeJGzUDhX93pmpWhODfdu/oojEKVcpCojmEmS1 +KCBtmIrQLqzMpnBpLNuSY+Q= +-----END PRIVATE KEY----- diff --git a/tests/integration/test_reload_client_certificate/test.py b/tests/integration/test_reload_client_certificate/test.py new file mode 100644 index 00000000000..cb091d92ea6 --- /dev/null +++ b/tests/integration/test_reload_client_certificate/test.py @@ -0,0 +1,197 @@ +import os +import threading +import time + +import pytest + +from helpers.cluster import ClickHouseCluster + +TEST_DIR = os.path.dirname(__file__) + +cluster = ClickHouseCluster( + __file__, + zookeeper_certfile=os.path.join(TEST_DIR, "configs_secure", "first_client.crt"), + zookeeper_keyfile=os.path.join(TEST_DIR, "configs_secure", "first_client.key"), +) + +node1 = cluster.add_instance( + "node1", + main_configs=[ + "configs_secure/first_client.crt", + "configs_secure/first_client.key", + "configs_secure/second_client.crt", + "configs_secure/second_client.key", + "configs_secure/third_client.crt", + "configs_secure/third_client.key", + "configs_secure/conf.d/remote_servers.xml", + "configs_secure/conf.d/ssl_conf.xml", + "configs/zookeeper_config_with_ssl.xml", + ], + with_zookeeper_secure=True, +) + +node2 = cluster.add_instance( + "node2", + main_configs=[ + "configs_secure/first_client.crt", + "configs_secure/first_client.key", + "configs_secure/second_client.crt", + "configs_secure/second_client.key", + "configs_secure/third_client.crt", + "configs_secure/third_client.key", + "configs_secure/conf.d/remote_servers.xml", + "configs_secure/conf.d/ssl_conf.xml", + "configs/zookeeper_config_with_ssl.xml", + ], + with_zookeeper_secure=True, +) + +nodes = [node1, node2] + + +@pytest.fixture(scope="module", autouse=True) +def started_cluster(): + try: + cluster.start() + yield cluster + finally: + cluster.shutdown() + + +def secure_connection_test(started_cluster): + # No asserts, connection works + + node1.query("SELECT count() FROM system.zookeeper WHERE path = '/'") + node2.query("SELECT count() FROM system.zookeeper WHERE path = '/'") + + threads_number = 4 + iterations = 4 + threads = [] + + # Just checking for race conditions + + for _ in range(threads_number): + threads.append( + threading.Thread( + target=lambda: [ + node1.query("SELECT count() FROM system.zookeeper WHERE path = '/'") + for _ in range(iterations) + ] + ) + ) + for thread in threads: + thread.start() + for thread in threads: + thread.join() + + +def change_config_to_key(name): + """ + Generate config with certificate/key name from args. + Reload config. + """ + for node in nodes: + node.exec_in_container( + [ + "bash", + "-c", + """cat > /etc/clickhouse-server/config.d/ssl_conf.xml << EOF + + + + /etc/clickhouse-server/config.d/{cur_name}_client.crt + /etc/clickhouse-server/config.d/{cur_name}_client.key + true + true + sslv2,sslv3 + true + none + + RejectCertificateHandler + + + + +EOF""".format( + cur_name=name + ), + ] + ) + + node.exec_in_container( + ["bash", "-c", "touch /etc/clickhouse-server/config.d/ssl_conf.xml"], + ) + + +def check_reload_successful(node, cert_name): + return node.grep_in_log( + f"Reloaded certificate (/etc/clickhouse-server/config.d/{cert_name}_client.crt)" + ) + + +def check_error_handshake(node): + return node.count_in_log("Code: 210.") + + +def clean_logs(): + for node in nodes: + node.exec_in_container( + [ + "bash", + "-c", + "echo -n > /var/log/clickhouse-server/clickhouse-server.log", + ] + ) + + +def check_certificate_switch(first, second): + # Set first certificate + + change_config_to_key(first) + + # Restart zookeeper to reload the session + + cluster.stop_zookeeper_nodes(["zoo1", "zoo2", "zoo3"]) + cluster.start_zookeeper_nodes(["zoo1", "zoo2", "zoo3"]) + cluster.wait_zookeeper_nodes_to_start(["zoo1", "zoo2", "zoo3"]) + clean_logs() + + # Change certificate + + change_config_to_key(second) + + # Time to log + + time.sleep(10) + + # Check information about client certificates reloading in log Clickhouse + + reload_successful = any(check_reload_successful(node, second) for node in nodes) + + # Restart zookeeper to reload the session and clean logs for new check + + cluster.stop_zookeeper_nodes(["zoo1", "zoo2", "zoo3"]) + cluster.start_zookeeper_nodes(["zoo1", "zoo2", "zoo3"]) + clean_logs() + cluster.wait_zookeeper_nodes_to_start(["zoo1", "zoo2", "zoo3"]) + + if second == "second": + try: + secure_connection_test(started_cluster) + assert False + except: + assert True + else: + secure_connection_test(started_cluster) + error_handshake = any(check_error_handshake(node) == "0\n" for node in nodes) + assert reload_successful and error_handshake + + +def test_wrong_cn_cert(): + """Checking the certificate reload with an incorrect CN, the expected behavior is Code: 210.""" + check_certificate_switch("first", "second") + + +def test_correct_cn_cert(): + """Replacement with a valid certificate, the expected behavior is to restore the connection with Zookeeper.""" + check_certificate_switch("second", "third") diff --git a/tests/integration/test_s3_access_headers/__init__.py b/tests/integration/test_s3_access_headers/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/integration/test_s3_access_headers/configs/config.d/named_collections.xml b/tests/integration/test_s3_access_headers/configs/config.d/named_collections.xml new file mode 100644 index 00000000000..d08d3401778 --- /dev/null +++ b/tests/integration/test_s3_access_headers/configs/config.d/named_collections.xml @@ -0,0 +1,9 @@ + + + + http://resolver:8081/root/test_named_colections.csv + minio + minio123 + + + diff --git a/tests/integration/test_s3_access_headers/configs/config.d/s3_headers.xml b/tests/integration/test_s3_access_headers/configs/config.d/s3_headers.xml new file mode 100644 index 00000000000..2d2eeb3c7b1 --- /dev/null +++ b/tests/integration/test_s3_access_headers/configs/config.d/s3_headers.xml @@ -0,0 +1,8 @@ + + + + http://resolver:8081/ + custom-auth-token: ValidToken1234 + + + diff --git a/tests/integration/test_s3_access_headers/configs/users.d/users.xml b/tests/integration/test_s3_access_headers/configs/users.d/users.xml new file mode 100644 index 00000000000..4b6ba057ecb --- /dev/null +++ b/tests/integration/test_s3_access_headers/configs/users.d/users.xml @@ -0,0 +1,9 @@ + + + + + default + 1 + + + diff --git a/tests/integration/test_s3_access_headers/s3_mocks/mocker_s3.py b/tests/integration/test_s3_access_headers/s3_mocks/mocker_s3.py new file mode 100644 index 00000000000..0bbcb2e60e8 --- /dev/null +++ b/tests/integration/test_s3_access_headers/s3_mocks/mocker_s3.py @@ -0,0 +1,97 @@ +import http.client +import http.server +import random +import socketserver +import sys +import urllib.parse + +UPSTREAM_HOST = "minio1:9001" +random.seed("No list objects/1.0") + + +def request(command, url, headers={}, data=None): + """Mini-requests.""" + + class Dummy: + pass + + parts = urllib.parse.urlparse(url) + c = http.client.HTTPConnection(parts.hostname, parts.port) + c.request( + command, + urllib.parse.urlunparse(parts._replace(scheme="", netloc="")), + headers=headers, + body=data, + ) + r = c.getresponse() + result = Dummy() + result.status_code = r.status + result.headers = r.headers + result.content = r.read() + return result + + +CUSTOM_AUTH_TOKEN_HEADER = "custom-auth-token" +CUSTOM_AUTH_TOKEN_VALID_VALUE = "ValidToken1234" + + +class RequestHandler(http.server.BaseHTTPRequestHandler): + def do_GET(self): + if self.path == "/": + self.send_response(200) + self.send_header("Content-Type", "text/plain") + self.end_headers() + self.wfile.write(b"OK") + return + self.do_HEAD() + + def do_PUT(self): + self.do_HEAD() + + def do_DELETE(self): + self.do_HEAD() + + def do_POST(self): + self.do_HEAD() + + def do_HEAD(self): + + custom_auth_token = self.headers.get(CUSTOM_AUTH_TOKEN_HEADER) + if custom_auth_token and custom_auth_token != CUSTOM_AUTH_TOKEN_VALID_VALUE: + self.send_response(403) + self.send_header("Content-Type", "application/xml") + self.end_headers() + + body = f""" + + AccessDenied + Access Denied. Custom token was {custom_auth_token}, the correct one: {CUSTOM_AUTH_TOKEN_VALID_VALUE}. + RESOURCE + REQUEST_ID + +""" + self.wfile.write(body.encode()) + return + + content_length = self.headers.get("Content-Length") + data = self.rfile.read(int(content_length)) if content_length else None + r = request( + self.command, + f"http://{UPSTREAM_HOST}{self.path}", + headers=self.headers, + data=data, + ) + self.send_response(r.status_code) + for k, v in r.headers.items(): + self.send_header(k, v) + self.end_headers() + self.wfile.write(r.content) + self.wfile.close() + + +class ThreadedHTTPServer(socketserver.ThreadingMixIn, http.server.HTTPServer): + """Handle requests in a separate thread.""" + + +httpd = ThreadedHTTPServer(("0.0.0.0", int(sys.argv[1])), RequestHandler) +httpd.serve_forever() diff --git a/tests/integration/test_s3_access_headers/test.py b/tests/integration/test_s3_access_headers/test.py new file mode 100644 index 00000000000..4d4a5b81230 --- /dev/null +++ b/tests/integration/test_s3_access_headers/test.py @@ -0,0 +1,124 @@ +import logging +import os + +import pytest + +from helpers.cluster import ClickHouseCluster +from helpers.mock_servers import start_mock_servers +from helpers.s3_tools import prepare_s3_bucket + +SCRIPT_DIR = os.path.dirname(os.path.realpath(__file__)) + + +def run_s3_mocks(started_cluster): + script_dir = os.path.join(os.path.dirname(__file__), "s3_mocks") + start_mock_servers( + started_cluster, + script_dir, + [ + ("mocker_s3.py", "resolver", "8081"), + ], + ) + + +@pytest.fixture(scope="module") +def started_cluster(): + cluster = ClickHouseCluster(__file__, with_spark=True) + try: + cluster.add_instance( + "node1", + main_configs=[ + "configs/config.d/named_collections.xml", + "configs/config.d/s3_headers.xml", + ], + user_configs=["configs/users.d/users.xml"], + with_minio=True, + ) + + logging.info("Starting cluster...") + cluster.start() + + prepare_s3_bucket(cluster) + logging.info("S3 bucket created") + + run_s3_mocks(cluster) + yield cluster + + finally: + cluster.shutdown() + + +CUSTOM_AUTH_TOKEN = "custom-auth-token" +CORRECT_TOKEN = "ValidToken1234" +INCORRECT_TOKEN = "InvalidToken1234" + + +@pytest.mark.parametrize( + "table_name, engine, query_with_invalid_token_must_fail", + [ + pytest.param( + "test_access_header", + "S3('http://resolver:8081/root/test_access_header.csv', 'CSV')", + True, + id="test_access_over_custom_header", + ), + pytest.param( + "test_static_override", + "S3('http://resolver:8081/root/test_static_override.csv', 'minio', 'minio123', 'CSV')", + False, + id="test_access_key_id_overrides_access_header", + ), + pytest.param( + "test_named_colections", + "S3(s3_mock, format='CSV')", + False, + id="test_named_coll_overrides_access_header", + ), + ], +) +def test_custom_access_header( + started_cluster, table_name, engine, query_with_invalid_token_must_fail +): + instance = started_cluster.instances["node1"] + + instance.query( + f""" + SET s3_truncate_on_insert=1; + INSERT INTO FUNCTION s3('http://minio1:9001/root/{table_name}.csv', 'minio', 'minio123','CSV') + SELECT number as a, toString(number) as b FROM numbers(3); + """ + ) + instance.query( + f""" + DROP TABLE IF EXISTS {table_name}; + CREATE TABLE {table_name} (name String, value UInt32) + ENGINE={engine}; + """ + ) + instance.query("SYSTEM DROP QUERY CACHE") + + assert instance.query(f"SELECT count(*) FROM {table_name}") == "3\n" + + config_path = "/etc/clickhouse-server/config.d/s3_headers.xml" + + instance.replace_in_config( + config_path, + f"{CUSTOM_AUTH_TOKEN}: {CORRECT_TOKEN}", + f"{CUSTOM_AUTH_TOKEN}: {INCORRECT_TOKEN}", + ) + instance.query("SYSTEM RELOAD CONFIG") + + if query_with_invalid_token_must_fail: + instance.query_and_get_error(f"SELECT count(*) FROM {table_name}") + + else: + assert instance.query(f"SELECT count(*) FROM {table_name}") == "3\n" + + instance.replace_in_config( + config_path, + f"{CUSTOM_AUTH_TOKEN}: {INCORRECT_TOKEN}", + f"{CUSTOM_AUTH_TOKEN}: {CORRECT_TOKEN}", + ) + + instance.query("SYSTEM RELOAD CONFIG") + assert instance.query(f"SELECT count(*) FROM {table_name}") == "3\n" diff --git a/tests/integration/test_scheduler/configs/storage_configuration.xml b/tests/integration/test_scheduler/configs/storage_configuration.xml index 823a00a05de..9498044c836 100644 --- a/tests/integration/test_scheduler/configs/storage_configuration.xml +++ b/tests/integration/test_scheduler/configs/storage_configuration.xml @@ -1,4 +1,5 @@ + /clickhouse/workload/definitions.sql @@ -12,6 +13,15 @@ network_read network_write + + s3 + http://minio1:9001/root/data/ + minio + minio123 + 33554432 + 10 + 10 + @@ -21,6 +31,13 @@ + + +
+ s3_no_resource +
+
+
diff --git a/tests/integration/test_scheduler/test.py b/tests/integration/test_scheduler/test.py index 050281b2e3a..c8f16c150e1 100644 --- a/tests/integration/test_scheduler/test.py +++ b/tests/integration/test_scheduler/test.py @@ -2,6 +2,7 @@ # pylint: disable=redefined-outer-name # pylint: disable=line-too-long +import random import threading import time @@ -9,6 +10,7 @@ import pytest from helpers.client import QueryRuntimeException from helpers.cluster import ClickHouseCluster +from helpers.network import PartitionManager cluster = ClickHouseCluster(__file__) @@ -23,6 +25,21 @@ node = cluster.add_instance( "configs/workloads.xml.default", ], with_minio=True, + with_zookeeper=True, +) + +node2 = cluster.add_instance( + "node2", + stay_alive=True, + main_configs=[ + "configs/storage_configuration.xml", + "configs/resources.xml", + "configs/resources.xml.default", + "configs/workloads.xml", + "configs/workloads.xml.default", + ], + with_minio=True, + with_zookeeper=True, ) @@ -55,6 +72,22 @@ def set_default_configs(): yield +@pytest.fixture(scope="function", autouse=True) +def clear_workloads_and_resources(): + node.query( + f""" + drop workload if exists production; + drop workload if exists development; + drop workload if exists admin; + drop workload if exists all; + drop resource if exists io_write; + drop resource if exists io_read; + drop resource if exists io; + """ + ) + yield + + def update_workloads_config(**settings): xml = "" for name in settings: @@ -570,3 +603,366 @@ def test_mutation_workload_change(): assert reads_before < reads_after assert writes_before < writes_after + + +def test_create_workload(): + node.query( + f""" + create resource io_write (write disk s3_no_resource); + create resource io_read (read disk s3_no_resource); + create workload all settings max_cost = 1000000 for io_write, max_cost = 2000000 for io_read; + create workload admin in all settings priority = 0; + create workload production in all settings priority = 1, weight = 9; + create workload development in all settings priority = 1, weight = 1; + """ + ) + + def do_checks(): + assert ( + node.query( + f"select count() from system.scheduler where path ilike '%/admin/%' and type='fifo'" + ) + == "2\n" + ) + assert ( + node.query( + f"select count() from system.scheduler where path ilike '%/admin' and type='unified' and priority=0" + ) + == "2\n" + ) + assert ( + node.query( + f"select count() from system.scheduler where path ilike '%/production/%' and type='fifo'" + ) + == "2\n" + ) + assert ( + node.query( + f"select count() from system.scheduler where path ilike '%/production' and type='unified' and weight=9" + ) + == "2\n" + ) + assert ( + node.query( + f"select count() from system.scheduler where path ilike '%/development/%' and type='fifo'" + ) + == "2\n" + ) + assert ( + node.query( + f"select count() from system.scheduler where path ilike '%/all/%' and type='inflight_limit' and resource='io_write' and max_cost=1000000" + ) + == "1\n" + ) + assert ( + node.query( + f"select count() from system.scheduler where path ilike '%/all/%' and type='inflight_limit' and resource='io_read' and max_cost=2000000" + ) + == "1\n" + ) + + do_checks() + node.restart_clickhouse() # Check that workloads persist + do_checks() + + +def test_workload_hierarchy_changes(): + node.query("create resource io_write (write disk s3_no_resource);") + node.query("create resource io_read (read disk s3_no_resource);") + queries = [ + "create workload all;", + "create workload X in all settings priority = 0;", + "create workload Y in all settings priority = 1;", + "create workload A1 in X settings priority = -1;", + "create workload B1 in X settings priority = 1;", + "create workload C1 in Y settings priority = -1;", + "create workload D1 in Y settings priority = 1;", + "create workload A2 in X settings priority = -1;", + "create workload B2 in X settings priority = 1;", + "create workload C2 in Y settings priority = -1;", + "create workload D2 in Y settings priority = 1;", + "drop workload A1;", + "drop workload A2;", + "drop workload B1;", + "drop workload B2;", + "drop workload C1;", + "drop workload C2;", + "drop workload D1;", + "drop workload D2;", + "create workload Z in all;", + "create workload A1 in Z settings priority = -1;", + "create workload A2 in Z settings priority = -1;", + "create workload A3 in Z settings priority = -1;", + "create workload B1 in Z settings priority = 1;", + "create workload B2 in Z settings priority = 1;", + "create workload B3 in Z settings priority = 1;", + "create workload C1 in X settings priority = -1;", + "create workload C2 in X settings priority = -1;", + "create workload C3 in X settings priority = -1;", + "create workload D1 in X settings priority = 1;", + "create workload D2 in X settings priority = 1;", + "create workload D3 in X settings priority = 1;", + "drop workload A1;", + "drop workload B1;", + "drop workload C1;", + "drop workload D1;", + "drop workload A2;", + "drop workload B2;", + "drop workload C2;", + "drop workload D2;", + "drop workload A3;", + "drop workload B3;", + "drop workload C3;", + "drop workload D3;", + "drop workload X;", + "drop workload Y;", + "drop workload Z;", + "drop workload all;", + ] + for iteration in range(3): + split_idx = random.randint(1, len(queries) - 2) + for query_idx in range(0, split_idx): + node.query(queries[query_idx]) + node.query( + "create resource io_test (write disk non_existent_disk, read disk non_existent_disk);" + ) + node.query("drop resource io_test;") + for query_idx in range(split_idx, len(queries)): + node.query(queries[query_idx]) + + +def test_resource_read_and_write(): + node.query( + f""" + drop table if exists data; + create table data (key UInt64 CODEC(NONE)) engine=MergeTree() order by tuple() settings min_bytes_for_wide_part=1e9, storage_policy='s3_no_resource'; + """ + ) + + node.query( + f""" + create resource io_write (write disk s3_no_resource); + create resource io_read (read disk s3_no_resource); + create workload all settings max_cost = 1000000; + create workload admin in all settings priority = 0; + create workload production in all settings priority = 1, weight = 9; + create workload development in all settings priority = 1, weight = 1; + """ + ) + + def write_query(workload): + try: + node.query( + f"insert into data select * from numbers(1e5) settings workload='{workload}'" + ) + except QueryRuntimeException: + pass + + thread1 = threading.Thread(target=write_query, args=["development"]) + thread2 = threading.Thread(target=write_query, args=["production"]) + thread3 = threading.Thread(target=write_query, args=["admin"]) + + thread1.start() + thread2.start() + thread3.start() + + thread3.join() + thread2.join() + thread1.join() + + assert ( + node.query( + f"select dequeued_requests>0 from system.scheduler where resource='io_write' and path ilike '%/admin/%' and type='fifo'" + ) + == "1\n" + ) + assert ( + node.query( + f"select dequeued_requests>0 from system.scheduler where resource='io_write' and path ilike '%/development/%' and type='fifo'" + ) + == "1\n" + ) + assert ( + node.query( + f"select dequeued_requests>0 from system.scheduler where resource='io_write' and path ilike '%/production/%' and type='fifo'" + ) + == "1\n" + ) + + def read_query(workload): + try: + node.query(f"select sum(key*key) from data settings workload='{workload}'") + except QueryRuntimeException: + pass + + thread1 = threading.Thread(target=read_query, args=["development"]) + thread2 = threading.Thread(target=read_query, args=["production"]) + thread3 = threading.Thread(target=read_query, args=["admin"]) + + thread1.start() + thread2.start() + thread3.start() + + thread3.join() + thread2.join() + thread1.join() + + assert ( + node.query( + f"select dequeued_requests>0 from system.scheduler where resource='io_read' and path ilike '%/admin/%' and type='fifo'" + ) + == "1\n" + ) + assert ( + node.query( + f"select dequeued_requests>0 from system.scheduler where resource='io_read' and path ilike '%/development/%' and type='fifo'" + ) + == "1\n" + ) + assert ( + node.query( + f"select dequeued_requests>0 from system.scheduler where resource='io_read' and path ilike '%/production/%' and type='fifo'" + ) + == "1\n" + ) + + +def test_resource_any_disk(): + node.query( + f""" + drop table if exists data; + create table data (key UInt64 CODEC(NONE)) engine=MergeTree() order by tuple() settings min_bytes_for_wide_part=1e9, storage_policy='s3_no_resource'; + """ + ) + + node.query( + f""" + create resource io (write any disk, read any disk); + create workload all settings max_cost = 1000000; + """ + ) + + node.query(f"insert into data select * from numbers(1e5) settings workload='all'") + + assert ( + node.query( + f"select dequeued_requests>0 from system.scheduler where resource='io' and path ilike '%/all/%' and type='fifo'" + ) + == "1\n" + ) + + node.query(f"select sum(key*key) from data settings workload='all'") + + assert ( + node.query( + f"select dequeued_requests>0 from system.scheduler where resource='io' and path ilike '%/all/%' and type='fifo'" + ) + == "1\n" + ) + + +def test_workload_entity_keeper_storage(): + node.query("create resource io_write (write disk s3_no_resource);") + node.query("create resource io_read (read disk s3_no_resource);") + queries = [ + "create workload all;", + "create workload X in all settings priority = 0;", + "create workload Y in all settings priority = 1;", + "create workload A1 in X settings priority = -1;", + "create workload B1 in X settings priority = 1;", + "create workload C1 in Y settings priority = -1;", + "create workload D1 in Y settings priority = 1;", + "create workload A2 in X settings priority = -1;", + "create workload B2 in X settings priority = 1;", + "create workload C2 in Y settings priority = -1;", + "create workload D2 in Y settings priority = 1;", + "drop workload A1;", + "drop workload A2;", + "drop workload B1;", + "drop workload B2;", + "drop workload C1;", + "drop workload C2;", + "drop workload D1;", + "drop workload D2;", + "create workload Z in all;", + "create workload A1 in Z settings priority = -1;", + "create workload A2 in Z settings priority = -1;", + "create workload A3 in Z settings priority = -1;", + "create workload B1 in Z settings priority = 1;", + "create workload B2 in Z settings priority = 1;", + "create workload B3 in Z settings priority = 1;", + "create workload C1 in X settings priority = -1;", + "create workload C2 in X settings priority = -1;", + "create workload C3 in X settings priority = -1;", + "create workload D1 in X settings priority = 1;", + "create workload D2 in X settings priority = 1;", + "create workload D3 in X settings priority = 1;", + "drop workload A1;", + "drop workload B1;", + "drop workload C1;", + "drop workload D1;", + "drop workload A2;", + "drop workload B2;", + "drop workload C2;", + "drop workload D2;", + "drop workload A3;", + "drop workload B3;", + "drop workload C3;", + "drop workload D3;", + "drop workload X;", + "drop workload Y;", + "drop workload Z;", + "drop workload all;", + ] + + def check_consistency(): + checks = [ + "select name, create_query from system.workloads order by all", + "select name, create_query from system.resources order by all", + "select resource, path, type, weight, priority, max_requests, max_cost, max_speed, max_burst from system.scheduler where resource not in ['network_read', 'network_write'] order by all", + ] + attempts = 30 + value1 = "" + value2 = "" + error_query = "" + retry_period = 0.1 + for attempt in range(attempts): + for query in checks: + value1 = node.query(query) + value2 = node2.query(query) + if value1 != value2: + error_query = query + break # error + else: + break # success + time.sleep(retry_period) + retry_period = min(3, retry_period * 1.5) + else: + raise Exception( + f"query '{error_query}' gives different results after {attempts} attempts:\n=== leader node ===\n{value1}\n=== follower node ===\n{value2}" + ) + + for iteration in range(3): + split_idx_1 = random.randint(1, len(queries) - 3) + split_idx_2 = random.randint(split_idx_1 + 1, len(queries) - 2) + + with PartitionManager() as pm: + pm.drop_instance_zk_connections(node2) + for query_idx in range(0, split_idx_1): + node.query(queries[query_idx]) + + check_consistency() + + with PartitionManager() as pm: + pm.drop_instance_zk_connections(node2) + for query_idx in range(split_idx_1, split_idx_2): + node.query(queries[query_idx]) + + check_consistency() + + with PartitionManager() as pm: + pm.drop_instance_zk_connections(node2) + for query_idx in range(split_idx_2, len(queries)): + node.query(queries[query_idx]) + + check_consistency() diff --git a/tests/integration/test_storage_azure_blob_storage/test_check_after_upload.py b/tests/integration/test_storage_azure_blob_storage/test_check_after_upload.py new file mode 100644 index 00000000000..c647c96809d --- /dev/null +++ b/tests/integration/test_storage_azure_blob_storage/test_check_after_upload.py @@ -0,0 +1,129 @@ +import logging +import os + +import pytest + +from helpers.cluster import ClickHouseCluster +from test_storage_azure_blob_storage.test import azure_query + +SCRIPT_DIR = os.path.dirname(os.path.realpath(__file__)) +NODE_NAME = "node" +TABLE_NAME = "blob_storage_table" +AZURE_BLOB_STORAGE_DISK = "blob_storage_disk" +LOCAL_DISK = "hdd" +CONTAINER_NAME = "cont" + + +def generate_cluster_def(port): + path = os.path.join( + os.path.dirname(os.path.realpath(__file__)), + "./_gen/disk_storage_conf.xml", + ) + os.makedirs(os.path.dirname(path), exist_ok=True) + with open(path, "w") as f: + f.write( + f""" + + + + azure_blob_storage + http://azurite1:{port}/devstoreaccount1 + cont + false + devstoreaccount1 + Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw== + 100000 + 100000 + 10 + 10 + true + + + local + / + + + + + +
+ blob_storage_disk +
+ + hdd + +
+
+
+
+
+""" + ) + return path + + +@pytest.fixture(scope="module") +def cluster(): + try: + cluster = ClickHouseCluster(__file__) + port = cluster.azurite_port + path = generate_cluster_def(port) + cluster.add_instance( + NODE_NAME, + main_configs=[ + path, + ], + with_azurite=True, + ) + logging.info("Starting cluster...") + cluster.start() + logging.info("Cluster started") + + yield cluster + finally: + cluster.shutdown() + + +# Note: use azure_query for selects and inserts and create table queries. +# For inserts there is no guarantee that retries will not result in duplicates. +# But it is better to retry anyway because connection related errors +# happens in fact only for inserts because reads already have build-in retries in code. + + +def create_table(node, table_name, **additional_settings): + settings = { + "storage_policy": "blob_storage_policy", + "old_parts_lifetime": 1, + "index_granularity": 512, + "temporary_directories_lifetime": 1, + } + settings.update(additional_settings) + + create_table_statement = f""" + CREATE TABLE {table_name} ( + dt Date, + id Int64, + data String, + INDEX min_max (id) TYPE minmax GRANULARITY 3 + ) ENGINE=MergeTree() + PARTITION BY dt + ORDER BY (dt, id) + SETTINGS {",".join((k+"="+repr(v) for k, v in settings.items()))}""" + + azure_query(node, f"DROP TABLE IF EXISTS {table_name}") + azure_query(node, create_table_statement) + assert ( + azure_query(node, f"SELECT COUNT(*) FROM {table_name} FORMAT Values") == "(0)" + ) + + +def test_simple(cluster): + node = cluster.instances[NODE_NAME] + create_table(node, TABLE_NAME) + + values = "('2021-11-13',3,'hello')" + azure_query(node, f"INSERT INTO {TABLE_NAME} VALUES {values}") + assert ( + azure_query(node, f"SELECT dt, id, data FROM {TABLE_NAME} FORMAT Values") + == values + ) diff --git a/tests/integration/test_storage_hdfs/test.py b/tests/integration/test_storage_hdfs/test.py index 362ea7d5bda..366bc28d2c9 100644 --- a/tests/integration/test_storage_hdfs/test.py +++ b/tests/integration/test_storage_hdfs/test.py @@ -396,6 +396,21 @@ def test_read_files_with_spaces(started_cluster): node1.query(f"drop table test") +def test_write_files_with_spaces(started_cluster): + fs = HdfsClient(hosts=started_cluster.hdfs_ip) + dir = "/itime=2024-10-24 10%3A02%3A04" + fs.mkdirs(dir) + + node1.query( + f"insert into function hdfs('hdfs://hdfs1:9000{dir}/test.csv', TSVRaw) select 123 settings hdfs_truncate_on_insert=1" + ) + result = node1.query( + f"select * from hdfs('hdfs://hdfs1:9000{dir}/test.csv', TSVRaw)" + ) + assert int(result) == 123 + fs.delete(dir, recursive=True) + + def test_truncate_table(started_cluster): hdfs_api = started_cluster.hdfs_api node1.query( diff --git a/tests/integration/test_storage_kafka/test.py b/tests/integration/test_storage_kafka/test.py index 0bade55415f..336ca824a2d 100644 --- a/tests/integration/test_storage_kafka/test.py +++ b/tests/integration/test_storage_kafka/test.py @@ -4193,7 +4193,7 @@ def test_kafka_formats_with_broken_message(kafka_cluster, create_query_generator ], "expected": { "raw_message": "050102696405496E743634000000000000000007626C6F636B4E6F06537472696E67034241440476616C3106537472696E6702414D0476616C3207466C6F617433320000003F0476616C330555496E743801", - "error": "Cannot convert: String to UInt16", + "error": "Cannot parse string 'BAD' as UInt16", }, "printable": False, }, diff --git a/tests/integration/test_storage_s3/test.py b/tests/integration/test_storage_s3/test.py index ad1842f4509..d8326711d84 100644 --- a/tests/integration/test_storage_s3/test.py +++ b/tests/integration/test_storage_s3/test.py @@ -1592,7 +1592,7 @@ def test_parallel_reading_with_memory_limit(started_cluster): f"select * from url('http://{started_cluster.minio_host}:{started_cluster.minio_port}/{bucket}/test_memory_limit.native') settings max_memory_usage=1000" ) - assert "Memory limit (for query) exceeded" in result + assert "Query memory limit exceeded" in result time.sleep(5) diff --git a/tests/integration/test_storage_s3_queue/test.py b/tests/integration/test_storage_s3_queue/test.py index a1fbf0882b6..c495fc1d44f 100644 --- a/tests/integration/test_storage_s3_queue/test.py +++ b/tests/integration/test_storage_s3_queue/test.py @@ -1721,6 +1721,7 @@ def test_upgrade(started_cluster): files_path, additional_settings={ "keeper_path": keeper_path, + "after_processing": "keep", }, ) total_values = generate_random_files( @@ -2099,3 +2100,166 @@ def test_processing_threads(started_cluster): assert node.contains_in_log( f"StorageS3Queue (default.{table_name}): Using 16 processing threads" ) + + +def test_alter_settings(started_cluster): + node1 = started_cluster.instances["node1"] + node2 = started_cluster.instances["node2"] + + table_name = f"test_alter_settings_{uuid.uuid4().hex[:8]}" + dst_table_name = f"{table_name}_dst" + keeper_path = f"/clickhouse/test_{table_name}" + files_path = f"{table_name}_data" + files_to_generate = 1000 + + node1.query("DROP DATABASE IF EXISTS r") + node2.query("DROP DATABASE IF EXISTS r") + + node1.query( + f"CREATE DATABASE r ENGINE=Replicated('/clickhouse/databases/{table_name}', 'shard1', 'node1')" + ) + node2.query( + f"CREATE DATABASE r ENGINE=Replicated('/clickhouse/databases/{table_name}', 'shard1', 'node2')" + ) + + create_table( + started_cluster, + node1, + table_name, + "unordered", + files_path, + additional_settings={ + "keeper_path": keeper_path, + "processing_threads_num": 10, + "loading_retries": 20, + }, + database_name="r", + ) + + assert '"processing_threads_num":10' in node1.query( + f"SELECT * FROM system.zookeeper WHERE path = '{keeper_path}'" + ) + + assert '"loading_retries":20' in node1.query( + f"SELECT * FROM system.zookeeper WHERE path = '{keeper_path}'" + ) + + assert '"after_processing":"keep"' in node1.query( + f"SELECT * FROM system.zookeeper WHERE path = '{keeper_path}'" + ) + + total_values = generate_random_files( + started_cluster, files_path, files_to_generate, start_ind=0, row_num=1 + ) + + create_mv(node1, f"r.{table_name}", dst_table_name) + create_mv(node2, f"r.{table_name}", dst_table_name) + + def get_count(): + return int( + node1.query( + f"SELECT count() FROM clusterAllReplicas(cluster, default.{dst_table_name})" + ) + ) + + expected_rows = files_to_generate + for _ in range(20): + if expected_rows == get_count(): + break + time.sleep(1) + assert expected_rows == get_count() + + node1.query( + f""" + ALTER TABLE r.{table_name} + MODIFY SETTING processing_threads_num=5, + loading_retries=10, + after_processing='delete', + tracked_files_limit=50, + tracked_file_ttl_sec=10000, + polling_min_timeout_ms=222, + polling_max_timeout_ms=333, + polling_backoff_ms=111 + """ + ) + + int_settings = { + "processing_threads_num": 5, + "loading_retries": 10, + "tracked_files_ttl_sec": 10000, + "tracked_files_limit": 50, + "polling_min_timeout_ms": 222, + "polling_max_timeout_ms": 333, + "polling_backoff_ms": 111, + } + string_settings = {"after_processing": "delete"} + + def with_keeper(setting): + return setting in { + "after_processing", + "loading_retries", + "processing_threads_num", + "tracked_files_limit", + "tracked_files_ttl_sec", + } + + def check_int_settings(node, settings): + for setting, value in settings.items(): + if with_keeper(setting): + assert f'"{setting}":{value}' in node.query( + f"SELECT * FROM system.zookeeper WHERE path = '{keeper_path}'" + ) + if setting == "tracked_files_ttl_sec": + setting = "tracked_file_ttl_sec" + assert ( + str(value) + == node.query( + f"SELECT value FROM system.s3_queue_settings WHERE name = '{setting}' and table = '{table_name}'" + ).strip() + ) + + def check_string_settings(node, settings): + for setting, value in settings.items(): + if with_keeper(setting): + assert f'"{setting}":"{value}"' in node.query( + f"SELECT * FROM system.zookeeper WHERE path = '{keeper_path}'" + ) + assert ( + str(value) + == node.query( + f"SELECT value FROM system.s3_queue_settings WHERE name = '{setting}' and table = '{table_name}'" + ).strip() + ) + + for node in [node1, node2]: + check_int_settings(node, int_settings) + check_string_settings(node, string_settings) + + node.restart_clickhouse() + + check_int_settings(node, int_settings) + check_string_settings(node, string_settings) + + node1.query( + f""" + ALTER TABLE r.{table_name} RESET SETTING after_processing, tracked_file_ttl_sec + """ + ) + + int_settings = { + "processing_threads_num": 5, + "loading_retries": 10, + "tracked_files_ttl_sec": 0, + "tracked_files_limit": 50, + } + string_settings = {"after_processing": "keep"} + + for node in [node1, node2]: + check_int_settings(node, int_settings) + check_string_settings(node, string_settings) + + node.restart_clickhouse() + assert expected_rows == get_count() + + check_int_settings(node, int_settings) + check_string_settings(node, string_settings) diff --git a/tests/integration/test_user_valid_until/test.py b/tests/integration/test_user_valid_until/test.py index eea05af9e45..828432f223e 100644 --- a/tests/integration/test_user_valid_until/test.py +++ b/tests/integration/test_user_valid_until/test.py @@ -76,6 +76,18 @@ def test_basic(started_cluster): node.query("DROP USER IF EXISTS user_basic") + # NOT IDENTIFIED test to make sure valid until is also parsed on its short-circuit + node.query("CREATE USER user_basic NOT IDENTIFIED VALID UNTIL '01/01/2010'") + + assert ( + node.query("SHOW CREATE USER user_basic") + == "CREATE USER user_basic IDENTIFIED WITH no_password VALID UNTIL \\'2010-01-01 00:00:00\\'\n" + ) + + assert error in node.query_and_get_error("SELECT 1", user="user_basic") + + node.query("DROP USER IF EXISTS user_basic") + def test_details(started_cluster): node.query("DROP USER IF EXISTS user_details_infinity, user_details_time_only") @@ -124,3 +136,51 @@ def test_restart(started_cluster): assert error in node.query_and_get_error("SELECT 1", user="user_restart") node.query("DROP USER IF EXISTS user_restart") + + +def test_multiple_authentication_methods(started_cluster): + node.query("DROP USER IF EXISTS user_basic") + + node.query( + "CREATE USER user_basic IDENTIFIED WITH plaintext_password BY 'no_expiration'," + "plaintext_password by 'not_expired' VALID UNTIL '06/11/2040', plaintext_password by 'expired' VALID UNTIL '06/11/2010'," + "plaintext_password by 'infinity' VALID UNTIL 'infinity'" + ) + + assert ( + node.query("SHOW CREATE USER user_basic") + == "CREATE USER user_basic IDENTIFIED WITH plaintext_password, plaintext_password VALID UNTIL \\'2040-11-06 00:00:00\\', " + "plaintext_password VALID UNTIL \\'2010-11-06 00:00:00\\', plaintext_password\n" + ) + assert node.query("SELECT 1", user="user_basic", password="no_expiration") == "1\n" + assert node.query("SELECT 1", user="user_basic", password="not_expired") == "1\n" + assert node.query("SELECT 1", user="user_basic", password="infinity") == "1\n" + + error = "Authentication failed" + assert error in node.query_and_get_error( + "SELECT 1", user="user_basic", password="expired" + ) + + # Expire them all + node.query("ALTER USER user_basic VALID UNTIL '06/11/2010 08:03:20'") + + assert ( + node.query("SHOW CREATE USER user_basic") + == "CREATE USER user_basic IDENTIFIED WITH plaintext_password VALID UNTIL \\'2010-11-06 08:03:20\\'," + " plaintext_password VALID UNTIL \\'2010-11-06 08:03:20\\'," + " plaintext_password VALID UNTIL \\'2010-11-06 08:03:20\\'," + " plaintext_password VALID UNTIL \\'2010-11-06 08:03:20\\'\n" + ) + + assert error in node.query_and_get_error( + "SELECT 1", user="user_basic", password="no_expiration" + ) + assert error in node.query_and_get_error( + "SELECT 1", user="user_basic", password="not_expired" + ) + assert error in node.query_and_get_error( + "SELECT 1", user="user_basic", password="infinity" + ) + assert error in node.query_and_get_error( + "SELECT 1", user="user_basic", password="expired" + ) diff --git a/tests/performance/replacing_final_non_intersecting.xml b/tests/performance/replacing_final_non_intersecting.xml new file mode 100644 index 00000000000..b3d32f1ca2e --- /dev/null +++ b/tests/performance/replacing_final_non_intersecting.xml @@ -0,0 +1,26 @@ + + + + 0 + 0 + + + + CREATE TABLE replacing_final_non_intersecting (d DateTime, c1 UInt64, c2 String, c3 LowCardinality(String)) + ENGINE = ReplacingMergeTree() + ORDER BY d + + + INSERT INTO replacing_final_non_intersecting SELECT toDateTime('2020-10-10 00:00:00') - number, number, toString(number), toString(number % 1000) FROM numbers(0, 5000000) + OPTIMIZE TABLE replacing_final_non_intersecting FINAL + SYSTEM STOP MERGES replacing_final_non_intersecting + INSERT INTO replacing_final_non_intersecting SELECT toDateTime('2020-10-10 00:00:00') - number, number, toString(number), toString(number % 1000) FROM numbers(5000000, 500000) + + SELECT * FROM replacing_final_non_intersecting FINAL FORMAT Null SETTINGS enable_vertical_final = 0 + SELECT * FROM replacing_final_non_intersecting FINAL FORMAT Null SETTINGS enable_vertical_final = 1 + + DROP TABLE IF EXISTS replacing_final_non_intersecting + diff --git a/tests/queries/0_stateless/01079_bad_alters_zookeeper_long.sh b/tests/queries/0_stateless/01079_bad_alters_zookeeper_long.sh index 22f8e5269bd..a619bcdbce2 100755 --- a/tests/queries/0_stateless/01079_bad_alters_zookeeper_long.sh +++ b/tests/queries/0_stateless/01079_bad_alters_zookeeper_long.sh @@ -22,7 +22,13 @@ $CLICKHOUSE_CLIENT --query "ALTER TABLE table_for_bad_alters MODIFY COLUMN value sleep 2 -while [[ $($CLICKHOUSE_CLIENT --query "KILL MUTATION WHERE mutation_id='0000000000' and database = '$CLICKHOUSE_DATABASE'" 2>&1) ]]; do +counter=0 retries=60 +while [[ $counter -lt $retries ]]; do + output=$($CLICKHOUSE_CLIENT --query "KILL MUTATION WHERE mutation_id='0000000000' and database = '$CLICKHOUSE_DATABASE'" 2>&1) + if [[ "$output" == *"finished"* ]]; then + break + fi + ((++counter)) sleep 1 done diff --git a/tests/queries/0_stateless/01162_strange_mutations.sh b/tests/queries/0_stateless/01162_strange_mutations.sh index f2428141264..db7ec8e0755 100755 --- a/tests/queries/0_stateless/01162_strange_mutations.sh +++ b/tests/queries/0_stateless/01162_strange_mutations.sh @@ -1,6 +1,7 @@ #!/usr/bin/env bash -# Tags: no-replicated-database +# Tags: no-replicated-database, no-shared-merge-tree # Tag no-replicated-database: CREATE AS SELECT is disabled +# Tag no-shared-merge-tree -- implemented separate test, just bad substituion here CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/01164_alter_memory_database.sql b/tests/queries/0_stateless/01164_alter_memory_database.sql index f46fc8f9853..0beddbfaa88 100644 --- a/tests/queries/0_stateless/01164_alter_memory_database.sql +++ b/tests/queries/0_stateless/01164_alter_memory_database.sql @@ -1,4 +1,5 @@ --- Tags: zookeeper, no-parallel +-- Tags: zookeeper, no-parallel, no-shared-merge-tree +-- no-shared-merge-tree: doesn't support databases without UUID drop database if exists test_1164_memory; create database test_1164_memory engine=Memory; diff --git a/tests/queries/0_stateless/01165_lost_part_empty_partition.sql b/tests/queries/0_stateless/01165_lost_part_empty_partition.sql index 787d4567218..2ed46a05823 100644 --- a/tests/queries/0_stateless/01165_lost_part_empty_partition.sql +++ b/tests/queries/0_stateless/01165_lost_part_empty_partition.sql @@ -1,4 +1,5 @@ --- Tags: zookeeper +-- Tags: zookeeper, no-shared-merge-tree +-- no-shared-merge-tree: shared merge tree doesn't loose data parts SET max_rows_to_read = 0; -- system.text_log can be really big diff --git a/tests/queries/0_stateless/01166_truncate_multiple_partitions.sql b/tests/queries/0_stateless/01166_truncate_multiple_partitions.sql index 1a7e3ed3bc4..8f5d3ccc1fe 100644 --- a/tests/queries/0_stateless/01166_truncate_multiple_partitions.sql +++ b/tests/queries/0_stateless/01166_truncate_multiple_partitions.sql @@ -1,3 +1,6 @@ +-- Tags: no-shared-catalog +-- no-shared-catalog: standard MergeTree is not supported + drop table if exists trunc; set default_table_engine='ReplicatedMergeTree'; diff --git a/tests/queries/0_stateless/01192_rename_database_zookeeper.sh b/tests/queries/0_stateless/01192_rename_database_zookeeper.sh index 1ac01fe6abc..e48fc428265 100755 --- a/tests/queries/0_stateless/01192_rename_database_zookeeper.sh +++ b/tests/queries/0_stateless/01192_rename_database_zookeeper.sh @@ -1,5 +1,9 @@ #!/usr/bin/env bash -# Tags: zookeeper, no-parallel, no-fasttest +# Tags: zookeeper, no-parallel, no-fasttest, no-shared-merge-tree +# no-shared-merge-tree: database ordinary not supported + +# Creation of a database with Ordinary engine emits a warning. +CLICKHOUSE_CLIENT_SERVER_LOGS_LEVEL=fatal # Creation of a database with Ordinary engine emits a warning. CLICKHOUSE_CLIENT_SERVER_LOGS_LEVEL=fatal diff --git a/tests/queries/0_stateless/01221_system_settings.reference b/tests/queries/0_stateless/01221_system_settings.reference index 32a0ed11b6c..821d2e386a9 100644 --- a/tests/queries/0_stateless/01221_system_settings.reference +++ b/tests/queries/0_stateless/01221_system_settings.reference @@ -1,4 +1,4 @@ -send_timeout 300 0 Timeout for sending data to the network, in seconds. If a client needs to send some data but is not able to send any bytes in this interval, the exception is thrown. If you set this setting on the client, the \'receive_timeout\' for the socket will also be set on the corresponding connection end on the server. \N \N 0 Seconds 300 0 -storage_policy default 0 Name of storage disk policy \N \N 0 String 0 +send_timeout 300 0 Timeout for sending data to the network, in seconds. If a client needs to send some data but is not able to send any bytes in this interval, the exception is thrown. If you set this setting on the client, the \'receive_timeout\' for the socket will also be set on the corresponding connection end on the server. \N \N 0 Seconds 300 0 Production +storage_policy default 0 Name of storage disk policy \N \N 0 String 0 Production 1 1 diff --git a/tests/queries/0_stateless/01271_show_privileges.reference b/tests/queries/0_stateless/01271_show_privileges.reference index 0e839ac6fc1..de6df7ac021 100644 --- a/tests/queries/0_stateless/01271_show_privileges.reference +++ b/tests/queries/0_stateless/01271_show_privileges.reference @@ -59,6 +59,8 @@ CREATE DICTIONARY [] DICTIONARY CREATE CREATE TEMPORARY TABLE [] GLOBAL CREATE ARBITRARY TEMPORARY TABLE CREATE ARBITRARY TEMPORARY TABLE [] GLOBAL CREATE CREATE FUNCTION [] GLOBAL CREATE +CREATE WORKLOAD [] GLOBAL CREATE +CREATE RESOURCE [] GLOBAL CREATE CREATE NAMED COLLECTION [] NAMED_COLLECTION NAMED COLLECTION ADMIN CREATE [] \N ALL DROP DATABASE [] DATABASE DROP @@ -66,6 +68,8 @@ DROP TABLE [] TABLE DROP DROP VIEW [] VIEW DROP DROP DICTIONARY [] DICTIONARY DROP DROP FUNCTION [] GLOBAL DROP +DROP WORKLOAD [] GLOBAL DROP +DROP RESOURCE [] GLOBAL DROP DROP NAMED COLLECTION [] NAMED_COLLECTION NAMED COLLECTION ADMIN DROP [] \N ALL UNDROP TABLE [] TABLE ALL @@ -108,6 +112,7 @@ TABLE ENGINE ['TABLE ENGINE'] TABLE_ENGINE ALL SYSTEM SHUTDOWN ['SYSTEM KILL','SHUTDOWN'] GLOBAL SYSTEM SYSTEM DROP DNS CACHE ['SYSTEM DROP DNS','DROP DNS CACHE','DROP DNS'] GLOBAL SYSTEM DROP CACHE SYSTEM DROP CONNECTIONS CACHE ['SYSTEM DROP CONNECTIONS CACHE','DROP CONNECTIONS CACHE'] GLOBAL SYSTEM DROP CACHE +SYSTEM PREWARM MARK CACHE ['SYSTEM PREWARM MARK','PREWARM MARK CACHE','PREWARM MARKS'] GLOBAL SYSTEM DROP CACHE SYSTEM DROP MARK CACHE ['SYSTEM DROP MARK','DROP MARK CACHE','DROP MARKS'] GLOBAL SYSTEM DROP CACHE SYSTEM DROP UNCOMPRESSED CACHE ['SYSTEM DROP UNCOMPRESSED','DROP UNCOMPRESSED CACHE','DROP UNCOMPRESSED'] GLOBAL SYSTEM DROP CACHE SYSTEM DROP MMAP CACHE ['SYSTEM DROP MMAP','DROP MMAP CACHE','DROP MMAP'] GLOBAL SYSTEM DROP CACHE @@ -184,6 +189,9 @@ HDFS [] GLOBAL SOURCES S3 [] GLOBAL SOURCES HIVE [] GLOBAL SOURCES AZURE [] GLOBAL SOURCES +KAFKA [] GLOBAL SOURCES +NATS [] GLOBAL SOURCES +RABBITMQ [] GLOBAL SOURCES SOURCES [] \N ALL CLUSTER [] GLOBAL ALL ALL ['ALL PRIVILEGES'] \N \N diff --git a/tests/queries/0_stateless/01301_aggregate_state_exception_memory_leak.reference b/tests/queries/0_stateless/01301_aggregate_state_exception_memory_leak.reference index 6282bf366d0..76c31901df7 100644 --- a/tests/queries/0_stateless/01301_aggregate_state_exception_memory_leak.reference +++ b/tests/queries/0_stateless/01301_aggregate_state_exception_memory_leak.reference @@ -1,2 +1,2 @@ -Memory limit exceeded +Query memory limit exceeded Ok diff --git a/tests/queries/0_stateless/01301_aggregate_state_exception_memory_leak.sh b/tests/queries/0_stateless/01301_aggregate_state_exception_memory_leak.sh index 5b7cba77432..ceb7b60be0f 100755 --- a/tests/queries/0_stateless/01301_aggregate_state_exception_memory_leak.sh +++ b/tests/queries/0_stateless/01301_aggregate_state_exception_memory_leak.sh @@ -16,5 +16,5 @@ for _ in {1..1000}; do if [[ $elapsed -gt 30 ]]; then break fi -done 2>&1 | grep -o -P 'Memory limit .+ exceeded' | sed -r -e 's/(Memory limit)(.+)( exceeded)/\1\3/' | uniq +done 2>&1 | grep -o -P 'Query memory limit exceeded' | sed -r -e 's/(.*):([a-Z ]*)([mM]emory limit exceeded)(.*)/\2\3/' | uniq echo 'Ok' diff --git a/tests/queries/0_stateless/01383_log_broken_table.sh b/tests/queries/0_stateless/01383_log_broken_table.sh index 997daf1bf2f..d3c5a2e9aad 100755 --- a/tests/queries/0_stateless/01383_log_broken_table.sh +++ b/tests/queries/0_stateless/01383_log_broken_table.sh @@ -24,7 +24,7 @@ function test_func() $CLICKHOUSE_CLIENT --query "INSERT INTO log SELECT number, number, number FROM numbers(1000000)" --max_memory_usage $MAX_MEM > "${CLICKHOUSE_TMP}"/insert_result 2>&1 RES=$? - grep -o -F 'Memory limit' "${CLICKHOUSE_TMP}"/insert_result || cat "${CLICKHOUSE_TMP}"/insert_result + grep -o -F 'emory limit' "${CLICKHOUSE_TMP}"/insert_result || cat "${CLICKHOUSE_TMP}"/insert_result $CLICKHOUSE_CLIENT --query "SELECT count(), sum(x + y + z) FROM log" > "${CLICKHOUSE_TMP}"/select_result 2>&1; @@ -36,9 +36,9 @@ function test_func() $CLICKHOUSE_CLIENT --query "DROP TABLE log"; } -test_func TinyLog | grep -v -P '^(Memory limit|0\t0|[1-9]000000\t)' -test_func StripeLog | grep -v -P '^(Memory limit|0\t0|[1-9]000000\t)' -test_func Log | grep -v -P '^(Memory limit|0\t0|[1-9]000000\t)' +test_func TinyLog | grep -v -P '^(emory limit|0\t0|[1-9]000000\t)' +test_func StripeLog | grep -v -P '^(emory limit|0\t0|[1-9]000000\t)' +test_func Log | grep -v -P '^(emory limit|0\t0|[1-9]000000\t)' rm "${CLICKHOUSE_TMP}/insert_result" rm "${CLICKHOUSE_TMP}/select_result" diff --git a/tests/queries/0_stateless/01513_optimize_aggregation_in_order_memory_long.sql b/tests/queries/0_stateless/01513_optimize_aggregation_in_order_memory_long.sql index d9430018469..5e7cd2f7da7 100644 --- a/tests/queries/0_stateless/01513_optimize_aggregation_in_order_memory_long.sql +++ b/tests/queries/0_stateless/01513_optimize_aggregation_in_order_memory_long.sql @@ -1,5 +1,5 @@ -- Tags: long, no-random-merge-tree-settings --- FIXME no-random-merge-tree-settings requires investigation +--- FIXME no-random-merge-tree-settings requires investigation drop table if exists data_01513; create table data_01513 (key String) engine=MergeTree() order by key; diff --git a/tests/queries/0_stateless/01514_distributed_cancel_query_on_error.sh b/tests/queries/0_stateless/01514_distributed_cancel_query_on_error.sh index edf3683ccba..245aa3ceb99 100755 --- a/tests/queries/0_stateless/01514_distributed_cancel_query_on_error.sh +++ b/tests/queries/0_stateless/01514_distributed_cancel_query_on_error.sh @@ -19,6 +19,6 @@ opts=( ) ${CLICKHOUSE_CLIENT} "${opts[@]}" -q "SELECT groupArray(repeat('a', if(_shard_num == 2, 100000, 1))), number%100000 k from remote('127.{2,3}', system.numbers) GROUP BY k LIMIT 10e6" |& { # the query should fail earlier on 127.3 and 127.2 should not even go to the memory limit exceeded error. - grep -F -q 'DB::Exception: Received from 127.3:9000. DB::Exception: Memory limit (for query) exceeded:' + grep -F -q "DB::Exception: Received from 127.3:${CLICKHOUSE_PORT_TCP}. DB::Exception: Query memory limit exceeded:" # while if this will not correctly then it will got the exception from the 127.2:9000 and fail } diff --git a/tests/queries/0_stateless/01650_fetch_patition_with_macro_in_zk_path_long.sql b/tests/queries/0_stateless/01650_fetch_patition_with_macro_in_zk_path_long.sql index 029a17f87dc..9c6bef4b6b4 100644 --- a/tests/queries/0_stateless/01650_fetch_patition_with_macro_in_zk_path_long.sql +++ b/tests/queries/0_stateless/01650_fetch_patition_with_macro_in_zk_path_long.sql @@ -1,4 +1,4 @@ --- Tags: long +-- Tags: long, no-shared-merge-tree DROP TABLE IF EXISTS test_01640; DROP TABLE IF EXISTS restore_01640; @@ -32,5 +32,3 @@ SELECT _part, * FROM restore_01640; DROP TABLE test_01640; DROP TABLE restore_01640; - - diff --git a/tests/queries/0_stateless/01655_plan_optimizations_optimize_read_in_window_order_long.sh b/tests/queries/0_stateless/01655_plan_optimizations_optimize_read_in_window_order_long.sh index e5e57ddb78a..fde0fb8a8de 100755 --- a/tests/queries/0_stateless/01655_plan_optimizations_optimize_read_in_window_order_long.sh +++ b/tests/queries/0_stateless/01655_plan_optimizations_optimize_read_in_window_order_long.sh @@ -1,5 +1,6 @@ #!/usr/bin/env bash -# Tags: long, no-random-merge-tree-settings +# Tags: long, no-random-merge-tree-settings, no-random-settings +# no sanitizers -- bad idea to check memory usage with sanitizers CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/01700_system_zookeeper_path_in.sql b/tests/queries/0_stateless/01700_system_zookeeper_path_in.sql index 3b321d3cea5..0c9f8c3293c 100644 --- a/tests/queries/0_stateless/01700_system_zookeeper_path_in.sql +++ b/tests/queries/0_stateless/01700_system_zookeeper_path_in.sql @@ -1,4 +1,5 @@ --- Tags: zookeeper +-- Tags: zookeeper, no-shared-merge-tree +-- no-shared-merge-tree: depend on replicated merge tree zookeeper structure DROP TABLE IF EXISTS sample_table; diff --git a/tests/queries/0_stateless/01710_projection_pk_trivial_count.reference b/tests/queries/0_stateless/01710_projection_pk_trivial_count.reference index 43316772467..546c26a232b 100644 --- a/tests/queries/0_stateless/01710_projection_pk_trivial_count.reference +++ b/tests/queries/0_stateless/01710_projection_pk_trivial_count.reference @@ -1,3 +1,3 @@ ReadFromMergeTree (default.x) - ReadFromPreparedSource (Optimized trivial count) + ReadFromPreparedSource (_minmax_count_projection) 5 diff --git a/tests/queries/0_stateless/01715_background_checker_blather_zookeeper_long.sql b/tests/queries/0_stateless/01715_background_checker_blather_zookeeper_long.sql index 32481be1bcd..3de17f8c30b 100644 --- a/tests/queries/0_stateless/01715_background_checker_blather_zookeeper_long.sql +++ b/tests/queries/0_stateless/01715_background_checker_blather_zookeeper_long.sql @@ -1,4 +1,5 @@ --- Tags: long, zookeeper +-- Tags: long, zookeeper, no-shared-merge-tree +-- no-shared-merge-tree: no replication queue DROP TABLE IF EXISTS i20203_1 SYNC; DROP TABLE IF EXISTS i20203_2 SYNC; diff --git a/tests/queries/0_stateless/01754_direct_dictionary_complex_key.sql b/tests/queries/0_stateless/01754_direct_dictionary_complex_key.sql index a695302161d..8b34eb87fb2 100644 --- a/tests/queries/0_stateless/01754_direct_dictionary_complex_key.sql +++ b/tests/queries/0_stateless/01754_direct_dictionary_complex_key.sql @@ -89,7 +89,7 @@ SELECT dictGetOrDefault('01754_dictionary_db.direct_dictionary_complex_key_compl SELECT 'dictHas'; SELECT dictHas('01754_dictionary_db.direct_dictionary_complex_key_complex_attributes', (number, concat('id_key_', toString(number)))) FROM system.numbers LIMIT 4; SELECT 'select all values as input stream'; -SELECT * FROM 01754_dictionary_db.direct_dictionary_complex_key_complex_attributes; +SELECT * FROM 01754_dictionary_db.direct_dictionary_complex_key_complex_attributes ORDER BY ALL; DROP DICTIONARY 01754_dictionary_db.direct_dictionary_complex_key_complex_attributes; DROP TABLE 01754_dictionary_db.complex_key_complex_attributes_source_table; diff --git a/tests/queries/0_stateless/02117_show_create_table_system.reference b/tests/queries/0_stateless/02117_show_create_table_system.reference index b260e2dce6c..2ea62444cff 100644 --- a/tests/queries/0_stateless/02117_show_create_table_system.reference +++ b/tests/queries/0_stateless/02117_show_create_table_system.reference @@ -342,7 +342,8 @@ CREATE TABLE system.merge_tree_settings `max` Nullable(String), `readonly` UInt8, `type` String, - `is_obsolete` UInt8 + `is_obsolete` UInt8, + `tier` Enum8('Production' = 0, 'Obsolete' = 4, 'Experimental' = 8, 'Beta' = 12) ) ENGINE = SystemMergeTreeSettings COMMENT 'Contains a list of all MergeTree engine specific settings, their current and default values along with descriptions. You may change any of them in SETTINGS section in CREATE query.' @@ -932,7 +933,8 @@ CREATE TABLE system.replicated_merge_tree_settings `max` Nullable(String), `readonly` UInt8, `type` String, - `is_obsolete` UInt8 + `is_obsolete` UInt8, + `tier` Enum8('Production' = 0, 'Obsolete' = 4, 'Experimental' = 8, 'Beta' = 12) ) ENGINE = SystemReplicatedMergeTreeSettings COMMENT 'Contains a list of all ReplicatedMergeTree engine specific settings, their current and default values along with descriptions. You may change any of them in SETTINGS section in CREATE query. ' @@ -1009,7 +1011,8 @@ CREATE TABLE system.settings `type` String, `default` String, `alias_for` String, - `is_obsolete` UInt8 + `is_obsolete` UInt8, + `tier` Enum8('Production' = 0, 'Obsolete' = 4, 'Experimental' = 8, 'Beta' = 12) ) ENGINE = SystemSettings COMMENT 'Contains a list of all user-level settings (which can be modified in a scope of query or session), their current and default values along with descriptions.' diff --git a/tests/queries/0_stateless/02125_many_mutations.sh b/tests/queries/0_stateless/02125_many_mutations.sh index 363253371cc..4dd9c6d9648 100755 --- a/tests/queries/0_stateless/02125_many_mutations.sh +++ b/tests/queries/0_stateless/02125_many_mutations.sh @@ -1,5 +1,6 @@ #!/usr/bin/env bash -# Tags: long, no-tsan, no-debug, no-asan, no-msan, no-ubsan +# Tags: long, no-tsan, no-debug, no-asan, no-msan, no-ubsan, no-shared-merge-tree +# no-shared-merge-tree -- this test is too slow CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/02125_many_mutations_2.sh b/tests/queries/0_stateless/02125_many_mutations_2.sh index d0025e6d8cc..e63bd296ca3 100755 --- a/tests/queries/0_stateless/02125_many_mutations_2.sh +++ b/tests/queries/0_stateless/02125_many_mutations_2.sh @@ -1,5 +1,6 @@ #!/usr/bin/env bash -# Tags: long, no-tsan, no-debug, no-asan, no-msan, no-ubsan, no-parallel +# Tags: long, no-tsan, no-debug, no-asan, no-msan, no-ubsan, no-parallel, no-shared-merge-tree +# no-shared-merge-tree -- this test is too slow CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/02205_ephemeral_1.sql b/tests/queries/0_stateless/02205_ephemeral_1.sql index 7a996ee3a8f..fd1d2f5fa18 100644 --- a/tests/queries/0_stateless/02205_ephemeral_1.sql +++ b/tests/queries/0_stateless/02205_ephemeral_1.sql @@ -1,3 +1,5 @@ +SET mutations_sync=2; + DROP TABLE IF EXISTS t_ephemeral_02205_1; CREATE TABLE t_ephemeral_02205_1 (x UInt32 DEFAULT y, y UInt32 EPHEMERAL 17, z UInt32 DEFAULT 5) ENGINE = Memory; diff --git a/tests/queries/0_stateless/02221_system_zookeeper_unrestricted.sh b/tests/queries/0_stateless/02221_system_zookeeper_unrestricted.sh index e23a272a4e8..deb45e20b7c 100755 --- a/tests/queries/0_stateless/02221_system_zookeeper_unrestricted.sh +++ b/tests/queries/0_stateless/02221_system_zookeeper_unrestricted.sh @@ -1,5 +1,6 @@ #!/usr/bin/env bash -# Tags: no-replicated-database, zookeeper +# Tags: no-replicated-database, zookeeper, no-shared-merge-tree +# no-shared-merge-tree: depend on specific paths created by replicated tables CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/02240_get_type_serialization_streams.reference b/tests/queries/0_stateless/02240_get_type_serialization_streams.reference index 15e9bf87562..eb16198e877 100644 --- a/tests/queries/0_stateless/02240_get_type_serialization_streams.reference +++ b/tests/queries/0_stateless/02240_get_type_serialization_streams.reference @@ -1,7 +1,7 @@ ['{ArraySizes}','{ArrayElements, Regular}'] ['{ArraySizes}','{ArrayElements, TupleElement(keys), Regular}','{ArrayElements, TupleElement(values), Regular}'] ['{TupleElement(1), Regular}','{TupleElement(2), Regular}','{TupleElement(3), Regular}'] -['{DictionaryKeys, Regular}','{DictionaryIndexes}'] +['{DictionaryKeys}','{DictionaryIndexes}'] ['{NullMap}','{NullableElements, Regular}'] ['{ArraySizes}','{ArrayElements, Regular}'] ['{ArraySizes}','{ArrayElements, TupleElement(keys), Regular}','{ArrayElements, TupleElement(values), Regular}'] diff --git a/tests/queries/0_stateless/02253_empty_part_checksums.sh b/tests/queries/0_stateless/02253_empty_part_checksums.sh index 371c0768e3d..66a4434576b 100755 --- a/tests/queries/0_stateless/02253_empty_part_checksums.sh +++ b/tests/queries/0_stateless/02253_empty_part_checksums.sh @@ -1,6 +1,7 @@ #!/usr/bin/env bash -# Tags: zookeeper, no-replicated-database +# Tags: zookeeper, no-replicated-database, no-shared-merge-tree # no-replicated-database because it adds extra replicas +# no-shared-merge-tree do something with parts on local fs CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/02254_projection_broken_part.sh b/tests/queries/0_stateless/02254_projection_broken_part.sh index 3521d1d9d16..04a0c4fb0a1 100755 --- a/tests/queries/0_stateless/02254_projection_broken_part.sh +++ b/tests/queries/0_stateless/02254_projection_broken_part.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# Tags: long, zookeeper +# Tags: long, zookeeper, no-shared-merge-tree CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/02255_broken_parts_chain_on_start.sh b/tests/queries/0_stateless/02255_broken_parts_chain_on_start.sh index de260937b9c..888ac73e4ab 100755 --- a/tests/queries/0_stateless/02255_broken_parts_chain_on_start.sh +++ b/tests/queries/0_stateless/02255_broken_parts_chain_on_start.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# Tags: long, zookeeper +# Tags: long, zookeeper, no-shared-merge-tree CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/02354_vector_search_queries.reference b/tests/queries/0_stateless/02354_vector_search_queries.reference index 223a18b57bf..cf80f46f53c 100644 --- a/tests/queries/0_stateless/02354_vector_search_queries.reference +++ b/tests/queries/0_stateless/02354_vector_search_queries.reference @@ -67,7 +67,7 @@ Expression (Projection) Condition: true Parts: 1/1 Granules: 4/4 --- Non-default quantization +-- Test all distance metrics x all quantization 1 [2,3.2] 2.3323807824711897 4 [2.4,5.2] 3.9999999046325727 2 [4.2,3.4] 4.427188573446585 @@ -75,7 +75,7 @@ Expression (Projection) Limit (preliminary LIMIT (without OFFSET)) Sorting (Sorting for ORDER BY) Expression (Before ORDER BY) - ReadFromMergeTree (default.tab_f64) + ReadFromMergeTree (default.tab_l2_f64) Indexes: PrimaryKey Condition: true @@ -93,7 +93,7 @@ Expression (Projection) Limit (preliminary LIMIT (without OFFSET)) Sorting (Sorting for ORDER BY) Expression (Before ORDER BY) - ReadFromMergeTree (default.tab_f32) + ReadFromMergeTree (default.tab_l2_f32) Indexes: PrimaryKey Condition: true @@ -111,7 +111,7 @@ Expression (Projection) Limit (preliminary LIMIT (without OFFSET)) Sorting (Sorting for ORDER BY) Expression (Before ORDER BY) - ReadFromMergeTree (default.tab_f16) + ReadFromMergeTree (default.tab_l2_f16) Indexes: PrimaryKey Condition: true @@ -129,7 +129,7 @@ Expression (Projection) Limit (preliminary LIMIT (without OFFSET)) Sorting (Sorting for ORDER BY) Expression (Before ORDER BY) - ReadFromMergeTree (default.tab_bf16) + ReadFromMergeTree (default.tab_l2_bf16) Indexes: PrimaryKey Condition: true @@ -147,7 +147,97 @@ Expression (Projection) Limit (preliminary LIMIT (without OFFSET)) Sorting (Sorting for ORDER BY) Expression (Before ORDER BY) - ReadFromMergeTree (default.tab_i8) + ReadFromMergeTree (default.tab_l2_i8) + Indexes: + PrimaryKey + Condition: true + Parts: 1/1 + Granules: 4/4 + Skip + Name: idx + Description: vector_similarity GRANULARITY 2 + Parts: 1/1 + Granules: 3/4 +6 [1,9.3] 0.005731362878640178 +4 [2.4,5.2] 0.09204062768384846 +1 [2,3.2] 0.15200169244542905 +Expression (Projection) + Limit (preliminary LIMIT (without OFFSET)) + Sorting (Sorting for ORDER BY) + Expression (Before ORDER BY) + ReadFromMergeTree (default.tab_cos_f64) + Indexes: + PrimaryKey + Condition: true + Parts: 1/1 + Granules: 4/4 + Skip + Name: idx + Description: vector_similarity GRANULARITY 2 + Parts: 1/1 + Granules: 3/4 +6 [1,9.3] 0.005731362878640178 +4 [2.4,5.2] 0.09204062768384846 +1 [2,3.2] 0.15200169244542905 +Expression (Projection) + Limit (preliminary LIMIT (without OFFSET)) + Sorting (Sorting for ORDER BY) + Expression (Before ORDER BY) + ReadFromMergeTree (default.tab_cos_f32) + Indexes: + PrimaryKey + Condition: true + Parts: 1/1 + Granules: 4/4 + Skip + Name: idx + Description: vector_similarity GRANULARITY 2 + Parts: 1/1 + Granules: 3/4 +6 [1,9.3] 0.005731362878640178 +4 [2.4,5.2] 0.09204062768384846 +1 [2,3.2] 0.15200169244542905 +Expression (Projection) + Limit (preliminary LIMIT (without OFFSET)) + Sorting (Sorting for ORDER BY) + Expression (Before ORDER BY) + ReadFromMergeTree (default.tab_cos_f16) + Indexes: + PrimaryKey + Condition: true + Parts: 1/1 + Granules: 4/4 + Skip + Name: idx + Description: vector_similarity GRANULARITY 2 + Parts: 1/1 + Granules: 3/4 +6 [1,9.3] 0.005731362878640178 +4 [2.4,5.2] 0.09204062768384846 +1 [2,3.2] 0.15200169244542905 +Expression (Projection) + Limit (preliminary LIMIT (without OFFSET)) + Sorting (Sorting for ORDER BY) + Expression (Before ORDER BY) + ReadFromMergeTree (default.tab_cos_bf16) + Indexes: + PrimaryKey + Condition: true + Parts: 1/1 + Granules: 4/4 + Skip + Name: idx + Description: vector_similarity GRANULARITY 2 + Parts: 1/1 + Granules: 3/4 +6 [1,9.3] 0.005731362878640178 +4 [2.4,5.2] 0.09204062768384846 +1 [2,3.2] 0.15200169244542905 +Expression (Projection) + Limit (preliminary LIMIT (without OFFSET)) + Sorting (Sorting for ORDER BY) + Expression (Before ORDER BY) + ReadFromMergeTree (default.tab_cos_i8) Indexes: PrimaryKey Condition: true diff --git a/tests/queries/0_stateless/02354_vector_search_queries.sql b/tests/queries/0_stateless/02354_vector_search_queries.sql index 71b8a1e520a..0941f9a43d6 100644 --- a/tests/queries/0_stateless/02354_vector_search_queries.sql +++ b/tests/queries/0_stateless/02354_vector_search_queries.sql @@ -81,88 +81,181 @@ SETTINGS max_limit_for_ann_queries = 2; -- LIMIT 3 > 2 --> don't use the ann ind DROP TABLE tab; -SELECT '-- Non-default quantization'; -CREATE TABLE tab_f64(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'f64', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; -CREATE TABLE tab_f32(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'f32', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; -CREATE TABLE tab_f16(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'f16', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; -CREATE TABLE tab_bf16(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'bf16', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; -CREATE TABLE tab_i8(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'i8', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; -INSERT INTO tab_f64 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); -INSERT INTO tab_f32 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); -INSERT INTO tab_f16 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); -INSERT INTO tab_bf16 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); -INSERT INTO tab_i8 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +SELECT '-- Test all distance metrics x all quantization'; + +DROP TABLE IF EXISTS tab_l2_f64; +DROP TABLE IF EXISTS tab_l2_f32; +DROP TABLE IF EXISTS tab_l2_f16; +DROP TABLE IF EXISTS tab_l2_bf16; +DROP TABLE IF EXISTS tab_l2_i8; +DROP TABLE IF EXISTS tab_cos_f64; +DROP TABLE IF EXISTS tab_cos_f32; +DROP TABLE IF EXISTS tab_cos_f16; +DROP TABLE IF EXISTS tab_cos_bf16; +DROP TABLE IF EXISTS tab_cos_i8; + +CREATE TABLE tab_l2_f64(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'f64', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; +CREATE TABLE tab_l2_f32(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'f32', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; +CREATE TABLE tab_l2_f16(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'f16', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; +CREATE TABLE tab_l2_bf16(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'bf16', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; +CREATE TABLE tab_l2_i8(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'i8', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; +CREATE TABLE tab_cos_f64(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'cosineDistance', 'f64', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; +CREATE TABLE tab_cos_f32(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'cosineDistance', 'f32', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; +CREATE TABLE tab_cos_f16(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'cosineDistance', 'f16', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; +CREATE TABLE tab_cos_bf16(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'cosineDistance', 'bf16', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; +CREATE TABLE tab_cos_i8(id Int32, vec Array(Float32), INDEX idx vec TYPE vector_similarity('hnsw', 'cosineDistance', 'i8', 0, 0) GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; + +INSERT INTO tab_l2_f64 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +INSERT INTO tab_l2_f32 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +INSERT INTO tab_l2_f16 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +INSERT INTO tab_l2_bf16 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +INSERT INTO tab_l2_i8 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +INSERT INTO tab_cos_f64 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +INSERT INTO tab_cos_f32 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +INSERT INTO tab_cos_f16 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +INSERT INTO tab_cos_bf16 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); +INSERT INTO tab_cos_i8 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]); WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_f64 +FROM tab_l2_f64 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; EXPLAIN indexes = 1 WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_f64 +FROM tab_l2_f64 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_f32 +FROM tab_l2_f32 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; EXPLAIN indexes = 1 WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_f32 +FROM tab_l2_f32 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_f16 +FROM tab_l2_f16 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; EXPLAIN indexes = 1 WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_f16 +FROM tab_l2_f16 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_bf16 +FROM tab_l2_bf16 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; EXPLAIN indexes = 1 WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_bf16 +FROM tab_l2_bf16 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_i8 +FROM tab_l2_i8 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; EXPLAIN indexes = 1 WITH [0.0, 2.0] AS reference_vec SELECT id, vec, L2Distance(vec, reference_vec) -FROM tab_i8 +FROM tab_l2_i8 ORDER BY L2Distance(vec, reference_vec) LIMIT 3; -DROP TABLE tab_f64; -DROP TABLE tab_f32; -DROP TABLE tab_f16; -DROP TABLE tab_bf16; -DROP TABLE tab_i8; +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_f64 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +EXPLAIN indexes = 1 +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_f64 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_f32 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +EXPLAIN indexes = 1 +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_f32 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_f16 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +EXPLAIN indexes = 1 +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_f16 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_bf16 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +EXPLAIN indexes = 1 +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_bf16 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_i8 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +EXPLAIN indexes = 1 +WITH [0.0, 2.0] AS reference_vec +SELECT id, vec, cosineDistance(vec, reference_vec) +FROM tab_cos_i8 +ORDER BY cosineDistance(vec, reference_vec) +LIMIT 3; + +DROP TABLE tab_l2_f64; +DROP TABLE tab_l2_f32; +DROP TABLE tab_l2_f16; +DROP TABLE tab_l2_bf16; +DROP TABLE tab_l2_i8; +DROP TABLE tab_cos_f64; +DROP TABLE tab_cos_f32; +DROP TABLE tab_cos_f16; +DROP TABLE tab_cos_bf16; +DROP TABLE tab_cos_i8; SELECT '-- Index on Array(Float64) column'; CREATE TABLE tab(id Int32, vec Array(Float64), INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance') GRANULARITY 2) ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 3; diff --git a/tests/queries/0_stateless/02369_lost_part_intersecting_merges.sh b/tests/queries/0_stateless/02369_lost_part_intersecting_merges.sh index 357c089e040..8853d75a86b 100755 --- a/tests/queries/0_stateless/02369_lost_part_intersecting_merges.sh +++ b/tests/queries/0_stateless/02369_lost_part_intersecting_merges.sh @@ -1,5 +1,6 @@ #!/usr/bin/env bash -# Tags: long, zookeeper +# Tags: zookeeper, no-shared-merge-tree, long +# no-shared-merge-tree: depend on local fs CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/02370_lost_part_intersecting_merges.sh b/tests/queries/0_stateless/02370_lost_part_intersecting_merges.sh index e34163d0502..de61f5cc23e 100755 --- a/tests/queries/0_stateless/02370_lost_part_intersecting_merges.sh +++ b/tests/queries/0_stateless/02370_lost_part_intersecting_merges.sh @@ -1,5 +1,6 @@ #!/usr/bin/env bash -# Tags: long, zookeeper +# Tags: long, zookeeper, no-shared-merge-tree +# no-shared-merge-tree: depend on local fs (remove parts) CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/02421_truncate_isolation_with_mutations.sh b/tests/queries/0_stateless/02421_truncate_isolation_with_mutations.sh index fabc9eab140..da0b132bcbc 100755 --- a/tests/queries/0_stateless/02421_truncate_isolation_with_mutations.sh +++ b/tests/queries/0_stateless/02421_truncate_isolation_with_mutations.sh @@ -11,6 +11,8 @@ CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=./parts.lib . "$CURDIR"/parts.lib +CLICKHOUSE_CLIENT="$CLICKHOUSE_CLIENT --apply_mutations_on_fly 0" + function reset_table() { table=${1:-"tt"} diff --git a/tests/queries/0_stateless/02432_s3_parallel_parts_cleanup.sql b/tests/queries/0_stateless/02432_s3_parallel_parts_cleanup.sql index 0e7a14ddf99..c25f4e13023 100644 --- a/tests/queries/0_stateless/02432_s3_parallel_parts_cleanup.sql +++ b/tests/queries/0_stateless/02432_s3_parallel_parts_cleanup.sql @@ -1,10 +1,13 @@ --- Tags: no-fasttest +-- Tags: no-fasttest, no-shared-merge-tree +-- no-shared-merge-tree: depend on custom storage policy SET send_logs_level = 'fatal'; drop table if exists rmt; drop table if exists rmt2; +set apply_mutations_on_fly = 0; + -- Disable compact parts, because we need hardlinks in mutations. create table rmt (n int, m int, k int) engine=ReplicatedMergeTree('/test/02432/{database}', '1') order by tuple() settings storage_policy = 's3_cache', allow_remote_fs_zero_copy_replication=1, diff --git a/tests/queries/0_stateless/02448_clone_replica_lost_part.sql b/tests/queries/0_stateless/02448_clone_replica_lost_part.sql index ec669ace620..9aea6aeaa94 100644 --- a/tests/queries/0_stateless/02448_clone_replica_lost_part.sql +++ b/tests/queries/0_stateless/02448_clone_replica_lost_part.sql @@ -1,4 +1,5 @@ --- Tags: long +-- Tags: long, no-shared-merge-tree +-- no-shared-merge-tree: depend on replication queue/fetches SET insert_keeper_fault_injection_probability=0; -- disable fault injection; part ids are non-deterministic in case of insert retries diff --git a/tests/queries/0_stateless/02454_create_table_with_custom_disk.sql b/tests/queries/0_stateless/02454_create_table_with_custom_disk.sql index a2d46cf6d1b..73f4e166ea2 100644 --- a/tests/queries/0_stateless/02454_create_table_with_custom_disk.sql +++ b/tests/queries/0_stateless/02454_create_table_with_custom_disk.sql @@ -1,4 +1,5 @@ --- Tags: no-object-storage, no-replicated-database +-- Tags: no-object-storage, no-replicated-database, no-shared-merge-tree +-- no-shared-merge-tree: custom disk DROP TABLE IF EXISTS test; diff --git a/tests/queries/0_stateless/02494_query_cache_system_tables.sql b/tests/queries/0_stateless/02494_query_cache_system_tables.sql index 7c9f01c4e91..12eaec0f8bc 100644 --- a/tests/queries/0_stateless/02494_query_cache_system_tables.sql +++ b/tests/queries/0_stateless/02494_query_cache_system_tables.sql @@ -44,9 +44,16 @@ SELECT * SETTINGS use_query_cache = 1; SELECT * FROM information_schema.tables SETTINGS use_query_cache = 1; -- { serverError QUERY_CACHE_USED_WITH_SYSTEM_TABLE } SELECT * FROM INFORMATION_SCHEMA.TABLES SETTINGS use_query_cache = 1; -- { serverError QUERY_CACHE_USED_WITH_SYSTEM_TABLE } +-- Issue #69010: A system table name appears as a literal. That's okay and must not throw. +DROP TABLE IF EXISTS tab; +CREATE TABLE tab (uid Int16, name String) ENGINE = Memory; +SELECT * FROM tab WHERE name = 'system.one' SETTINGS use_query_cache = true; +DROP TABLE tab; + -- System tables can be "hidden" inside e.g. table functions SELECT * FROM clusterAllReplicas('test_shard_localhost', system.one) SETTINGS use_query_cache = 1; -- {serverError QUERY_CACHE_USED_WITH_SYSTEM_TABLE } SELECT * FROM clusterAllReplicas('test_shard_localhost', 'system.one') SETTINGS use_query_cache = 1; -- {serverError QUERY_CACHE_USED_WITH_SYSTEM_TABLE } +-- Note how in the previous query ^^ 'system.one' is also a literal. ClusterAllReplicas gets special handling. -- Criminal edge case that a user creates a table named "system". The query cache must not reject queries against it. DROP TABLE IF EXISTS system; @@ -60,5 +67,4 @@ CREATE TABLE system.system (c UInt64) ENGINE = Memory; SElECT * FROM system.system SETTINGS use_query_cache = 1; -- { serverError QUERY_CACHE_USED_WITH_SYSTEM_TABLE } DROP TABLE system.system; --- Cleanup SYSTEM DROP QUERY CACHE; diff --git a/tests/queries/0_stateless/02555_davengers_rename_chain.sh b/tests/queries/0_stateless/02555_davengers_rename_chain.sh index b770eaba087..eaa455f181d 100755 --- a/tests/queries/0_stateless/02555_davengers_rename_chain.sh +++ b/tests/queries/0_stateless/02555_davengers_rename_chain.sh @@ -1,7 +1,7 @@ #!/usr/bin/env bash -# Tags: replica, no-fasttest +# Tags: replica, no-fasttest, no-shared-merge-tree # no-fasttest: Mutation load can be slow - +# no-shared-merge-tree -- have separate test for it CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh . "$CUR_DIR"/../shell_config.sh diff --git a/tests/queries/0_stateless/02561_sorting_constants_and_distinct_crash.sql b/tests/queries/0_stateless/02561_sorting_constants_and_distinct_crash.sql index 93a47c6736a..93c10dce52c 100644 --- a/tests/queries/0_stateless/02561_sorting_constants_and_distinct_crash.sql +++ b/tests/queries/0_stateless/02561_sorting_constants_and_distinct_crash.sql @@ -16,7 +16,8 @@ select distinct from ( select string_value from test_table -); +) +order by all; select distinct 'constant_1' as constant_value, * diff --git a/tests/queries/0_stateless/02725_memory-for-merges.sql b/tests/queries/0_stateless/02725_memory-for-merges.sql index 8e4d4f5b3e0..a0adddb1aff 100644 --- a/tests/queries/0_stateless/02725_memory-for-merges.sql +++ b/tests/queries/0_stateless/02725_memory-for-merges.sql @@ -1,4 +1,4 @@ --- Tags: no-object-storage, no-random-merge-tree-settings +-- Tags: no-object-storage, no-random-merge-tree-settings, no-fasttest -- We allocate a lot of memory for buffers when reading or writing to S3 DROP TABLE IF EXISTS 02725_memory_for_merges SYNC; diff --git a/tests/queries/0_stateless/02735_system_zookeeper_connection.sql b/tests/queries/0_stateless/02735_system_zookeeper_connection.sql index 48ada633225..2ea40edddf9 100644 --- a/tests/queries/0_stateless/02735_system_zookeeper_connection.sql +++ b/tests/queries/0_stateless/02735_system_zookeeper_connection.sql @@ -1,4 +1,5 @@ --- Tags: no-fasttest, no-replicated-database +-- Tags: no-fasttest, no-replicated-database, no-shared-merge-tree +-- no-shared-merge-tree -- smt doesn't support aux zookeepers DROP TABLE IF EXISTS test_zk_connection_table; diff --git a/tests/queries/0_stateless/02882_replicated_fetch_checksums_doesnt_match.sql b/tests/queries/0_stateless/02882_replicated_fetch_checksums_doesnt_match.sql index a745625f17a..45027e0454f 100644 --- a/tests/queries/0_stateless/02882_replicated_fetch_checksums_doesnt_match.sql +++ b/tests/queries/0_stateless/02882_replicated_fetch_checksums_doesnt_match.sql @@ -1,3 +1,5 @@ +-- Tags: no-shared-merge-tree + DROP TABLE IF EXISTS checksums_r3; DROP TABLE IF EXISTS checksums_r2; DROP TABLE IF EXISTS checksums_r1; diff --git a/tests/queries/0_stateless/02910_replicated_merge_parameters_must_consistent.sql b/tests/queries/0_stateless/02910_replicated_merge_parameters_must_consistent.sql index 0f452105e6d..ec19e54e9b6 100644 --- a/tests/queries/0_stateless/02910_replicated_merge_parameters_must_consistent.sql +++ b/tests/queries/0_stateless/02910_replicated_merge_parameters_must_consistent.sql @@ -1,4 +1,5 @@ --- Tags: zookeeper, no-replicated-database +-- Tags: zookeeper, no-replicated-database, no-shared-merge-tree + CREATE TABLE t ( `id` UInt64, diff --git a/tests/queries/0_stateless/02916_replication_protocol_wait_for_part.sql b/tests/queries/0_stateless/02916_replication_protocol_wait_for_part.sql index 010e29a34e8..63c7120a61a 100644 --- a/tests/queries/0_stateless/02916_replication_protocol_wait_for_part.sql +++ b/tests/queries/0_stateless/02916_replication_protocol_wait_for_part.sql @@ -1,4 +1,4 @@ --- Tags: no-replicated-database, no-fasttest +-- Tags: no-replicated-database, no-fasttest, no-shared-merge-tree -- Tag no-replicated-database: different number of replicas create table tableIn (n int) diff --git a/tests/queries/0_stateless/02919_insert_meet_eternal_hardware_error.sql b/tests/queries/0_stateless/02919_insert_meet_eternal_hardware_error.sql index 05602b42c6a..b04b22ac9cd 100644 --- a/tests/queries/0_stateless/02919_insert_meet_eternal_hardware_error.sql +++ b/tests/queries/0_stateless/02919_insert_meet_eternal_hardware_error.sql @@ -1,4 +1,5 @@ --- Tags: zookeeper, no-parallel +-- Tags: zookeeper, no-parallel, no-shared-merge-tree +-- no-shared-merge-tree: This failure injection is only RMT specific DROP TABLE IF EXISTS t_hardware_error NO DELAY; diff --git a/tests/queries/0_stateless/02922_deduplication_with_zero_copy.sh b/tests/queries/0_stateless/02922_deduplication_with_zero_copy.sh index 2eccade5c81..67a1e6f5e8d 100755 --- a/tests/queries/0_stateless/02922_deduplication_with_zero_copy.sh +++ b/tests/queries/0_stateless/02922_deduplication_with_zero_copy.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# Tags: long, no-replicated-database, no-fasttest +# Tags: long, no-replicated-database, no-fasttest, no-shared-merge-tree set -e diff --git a/tests/queries/0_stateless/02932_refreshable_materialized_views_1.reference b/tests/queries/0_stateless/02932_refreshable_materialized_views_1.reference index b21356db24e..3ec0d3b9ee2 100644 --- a/tests/queries/0_stateless/02932_refreshable_materialized_views_1.reference +++ b/tests/queries/0_stateless/02932_refreshable_materialized_views_1.reference @@ -2,7 +2,6 @@ CREATE MATERIALIZED VIEW default.a\nREFRESH EVERY 2 SECOND\n(\n `x` UInt64\n)\nENGINE = Memory\nAS SELECT number AS x\nFROM numbers(2)\nUNION ALL\nSELECT rand64() AS x <2: refreshed> 3 1 1 <3: time difference at least> 1000 -<4: next refresh in> 2 Scheduled <4.1: fake clock> Scheduled 2050-01-01 00:00:01 2050-01-01 00:00:02 1 3 3 3 0 <4.5: altered> Scheduled 2050-01-01 00:00:01 2052-01-01 00:00:00 CREATE MATERIALIZED VIEW default.a\nREFRESH EVERY 2 YEAR\n(\n `x` UInt64\n)\nENGINE = Memory\nAS SELECT x * 2 AS x\nFROM default.src diff --git a/tests/queries/0_stateless/02932_refreshable_materialized_views_1.sh b/tests/queries/0_stateless/02932_refreshable_materialized_views_1.sh index 739617a2986..e28d88310c6 100755 --- a/tests/queries/0_stateless/02932_refreshable_materialized_views_1.sh +++ b/tests/queries/0_stateless/02932_refreshable_materialized_views_1.sh @@ -50,14 +50,6 @@ done # to make sure the clock+timer code works at all. If it turns out flaky, increase refresh period above. $CLICKHOUSE_CLIENT -q " select '<3: time difference at least>', min2(reinterpret(now64(), 'Int64') - $start_time, 1000);" -while : -do - # Wait for status to change to Scheduled. If status = Scheduling, next_refresh_time is stale. - res="`$CLICKHOUSE_CLIENT -q "select '<4: next refresh in>', next_refresh_time-last_success_time, status from refreshes -- $LINENO"`" - echo "$res" | grep -q 'Scheduled' && break - sleep 0.5 -done -echo "$res" # Create a source table from which views will read. $CLICKHOUSE_CLIENT -q " diff --git a/tests/queries/0_stateless/02943_rmt_alter_metadata_merge_checksum_mismatch.sh b/tests/queries/0_stateless/02943_rmt_alter_metadata_merge_checksum_mismatch.sh index 44af2dbf26f..2a5f957b97b 100755 --- a/tests/queries/0_stateless/02943_rmt_alter_metadata_merge_checksum_mismatch.sh +++ b/tests/queries/0_stateless/02943_rmt_alter_metadata_merge_checksum_mismatch.sh @@ -1,6 +1,7 @@ #!/usr/bin/env bash -# Tags: no-parallel +# Tags: no-parallel, no-shared-merge-tree # Tag no-parallel: failpoint is in use +# Tag no-shared-merge-tree: looks like it tests a specific behaviour of ReplicatedMergeTree with failpoints CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) # shellcheck source=../shell_config.sh diff --git a/tests/queries/0_stateless/03015_optimize_final_rmt.sh b/tests/queries/0_stateless/03015_optimize_final_rmt.sh index e8bd7466503..187c8d54842 100755 --- a/tests/queries/0_stateless/03015_optimize_final_rmt.sh +++ b/tests/queries/0_stateless/03015_optimize_final_rmt.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# Tags: long, no-random-settings, no-random-merge-tree-settings, no-tsan, no-msan, no-ubsan, no-asan +# Tags: long, no-random-settings, no-random-merge-tree-settings, no-tsan, no-msan, no-ubsan, no-asan, no-debug # no sanitizers: too slow CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) diff --git a/tests/queries/0_stateless/03096_text_log_format_string_args_not_empty.sql b/tests/queries/0_stateless/03096_text_log_format_string_args_not_empty.sql index a08f35cfc1d..a4eef59f442 100644 --- a/tests/queries/0_stateless/03096_text_log_format_string_args_not_empty.sql +++ b/tests/queries/0_stateless/03096_text_log_format_string_args_not_empty.sql @@ -7,7 +7,7 @@ select conut(); -- { serverError UNKNOWN_FUNCTION } system flush logs; SET max_rows_to_read = 0; -- system.text_log can be really big -select count() > 0 from system.text_log where message_format_string = 'Peak memory usage{}: {}.' and value1 is not null and value2 like '% MiB'; +select count() > 0 from system.text_log where message_format_string = '{}{} memory usage: {}.' and not empty(value1) and value3 like '% MiB'; select count() > 0 from system.text_log where level = 'Error' and message_format_string = 'Unknown {}{} identifier {} in scope {}{}' and value1 = 'expression' and value3 = '`count`' and value4 = 'SELECT count'; diff --git a/tests/queries/0_stateless/03141_wildcard_grants.sql b/tests/queries/0_stateless/03141_wildcard_grants.sql index 45962d9b929..e71fa531134 100644 --- a/tests/queries/0_stateless/03141_wildcard_grants.sql +++ b/tests/queries/0_stateless/03141_wildcard_grants.sql @@ -19,4 +19,6 @@ REVOKE SELECT ON team*.* FROM user_03141; SHOW GRANTS FOR user_03141; SELECT '---'; +GRANT SELECT(bar) ON foo.test* TO user_03141; -- { clientError SYNTAX_ERROR } + DROP USER user_03141; diff --git a/tests/queries/0_stateless/03155_in_nested_subselects.reference b/tests/queries/0_stateless/03155_in_nested_subselects.reference index 5565ed6787f..0463db26710 100644 --- a/tests/queries/0_stateless/03155_in_nested_subselects.reference +++ b/tests/queries/0_stateless/03155_in_nested_subselects.reference @@ -1,4 +1,4 @@ 0 +0 1 -0 1 diff --git a/tests/queries/0_stateless/03155_in_nested_subselects.sql b/tests/queries/0_stateless/03155_in_nested_subselects.sql index faecb73040d..62a25165162 100644 --- a/tests/queries/0_stateless/03155_in_nested_subselects.sql +++ b/tests/queries/0_stateless/03155_in_nested_subselects.sql @@ -16,4 +16,4 @@ using id; INSERT INTO Null SELECT number AS id FROM numbers(2); -select * from Example; -- should return 4 rows +select * from Example order by all; -- should return 4 rows diff --git a/tests/queries/0_stateless/03203_system_query_metric_log.reference b/tests/queries/0_stateless/03203_system_query_metric_log.reference index d761659fce2..940b0c4e178 100644 --- a/tests/queries/0_stateless/03203_system_query_metric_log.reference +++ b/tests/queries/0_stateless/03203_system_query_metric_log.reference @@ -1,12 +1,30 @@ -number_of_metrics_1000_ok timestamp_diff_in_metrics_1000_ok -initial_data_1000_ok -data_1000_ok -number_of_metrics_1234_ok timestamp_diff_in_metrics_1234_ok -initial_data_1234_ok -data_1234_ok -number_of_metrics_123_ok timestamp_diff_in_metrics_123_ok -initial_data_123_ok -data_123_ok +--Interval 1000: check that amount of events is correct +1 +--Interval 1000: check that the delta/diff between the events is correct +1 +--Interval 1000: check that the Query, SelectQuery and InitialQuery values are correct for the first event +1 +--Interval 1000: check that the SleepFunctionCalls, SleepFunctionMilliseconds and ProfileEvent_SleepFunctionElapsedMicroseconds are correct +1 +--Interval 400: check that amount of events is correct +1 +--Interval 400: check that the delta/diff between the events is correct +1 +--Interval 400: check that the Query, SelectQuery and InitialQuery values are correct for the first event +1 +--Interval 400: check that the SleepFunctionCalls, SleepFunctionMilliseconds and ProfileEvent_SleepFunctionElapsedMicroseconds are correct +1 +--Interval 123: check that amount of events is correct +1 +--Interval 123: check that the delta/diff between the events is correct +1 +--Interval 123: check that the Query, SelectQuery and InitialQuery values are correct for the first event +1 +--Interval 123: check that the SleepFunctionCalls, SleepFunctionMilliseconds and ProfileEvent_SleepFunctionElapsedMicroseconds are correct +1 +--Check that a query_metric_log_interval=0 disables the collection 0 +-Check that a query which execution time is less than query_metric_log_interval is never collected 0 +--Check that there is a final event when queries finish 3 diff --git a/tests/queries/0_stateless/03203_system_query_metric_log.sh b/tests/queries/0_stateless/03203_system_query_metric_log.sh index 1c189c6ce41..bf94be79d7c 100755 --- a/tests/queries/0_stateless/03203_system_query_metric_log.sh +++ b/tests/queries/0_stateless/03203_system_query_metric_log.sh @@ -7,7 +7,7 @@ CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) readonly query_prefix=$CLICKHOUSE_DATABASE $CLICKHOUSE_CLIENT --query-id="${query_prefix}_1000" -q "SELECT sleep(2.5) FORMAT Null" & -$CLICKHOUSE_CLIENT --query-id="${query_prefix}_1234" -q "SELECT sleep(2.5) SETTINGS query_metric_log_interval=1234 FORMAT Null" & +$CLICKHOUSE_CLIENT --query-id="${query_prefix}_400" -q "SELECT sleep(2.5) SETTINGS query_metric_log_interval=400 FORMAT Null" & $CLICKHOUSE_CLIENT --query-id="${query_prefix}_123" -q "SELECT sleep(2.5) SETTINGS query_metric_log_interval=123 FORMAT Null" & $CLICKHOUSE_CLIENT --query-id="${query_prefix}_0" -q "SELECT sleep(2.5) SETTINGS query_metric_log_interval=0 FORMAT Null" & $CLICKHOUSE_CLIENT --query-id="${query_prefix}_fast" -q "SELECT sleep(0.1) FORMAT Null" & @@ -20,32 +20,42 @@ function check_log() { interval=$1 + # Check that the amount of events collected is correct, leaving a 20% of margin. + $CLICKHOUSE_CLIENT -m -q """ + SELECT '--Interval $interval: check that amount of events is correct'; + SELECT + count() BETWEEN ((ceil(2500 / $interval) - 1) * 0.8) AND ((ceil(2500 / $interval) + 1) * 1.2) + FROM system.query_metric_log + WHERE event_date >= yesterday() AND query_id = '${query_prefix}_${interval}' + """ + # We calculate the diff of each row with its previous row to check whether the intervals at # which data is collected is right. The first row is always skipped because the diff with the # preceding one (itself) is 0. The last row is also skipped, because it doesn't contain a full # interval. $CLICKHOUSE_CLIENT --max_threads=1 -m -q """ - WITH diff AS ( - SELECT - row_number() OVER () AS row, - count() OVER () as total_rows, - event_time_microseconds, - first_value(event_time_microseconds) OVER (ORDER BY event_time_microseconds ROWS BETWEEN 1 PRECEDING AND 0 FOLLOWING) as prev, - dateDiff('ms', prev, event_time_microseconds) AS diff - FROM system.query_metric_log - WHERE event_date >= yesterday() AND query_id = '${query_prefix}_${interval}' - ORDER BY event_time_microseconds - OFFSET 1 - ) - SELECT if(count() BETWEEN ((ceil(2500 / $interval) - 2) * 0.8) AND ((ceil(2500 / $interval) - 2) * 1.2), 'number_of_metrics_${interval}_ok', 'number_of_metrics_${interval}_error'), - if(avg(diff) BETWEEN $interval * 0.8 AND $interval * 1.2, 'timestamp_diff_in_metrics_${interval}_ok', 'timestamp_diff_in_metrics_${interval}_error') - FROM diff WHERE row < total_rows + SELECT '--Interval $interval: check that the delta/diff between the events is correct'; + WITH diff AS ( + SELECT + row_number() OVER () AS row, + count() OVER () as total_rows, + event_time_microseconds, + first_value(event_time_microseconds) OVER (ORDER BY event_time_microseconds ROWS BETWEEN 1 PRECEDING AND 0 FOLLOWING) as prev, + dateDiff('ms', prev, event_time_microseconds) AS diff + FROM system.query_metric_log + WHERE event_date >= yesterday() AND query_id = '${query_prefix}_${interval}' + ORDER BY event_time_microseconds + OFFSET 1 + ) + SELECT avg(diff) BETWEEN $interval * 0.8 AND $interval * 1.2 + FROM diff WHERE row < total_rows """ # Check that the first event contains information from the beginning of the query. # Notice the rest of the events won't contain these because the diff will be 0. $CLICKHOUSE_CLIENT -m -q """ - SELECT if(ProfileEvent_Query = 1 AND ProfileEvent_SelectQuery = 1 AND ProfileEvent_InitialQuery = 1, 'initial_data_${interval}_ok', 'initial_data_${interval}_error') + SELECT '--Interval $interval: check that the Query, SelectQuery and InitialQuery values are correct for the first event'; + SELECT ProfileEvent_Query = 1 AND ProfileEvent_SelectQuery = 1 AND ProfileEvent_InitialQuery = 1 FROM system.query_metric_log WHERE event_date >= yesterday() AND query_id = '${query_prefix}_${interval}' ORDER BY event_time_microseconds @@ -55,27 +65,36 @@ function check_log() # Also check that it contains some data that we know it's going to be there. # Notice the Sleep events can be in any of the rows, not only in the first one. $CLICKHOUSE_CLIENT -m -q """ - SELECT if(sum(ProfileEvent_SleepFunctionCalls) = 1 AND - sum(ProfileEvent_SleepFunctionMicroseconds) = 2500000 AND - sum(ProfileEvent_SleepFunctionElapsedMicroseconds) = 2500000 AND - sum(ProfileEvent_Query) = 1 AND - sum(ProfileEvent_SelectQuery) = 1 AND - sum(ProfileEvent_InitialQuery) = 1, - 'data_${interval}_ok', 'data_${interval}_error') + SELECT '--Interval $interval: check that the SleepFunctionCalls, SleepFunctionMilliseconds and ProfileEvent_SleepFunctionElapsedMicroseconds are correct'; + SELECT sum(ProfileEvent_SleepFunctionCalls) = 1 AND + sum(ProfileEvent_SleepFunctionMicroseconds) = 2500000 AND + sum(ProfileEvent_SleepFunctionElapsedMicroseconds) = 2500000 AND + sum(ProfileEvent_Query) = 1 AND + sum(ProfileEvent_SelectQuery) = 1 AND + sum(ProfileEvent_InitialQuery) = 1 FROM system.query_metric_log WHERE event_date >= yesterday() AND query_id = '${query_prefix}_${interval}' """ } check_log 1000 -check_log 1234 +check_log 400 check_log 123 # query_metric_log_interval=0 disables the collection altogether -$CLICKHOUSE_CLIENT -m -q """SELECT count() FROM system.query_metric_log WHERE event_date >= yesterday() AND query_id = '${query_prefix}_0'""" +$CLICKHOUSE_CLIENT -m -q """ + SELECT '--Check that a query_metric_log_interval=0 disables the collection'; + SELECT count() FROM system.query_metric_log WHERE event_date >= yesterday() AND query_id = '${query_prefix}_0' +""" # a quick query that takes less than query_metric_log_interval is never collected -$CLICKHOUSE_CLIENT -m -q """SELECT count() FROM system.query_metric_log WHERE event_date >= yesterday() AND query_id = '${query_prefix}_fast'""" +$CLICKHOUSE_CLIENT -m -q """ + SELECT '-Check that a query which execution time is less than query_metric_log_interval is never collected'; + SELECT count() FROM system.query_metric_log WHERE event_date >= yesterday() AND query_id = '${query_prefix}_fast' +""" # a query that takes more than query_metric_log_interval is collected including the final row -$CLICKHOUSE_CLIENT -m -q """SELECT count() FROM system.query_metric_log WHERE event_date >= yesterday() AND query_id = '${query_prefix}_1000'""" +$CLICKHOUSE_CLIENT -m -q """ + SELECT '--Check that there is a final event when queries finish'; + SELECT count() FROM system.query_metric_log WHERE event_date >= yesterday() AND query_id = '${query_prefix}_1000' +""" diff --git a/tests/queries/0_stateless/03208_datetime_cast_losing_precision.reference b/tests/queries/0_stateless/03208_datetime_cast_losing_precision.reference new file mode 100644 index 00000000000..664ea35f7f6 --- /dev/null +++ b/tests/queries/0_stateless/03208_datetime_cast_losing_precision.reference @@ -0,0 +1,10 @@ +0 +0 +0 +0 +\N +0 +1 +0 +0 +0 diff --git a/tests/queries/0_stateless/03208_datetime_cast_losing_precision.sql b/tests/queries/0_stateless/03208_datetime_cast_losing_precision.sql new file mode 100644 index 00000000000..2e2c7009c2e --- /dev/null +++ b/tests/queries/0_stateless/03208_datetime_cast_losing_precision.sql @@ -0,0 +1,43 @@ +WITH toDateTime('2024-10-16 18:00:30') as t +SELECT toDateTime64(t, 3) + interval 100 milliseconds IN (SELECT t) settings transform_null_in=0; + +WITH toDateTime('2024-10-16 18:00:30') as t +SELECT toDateTime64(t, 3) + interval 100 milliseconds IN (SELECT t) settings transform_null_in=1; + +WITH toDateTime('1970-01-01 00:00:01') as t +SELECT toDateTime64(t, 3) + interval 100 milliseconds IN (now(), Null) settings transform_null_in=1; + +WITH toDateTime('1970-01-01 00:00:01') as t +SELECT toDateTime64(t, 3) + interval 100 milliseconds IN (now(), Null) settings transform_null_in=0; + +WITH toDateTime('1970-01-01 00:00:01') as t, + arrayJoin([Null, toDateTime64(t, 3) + interval 100 milliseconds]) as x +SELECT x IN (now(), Null) settings transform_null_in=0; + +WITH toDateTime('1970-01-01 00:00:01') as t, + arrayJoin([Null, toDateTime64(t, 3) + interval 100 milliseconds]) as x +SELECT x IN (now(), Null) settings transform_null_in=1; + +WITH toDateTime('2024-10-16 18:00:30') as t +SELECT ( + SELECT + toDateTime64(t, 3) + interval 100 milliseconds, + toDateTime64(t, 3) + interval 101 milliseconds +) +IN ( + SELECT + t, + t +) SETTINGS transform_null_in=0; + +WITH toDateTime('2024-10-16 18:00:30') as t +SELECT ( + SELECT + toDateTime64(t, 3) + interval 100 milliseconds, + toDateTime64(t, 3) + interval 101 milliseconds +) +IN ( + SELECT + t, + t +) SETTINGS transform_null_in=1; diff --git a/tests/queries/0_stateless/03223_system_tables_set_not_ready.sql b/tests/queries/0_stateless/03223_system_tables_set_not_ready.sql index 907fa47143c..2cbc0286f4c 100644 --- a/tests/queries/0_stateless/03223_system_tables_set_not_ready.sql +++ b/tests/queries/0_stateless/03223_system_tables_set_not_ready.sql @@ -1,5 +1,6 @@ --- Tags: no-fasttest +-- Tags: no-fasttest, no-shared-merge-tree -- Tag no-fasttest -- due to EmbeddedRocksDB +-- Tag no-shared-merge-tree -- due to system.replication_queue drop table if exists null; drop table if exists dist; diff --git a/tests/queries/0_stateless/03230_async_insert_native.reference b/tests/queries/0_stateless/03230_async_insert_native.reference new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/queries/0_stateless/03230_async_insert_native.sh b/tests/queries/0_stateless/03230_async_insert_native.sh new file mode 100755 index 00000000000..5ac3e40fa31 --- /dev/null +++ b/tests/queries/0_stateless/03230_async_insert_native.sh @@ -0,0 +1,23 @@ +#!/usr/bin/env bash + +CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +# shellcheck source=../shell_config.sh +. "$CUR_DIR"/../shell_config.sh + +${CLICKHOUSE_CLIENT} -q " + DROP TABLE IF EXISTS async_inserts_native; + CREATE TABLE async_inserts_native (m Map(UInt64, UInt64), v UInt64 MATERIALIZED m[4]) ENGINE = Memory; +" + +url="${CLICKHOUSE_URL}&async_insert=1&async_insert_busy_timeout_max_ms=1000&async_insert_busy_timeout_min_ms=1000&wait_for_async_insert=1" + +# This test runs inserts with memory_tracker_fault_probability > 0 to trigger memory limit during insertion. +# If rollback of columns is wrong in that case it may produce LOGICAL_ERROR and it will caught by termintation of server in debug mode. +for _ in {1..10}; do + ${CLICKHOUSE_CLIENT} -q "SELECT (range(number), range(number))::Map(UInt64, UInt64) AS m FROM numbers(1000) FORMAT Native" | \ + ${CLICKHOUSE_CURL} -sS -X POST "${url}&max_block_size=100&memory_tracker_fault_probability=0.01&query=INSERT+INTO+async_inserts_native+FORMAT+Native" --data-binary @- >/dev/null 2>&1 & +done + +wait + +${CLICKHOUSE_CLIENT} -q "DROP TABLE async_inserts_native;" diff --git a/tests/queries/0_stateless/03231_bson_tuple_array_map.reference b/tests/queries/0_stateless/03231_bson_tuple_array_map.reference new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/queries/0_stateless/03231_bson_tuple_array_map.sh b/tests/queries/0_stateless/03231_bson_tuple_array_map.sh new file mode 100755 index 00000000000..600b15fb70a --- /dev/null +++ b/tests/queries/0_stateless/03231_bson_tuple_array_map.sh @@ -0,0 +1,18 @@ +#!/usr/bin/env bash + +CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +# shellcheck source=../shell_config.sh +. "$CURDIR"/../shell_config.sh + +DATA_FILE=$CLICKHOUSE_TEST_UNIQUE_NAME.data + +$CLICKHOUSE_LOCAL -q "select tuple(1, x'00000000000000000000FFFF0000000000') as x format BSONEachRow" > $DATA_FILE +$CLICKHOUSE_LOCAL -q "select * from file('$DATA_FILE', BSONEachRow, 'x Tuple(UInt32, IPv6)') settings input_format_allow_errors_num=1" + +$CLICKHOUSE_LOCAL -q "select [x'00000000000000000000FFFF00000000', x'00000000000000000000FFFF0000000000'] as x format BSONEachRow" > $DATA_FILE +$CLICKHOUSE_LOCAL -q "select * from file('$DATA_FILE', BSONEachRow, 'x Array(IPv6)') settings input_format_allow_errors_num=1" + +$CLICKHOUSE_LOCAL -q "select map('key1', x'00000000000000000000FFFF00000000', 'key2', x'00000000000000000000FFFF0000000000') as x format BSONEachRow" > $DATA_FILE +$CLICKHOUSE_LOCAL -q "select * from file('$DATA_FILE', BSONEachRow, 'x Map(String, IPv6)') settings input_format_allow_errors_num=1" + +rm $DATA_FILE diff --git a/tests/queries/0_stateless/03232_resource_create_and_drop.reference b/tests/queries/0_stateless/03232_resource_create_and_drop.reference new file mode 100644 index 00000000000..2a1045d314c --- /dev/null +++ b/tests/queries/0_stateless/03232_resource_create_and_drop.reference @@ -0,0 +1,5 @@ +03232_resource_1 ['03232_disk_1'] ['03232_disk_1'] CREATE RESOURCE `03232_resource_1` (WRITE DISK `03232_disk_1`, READ DISK `03232_disk_1`) +03232_resource_1 ['03232_disk_1'] ['03232_disk_1'] CREATE RESOURCE `03232_resource_1` (WRITE DISK `03232_disk_1`, READ DISK `03232_disk_1`) +03232_resource_2 ['03232_disk_2'] [] CREATE RESOURCE `03232_resource_2` (READ DISK `03232_disk_2`) +03232_resource_3 [] ['03232_disk_2'] CREATE RESOURCE `03232_resource_3` (WRITE DISK `03232_disk_2`) +03232_resource_1 ['03232_disk_1'] ['03232_disk_1'] CREATE RESOURCE `03232_resource_1` (WRITE DISK `03232_disk_1`, READ DISK `03232_disk_1`) diff --git a/tests/queries/0_stateless/03232_resource_create_and_drop.sql b/tests/queries/0_stateless/03232_resource_create_and_drop.sql new file mode 100644 index 00000000000..ceebd557a51 --- /dev/null +++ b/tests/queries/0_stateless/03232_resource_create_and_drop.sql @@ -0,0 +1,11 @@ +-- Tags: no-parallel +-- Do not run this test in parallel because creating the same resource twice will fail +CREATE OR REPLACE RESOURCE 03232_resource_1 (WRITE DISK 03232_disk_1, READ DISK 03232_disk_1); +SELECT name, read_disks, write_disks, create_query FROM system.resources WHERE name ILIKE '03232_%' ORDER BY name; +CREATE RESOURCE IF NOT EXISTS 03232_resource_2 (READ DISK 03232_disk_2); +CREATE RESOURCE 03232_resource_3 (WRITE DISK 03232_disk_2); +SELECT name, read_disks, write_disks, create_query FROM system.resources WHERE name ILIKE '03232_%' ORDER BY name; +DROP RESOURCE IF EXISTS 03232_resource_2; +DROP RESOURCE 03232_resource_3; +SELECT name, read_disks, write_disks, create_query FROM system.resources WHERE name ILIKE '03232_%' ORDER BY name; +DROP RESOURCE 03232_resource_1; diff --git a/tests/queries/0_stateless/03232_workload_create_and_drop.reference b/tests/queries/0_stateless/03232_workload_create_and_drop.reference new file mode 100644 index 00000000000..923e8652a35 --- /dev/null +++ b/tests/queries/0_stateless/03232_workload_create_and_drop.reference @@ -0,0 +1,5 @@ +all CREATE WORKLOAD `all` +all CREATE WORKLOAD `all` +development all CREATE WORKLOAD development IN `all` +production all CREATE WORKLOAD production IN `all` +all CREATE WORKLOAD `all` diff --git a/tests/queries/0_stateless/03232_workload_create_and_drop.sql b/tests/queries/0_stateless/03232_workload_create_and_drop.sql new file mode 100644 index 00000000000..1d8f97baf4c --- /dev/null +++ b/tests/queries/0_stateless/03232_workload_create_and_drop.sql @@ -0,0 +1,11 @@ +-- Tags: no-parallel +-- Do not run this test in parallel because `all` workload might affect other queries execution process +CREATE OR REPLACE WORKLOAD all; +SELECT name, parent, create_query FROM system.workloads ORDER BY name; +CREATE WORKLOAD IF NOT EXISTS production IN all; +CREATE WORKLOAD development IN all; +SELECT name, parent, create_query FROM system.workloads ORDER BY name; +DROP WORKLOAD IF EXISTS production; +DROP WORKLOAD development; +SELECT name, parent, create_query FROM system.workloads ORDER BY name; +DROP WORKLOAD all; diff --git a/tests/queries/0_stateless/03232_workloads_and_resources.reference b/tests/queries/0_stateless/03232_workloads_and_resources.reference new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/queries/0_stateless/03232_workloads_and_resources.sql b/tests/queries/0_stateless/03232_workloads_and_resources.sql new file mode 100644 index 00000000000..a3e46166396 --- /dev/null +++ b/tests/queries/0_stateless/03232_workloads_and_resources.sql @@ -0,0 +1,68 @@ +-- Tags: no-parallel +-- Do not run this test in parallel because `all` workload might affect other queries execution process + +-- Test simple resource and workload hierarchy creation +create resource 03232_write (write disk 03232_fake_disk); +create resource 03232_read (read disk 03232_fake_disk); +create workload all settings max_requests = 100 for 03232_write, max_requests = 200 for 03232_read; +create workload admin in all settings priority = 0; +create workload production in all settings priority = 1, weight = 9; +create workload development in all settings priority = 1, weight = 1; + +-- Test that illegal actions are not allowed +create workload another_root; -- {serverError BAD_ARGUMENTS} +create workload self_ref in self_ref; -- {serverError BAD_ARGUMENTS} +drop workload all; -- {serverError BAD_ARGUMENTS} +create workload invalid in 03232_write; -- {serverError BAD_ARGUMENTS} +create workload invalid in all settings priority = 0 for all; -- {serverError BAD_ARGUMENTS} +create workload invalid in all settings priority = 'invalid_value'; -- {serverError BAD_GET} +create workload invalid in all settings weight = 0; -- {serverError INVALID_SCHEDULER_NODE} +create workload invalid in all settings weight = -1; -- {serverError BAD_ARGUMENTS} +create workload invalid in all settings max_speed = -1; -- {serverError BAD_ARGUMENTS} +create workload invalid in all settings max_cost = -1; -- {serverError BAD_ARGUMENTS} +create workload invalid in all settings max_requests = -1; -- {serverError BAD_ARGUMENTS} +create workload invalid in all settings max_requests = 1.5; -- {serverError BAD_GET} +create or replace workload all in production; -- {serverError BAD_ARGUMENTS} + +-- Test CREATE OR REPLACE WORKLOAD +create or replace workload all settings max_requests = 200 for 03232_write, max_requests = 100 for 03232_read; +create or replace workload admin in all settings priority = 1; +create or replace workload admin in all settings priority = 2; +create or replace workload admin in all settings priority = 0; +create or replace workload production in all settings priority = 1, weight = 90; +create or replace workload production in all settings priority = 0, weight = 9; +create or replace workload production in all settings priority = 2, weight = 9; +create or replace workload development in all settings priority = 1; +create or replace workload development in all settings priority = 0; +create or replace workload development in all settings priority = 2; + +-- Test CREATE OR REPLACE RESOURCE +create or replace resource 03232_write (write disk 03232_fake_disk_2); +create or replace resource 03232_read (read disk 03232_fake_disk_2); + +-- Test update settings with CREATE OR REPLACE WORKLOAD +create or replace workload production in all settings priority = 1, weight = 9, max_requests = 100; +create or replace workload development in all settings priority = 1, weight = 1, max_requests = 10; +create or replace workload production in all settings priority = 1, weight = 9, max_cost = 100000; +create or replace workload development in all settings priority = 1, weight = 1, max_cost = 10000; +create or replace workload production in all settings priority = 1, weight = 9, max_speed = 1000000; +create or replace workload development in all settings priority = 1, weight = 1, max_speed = 100000; +create or replace workload production in all settings priority = 1, weight = 9, max_speed = 1000000, max_burst = 10000000; +create or replace workload development in all settings priority = 1, weight = 1, max_speed = 100000, max_burst = 1000000; +create or replace workload all settings max_cost = 1000000, max_speed = 100000 for 03232_write, max_speed = 200000 for 03232_read; +create or replace workload all settings max_requests = 100 for 03232_write, max_requests = 200 for 03232_read; +create or replace workload production in all settings priority = 1, weight = 9; +create or replace workload development in all settings priority = 1, weight = 1; + +-- Test change parent with CREATE OR REPLACE WORKLOAD +create or replace workload development in production settings priority = 1, weight = 1; +create or replace workload development in admin settings priority = 1, weight = 1; +create or replace workload development in all settings priority = 1, weight = 1; + +-- Clean up +drop workload if exists production; +drop workload if exists development; +drop workload if exists admin; +drop workload if exists all; +drop resource if exists 03232_write; +drop resource if exists 03232_read; diff --git a/tests/queries/0_stateless/03247_json_extract_lc_nullable.reference b/tests/queries/0_stateless/03247_json_extract_lc_nullable.reference new file mode 100644 index 00000000000..a949a93dfcc --- /dev/null +++ b/tests/queries/0_stateless/03247_json_extract_lc_nullable.reference @@ -0,0 +1 @@ +128 diff --git a/tests/queries/0_stateless/03247_json_extract_lc_nullable.sql b/tests/queries/0_stateless/03247_json_extract_lc_nullable.sql new file mode 100644 index 00000000000..bac1e34c1ab --- /dev/null +++ b/tests/queries/0_stateless/03247_json_extract_lc_nullable.sql @@ -0,0 +1,2 @@ +select JSONExtract('{"a" : 128}', 'a', 'LowCardinality(Nullable(Int128))'); + diff --git a/tests/queries/0_stateless/03252_check_number_of_arguments_for_dynamic.reference b/tests/queries/0_stateless/03252_check_number_of_arguments_for_dynamic.reference new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/queries/0_stateless/03252_check_number_of_arguments_for_dynamic.sql b/tests/queries/0_stateless/03252_check_number_of_arguments_for_dynamic.sql new file mode 100644 index 00000000000..79a8617930e --- /dev/null +++ b/tests/queries/0_stateless/03252_check_number_of_arguments_for_dynamic.sql @@ -0,0 +1,18 @@ +set enable_analyzer=1; +set allow_experimental_json_type=1; + +CREATE TABLE t +( + `a` JSON +) +ENGINE = MergeTree() +ORDER BY tuple(); + +insert into t values ('{"a":1}'), ('{"a":2.0}'); + +SELECT 1 +FROM +( + SELECT 1 AS c0 +) AS tx +FULL OUTER JOIN t AS t2 ON equals(t2.a.Float32); -- { serverError NUMBER_OF_ARGUMENTS_DOESNT_MATCH } diff --git a/tests/queries/0_stateless/03252_optimize_functions_to_subcolumns_map.reference b/tests/queries/0_stateless/03252_optimize_functions_to_subcolumns_map.reference deleted file mode 100644 index 3bc835eaeac..00000000000 --- a/tests/queries/0_stateless/03252_optimize_functions_to_subcolumns_map.reference +++ /dev/null @@ -1 +0,0 @@ -['foo'] ['bar'] diff --git a/tests/queries/0_stateless/03252_optimize_functions_to_subcolumns_map.sql b/tests/queries/0_stateless/03252_optimize_functions_to_subcolumns_map.sql deleted file mode 100644 index e0cc932783d..00000000000 --- a/tests/queries/0_stateless/03252_optimize_functions_to_subcolumns_map.sql +++ /dev/null @@ -1,9 +0,0 @@ -drop table if exists x; -create table x -( - kv Map(LowCardinality(String), LowCardinality(String)), - k Array(LowCardinality(String)) alias mapKeys(kv), - v Array(LowCardinality(String)) alias mapValues(kv) -) engine=Memory(); -insert into x values (map('foo', 'bar')); -select k, v from x settings optimize_functions_to_subcolumns=1; diff --git a/tests/queries/0_stateless/03254_parquet_bool_native_reader.reference b/tests/queries/0_stateless/03254_parquet_bool_native_reader.reference new file mode 100644 index 00000000000..0c7e55ad234 --- /dev/null +++ b/tests/queries/0_stateless/03254_parquet_bool_native_reader.reference @@ -0,0 +1,20 @@ +0 false +1 \N +2 false +3 \N +4 false +5 \N +6 false +7 \N +8 true +9 \N +0 false +1 \N +2 false +3 \N +4 false +5 \N +6 false +7 \N +8 true +9 \N diff --git a/tests/queries/0_stateless/03254_parquet_bool_native_reader.sh b/tests/queries/0_stateless/03254_parquet_bool_native_reader.sh new file mode 100755 index 00000000000..c28523b3c54 --- /dev/null +++ b/tests/queries/0_stateless/03254_parquet_bool_native_reader.sh @@ -0,0 +1,21 @@ +#!/usr/bin/env bash +# Tags: no-ubsan, no-fasttest + +CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +# shellcheck source=../shell_config.sh +. "$CUR_DIR"/../shell_config.sh + +USER_FILES_PATH=$($CLICKHOUSE_CLIENT_BINARY --query "select _path,_file from file('nonexist.txt', 'CSV', 'val1 char')" 2>&1 | grep Exception | awk '{gsub("/nonexist.txt","",$9); print $9}') + +WORKING_DIR="${USER_FILES_PATH}/${CLICKHOUSE_TEST_UNIQUE_NAME}" + +mkdir -p "${WORKING_DIR}" + +DATA_FILE="${CUR_DIR}/data_parquet/nullbool.parquet" + +DATA_FILE_USER_PATH="${WORKING_DIR}/nullbool.parquet" + +cp ${DATA_FILE} ${DATA_FILE_USER_PATH} + +${CLICKHOUSE_CLIENT} --query="select id, bool from file('${DATA_FILE_USER_PATH}', Parquet) order by id SETTINGS input_format_parquet_use_native_reader=false;" +${CLICKHOUSE_CLIENT} --query="select id, bool from file('${DATA_FILE_USER_PATH}', Parquet) order by id SETTINGS input_format_parquet_use_native_reader=true;" diff --git a/tests/queries/0_stateless/03254_prewarm_mark_cache_columns.reference b/tests/queries/0_stateless/03254_prewarm_mark_cache_columns.reference new file mode 100644 index 00000000000..e3b4928b2f4 --- /dev/null +++ b/tests/queries/0_stateless/03254_prewarm_mark_cache_columns.reference @@ -0,0 +1,6 @@ +1 +1 +1 +4 +4 +4 diff --git a/tests/queries/0_stateless/03254_prewarm_mark_cache_columns.sql b/tests/queries/0_stateless/03254_prewarm_mark_cache_columns.sql new file mode 100644 index 00000000000..4d04cee55d0 --- /dev/null +++ b/tests/queries/0_stateless/03254_prewarm_mark_cache_columns.sql @@ -0,0 +1,30 @@ +-- Tags: no-parallel, no-random-settings, no-random-merge-tree-settings + +DROP TABLE IF EXISTS t_prewarm_columns; + +CREATE TABLE t_prewarm_columns (a UInt64, b UInt64, c UInt64, d UInt64) +ENGINE = MergeTree ORDER BY a +SETTINGS min_bytes_for_wide_part = 0, prewarm_mark_cache = 1, columns_to_prewarm_mark_cache = 'a,c'; + +INSERT INTO t_prewarm_columns VALUES (1, 1, 1, 1); + +SELECT count() FROM t_prewarm_columns WHERE NOT ignore(*); + +SYSTEM DROP MARK CACHE; +DETACH TABLE t_prewarm_columns; +ATTACH TABLE t_prewarm_columns; + +SELECT count() FROM t_prewarm_columns WHERE NOT ignore(*); + +SYSTEM DROP MARK CACHE; +SYSTEM PREWARM MARK CACHE t_prewarm_columns; + +SELECT count() FROM t_prewarm_columns WHERE NOT ignore(*); + +SYSTEM FLUSH LOGS; + +SELECT ProfileEvents['LoadedMarksCount'] FROM system.query_log +WHERE current_database = currentDatabase() AND type = 'QueryFinish' AND query LIKE 'SELECT count() FROM t_prewarm_columns%' +ORDER BY event_time_microseconds; + +DROP TABLE t_prewarm_columns; diff --git a/tests/queries/0_stateless/03254_prewarm_mark_cache_rmt.reference b/tests/queries/0_stateless/03254_prewarm_mark_cache_rmt.reference new file mode 100644 index 00000000000..f1bdbd462be --- /dev/null +++ b/tests/queries/0_stateless/03254_prewarm_mark_cache_rmt.reference @@ -0,0 +1,16 @@ +20000 +20000 +40000 +40000 +40000 +40000 +40000 +40000 +0 +0 +0 +0 +0 +0 +1 +0 diff --git a/tests/queries/0_stateless/03254_prewarm_mark_cache_rmt.sql b/tests/queries/0_stateless/03254_prewarm_mark_cache_rmt.sql new file mode 100644 index 00000000000..97d18185115 --- /dev/null +++ b/tests/queries/0_stateless/03254_prewarm_mark_cache_rmt.sql @@ -0,0 +1,65 @@ +-- Tags: no-parallel, no-shared-merge-tree + +DROP TABLE IF EXISTS t_prewarm_cache_rmt_1; +DROP TABLE IF EXISTS t_prewarm_cache_rmt_2; + +CREATE TABLE t_prewarm_cache_rmt_1 (a UInt64, b UInt64, c UInt64) +ENGINE = ReplicatedMergeTree('/clickhouse/tables/{database}/03254_prewarm_mark_cache_smt/t_prewarm_cache', '1') +ORDER BY a SETTINGS prewarm_mark_cache = 1; + +CREATE TABLE t_prewarm_cache_rmt_2 (a UInt64, b UInt64, c UInt64) +ENGINE = ReplicatedMergeTree('/clickhouse/tables/{database}/03254_prewarm_mark_cache_smt/t_prewarm_cache', '2') +ORDER BY a SETTINGS prewarm_mark_cache = 1; + +SYSTEM DROP MARK CACHE; + +SYSTEM STOP FETCHES t_prewarm_cache_rmt_2; + +-- Check that prewarm works on insert. +INSERT INTO t_prewarm_cache_rmt_1 SELECT number, rand(), rand() FROM numbers(20000); +SELECT count() FROM t_prewarm_cache_rmt_1 WHERE NOT ignore(*); + +-- Check that prewarm works on fetch. +SYSTEM DROP MARK CACHE; +SYSTEM START FETCHES t_prewarm_cache_rmt_2; +SYSTEM SYNC REPLICA t_prewarm_cache_rmt_2; +SELECT count() FROM t_prewarm_cache_rmt_2 WHERE NOT ignore(*); + +-- Check that prewarm works on merge. +INSERT INTO t_prewarm_cache_rmt_1 SELECT number, rand(), rand() FROM numbers(20000); +OPTIMIZE TABLE t_prewarm_cache_rmt_1 FINAL; + +SYSTEM SYNC REPLICA t_prewarm_cache_rmt_2; + +SELECT count() FROM t_prewarm_cache_rmt_1 WHERE NOT ignore(*); +SELECT count() FROM t_prewarm_cache_rmt_2 WHERE NOT ignore(*); + +-- Check that prewarm works on restart. +SYSTEM DROP MARK CACHE; + +DETACH TABLE t_prewarm_cache_rmt_1; +DETACH TABLE t_prewarm_cache_rmt_2; + +ATTACH TABLE t_prewarm_cache_rmt_1; +ATTACH TABLE t_prewarm_cache_rmt_2; + +SELECT count() FROM t_prewarm_cache_rmt_1 WHERE NOT ignore(*); +SELECT count() FROM t_prewarm_cache_rmt_2 WHERE NOT ignore(*); + +SYSTEM DROP MARK CACHE; + +SELECT count() FROM t_prewarm_cache_rmt_1 WHERE NOT ignore(*); + +--- Check that system query works. +SYSTEM PREWARM MARK CACHE t_prewarm_cache_rmt_1; + +SELECT count() FROM t_prewarm_cache_rmt_1 WHERE NOT ignore(*); + +SYSTEM FLUSH LOGS; + +SELECT ProfileEvents['LoadedMarksCount'] > 0 FROM system.query_log +WHERE current_database = currentDatabase() AND type = 'QueryFinish' AND query LIKE 'SELECT count() FROM t_prewarm_cache%' +ORDER BY event_time_microseconds; + +DROP TABLE IF EXISTS t_prewarm_cache_rmt_1; +DROP TABLE IF EXISTS t_prewarm_cache_rmt_2; diff --git a/tests/queries/0_stateless/03254_project_lwd_respects_row_exists.reference b/tests/queries/0_stateless/03254_project_lwd_respects_row_exists.reference new file mode 100644 index 00000000000..ecc1f6c0911 --- /dev/null +++ b/tests/queries/0_stateless/03254_project_lwd_respects_row_exists.reference @@ -0,0 +1 @@ +34 1 diff --git a/tests/queries/0_stateless/03254_project_lwd_respects_row_exists.sql b/tests/queries/0_stateless/03254_project_lwd_respects_row_exists.sql new file mode 100644 index 00000000000..794f74ad15f --- /dev/null +++ b/tests/queries/0_stateless/03254_project_lwd_respects_row_exists.sql @@ -0,0 +1,20 @@ +DROP TABLE IF EXISTS users; + +CREATE TABLE users ( + uid Int16, + name String, + age Int16, + projection p1 (select age, count() group by age), +) ENGINE = MergeTree order by uid +SETTINGS lightweight_mutation_projection_mode = 'rebuild'; + +INSERT INTO users VALUES (1231, 'John', 33), (1232, 'Mary', 34); + +DELETE FROM users WHERE uid = 1231; + +SELECT + age, + count() +FROM users +GROUP BY age +SETTINGS optimize_use_projections = 1, force_optimize_projection = 1; diff --git a/tests/queries/0_stateless/03254_session_expire_in_use_in_http_interface.reference b/tests/queries/0_stateless/03254_session_expire_in_use_in_http_interface.reference new file mode 100644 index 00000000000..02a9f40656d --- /dev/null +++ b/tests/queries/0_stateless/03254_session_expire_in_use_in_http_interface.reference @@ -0,0 +1,3 @@ +A session successfully closes when timeout first expires with refcount != 1 +45 +1 diff --git a/tests/queries/0_stateless/03254_session_expire_in_use_in_http_interface.sh b/tests/queries/0_stateless/03254_session_expire_in_use_in_http_interface.sh new file mode 100755 index 00000000000..f1782cd645b --- /dev/null +++ b/tests/queries/0_stateless/03254_session_expire_in_use_in_http_interface.sh @@ -0,0 +1,15 @@ +#!/usr/bin/env bash +# Tags: long, no-parallel +# shellcheck disable=SC2015 + +CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +# shellcheck source=../shell_config.sh +. "$CURDIR"/../shell_config.sh + +echo "A session successfully closes when timeout first expires with refcount != 1" +# Here we do not want an infinite loop - because we want this mechanism to be reliable in all cases +# So it's better to give it enough time to complete even in constrained environments +${CLICKHOUSE_CURL} -sS "${CLICKHOUSE_URL}&session_id=${CLICKHOUSE_DATABASE}_10&session_timeout=1" --data-binary "CREATE TEMPORARY TABLE x (n UInt64) AS SELECT number FROM numbers(10)" +${CLICKHOUSE_CURL} -sS "${CLICKHOUSE_URL}&session_id=${CLICKHOUSE_DATABASE}_10&session_timeout=1" --data-binary "SELECT sum(n + sleep(3)) FROM x" # This query ensures timeout expires with refcount > 1 +sleep 15 +${CLICKHOUSE_CURL} -sS "${CLICKHOUSE_URL}&session_id=${CLICKHOUSE_DATABASE}_10&session_check=1" --data-binary "SELECT 1" | grep -c -F 'SESSION_NOT_FOUND' diff --git a/tests/queries/0_stateless/03257_async_insert_native_empty_block.reference b/tests/queries/0_stateless/03257_async_insert_native_empty_block.reference new file mode 100644 index 00000000000..6df2a541bff --- /dev/null +++ b/tests/queries/0_stateless/03257_async_insert_native_empty_block.reference @@ -0,0 +1,9 @@ +1 name1 +2 name2 +3 +4 +5 +Ok Preprocessed 2 +Ok Preprocessed 3 +Ok Preprocessed 0 +Ok Preprocessed 0 diff --git a/tests/queries/0_stateless/03257_async_insert_native_empty_block.sh b/tests/queries/0_stateless/03257_async_insert_native_empty_block.sh new file mode 100755 index 00000000000..43a5472914d --- /dev/null +++ b/tests/queries/0_stateless/03257_async_insert_native_empty_block.sh @@ -0,0 +1,27 @@ +#!/usr/bin/env bash + +CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +# shellcheck source=../shell_config.sh +. "$CUR_DIR"/../shell_config.sh + +$CLICKHOUSE_CLIENT --query " + DROP TABLE IF EXISTS json_square_brackets; + CREATE TABLE json_square_brackets (id UInt32, name String) ENGINE = MergeTree ORDER BY tuple() +" + +MY_CLICKHOUSE_CLIENT="$CLICKHOUSE_CLIENT --async_insert 1 --wait_for_async_insert 1" + +echo '[{"id": 1, "name": "name1"}, {"id": 2, "name": "name2"}]' | $MY_CLICKHOUSE_CLIENT -q "INSERT INTO json_square_brackets FORMAT JSONEachRow" + +echo '[{"id": 3}, {"id": 4}, {"id": 5}]' | $MY_CLICKHOUSE_CLIENT -q "INSERT INTO json_square_brackets FORMAT JSONEachRow" + +echo '[]' | $MY_CLICKHOUSE_CLIENT -q "INSERT INTO json_square_brackets FORMAT JSONEachRow" + +echo '' | $MY_CLICKHOUSE_CLIENT -q "INSERT INTO json_square_brackets FORMAT JSONEachRow" + +$CLICKHOUSE_CLIENT --query " + SYSTEM FLUSH LOGS; + SELECT * FROM json_square_brackets ORDER BY id; + SELECT status, data_kind, rows FROM system.asynchronous_insert_log WHERE database = currentDatabase() AND table = 'json_square_brackets' ORDER BY event_time_microseconds; + DROP TABLE json_square_brackets; +" diff --git a/tests/queries/0_stateless/03257_json_escape_file_names.reference b/tests/queries/0_stateless/03257_json_escape_file_names.reference new file mode 100644 index 00000000000..f44e7d62cc1 --- /dev/null +++ b/tests/queries/0_stateless/03257_json_escape_file_names.reference @@ -0,0 +1,3 @@ +{"a-b-c":"43","a-b\\/c-d\\/e":"44","a\\/b\\/c":"42"} +42 43 44 +42 43 44 diff --git a/tests/queries/0_stateless/03257_json_escape_file_names.sql b/tests/queries/0_stateless/03257_json_escape_file_names.sql new file mode 100644 index 00000000000..9cc150170fd --- /dev/null +++ b/tests/queries/0_stateless/03257_json_escape_file_names.sql @@ -0,0 +1,10 @@ +set allow_experimental_json_type = 1; +drop table if exists test; +create table test (json JSON) engine=MergeTree order by tuple() settings min_rows_for_wide_part=0, min_bytes_for_wide_part=0; +insert into test format JSONAsObject {"a/b/c" : 42, "a-b-c" : 43, "a-b/c-d/e" : 44}; + +select * from test; +select json.`a/b/c`, json.`a-b-c`, json.`a-b/c-d/e` from test; +select json.`a/b/c`.:Int64, json.`a-b-c`.:Int64, json.`a-b/c-d/e`.:Int64 from test; +drop table test; + diff --git a/tests/queries/0_stateless/03257_scalar_in_format_table_expression.reference b/tests/queries/0_stateless/03257_scalar_in_format_table_expression.reference new file mode 100644 index 00000000000..5d60960bee9 --- /dev/null +++ b/tests/queries/0_stateless/03257_scalar_in_format_table_expression.reference @@ -0,0 +1,5 @@ +Hello 111 +World 123 +Hello 111 +World 123 +6 6 diff --git a/tests/queries/0_stateless/03257_scalar_in_format_table_expression.sql b/tests/queries/0_stateless/03257_scalar_in_format_table_expression.sql new file mode 100644 index 00000000000..ec89c9874e9 --- /dev/null +++ b/tests/queries/0_stateless/03257_scalar_in_format_table_expression.sql @@ -0,0 +1,84 @@ +SELECT * FROM format( + JSONEachRow, +$$ +{"a": "Hello", "b": 111} +{"a": "World", "b": 123} +$$ + ); + +-- Should be equivalent to the previous one +SELECT * FROM format( + JSONEachRow, + ( + SELECT $$ +{"a": "Hello", "b": 111} +{"a": "World", "b": 123} +$$ + ) + ); + +-- The scalar subquery is incorrect so it should throw the proper error +SELECT * FROM format( + JSONEachRow, + ( + SELECT $$ +{"a": "Hello", "b": 111} +{"a": "World", "b": 123} +$$ + WHERE column_does_not_exists = 4 + ) + ); -- { serverError UNKNOWN_IDENTIFIER } + +-- https://github.com/ClickHouse/ClickHouse/issues/70177 + +-- Resolution of the scalar subquery should work ok (already did, adding a test just for safety) +-- Disabled for the old analyzer since it incorrectly passes 's' to format, instead of resolving s and passing that +WITH (SELECT sum(number)::String as s FROM numbers(4)) as s +SELECT *, s +FROM format(TSVRaw, s) +SETTINGS enable_analyzer=1; + +SELECT count() +FROM format(TSVRaw, ( + SELECT where_qualified__fuzz_19 + FROM numbers(10000) +)); -- { serverError UNKNOWN_IDENTIFIER } + +SELECT count() +FROM format(TSVRaw, ( + SELECT where_qualified__fuzz_19 + FROM numbers(10000) + UNION ALL + SELECT where_qualified__fuzz_35 + FROM numbers(10000) +)); -- { serverError UNKNOWN_IDENTIFIER } + +WITH ( + SELECT where_qualified__fuzz_19 + FROM numbers(10000) +) as s SELECT count() +FROM format(TSVRaw, s); -- { serverError UNKNOWN_IDENTIFIER } + +-- https://github.com/ClickHouse/ClickHouse/issues/70675 +SELECT count() +FROM format(TSVRaw, ( + SELECT CAST(arrayStringConcat(groupArray(format(TSVRaw, ( + SELECT CAST(arrayStringConcat(1 GLOBAL IN ( + SELECT 1 + WHERE 1 GLOBAL IN ( + SELECT toUInt128(1) + GROUP BY + GROUPING SETS ((1)) + WITH ROLLUP + ) + GROUP BY 1 + WITH CUBE + ), groupArray('some long string')), 'LowCardinality(String)') + FROM numbers(10000) + )), toLowCardinality('some long string')) RESPECT NULLS, '\n'), 'LowCardinality(String)') + FROM numbers(10000) +)) +FORMAT TSVRaw; -- { serverError UNKNOWN_IDENTIFIER, ILLEGAL_TYPE_OF_ARGUMENT } + +-- Same but for table function numbers +SELECT 1 FROM numbers((SELECT DEFAULT)); -- { serverError UNKNOWN_IDENTIFIER } diff --git a/tests/queries/0_stateless/03257_setting_tiers.reference b/tests/queries/0_stateless/03257_setting_tiers.reference new file mode 100644 index 00000000000..d3d171221e8 --- /dev/null +++ b/tests/queries/0_stateless/03257_setting_tiers.reference @@ -0,0 +1,10 @@ +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 diff --git a/tests/queries/0_stateless/03257_setting_tiers.sql b/tests/queries/0_stateless/03257_setting_tiers.sql new file mode 100644 index 00000000000..c7ffe87a80b --- /dev/null +++ b/tests/queries/0_stateless/03257_setting_tiers.sql @@ -0,0 +1,11 @@ +SELECT count() > 0 FROM system.settings WHERE tier = 'Production'; +SELECT count() > 0 FROM system.settings WHERE tier = 'Beta'; +SELECT count() > 0 FROM system.settings WHERE tier = 'Experimental'; +SELECT count() > 0 FROM system.settings WHERE tier = 'Obsolete'; +SELECT count() == countIf(tier IN ['Production', 'Beta', 'Experimental', 'Obsolete']) FROM system.settings; + +SELECT count() > 0 FROM system.merge_tree_settings WHERE tier = 'Production'; +SELECT count() > 0 FROM system.merge_tree_settings WHERE tier = 'Beta'; +SELECT count() > 0 FROM system.merge_tree_settings WHERE tier = 'Experimental'; +SELECT count() > 0 FROM system.merge_tree_settings WHERE tier = 'Obsolete'; +SELECT count() == countIf(tier IN ['Production', 'Beta', 'Experimental', 'Obsolete']) FROM system.merge_tree_settings; diff --git a/tests/queries/0_stateless/03258_dynamic_in_functions_weak_ptr_exception.reference b/tests/queries/0_stateless/03258_dynamic_in_functions_weak_ptr_exception.reference new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/queries/0_stateless/03258_dynamic_in_functions_weak_ptr_exception.sql b/tests/queries/0_stateless/03258_dynamic_in_functions_weak_ptr_exception.sql new file mode 100644 index 00000000000..f825353c135 --- /dev/null +++ b/tests/queries/0_stateless/03258_dynamic_in_functions_weak_ptr_exception.sql @@ -0,0 +1,6 @@ +SET allow_experimental_dynamic_type = 1; +DROP TABLE IF EXISTS t0; +CREATE TABLE t0 (c0 Tuple(c1 Int,c2 Dynamic)) ENGINE = Memory(); +SELECT 1 FROM t0 tx JOIN t0 ty ON tx.c0 = ty.c0; +DROP TABLE t0; + diff --git a/tests/queries/0_stateless/03258_nonexistent_db.reference b/tests/queries/0_stateless/03258_nonexistent_db.reference new file mode 100644 index 00000000000..825bae3beaa --- /dev/null +++ b/tests/queries/0_stateless/03258_nonexistent_db.reference @@ -0,0 +1,2 @@ +UNKNOWN_DATABASE +OK diff --git a/tests/queries/0_stateless/03258_nonexistent_db.sh b/tests/queries/0_stateless/03258_nonexistent_db.sh new file mode 100755 index 00000000000..847d692c440 --- /dev/null +++ b/tests/queries/0_stateless/03258_nonexistent_db.sh @@ -0,0 +1,7 @@ +#!/usr/bin/env bash + +CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +# shellcheck source=../shell_config.sh +. "$CURDIR"/../shell_config.sh + +timeout 5 ${CLICKHOUSE_CLIENT_BINARY} --database "nonexistent" 2>&1 | grep -o "UNKNOWN_DATABASE" && echo "OK" || echo "FAIL" diff --git a/tests/queries/0_stateless/03259_native_http_async_insert_settings.reference b/tests/queries/0_stateless/03259_native_http_async_insert_settings.reference new file mode 100644 index 00000000000..573541ac970 --- /dev/null +++ b/tests/queries/0_stateless/03259_native_http_async_insert_settings.reference @@ -0,0 +1 @@ +0 diff --git a/tests/queries/0_stateless/03259_native_http_async_insert_settings.sh b/tests/queries/0_stateless/03259_native_http_async_insert_settings.sh new file mode 100755 index 00000000000..c0934b06cc7 --- /dev/null +++ b/tests/queries/0_stateless/03259_native_http_async_insert_settings.sh @@ -0,0 +1,17 @@ +#!/usr/bin/env bash + +CUR_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +# shellcheck source=../shell_config.sh +. "$CUR_DIR"/../shell_config.sh + + +$CLICKHOUSE_CLIENT -q "drop table if exists test" +$CLICKHOUSE_CLIENT -q "create table test (x UInt32) engine=Memory"; + +url="${CLICKHOUSE_URL}&async_insert=1&wait_for_async_insert=1" + +$CLICKHOUSE_LOCAL -q "select NULL::Nullable(UInt32) as x format Native" | ${CLICKHOUSE_CURL} -sS "$url&query=INSERT%20INTO%20test%20FORMAT%20Native" --data-binary @- + +$CLICKHOUSE_CLIENT -q "select * from test"; +$CLICKHOUSE_CLIENT -q "drop table test" + diff --git a/tests/queries/0_stateless/03260_dynamic_low_cardinality_dict_bug.reference b/tests/queries/0_stateless/03260_dynamic_low_cardinality_dict_bug.reference new file mode 100644 index 00000000000..8ae0f8e9f14 --- /dev/null +++ b/tests/queries/0_stateless/03260_dynamic_low_cardinality_dict_bug.reference @@ -0,0 +1,20 @@ +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 +12345678 diff --git a/tests/queries/0_stateless/03260_dynamic_low_cardinality_dict_bug.sql b/tests/queries/0_stateless/03260_dynamic_low_cardinality_dict_bug.sql new file mode 100644 index 00000000000..c5b981d5965 --- /dev/null +++ b/tests/queries/0_stateless/03260_dynamic_low_cardinality_dict_bug.sql @@ -0,0 +1,12 @@ +set allow_experimental_dynamic_type = 1; +set min_bytes_to_use_direct_io = 0; + +drop table if exists test; +create table test (id UInt64, d Dynamic) engine=MergeTree order by id settings min_rows_for_wide_part=1, min_bytes_for_wide_part=1, index_granularity=1, use_adaptive_write_buffer_for_dynamic_subcolumns=0, max_compress_block_size=8, min_compress_block_size=8, use_compact_variant_discriminators_serialization=0; + +insert into test select number, '12345678'::LowCardinality(String) from numbers(20); + +select d.`LowCardinality(String)` from test settings max_threads=1; + +drop table test; + diff --git a/tests/queries/0_stateless/data_parquet/nullbool.parquet b/tests/queries/0_stateless/data_parquet/nullbool.parquet new file mode 100644 index 00000000000..d9b365bbe75 Binary files /dev/null and b/tests/queries/0_stateless/data_parquet/nullbool.parquet differ diff --git a/tests/queries/0_stateless/transactions.lib b/tests/queries/0_stateless/transactions.lib index 12345ac2799..94125004849 100755 --- a/tests/queries/0_stateless/transactions.lib +++ b/tests/queries/0_stateless/transactions.lib @@ -11,7 +11,7 @@ function tx() session="${CLICKHOUSE_TEST_ZOOKEEPER_PREFIX}_tx$tx_num" query_id="${session}_${RANDOM}" url_without_session="https://${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT_HTTPS}/?" - url="${url_without_session}session_id=$session&query_id=$query_id&database=$CLICKHOUSE_DATABASE" + url="${url_without_session}session_id=$session&query_id=$query_id&database=$CLICKHOUSE_DATABASE&apply_mutations_on_fly=0" ${CLICKHOUSE_CURL} -m 90 -sSk "$url" --data "$query" | sed "s/^/tx$tx_num\t/" } @@ -56,7 +56,7 @@ function tx_async() session="${CLICKHOUSE_TEST_ZOOKEEPER_PREFIX}_tx$tx_num" query_id="${session}_${RANDOM}" url_without_session="https://${CLICKHOUSE_HOST}:${CLICKHOUSE_PORT_HTTPS}/?" - url="${url_without_session}session_id=$session&query_id=$query_id&database=$CLICKHOUSE_DATABASE" + url="${url_without_session}session_id=$session&query_id=$query_id&database=$CLICKHOUSE_DATABASE&apply_mutations_on_fly=0" # We cannot be sure that query will actually start execution and appear in system.processes before the next call to tx_wait # Also we cannot use global map in bash to store last query_id for each tx_num, so we use tmp file... diff --git a/tests/queries/1_stateful/00170_s3_cache.reference b/tests/queries/1_stateful/00170_s3_cache.reference index 293fbd7f8cb..79c780b0729 100644 --- a/tests/queries/1_stateful/00170_s3_cache.reference +++ b/tests/queries/1_stateful/00170_s3_cache.reference @@ -3,6 +3,7 @@ SET allow_prefetched_read_pool_for_remote_filesystem=0; SET enable_filesystem_cache_on_write_operations=0; SET max_memory_usage='20G'; +SET read_through_distributed_cache = 1; SYSTEM DROP FILESYSTEM CACHE; SELECT count() FROM test.hits_s3; 8873898 diff --git a/tests/queries/1_stateful/00170_s3_cache.sql b/tests/queries/1_stateful/00170_s3_cache.sql index 23663a1844d..8709d7d14f1 100644 --- a/tests/queries/1_stateful/00170_s3_cache.sql +++ b/tests/queries/1_stateful/00170_s3_cache.sql @@ -5,6 +5,7 @@ SET allow_prefetched_read_pool_for_remote_filesystem=0; SET enable_filesystem_cache_on_write_operations=0; SET max_memory_usage='20G'; +SET read_through_distributed_cache = 1; SYSTEM DROP FILESYSTEM CACHE; SELECT count() FROM test.hits_s3; SELECT count() FROM test.hits_s3 WHERE AdvEngineID != 0; diff --git a/tests/queries/bugs/03254_attach_part_order.reference b/tests/queries/bugs/03254_attach_part_order.reference new file mode 100644 index 00000000000..d19922d01d6 --- /dev/null +++ b/tests/queries/bugs/03254_attach_part_order.reference @@ -0,0 +1,4 @@ +Row 1: +────── +id: 1 +visits: 115 diff --git a/tests/queries/bugs/03254_attach_part_order.sql b/tests/queries/bugs/03254_attach_part_order.sql new file mode 100644 index 00000000000..81439dca030 --- /dev/null +++ b/tests/queries/bugs/03254_attach_part_order.sql @@ -0,0 +1,34 @@ +CREATE TABLE test_table +( + dt DateTime, + id UInt32, + url String, + visits UInt32 +) +ENGINE ReplacingMergeTree +ORDER BY (dt, id) +PARTITION BY toYYYYMM(dt); + +SYSTEM STOP merges test_table; + +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 100); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 101); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 102); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 103); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 104); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 105); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 106); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 107); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 108); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 109); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 110); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 111); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 112); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 113); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 114); +INSERT INTO test_table VALUES (toDate('2024-10-24'), 1, '/index', 115); + +ALTER TABLE test_table DETACH PARTITION 202410; +ALTER TABLE test_table ATTACH PARTITION 202410; + +SELECT id, visits FROM test_table FINAL ORDER BY id FORMAT Vertical; \ No newline at end of file diff --git a/utils/check-style/check-doc-aspell b/utils/check-style/check-doc-aspell index b5a3958e6cf..0406b337575 100755 --- a/utils/check-style/check-doc-aspell +++ b/utils/check-style/check-doc-aspell @@ -53,7 +53,7 @@ done if (( STATUS != 0 )); then echo "====== Errors found ======" echo "To exclude some words add them to the dictionary file \"${ASPELL_IGNORE_PATH}/aspell-dict.txt\"" - echo "You can also run ${0} -i to see the errors interactively and fix them or add to the dictionary file" + echo "You can also run '$(realpath --relative-base=${ROOT_PATH} ${0}) -i' to see the errors interactively and fix them or add to the dictionary file" fi exit ${STATUS} diff --git a/utils/list-licenses/list-licenses.sh b/utils/list-licenses/list-licenses.sh index cc730464e8e..c06d61a9c43 100755 --- a/utils/list-licenses/list-licenses.sh +++ b/utils/list-licenses/list-licenses.sh @@ -1,6 +1,7 @@ #!/usr/bin/env bash -if [[ "$OSTYPE" == "darwin"* ]]; then +if [[ "$OSTYPE" == "darwin"* ]] +then # use GNU versions, their presence is ensured in cmake/tools.cmake GREP_CMD=ggrep FIND_CMD=gfind @@ -12,31 +13,35 @@ fi ROOT_PATH="$(git rev-parse --show-toplevel)" LIBS_PATH="${ROOT_PATH}/contrib" -mapfile -t libs < <(echo "${ROOT_PATH}/base/poco"; find "${LIBS_PATH}" -type d -maxdepth 1 ! -name '*-cmake' | LC_ALL=C sort) -for LIB in "${libs[@]}"; do +mapfile -t libs < <(echo "${ROOT_PATH}/base/poco"; find "${LIBS_PATH}" -maxdepth 1 -type d -not -name '*-cmake' -not -name 'rust_vendor' | LC_ALL=C sort) +for LIB in "${libs[@]}" +do LIB_NAME=$(basename "$LIB") LIB_LICENSE=$( - LC_ALL=C ${FIND_CMD} "$LIB" -type f -and '(' -iname 'LICENSE*' -or -iname 'COPYING*' -or -iname 'COPYRIGHT*' ')' -and -not '(' -iname '*.html' -or -iname '*.htm' -or -iname '*.rtf' -or -name '*.cpp' -or -name '*.h' -or -iname '*.json' ')' -printf "%d\t%p\n" | + LC_ALL=C ${FIND_CMD} "$LIB" -type f -and '(' -iname 'LICENSE*' -or -iname 'COPYING*' -or -iname 'COPYRIGHT*' -or -iname 'NOTICE' ')' -and -not '(' -iname '*.html' -or -iname '*.htm' -or -iname '*.rtf' -or -name '*.cpp' -or -name '*.h' -or -iname '*.json' ')' -printf "%d\t%p\n" | LC_ALL=C sort | LC_ALL=C awk ' BEGIN { IGNORECASE=1; min_depth = 0 } /LICENSE/ { if (!min_depth || $1 <= min_depth) { min_depth = $1; license = $2 } } /COPY/ { if (!min_depth || $1 <= min_depth) { min_depth = $1; copying = $2 } } - END { if (license) { print license } else { print copying } }') - - if [ -n "$LIB_LICENSE" ]; then + /NOTICE/ { if (!min_depth || $1 <= min_depth) { min_depth = $1; notice = $2 } } + END { if (license) { print license } else if (copying) { print copying } else { print notice } }') + if [ -n "$LIB_LICENSE" ] + then LICENSE_TYPE=$( (${GREP_CMD} -q -F 'Apache' "$LIB_LICENSE" && echo "Apache") || (${GREP_CMD} -q -F 'Boost' "$LIB_LICENSE" && echo "Boost") || - (${GREP_CMD} -q -i -P 'public\s*domain' "$LIB_LICENSE" && + (${GREP_CMD} -q -i -P 'public\s*domain|CC0 1\.0 Universal' "$LIB_LICENSE" && echo "Public Domain") || (${GREP_CMD} -q -F 'BSD' "$LIB_LICENSE" && echo "BSD") || (${GREP_CMD} -q -F 'Lesser General Public License' "$LIB_LICENSE" && echo "LGPL") || + (${GREP_CMD} -q -F 'General Public License' "$LIB_LICENSE" && + echo "GPL") || (${GREP_CMD} -q -i -F 'The origin of this software must not be misrepresented' "$LIB_LICENSE" && ${GREP_CMD} -q -i -F 'Altered source versions must be plainly marked as such' "$LIB_LICENSE" && ${GREP_CMD} -q -i -F 'This notice may not be removed or altered' "$LIB_LICENSE" && @@ -73,8 +78,23 @@ for LIB in "${libs[@]}"; do echo "HPND") || echo "Unknown") + if [ "$LICENSE_TYPE" == "GPL" ] + then + echo "Fatal error: General Public License found in ${LIB_NAME}." >&2 + exit 1 + fi + + if [ "$LICENSE_TYPE" == "Unknown" ] + then + echo "Fatal error: sources with unknown license found in ${LIB_NAME}." >&2 + exit 1 + fi + RELATIVE_PATH=$(echo "$LIB_LICENSE" | sed -r -e 's!^.+/(contrib|base)/!/\1/!') echo -e "$LIB_NAME\t$LICENSE_TYPE\t$RELATIVE_PATH" fi done + +# Special care for Rust +find "${LIBS_PATH}/rust_vendor/" -name 'Cargo.toml' | xargs grep 'license = ' | (grep -v -P 'MIT|Apache|MPL' && echo "Fatal error: unrecognized licenses in the Rust code" >&2 && exit 1 || true)