Merge remote-tracking branch 'origin/master' into fix_docs_create_materialized_view_on_cluster

This commit is contained in:
Dmitry Novik 2024-11-01 14:23:57 +01:00
commit a6e2a09843
273 changed files with 10690 additions and 1192 deletions

1
.gitignore vendored
View File

@ -159,6 +159,7 @@ website/package-lock.json
/programs/server/store
/programs/server/uuid
/programs/server/coordination
/programs/server/workload
# temporary test files
tests/queries/0_stateless/test_*

View File

@ -1,4 +1,5 @@
### Table of Contents
**[ClickHouse release v24.10, 2024-10-31](#2410)**<br/>
**[ClickHouse release v24.9, 2024-09-26](#249)**<br/>
**[ClickHouse release v24.8 LTS, 2024-08-20](#248)**<br/>
**[ClickHouse release v24.7, 2024-07-30](#247)**<br/>
@ -12,6 +13,165 @@
# 2024 Changelog
### <a id="2410"></a> ClickHouse release 24.10, 2024-10-31
#### Backward Incompatible Change
* Allow to write `SETTINGS` before `FORMAT` in a chain of queries with `UNION` when subqueries are inside parentheses. This closes [#39712](https://github.com/ClickHouse/ClickHouse/issues/39712). Change the behavior when a query has the SETTINGS clause specified twice in a sequence. The closest SETTINGS clause will have a preference for the corresponding subquery. In the previous versions, the outermost SETTINGS clause could take a preference over the inner one. [#68614](https://github.com/ClickHouse/ClickHouse/pull/68614) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Reordering of filter conditions from `[PRE]WHERE` clause is now allowed by default. It could be disabled by setting `allow_reorder_prewhere_conditions` to `false`. [#70657](https://github.com/ClickHouse/ClickHouse/pull/70657) ([Nikita Taranov](https://github.com/nickitat)).
* Remove the `idxd-config` library, which has an incompatible license. This also removes the experimental Intel DeflateQPL codec. [#70987](https://github.com/ClickHouse/ClickHouse/pull/70987) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
#### New Feature
* Allow to grant access to the wildcard prefixes. `GRANT SELECT ON db.table_pefix_* TO user`. [#65311](https://github.com/ClickHouse/ClickHouse/pull/65311) ([pufit](https://github.com/pufit)).
* If you press space bar during query runtime, the client will display a real-time table with detailed metrics. You can enable it globally with the new `--progress-table` option in clickhouse-client; a new `--enable-progress-table-toggle` is associated with the `--progress-table` option, and toggles the rendering of the progress table by pressing the control key (Space). [#63689](https://github.com/ClickHouse/ClickHouse/pull/63689) ([Maria Khristenko](https://github.com/mariaKhr)), [#70423](https://github.com/ClickHouse/ClickHouse/pull/70423) ([Julia Kartseva](https://github.com/jkartseva)).
* Allow to cache read files for object storage table engines and data lakes using hash from ETag + file path as cache key. [#70135](https://github.com/ClickHouse/ClickHouse/pull/70135) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Support creating a table with a query: `CREATE TABLE ... CLONE AS ...`. It clones the source table's schema and then attaches all partitions to the newly created table. This feature is only supported with tables of the `MergeTree` family Closes [#65015](https://github.com/ClickHouse/ClickHouse/issues/65015). [#69091](https://github.com/ClickHouse/ClickHouse/pull/69091) ([tuanpach](https://github.com/tuanpach)).
* Add a new system table, `system.query_metric_log` which contains history of memory and metric values from table system.events for individual queries, periodically flushed to disk. [#66532](https://github.com/ClickHouse/ClickHouse/pull/66532) ([Pablo Marcos](https://github.com/pamarcos)).
* A simple SELECT query can be written with implicit SELECT to enable calculator-style expressions, e.g., `ch "1 + 2"`. This is controlled by a new setting, `implicit_select`. [#68502](https://github.com/ClickHouse/ClickHouse/pull/68502) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Support the `--copy` mode for clickhouse local as a shortcut for format conversion [#68503](https://github.com/ClickHouse/ClickHouse/issues/68503). [#68583](https://github.com/ClickHouse/ClickHouse/pull/68583) ([Denis Hananein](https://github.com/denis-hananein)).
* Add a builtin HTML page for visualizing merges which is available at the `/merges` path. [#70821](https://github.com/ClickHouse/ClickHouse/pull/70821) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Add support for `arrayUnion` function. [#68989](https://github.com/ClickHouse/ClickHouse/pull/68989) ([Peter Nguyen](https://github.com/petern48)).
* Allow parametrised SQL aliases. [#50665](https://github.com/ClickHouse/ClickHouse/pull/50665) ([Anton Kozlov](https://github.com/tonickkozlov)).
* A new aggregate function `quantileExactWeightedInterpolated`, which is a interpolated version based on quantileExactWeighted. Some people may wonder why we need a new `quantileExactWeightedInterpolated` since we already have `quantileExactInterpolatedWeighted`. The reason is the new one is more accurate than the old one. This is for spark compatibility. [#69619](https://github.com/ClickHouse/ClickHouse/pull/69619) ([李扬](https://github.com/taiyang-li)).
* A new function `arrayElementOrNull`. It returns `NULL` if the array index is out of range or a Map key not found. [#69646](https://github.com/ClickHouse/ClickHouse/pull/69646) ([李扬](https://github.com/taiyang-li)).
* Allows users to specify regular expressions through new `message_regexp` and `message_regexp_negative` fields in the `config.xml` file to filter out logging. The logging is applied to the formatted un-colored text for the most intuitive developer experience. [#69657](https://github.com/ClickHouse/ClickHouse/pull/69657) ([Peter Nguyen](https://github.com/petern48)).
* Added `RIPEMD160` function, which computes the RIPEMD-160 cryptographic hash of a string. Example: `SELECT HEX(RIPEMD160('The quick brown fox jumps over the lazy dog'))` returns `37F332F68DB77BD9D7EDD4969571AD671CF9DD3B`. [#70087](https://github.com/ClickHouse/ClickHouse/pull/70087) ([Dergousov Maxim](https://github.com/m7kss1)).
* Support reading `Iceberg` tables on `HDFS`. [#70268](https://github.com/ClickHouse/ClickHouse/pull/70268) ([flynn](https://github.com/ucasfl)).
* Support for CTE in the form of `WITH ... INSERT`, as previously we only supported `INSERT ... WITH ...`. [#70593](https://github.com/ClickHouse/ClickHouse/pull/70593) ([Shichao Jin](https://github.com/jsc0218)).
* MongoDB integration: support for all MongoDB types, support for WHERE and ORDER BY statements on MongoDB side, restriction for expressions unsupported by MongoDB. Note that the new inegration is disabled by default, to use it, please set `<use_legacy_mongodb_integration>` to `false` in server config. [#63279](https://github.com/ClickHouse/ClickHouse/pull/63279) ([Kirill Nikiforov](https://github.com/allmazz)).
* A new function `getSettingOrDefault` added to return the default value and avoid exception if a custom setting is not found in the current profile. [#69917](https://github.com/ClickHouse/ClickHouse/pull/69917) ([Shankar](https://github.com/shiyer7474)).
#### Experimental feature
* Refreshable materialized views are production ready. [#70550](https://github.com/ClickHouse/ClickHouse/pull/70550) ([Michael Kolupaev](https://github.com/al13n321)). Refreshable materialized views are now supported in Replicated databases. [#60669](https://github.com/ClickHouse/ClickHouse/pull/60669) ([Michael Kolupaev](https://github.com/al13n321)).
* Parallel replicas are moved from experimental to beta. Reworked settings that control the behavior of parallel replicas algorithms. A quick recap: ClickHouse has four different algorithms for parallel reading involving multiple replicas, which is reflected in the setting `parallel_replicas_mode`, the default value for it is `read_tasks` Additionally, the toggle-switch setting `enable_parallel_replicas` has been added. [#63151](https://github.com/ClickHouse/ClickHouse/pull/63151) ([Alexey Milovidov](https://github.com/alexey-milovidov)), ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)).
* Support for the `Dynamic` type in most functions by executing them on internal types inside `Dynamic`. [#69691](https://github.com/ClickHouse/ClickHouse/pull/69691) ([Pavel Kruglov](https://github.com/Avogar)).
* Allow to read/write the `JSON` type as a binary string in `RowBinary` format under settings `input_format_binary_read_json_as_string/output_format_binary_write_json_as_string`. [#70288](https://github.com/ClickHouse/ClickHouse/pull/70288) ([Pavel Kruglov](https://github.com/Avogar)).
* Allow to serialize/deserialize `JSON` column as single String column in the Native format. For output use setting `output_format_native_write_json_as_string`. For input, use serialization version `1` before the column data. [#70312](https://github.com/ClickHouse/ClickHouse/pull/70312) ([Pavel Kruglov](https://github.com/Avogar)).
* Introduced a special (experimental) mode of a merge selector for MergeTree tables which makes it more aggressive for the partitions that are close to the limit by the number of parts. It is controlled by the `merge_selector_use_blurry_base` MergeTree-level setting. [#70645](https://github.com/ClickHouse/ClickHouse/pull/70645) ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)).
* Implement generic ser/de between Avro's `Union` and ClickHouse's `Variant` types. Resolves [#69713](https://github.com/ClickHouse/ClickHouse/issues/69713). [#69712](https://github.com/ClickHouse/ClickHouse/pull/69712) ([Jiří Kozlovský](https://github.com/jirislav)).
#### Performance Improvement
* Refactor `IDisk` and `IObjectStorage` for better performance. Tables from `plain` and `plain_rewritable` object storages will initialize faster. [#68146](https://github.com/ClickHouse/ClickHouse/pull/68146) ([Alexey Milovidov](https://github.com/alexey-milovidov), [Julia Kartseva](https://github.com/jkartseva)). Do not call the LIST object storage API when determining if a file or directory exists on the plain rewritable disk, as it can be cost-inefficient. [#70852](https://github.com/ClickHouse/ClickHouse/pull/70852) ([Julia Kartseva](https://github.com/jkartseva)). Reduce the number of object storage HEAD API requests in the plain_rewritable disk. [#70915](https://github.com/ClickHouse/ClickHouse/pull/70915) ([Julia Kartseva](https://github.com/jkartseva)).
* Added an ability to parse data directly into sparse columns. [#69828](https://github.com/ClickHouse/ClickHouse/pull/69828) ([Anton Popov](https://github.com/CurtizJ)).
* Improved performance of parsing formats with high number of missed values (e.g. `JSONEachRow`). [#69875](https://github.com/ClickHouse/ClickHouse/pull/69875) ([Anton Popov](https://github.com/CurtizJ)).
* Supports parallel reading of parquet row groups and prefetching of row groups in single-threaded mode. [#69862](https://github.com/ClickHouse/ClickHouse/pull/69862) ([LiuNeng](https://github.com/liuneng1994)).
* Support minmax index for `pointInPolygon`. [#62085](https://github.com/ClickHouse/ClickHouse/pull/62085) ([JackyWoo](https://github.com/JackyWoo)).
* Use bloom filters when reading Parquet files. [#62966](https://github.com/ClickHouse/ClickHouse/pull/62966) ([Arthur Passos](https://github.com/arthurpassos)).
* Lock-free parts rename to avoid INSERT affect SELECT (due to parts lock) (under normal circumstances with `fsync_part_directory`, QPS of SELECT with INSERT in parallel, increased 2x, under heavy load the effect is even bigger). Note, this only includes `ReplicatedMergeTree` for now. [#64955](https://github.com/ClickHouse/ClickHouse/pull/64955) ([Azat Khuzhin](https://github.com/azat)).
* Respect `ttl_only_drop_parts` on `materialize ttl`; only read necessary columns to recalculate TTL and drop parts by replacing them with an empty one. [#65488](https://github.com/ClickHouse/ClickHouse/pull/65488) ([Andrey Zvonov](https://github.com/zvonand)).
* Optimized thread creation in the ThreadPool to minimize lock contention. Thread creation is now performed outside of the critical section to avoid delays in job scheduling and thread management under high load conditions. This leads to a much more responsive ClickHouse under heavy concurrent load. [#68694](https://github.com/ClickHouse/ClickHouse/pull/68694) ([filimonov](https://github.com/filimonov)).
* Enable reading `LowCardinality` string columns from `ORC`. [#69481](https://github.com/ClickHouse/ClickHouse/pull/69481) ([李扬](https://github.com/taiyang-li)).
* Use `LowCardinality` for `ProfileEvents` in system logs such as `part_log`, `query_views_log`, `filesystem_cache_log`. [#70152](https://github.com/ClickHouse/ClickHouse/pull/70152) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Improve performance of `fromUnixTimestamp`/`toUnixTimestamp` functions. [#71042](https://github.com/ClickHouse/ClickHouse/pull/71042) ([kevinyhzou](https://github.com/KevinyhZou)).
* Don't disable nonblocking read from page cache for the entire server when reading from a blocking I/O. This was leading to a poorer performance when a single filesystem (e.g., tmpfs) didn't support the `preadv2` syscall while others do. [#70299](https://github.com/ClickHouse/ClickHouse/pull/70299) ([Antonio Andelic](https://github.com/antonio2368)).
* `ALTER TABLE .. REPLACE PARTITION` doesn't wait anymore for mutations/merges that happen in other partitions. [#59138](https://github.com/ClickHouse/ClickHouse/pull/59138) ([Vasily Nemkov](https://github.com/Enmk)).
* Don't do validation when synchronizing ACL from Keeper. It's validating during creation. It shouldn't matter that much, but there are installations with tens of thousands or even more user created, and the unnecessary hash validation can take a long time to finish during server startup (it synchronizes everything from keeper). [#70644](https://github.com/ClickHouse/ClickHouse/pull/70644) ([Raúl Marín](https://github.com/Algunenano)).
#### Improvement
* `CREATE TABLE AS` will copy `PRIMARY KEY`, `ORDER BY`, and similar clauses (of `MergeTree` tables). [#69739](https://github.com/ClickHouse/ClickHouse/pull/69739) ([sakulali](https://github.com/sakulali)).
* Support 64-bit XID in Keeper. It can be enabled with the `use_xid_64` configuration value. [#69908](https://github.com/ClickHouse/ClickHouse/pull/69908) ([Antonio Andelic](https://github.com/antonio2368)).
* Command-line arguments for Bool settings are set to true when no value is provided for the argument (e.g. `clickhouse-client --optimize_aggregation_in_order --query "SELECT 1"`). [#70459](https://github.com/ClickHouse/ClickHouse/pull/70459) ([davidtsuk](https://github.com/davidtsuk)).
* Added user-level settings `min_free_disk_bytes_to_throw_insert` and `min_free_disk_ratio_to_throw_insert` to prevent insertions on disks that are almost full. [#69755](https://github.com/ClickHouse/ClickHouse/pull/69755) ([Marco Vilas Boas](https://github.com/marco-vb)).
* Embedded documentation for settings will be strictly more detailed and complete than the documentation on the website. This is the first step before making the website documentation always auto-generated from the source code. This has long-standing implications: - it will be guaranteed to have every setting; - there is no chance of having default values obsolete; - we can generate this documentation for each ClickHouse version; - the documentation can be displayed by the server itself even without Internet access. Generate the docs on the website from the source code. [#70289](https://github.com/ClickHouse/ClickHouse/pull/70289) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Allow empty needle in the function `replace`, the same behavior with PostgreSQL. [#69918](https://github.com/ClickHouse/ClickHouse/pull/69918) ([zhanglistar](https://github.com/zhanglistar)).
* Allow empty needle in functions `replaceRegexp*`. [#70053](https://github.com/ClickHouse/ClickHouse/pull/70053) ([zhanglistar](https://github.com/zhanglistar)).
* Symbolic links for tables in the `data/database_name/` directory are created for the actual paths to the table's data, depending on the storage policy, instead of the `store/...` directory on the default disk. [#61777](https://github.com/ClickHouse/ClickHouse/pull/61777) ([Kirill](https://github.com/kirillgarbar)).
* While parsing an `Enum` field from `JSON`, a string containing an integer will be interpreted as the corresponding `Enum` element. This closes [#65119](https://github.com/ClickHouse/ClickHouse/issues/65119). [#66801](https://github.com/ClickHouse/ClickHouse/pull/66801) ([scanhex12](https://github.com/scanhex12)).
* Allow `TRIM` -ing `LEADING` or `TRAILING` empty string as a no-op. Closes [#67792](https://github.com/ClickHouse/ClickHouse/issues/67792). [#68455](https://github.com/ClickHouse/ClickHouse/pull/68455) ([Peter Nguyen](https://github.com/petern48)).
* Improve compatibility of `cast(timestamp as String)` with Spark. [#69179](https://github.com/ClickHouse/ClickHouse/pull/69179) ([Wenzheng Liu](https://github.com/lwz9103)).
* Always use the new analyzer to calculate constant expressions when `enable_analyzer` is set to `true`. Support calculation of `executable` table function arguments without using `SELECT` query for constant expressions. [#69292](https://github.com/ClickHouse/ClickHouse/pull/69292) ([Dmitry Novik](https://github.com/novikd)).
* Add a setting `enable_secure_identifiers` to disallow identifiers with special characters. [#69411](https://github.com/ClickHouse/ClickHouse/pull/69411) ([tuanpach](https://github.com/tuanpach)).
* Add `show_create_query_identifier_quoting_rule` to define identifier quoting behavior in the `SHOW CREATE TABLE` query result. Possible values: - `user_display`: When the identifiers is a keyword. - `when_necessary`: When the identifiers is one of `{"distinct", "all", "table"}` and when it could lead to ambiguity: column names, dictionary attribute names. - `always`: Always quote identifiers. [#69448](https://github.com/ClickHouse/ClickHouse/pull/69448) ([tuanpach](https://github.com/tuanpach)).
* Improve restoring of access entities' dependencies [#69563](https://github.com/ClickHouse/ClickHouse/pull/69563) ([Vitaly Baranov](https://github.com/vitlibar)).
* If you run `clickhouse-client` or other CLI application, and it starts up slowly due to an overloaded server, and you start typing your query, such as `SELECT`, the previous versions will display the remaining of the terminal echo contents before printing the greetings message, such as `SELECTClickHouse local version 24.10.1.1.` instead of `ClickHouse local version 24.10.1.1.`. Now it is fixed. This closes [#31696](https://github.com/ClickHouse/ClickHouse/issues/31696). [#69856](https://github.com/ClickHouse/ClickHouse/pull/69856) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Add new column `readonly_duration` to the `system.replicas` table. Needed to be able to distinguish actual readonly replicas from sentinel ones in alerts. [#69871](https://github.com/ClickHouse/ClickHouse/pull/69871) ([Miсhael Stetsyuk](https://github.com/mstetsyuk)).
* Change the type of `join_output_by_rowlist_perkey_rows_threshold` setting type to unsigned integer. [#69886](https://github.com/ClickHouse/ClickHouse/pull/69886) ([kevinyhzou](https://github.com/KevinyhZou)).
* Enhance OpenTelemetry span logging to include query settings. [#70011](https://github.com/ClickHouse/ClickHouse/pull/70011) ([sharathks118](https://github.com/sharathks118)).
* Add diagnostic info about higher-order array functions if lambda result type is unexpected. [#70093](https://github.com/ClickHouse/ClickHouse/pull/70093) ([ttanay](https://github.com/ttanay)).
* Keeper improvement: less locking during cluster changes. [#70275](https://github.com/ClickHouse/ClickHouse/pull/70275) ([Antonio Andelic](https://github.com/antonio2368)).
* Add `WITH IMPLICIT` and `FINAL` keywords to the `SHOW GRANTS` command. Fix a minor bug with implicit grants: [#70094](https://github.com/ClickHouse/ClickHouse/issues/70094). [#70293](https://github.com/ClickHouse/ClickHouse/pull/70293) ([pufit](https://github.com/pufit)).
* Respect `compatibility` for MergeTree settings. The `compatibility` value is taken from the `default` profile on server startup, and default MergeTree settings are changed accordingly. Further changes of the `compatibility` setting do not affect MergeTree settings. [#70322](https://github.com/ClickHouse/ClickHouse/pull/70322) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Avoid spamming the logs with large HTTP response bodies in case of errors during inter-server communication. [#70487](https://github.com/ClickHouse/ClickHouse/pull/70487) ([Vladimir Cherkasov](https://github.com/vdimir)).
* Added a new setting `max_parts_to_move` to control the maximum number of parts that can be moved at once. [#70520](https://github.com/ClickHouse/ClickHouse/pull/70520) ([Vladimir Cherkasov](https://github.com/vdimir)).
* Limit the frequency of certain log messages. [#70601](https://github.com/ClickHouse/ClickHouse/pull/70601) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* `CHECK TABLE` with `PART` qualifier was incorrectly formatted in the client. [#70660](https://github.com/ClickHouse/ClickHouse/pull/70660) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Support writing the column index and the offset index using parquet native writer. [#70669](https://github.com/ClickHouse/ClickHouse/pull/70669) ([LiuNeng](https://github.com/liuneng1994)).
* Support parsing `DateTime64` for microsecond and timezone in joda syntax ("joda" is a popular Java library for date and time, and the "joda syntax" is that library's style). [#70737](https://github.com/ClickHouse/ClickHouse/pull/70737) ([kevinyhzou](https://github.com/KevinyhZou)).
* Changed an approach to figure out if a cloud storage supports [batch delete](https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html) or not. [#70786](https://github.com/ClickHouse/ClickHouse/pull/70786) ([Vitaly Baranov](https://github.com/vitlibar)).
* Support for Parquet page v2 in the native reader. [#70807](https://github.com/ClickHouse/ClickHouse/pull/70807) ([Arthur Passos](https://github.com/arthurpassos)).
* A check if table has both `storage_policy` and `disk` set. A check if a new storage policy is compatible with an old one when using `disk` setting is added. [#70839](https://github.com/ClickHouse/ClickHouse/pull/70839) ([Kirill](https://github.com/kirillgarbar)).
* Add `system.s3_queue_settings` and `system.azure_queue_settings`. [#70841](https://github.com/ClickHouse/ClickHouse/pull/70841) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Functions `base58Encode` and `base58Decode` now accept arguments of type `FixedString`. Example: `SELECT base58Encode(toFixedString('plaintext', 9));`. [#70846](https://github.com/ClickHouse/ClickHouse/pull/70846) ([Faizan Patel](https://github.com/faizan2786)).
* Add the `partition` column to every entry type of the part log. Previously, it was set only for some entries. This closes [#70819](https://github.com/ClickHouse/ClickHouse/issues/70819). [#70848](https://github.com/ClickHouse/ClickHouse/pull/70848) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Add `MergeStart` and `MutateStart` events into `system.part_log` which helps with merges analysis and visualization. [#70850](https://github.com/ClickHouse/ClickHouse/pull/70850) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Add a profile event about the number of merged source parts. It allows the monitoring of the fanout of the merge tree in production. [#70908](https://github.com/ClickHouse/ClickHouse/pull/70908) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Background downloads to the filesystem cache were enabled back. [#70929](https://github.com/ClickHouse/ClickHouse/pull/70929) ([Nikita Taranov](https://github.com/nickitat)).
* Add a new merge selector algorithm, named `Trivial`, for professional usage only. It is worse than the `Simple` merge selector. [#70969](https://github.com/ClickHouse/ClickHouse/pull/70969) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Support for atomic `CREATE OR REPLACE VIEW`. [#70536](https://github.com/ClickHouse/ClickHouse/pull/70536) ([tuanpach](https://github.com/tuanpach))
* Added `strict_once` mode to aggregate function `windowFunnel` to avoid counting one event several times in case it matches multiple conditions, close [#21835](https://github.com/ClickHouse/ClickHouse/issues/21835). [#69738](https://github.com/ClickHouse/ClickHouse/pull/69738) ([Vladimir Cherkasov](https://github.com/vdimir)).
#### Bug Fix (user-visible misbehavior in an official stable release)
* Apply configuration updates in global context object. It fixes issues like [#62308](https://github.com/ClickHouse/ClickHouse/issues/62308). [#62944](https://github.com/ClickHouse/ClickHouse/pull/62944) ([Amos Bird](https://github.com/amosbird)).
* Fix `ReadSettings` not using user set values, because defaults were only used. [#65625](https://github.com/ClickHouse/ClickHouse/pull/65625) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Fix type mismatch issue in `sumMapFiltered` when using signed arguments. [#58408](https://github.com/ClickHouse/ClickHouse/pull/58408) ([Chen768959](https://github.com/Chen768959)).
* Fix toHour-like conversion functions' monotonicity when optional time zone argument is passed. [#60264](https://github.com/ClickHouse/ClickHouse/pull/60264) ([Amos Bird](https://github.com/amosbird)).
* Relax `supportsPrewhere` check for `Merge` tables. This fixes [#61064](https://github.com/ClickHouse/ClickHouse/issues/61064). It was hardened unnecessarily in [#60082](https://github.com/ClickHouse/ClickHouse/issues/60082). [#61091](https://github.com/ClickHouse/ClickHouse/pull/61091) ([Amos Bird](https://github.com/amosbird)).
* Fix `use_concurrency_control` setting handling for proper `concurrent_threads_soft_limit_num` limit enforcing. This enables concurrency control by default because previously it was broken. [#61473](https://github.com/ClickHouse/ClickHouse/pull/61473) ([Sergei Trifonov](https://github.com/serxa)).
* Fix incorrect `JOIN ON` section optimization in case of `IS NULL` check under any other function (like `NOT`) that may lead to wrong results. Closes [#67915](https://github.com/ClickHouse/ClickHouse/issues/67915). [#68049](https://github.com/ClickHouse/ClickHouse/pull/68049) ([Vladimir Cherkasov](https://github.com/vdimir)).
* Prevent `ALTER` queries that would make the `CREATE` query of tables invalid. [#68574](https://github.com/ClickHouse/ClickHouse/pull/68574) ([János Benjamin Antal](https://github.com/antaljanosbenjamin)).
* Fix inconsistent AST formatting for `negate` (`-`) and `NOT` functions with tuples and arrays. [#68600](https://github.com/ClickHouse/ClickHouse/pull/68600) ([Vladimir Cherkasov](https://github.com/vdimir)).
* Fix insertion of incomplete type into `Dynamic` during deserialization. It could lead to `Parameter out of bound` errors. [#69291](https://github.com/ClickHouse/ClickHouse/pull/69291) ([Pavel Kruglov](https://github.com/Avogar)).
* Zero-copy replication, which is experimental and should not be used in production: fix inf loop after `restore replica` in the replicated merge tree with zero copy. [#69293](https://github.com/CljmnickHouse/ClickHouse/pull/69293) ([MikhailBurdukov](https://github.com/MikhailBurdukov)).
* Return back default value of `processing_threads_num` as number of cpu cores in storage `S3Queue`. [#69384](https://github.com/ClickHouse/ClickHouse/pull/69384) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Bypass try/catch flow when de/serializing nested repeated protobuf to nested columns (fixes [#41971](https://github.com/ClickHouse/ClickHouse/issues/41971)). [#69556](https://github.com/ClickHouse/ClickHouse/pull/69556) ([Eliot Hautefeuille](https://github.com/hileef)).
* Fix crash during insertion into FixedString column in PostgreSQL engine. [#69584](https://github.com/ClickHouse/ClickHouse/pull/69584) ([Pavel Kruglov](https://github.com/Avogar)).
* Fix crash when executing `create view t as (with recursive 42 as ttt select ttt);`. [#69676](https://github.com/ClickHouse/ClickHouse/pull/69676) ([Han Fei](https://github.com/hanfei1991)).
* Fixed `maxMapState` throwing 'Bad get' if value type is DateTime64. [#69787](https://github.com/ClickHouse/ClickHouse/pull/69787) ([Michael Kolupaev](https://github.com/al13n321)).
* Fix `getSubcolumn` with `LowCardinality` columns by overriding `useDefaultImplementationForLowCardinalityColumns` to return `true`. [#69831](https://github.com/ClickHouse/ClickHouse/pull/69831) ([Miсhael Stetsyuk](https://github.com/mstetsyuk)).
* Fix permanent blocked distributed sends if a DROP of distributed table failed. [#69843](https://github.com/ClickHouse/ClickHouse/pull/69843) ([Azat Khuzhin](https://github.com/azat)).
* Fix non-cancellable queries containing WITH FILL with NaN keys. This closes [#69261](https://github.com/ClickHouse/ClickHouse/issues/69261). [#69845](https://github.com/ClickHouse/ClickHouse/pull/69845) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Fix analyzer default with old compatibility value. [#69895](https://github.com/ClickHouse/ClickHouse/pull/69895) ([Raúl Marín](https://github.com/Algunenano)).
* Don't check dependencies during CREATE OR REPLACE VIEW during DROP of old table. Previously CREATE OR REPLACE query failed when there are dependent tables of the recreated view. [#69907](https://github.com/ClickHouse/ClickHouse/pull/69907) ([Pavel Kruglov](https://github.com/Avogar)).
* Something for Decimal. Fixes [#69730](https://github.com/ClickHouse/ClickHouse/issues/69730). [#69978](https://github.com/ClickHouse/ClickHouse/pull/69978) ([Arthur Passos](https://github.com/arthurpassos)).
* Now DEFINER/INVOKER will work with parameterized views. [#69984](https://github.com/ClickHouse/ClickHouse/pull/69984) ([pufit](https://github.com/pufit)).
* Fix parsing for view's definers. [#69985](https://github.com/ClickHouse/ClickHouse/pull/69985) ([pufit](https://github.com/pufit)).
* Fixed a bug when the timezone could change the result of the query with a `Date` or `Date32` arguments. [#70036](https://github.com/ClickHouse/ClickHouse/pull/70036) ([Yarik Briukhovetskyi](https://github.com/yariks5s)).
* Fixes `Block structure mismatch` for queries with nested views and `WHERE` condition. Fixes [#66209](https://github.com/ClickHouse/ClickHouse/issues/66209). [#70054](https://github.com/ClickHouse/ClickHouse/pull/70054) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
* Avoid reusing columns among different named tuples when evaluating `tuple` functions. This fixes [#70022](https://github.com/ClickHouse/ClickHouse/issues/70022). [#70103](https://github.com/ClickHouse/ClickHouse/pull/70103) ([Amos Bird](https://github.com/amosbird)).
* Fix wrong LOGICAL_ERROR when replacing literals in ranges. [#70122](https://github.com/ClickHouse/ClickHouse/pull/70122) ([Pablo Marcos](https://github.com/pamarcos)).
* Check for Nullable(Nothing) type during ALTER TABLE MODIFY COLUMN/QUERY to prevent tables with such data type. [#70123](https://github.com/ClickHouse/ClickHouse/pull/70123) ([Pavel Kruglov](https://github.com/Avogar)).
* Proper error message for illegal query `JOIN ... ON *` , close [#68650](https://github.com/ClickHouse/ClickHouse/issues/68650). [#70124](https://github.com/ClickHouse/ClickHouse/pull/70124) ([Vladimir Cherkasov](https://github.com/vdimir)).
* Fix wrong result with skipping index. [#70127](https://github.com/ClickHouse/ClickHouse/pull/70127) ([Raúl Marín](https://github.com/Algunenano)).
* Fix data race in ColumnObject/ColumnTuple decompress method that could lead to heap use after free. [#70137](https://github.com/ClickHouse/ClickHouse/pull/70137) ([Pavel Kruglov](https://github.com/Avogar)).
* Fix possible hung in ALTER COLUMN with Dynamic type. [#70144](https://github.com/ClickHouse/ClickHouse/pull/70144) ([Pavel Kruglov](https://github.com/Avogar)).
* Now ClickHouse will consider more errors as retriable and will not mark data parts as broken in case of such errors. [#70145](https://github.com/ClickHouse/ClickHouse/pull/70145) ([alesapin](https://github.com/alesapin)).
* Use correct `max_types` parameter during Dynamic type creation for JSON subcolumn. [#70147](https://github.com/ClickHouse/ClickHouse/pull/70147) ([Pavel Kruglov](https://github.com/Avogar)).
* Fix the password being displayed in `system.query_log` for users with bcrypt password authentication method. [#70148](https://github.com/ClickHouse/ClickHouse/pull/70148) ([Nikolay Degterinsky](https://github.com/evillique)).
* Fix event counter for the native interface (InterfaceNativeSendBytes). [#70153](https://github.com/ClickHouse/ClickHouse/pull/70153) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
* Fix possible crash related to JSON columns. [#70172](https://github.com/ClickHouse/ClickHouse/pull/70172) ([Pavel Kruglov](https://github.com/Avogar)).
* Fix multiple issues with arrayMin and arrayMax. [#70207](https://github.com/ClickHouse/ClickHouse/pull/70207) ([Raúl Marín](https://github.com/Algunenano)).
* Respect setting allow_simdjson in the JSON type parser. [#70218](https://github.com/ClickHouse/ClickHouse/pull/70218) ([Pavel Kruglov](https://github.com/Avogar)).
* Fix a null pointer dereference on creating a materialized view with two selects and an `INTERSECT`, e.g. `CREATE MATERIALIZED VIEW v0 AS (SELECT 1) INTERSECT (SELECT 1);`. [#70264](https://github.com/ClickHouse/ClickHouse/pull/70264) ([Konstantin Bogdanov](https://github.com/thevar1able)).
* Don't modify global settings with startup scripts. Previously, changing a setting in a startup script would change it globally. [#70310](https://github.com/ClickHouse/ClickHouse/pull/70310) ([Antonio Andelic](https://github.com/antonio2368)).
* Fix ALTER of `Dynamic` type with reducing max_types parameter that could lead to server crash. [#70328](https://github.com/ClickHouse/ClickHouse/pull/70328) ([Pavel Kruglov](https://github.com/Avogar)).
* Fix crash when using WITH FILL incorrectly. [#70338](https://github.com/ClickHouse/ClickHouse/pull/70338) ([Raúl Marín](https://github.com/Algunenano)).
* Fix possible use-after-free in `SYSTEM DROP FORMAT SCHEMA CACHE FOR Protobuf`. [#70358](https://github.com/ClickHouse/ClickHouse/pull/70358) ([Azat Khuzhin](https://github.com/azat)).
* Fix crash during GROUP BY JSON sub-object subcolumn. [#70374](https://github.com/ClickHouse/ClickHouse/pull/70374) ([Pavel Kruglov](https://github.com/Avogar)).
* Don't prefetch parts for vertical merges if part has no rows. [#70452](https://github.com/ClickHouse/ClickHouse/pull/70452) ([Antonio Andelic](https://github.com/antonio2368)).
* Fix crash in WHERE with lambda functions. [#70464](https://github.com/ClickHouse/ClickHouse/pull/70464) ([Raúl Marín](https://github.com/Algunenano)).
* Fix table creation with `CREATE ... AS table_function(...)` with database `Replicated` and unavailable table function source on secondary replica. [#70511](https://github.com/ClickHouse/ClickHouse/pull/70511) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Ignore all output on async insert with `wait_for_async_insert=1`. Closes [#62644](https://github.com/ClickHouse/ClickHouse/issues/62644). [#70530](https://github.com/ClickHouse/ClickHouse/pull/70530) ([Konstantin Bogdanov](https://github.com/thevar1able)).
* Ignore frozen_metadata.txt while traversing shadow directory from system.remote_data_paths. [#70590](https://github.com/ClickHouse/ClickHouse/pull/70590) ([Aleksei Filatov](https://github.com/aalexfvk)).
* Fix creation of stateful window functions on misaligned memory. [#70631](https://github.com/ClickHouse/ClickHouse/pull/70631) ([Raúl Marín](https://github.com/Algunenano)).
* Fixed rare crashes in `SELECT`-s and merges after adding a column of `Array` type with non-empty default expression. [#70695](https://github.com/ClickHouse/ClickHouse/pull/70695) ([Anton Popov](https://github.com/CurtizJ)).
* Insert into table function s3 will respect query settings. [#70696](https://github.com/ClickHouse/ClickHouse/pull/70696) ([Vladimir Cherkasov](https://github.com/vdimir)).
* Fix infinite recursion when inferring a protobuf schema when skipping unsupported fields is enabled. [#70697](https://github.com/ClickHouse/ClickHouse/pull/70697) ([Raúl Marín](https://github.com/Algunenano)).
* Disable enable_named_columns_in_function_tuple by default. [#70833](https://github.com/ClickHouse/ClickHouse/pull/70833) ([Raúl Marín](https://github.com/Algunenano)).
* Fix S3Queue table engine setting processing_threads_num not being effective in case it was deduced from the number of cpu cores on the server. [#70837](https://github.com/ClickHouse/ClickHouse/pull/70837) ([Kseniia Sumarokova](https://github.com/kssenii)).
* Normalize named tuple arguments in aggregation states. This fixes [#69732](https://github.com/ClickHouse/ClickHouse/issues/69732) . [#70853](https://github.com/ClickHouse/ClickHouse/pull/70853) ([Amos Bird](https://github.com/amosbird)).
* Fix a logical error due to negative zeros in the two-level hash table. This closes [#70973](https://github.com/ClickHouse/ClickHouse/issues/70973). [#70979](https://github.com/ClickHouse/ClickHouse/pull/70979) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
* Fix `limit by`, `limit with ties` for distributed and parallel replicas. [#70880](https://github.com/ClickHouse/ClickHouse/pull/70880) ([Nikita Taranov](https://github.com/nickitat)).
### <a id="249"></a> ClickHouse release 24.9, 2024-09-26
#### Backward Incompatible Change

View File

@ -48,6 +48,7 @@ Upcoming meetups
* [Paris Meetup](https://www.meetup.com/clickhouse-france-user-group/events/303096434) - November 26
* [Amsterdam Meetup](https://www.meetup.com/clickhouse-netherlands-user-group/events/303638814) - December 3
* [New York Meetup](https://www.meetup.com/clickhouse-new-york-user-group/events/304268174) - December 9
* [San Francisco Meetup](https://www.meetup.com/clickhouse-silicon-valley-meetup-group/events/304286951/) - December 12
Recently completed meetups

View File

@ -28,7 +28,7 @@ COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && locale-gen en_US.UTF-8
ENV LC_ALL en_US.UTF-8
ENV LC_ALL=en_US.UTF-8
# Architecture of the image when BuildKit/buildx is used
ARG TARGETARCH

View File

@ -12,6 +12,7 @@ charset-normalizer==3.3.2
click==8.1.7
codespell==2.2.1
cryptography==43.0.1
datacompy==0.7.3
Deprecated==1.2.14
dill==0.3.8
flake8==4.0.1
@ -23,6 +24,7 @@ mccabe==0.6.1
multidict==6.0.5
mypy==1.8.0
mypy-extensions==1.0.0
pandas==2.2.3
packaging==24.1
pathspec==0.9.0
pip==24.1.1

View File

@ -290,6 +290,7 @@ The following settings can be specified in configuration file for given endpoint
- `expiration_window_seconds` — Grace period for checking if expiration-based credentials have expired. Optional, default value is `120`.
- `no_sign_request` - Ignore all the credentials so requests are not signed. Useful for accessing public buckets.
- `header` — Adds specified HTTP header to a request to given endpoint. Optional, can be specified multiple times.
- `access_header` - Adds specified HTTP header to a request to given endpoint, in cases where there are no other credentials from another source.
- `server_side_encryption_customer_key_base64` — If specified, required headers for accessing S3 objects with SSE-C encryption will be set. Optional.
- `server_side_encryption_kms_key_id` - If specified, required headers for accessing S3 objects with [SSE-KMS encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html) will be set. If an empty string is specified, the AWS managed S3 key will be used. Optional.
- `server_side_encryption_kms_encryption_context` - If specified alongside `server_side_encryption_kms_key_id`, the given encryption context header for SSE-KMS will be set. Optional.
@ -320,6 +321,32 @@ The following settings can be specified in configuration file for given endpoint
</s3>
```
## Working with archives
Suppose that we have several archive files with following URIs on S3:
- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-10.csv.zip'
- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-11.csv.zip'
- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-12.csv.zip'
Extracting data from these archives is possible using ::. Globs can be used both in the url part as well as in the part after :: (responsible for the name of a file inside the archive).
``` sql
SELECT *
FROM s3(
'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-1{0..2}.csv.zip :: *.csv'
);
```
:::note
ClickHouse supports three archive formats:
ZIP
TAR
7Z
While ZIP and TAR archives can be accessed from any supported storage location, 7Z archives can only be read from the local filesystem where ClickHouse is installed.
:::
## Accessing public buckets
ClickHouse tries to fetch credentials from many different types of sources.

View File

@ -3224,6 +3224,34 @@ Default value: "default"
**See Also**
- [Workload Scheduling](/docs/en/operations/workload-scheduling.md)
## workload_path {#workload_path}
The directory used as a storage for all `CREATE WORKLOAD` and `CREATE RESOURCE` queries. By default `/workload/` folder under server working directory is used.
**Example**
``` xml
<workload_path>/var/lib/clickhouse/workload/</workload_path>
```
**See Also**
- [Workload Hierarchy](/docs/en/operations/workload-scheduling.md#workloads)
- [workload_zookeeper_path](#workload_zookeeper_path)
## workload_zookeeper_path {#workload_zookeeper_path}
The path to a ZooKeeper node, which is used as a storage for all `CREATE WORKLOAD` and `CREATE RESOURCE` queries. For consistency all SQL definitions are stored as a value of this single znode. By default ZooKeeper is not used and definitions are stored on [disk](#workload_path).
**Example**
``` xml
<workload_zookeeper_path>/clickhouse/workload/definitions.sql</workload_zookeeper_path>
```
**See Also**
- [Workload Hierarchy](/docs/en/operations/workload-scheduling.md#workloads)
- [workload_path](#workload_path)
## max_authentication_methods_per_user {#max_authentication_methods_per_user}
The maximum number of authentication methods a user can be created with or altered to.

View File

@ -18,6 +18,11 @@ Columns:
- `1` — Current user cant change the setting.
- `type` ([String](../../sql-reference/data-types/string.md)) — Setting type (implementation specific string value).
- `is_obsolete` ([UInt8](../../sql-reference/data-types/int-uint.md#uint-ranges)) - Shows whether a setting is obsolete.
- `tier` ([Enum8](../../sql-reference/data-types/enum.md)) — Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their development and the expectations one might have when using them. Values:
- `'Production'` — The feature is stable, safe to use and does not have issues interacting with other **production** features. .
- `'Beta'` — The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome.
- `'Experimental'` — The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time.
- `'Obsolete'` — No longer supported. Either it is already removed or it will be removed in future releases.
**Example**
```sql

View File

@ -0,0 +1,37 @@
---
slug: /en/operations/system-tables/resources
---
# resources
Contains information for [resources](/docs/en/operations/workload-scheduling.md#workload_entity_storage) residing on the local server. The table contains a row for every resource.
Example:
``` sql
SELECT *
FROM system.resources
FORMAT Vertical
```
``` text
Row 1:
──────
name: io_read
read_disks: ['s3']
write_disks: []
create_query: CREATE RESOURCE io_read (READ DISK s3)
Row 2:
──────
name: io_write
read_disks: []
write_disks: ['s3']
create_query: CREATE RESOURCE io_write (WRITE DISK s3)
```
Columns:
- `name` (`String`) - Resource name.
- `read_disks` (`Array(String)`) - The array of disk names that uses this resource for read operations.
- `write_disks` (`Array(String)`) - The array of disk names that uses this resource for write operations.
- `create_query` (`String`) - The definition of the resource.

View File

@ -18,6 +18,11 @@ Columns:
- `1` — Current user cant change the setting.
- `default` ([String](../../sql-reference/data-types/string.md)) — Setting default value.
- `is_obsolete` ([UInt8](../../sql-reference/data-types/int-uint.md#uint-ranges)) - Shows whether a setting is obsolete.
- `tier` ([Enum8](../../sql-reference/data-types/enum.md)) — Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their development and the expectations one might have when using them. Values:
- `'Production'` — The feature is stable, safe to use and does not have issues interacting with other **production** features. .
- `'Beta'` — The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome.
- `'Experimental'` — The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time.
- `'Obsolete'` — No longer supported. Either it is already removed or it will be removed in future releases.
**Example**
@ -26,19 +31,99 @@ The following example shows how to get information about settings which name con
``` sql
SELECT *
FROM system.settings
WHERE name LIKE '%min_i%'
WHERE name LIKE '%min_insert_block_size_%'
FORMAT Vertical
```
``` text
┌─name───────────────────────────────────────────────_─value─────_─changed─_─description───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────_─min──_─max──_─readonly─_─type─────────_─default───_─alias_for─_─is_obsolete─┐
│ min_insert_block_size_rows │ 1048449 │ 0 │ Squash blocks passed to INSERT query to specified size in rows, if blocks are not big enough. │ ________ │ 0 │ UInt64 │ 1048449 │ │ 0 │
│ min_insert_block_size_bytes │ 268402944 │ 0 │ Squash blocks passed to INSERT query to specified size in bytes, if blocks are not big enough. │ ________ │ 0 │ UInt64 │ 268402944 │ │ 0 │
│ min_insert_block_size_rows_for_materialized_views │ 0 │ 0 │ Like min_insert_block_size_rows, but applied only during pushing to MATERIALIZED VIEW (default: min_insert_block_size_rows) │ ________ │ 0 │ UInt64 │ 0 │ │ 0 │
│ min_insert_block_size_bytes_for_materialized_views │ 0 │ 0 │ Like min_insert_block_size_bytes, but applied only during pushing to MATERIALIZED VIEW (default: min_insert_block_size_bytes) │ ________ │ 0 │ UInt64 │ 0 │ │ 0 │
│ read_backoff_min_interval_between_events_ms │ 1000 │ 0 │ Settings to reduce the number of threads in case of slow reads. Do not pay attention to the event, if the previous one has passed less than a certain amount of time. │ ________ │ 0 │ Milliseconds │ 1000 │ │ 0 │
└────────────────────────────────────────────────────┴───────────┴─────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
──────────────────────────────────────────────────────┴──────┴──────┴──────────┴──────────────┴───────────┴───────────┴─────────────┘
```
Row 1:
──────
name: min_insert_block_size_rows
value: 1048449
changed: 0
description: Sets the minimum number of rows in the block that can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones.
Possible values:
- Positive integer.
- 0 — Squashing disabled.
min: ᴺᵁᴸᴸ
max: ᴺᵁᴸᴸ
readonly: 0
type: UInt64
default: 1048449
alias_for:
is_obsolete: 0
tier: Production
Row 2:
──────
name: min_insert_block_size_bytes
value: 268402944
changed: 0
description: Sets the minimum number of bytes in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones.
Possible values:
- Positive integer.
- 0 — Squashing disabled.
min: ᴺᵁᴸᴸ
max: ᴺᵁᴸᴸ
readonly: 0
type: UInt64
default: 268402944
alias_for:
is_obsolete: 0
tier: Production
Row 3:
──────
name: min_insert_block_size_rows_for_materialized_views
value: 0
changed: 0
description: Sets the minimum number of rows in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. This setting is applied only for blocks inserted into [materialized view](../../sql-reference/statements/create/view.md). By adjusting this setting, you control blocks squashing while pushing to materialized view and avoid excessive memory usage.
Possible values:
- Any positive integer.
- 0 — Squashing disabled.
**See Also**
- [min_insert_block_size_rows](#min-insert-block-size-rows)
min: ᴺᵁᴸᴸ
max: ᴺᵁᴸᴸ
readonly: 0
type: UInt64
default: 0
alias_for:
is_obsolete: 0
tier: Production
Row 4:
──────
name: min_insert_block_size_bytes_for_materialized_views
value: 0
changed: 0
description: Sets the minimum number of bytes in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. This setting is applied only for blocks inserted into [materialized view](../../sql-reference/statements/create/view.md). By adjusting this setting, you control blocks squashing while pushing to materialized view and avoid excessive memory usage.
Possible values:
- Any positive integer.
- 0 — Squashing disabled.
**See also**
- [min_insert_block_size_bytes](#min-insert-block-size-bytes)
min: ᴺᵁᴸᴸ
max: ᴺᵁᴸᴸ
readonly: 0
type: UInt64
default: 0
alias_for:
is_obsolete: 0
tier: Production
```
Using of `WHERE changed` can be useful, for example, when you want to check:

View File

@ -0,0 +1,40 @@
---
slug: /en/operations/system-tables/workloads
---
# workloads
Contains information for [workloads](/docs/en/operations/workload-scheduling.md#workload_entity_storage) residing on the local server. The table contains a row for every workload.
Example:
``` sql
SELECT *
FROM system.workloads
FORMAT Vertical
```
``` text
Row 1:
──────
name: production
parent: all
create_query: CREATE WORKLOAD production IN `all` SETTINGS weight = 9
Row 2:
──────
name: development
parent: all
create_query: CREATE WORKLOAD development IN `all`
Row 3:
──────
name: all
parent:
create_query: CREATE WORKLOAD `all`
```
Columns:
- `name` (`String`) - Workload name.
- `parent` (`String`) - Parent workload name.
- `create_query` (`String`) - The definition of the workload.

View File

@ -43,6 +43,20 @@ Example:
</clickhouse>
```
An alternative way to express which disks are used by a resource is SQL syntax:
```sql
CREATE RESOURCE resource_name (WRITE DISK disk1, READ DISK disk2)
```
Resource could be used for any number of disk for READ or WRITE or both for READ and WRITE. There a syntax allowing to use a resource for all the disks:
```sql
CREATE RESOURCE all_io (READ ANY DISK, WRITE ANY DISK);
```
Note that server configuration options have priority over SQL way to define resources.
## Workload markup {#workload_markup}
Queries can be marked with setting `workload` to distinguish different workloads. If `workload` is not set, than value "default" is used. Note that you are able to specify the other value using settings profiles. Setting constraints can be used to make `workload` constant if you want all queries from the user to be marked with fixed value of `workload` setting.
@ -153,9 +167,48 @@ Example:
</clickhouse>
```
## Workload hierarchy (SQL only) {#workloads}
Defining resources and classifiers in XML could be challenging. ClickHouse provides SQL syntax that is much more convenient. All resources that were created with `CREATE RESOURCE` share the same structure of the hierarchy, but could differ in some aspects. Every workload created with `CREATE WORKLOAD` maintains a few automatically created scheduling nodes for every resource. A child workload can be created inside another parent workload. Here is the example that defines exactly the same hierarchy as XML configuration above:
```sql
CREATE RESOURCE network_write (WRITE DISK s3)
CREATE RESOURCE network_read (READ DISK s3)
CREATE WORKLOAD all SETTINGS max_requests = 100
CREATE WORKLOAD development IN all
CREATE WORKLOAD production IN all SETTINGS weight = 3
```
The name of a leaf workload without children could be used in query settings `SETTINGS workload = 'name'`. Note that workload classifiers are also created automatically when using SQL syntax.
To customize workload the following settings could be used:
* `priority` - sibling workloads are served according to static priority values (lower value means higher priority).
* `weight` - sibling workloads having the same static priority share resources according to weights.
* `max_requests` - the limit on the number of concurrent resource requests in this workload.
* `max_cost` - the limit on the total inflight bytes count of concurrent resource requests in this workload.
* `max_speed` - the limit on byte processing rate of this workload (the limit is independent for every resource).
* `max_burst` - maximum number of bytes that could be processed by the workload without being throttled (for every resource independently).
Note that workload settings are translated into a proper set of scheduling nodes. For more details, see the description of the scheduling node [types and options](#hierarchy).
There is no way to specify different hierarchies of workloads for different resources. But there is a way to specify different workload setting value for a specific resource:
```sql
CREATE OR REPLACE WORKLOAD all SETTINGS max_requests = 100, max_speed = 1000000 FOR network_read, max_speed = 2000000 FOR network_write
```
Also note that workload or resource could not be dropped if it is referenced from another workload. To update a definition of a workload use `CREATE OR REPLACE WORKLOAD` query.
## Workloads and resources storage {#workload_entity_storage}
Definitions of all workloads and resources in the form of `CREATE WORKLOAD` and `CREATE RESOURCE` queries are stored persistently either on disk at `workload_path` or in ZooKeeper at `workload_zookeeper_path`. ZooKeeper storage is recommended to achieve consistency between nodes. Alternatively `ON CLUSTER` clause could be used along with disk storage.
## See also
- [system.scheduler](/docs/en/operations/system-tables/scheduler.md)
- [system.workloads](/docs/en/operations/system-tables/workloads.md)
- [system.resources](/docs/en/operations/system-tables/resources.md)
- [merge_workload](/docs/en/operations/settings/merge-tree-settings.md#merge_workload) merge tree setting
- [merge_workload](/docs/en/operations/server-configuration-parameters/settings.md#merge_workload) global server setting
- [mutation_workload](/docs/en/operations/settings/merge-tree-settings.md#mutation_workload) merge tree setting
- [mutation_workload](/docs/en/operations/server-configuration-parameters/settings.md#mutation_workload) global server setting
- [workload_path](/docs/en/operations/server-configuration-parameters/settings.md#workload_path) global server setting
- [workload_zookeeper_path](/docs/en/operations/server-configuration-parameters/settings.md#workload_zookeeper_path) global server setting

View File

@ -12,7 +12,7 @@ Syntax:
``` sql
ALTER USER [IF EXISTS] name1 [RENAME TO new_name |, name2 [,...]]
[ON CLUSTER cluster_name]
[NOT IDENTIFIED | RESET AUTHENTICATION METHODS TO NEW | {IDENTIFIED | ADD IDENTIFIED} {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']}
[NOT IDENTIFIED | RESET AUTHENTICATION METHODS TO NEW | {IDENTIFIED | ADD IDENTIFIED} {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} [VALID UNTIL datetime]
[, {[{plaintext_password | sha256_password | sha256_hash | ...}] BY {'password' | 'hash'}} | {ldap SERVER 'server_name'} | {...} | ... [,...]]]
[[ADD | DROP] HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE]
[VALID UNTIL datetime]
@ -91,3 +91,15 @@ Reset authentication methods and keep the most recent added one:
``` sql
ALTER USER user1 RESET AUTHENTICATION METHODS TO NEW
```
## VALID UNTIL Clause
Allows you to specify the expiration date and, optionally, the time for an authentication method. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`.
The `VALID UNTIL` clause can only be specified along with an authentication method, except for the case where no authentication method has been specified in the query. In this scenario, the `VALID UNTIL` clause will be applied to all existing authentication methods.
Examples:
- `ALTER USER name1 VALID UNTIL '2025-01-01'`
- `ALTER USER name1 VALID UNTIL '2025-01-01 12:00:00 UTC'`
- `ALTER USER name1 VALID UNTIL 'infinity'`
- `ALTER USER name1 IDENTIFIED WITH plaintext_password BY 'no_expiration', bcrypt_password BY 'expiration_set' VALID UNTIL'2025-01-01''`

View File

@ -11,7 +11,7 @@ Syntax:
``` sql
CREATE USER [IF NOT EXISTS | OR REPLACE] name1 [, name2 [,...]] [ON CLUSTER cluster_name]
[NOT IDENTIFIED | IDENTIFIED {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']}
[NOT IDENTIFIED | IDENTIFIED {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} [VALID UNTIL datetime]
[, {[{plaintext_password | sha256_password | sha256_hash | ...}] BY {'password' | 'hash'}} | {ldap SERVER 'server_name'} | {...} | ... [,...]]]
[HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE]
[VALID UNTIL datetime]
@ -178,7 +178,8 @@ ClickHouse treats `user_name@'address'` as a username as a whole. Thus, technica
## VALID UNTIL Clause
Allows you to specify the expiration date and, optionally, the time for user credentials. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`.
Allows you to specify the expiration date and, optionally, the time for an authentication method. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`.
The `VALID UNTIL` clause can only be specified along with an authentication method, except for the case where no authentication method has been specified in the query. In this scenario, the `VALID UNTIL` clause will be applied to all existing authentication methods.
Examples:
@ -186,6 +187,7 @@ Examples:
- `CREATE USER name1 VALID UNTIL '2025-01-01 12:00:00 UTC'`
- `CREATE USER name1 VALID UNTIL 'infinity'`
- ```CREATE USER name1 VALID UNTIL '2025-01-01 12:00:00 `Asia/Tokyo`'```
- `CREATE USER name1 IDENTIFIED WITH plaintext_password BY 'no_expiration', bcrypt_password BY 'expiration_set' VALID UNTIL '2025-01-01''`
## GRANTEES Clause

View File

@ -83,7 +83,7 @@ The presence of long-running or incomplete mutations often indicates that a Clic
- Or manually kill some of these mutations by sending a `KILL` command.
``` sql
KILL MUTATION [ON CLUSTER cluster]
KILL MUTATION
WHERE <where expression to SELECT FROM system.mutations query>
[TEST]
[FORMAT format]
@ -135,7 +135,6 @@ KILL MUTATION WHERE database = 'default' AND table = 'table'
-- Cancel the specific mutation:
KILL MUTATION WHERE database = 'default' AND table = 'table' AND mutation_id = 'mutation_3.txt'
```
:::tip If you are killing a mutation in ClickHouse Cloud or in a self-managed cluster, then be sure to use the ```ON CLUSTER [cluster-name]``` option, in order to ensure the mutation is killed on all replicas:::
The query is useful when a mutation is stuck and cannot finish (e.g. if some function in the mutation query throws an exception when applied to the data contained in the table).

View File

@ -284,6 +284,14 @@ FROM s3(
);
```
:::note
ClickHouse supports three archive formats:
ZIP
TAR
7Z
While ZIP and TAR archives can be accessed from any supported storage location, 7Z archives can only be read from the local filesystem where ClickHouse is installed.
:::
## Virtual Columns {#virtual-columns}

View File

@ -138,6 +138,7 @@ CREATE TABLE table_with_asterisk (name String, value UInt32)
- `use_insecure_imds_request` — признак использования менее безопасного соединения при выполнении запроса к IMDS при получении учётных данных из метаданных Amazon EC2. Значение по умолчанию — `false`.
- `region` — название региона S3.
- `header` — добавляет указанный HTTP-заголовок к запросу на заданную точку приема запроса. Может быть определен несколько раз.
- `access_header` - добавляет указанный HTTP-заголовок к запросу на заданную точку приема запроса, в случая если не указаны другие способы авторизации.
- `server_side_encryption_customer_key_base64` — устанавливает необходимые заголовки для доступа к объектам S3 с шифрованием SSE-C.
- `single_read_retries` — Максимальное количество попыток запроса при единичном чтении. Значение по умолчанию — `4`.

View File

@ -590,6 +590,7 @@ try
#if USE_SSL
CertificateReloader::instance().tryLoad(*config);
CertificateReloader::instance().tryLoadClient(*config);
#endif
});

View File

@ -86,7 +86,7 @@
#include <Dictionaries/registerDictionaries.h>
#include <Disks/registerDisks.h>
#include <Common/Scheduler/Nodes/registerSchedulerNodes.h>
#include <Common/Scheduler/Nodes/registerResourceManagers.h>
#include <Common/Scheduler/Workload/IWorkloadEntityStorage.h>
#include <Common/Config/ConfigReloader.h>
#include <Server/HTTPHandlerFactory.h>
#include "MetricsTransmitter.h"
@ -920,7 +920,6 @@ try
registerFormats();
registerRemoteFileMetadatas();
registerSchedulerNodes();
registerResourceManagers();
CurrentMetrics::set(CurrentMetrics::Revision, ClickHouseRevision::getVersionRevision());
CurrentMetrics::set(CurrentMetrics::VersionInteger, ClickHouseRevision::getVersionInteger());
@ -2253,6 +2252,8 @@ try
database_catalog.assertDatabaseExists(default_database);
/// Load user-defined SQL functions.
global_context->getUserDefinedSQLObjectsStorage().loadObjects();
/// Load WORKLOADs and RESOURCEs.
global_context->getWorkloadEntityStorage().loadEntities();
global_context->getRefreshSet().setRefreshesStopped(false);
}
@ -2340,6 +2341,7 @@ try
#if USE_SSL
CertificateReloader::instance().tryLoad(config());
CertificateReloader::instance().tryLoadClient(config());
#endif
/// Must be done after initialization of `servers`, because async_metrics will access `servers` variable from its thread.

View File

@ -1399,6 +1399,10 @@
If not specified they will be stored locally. -->
<!-- <user_defined_zookeeper_path>/clickhouse/user_defined</user_defined_zookeeper_path> -->
<!-- Path in ZooKeeper to store workload and resource created by the command CREATE WORKLOAD and CREATE REESOURCE.
If not specified they will be stored locally. -->
<!-- <workload_zookeeper_path>/clickhouse/workload/definitions.sql</workload_zookeeper_path> -->
<!-- Uncomment if you want data to be compressed 30-100% better.
Don't do that if you just started using ClickHouse.
-->

View File

@ -1,12 +1,16 @@
#include <Access/AccessControl.h>
#include <Access/AuthenticationData.h>
#include <Common/Exception.h>
#include <Interpreters/Access/getValidUntilFromAST.h>
#include <Interpreters/Context.h>
#include <Interpreters/evaluateConstantExpression.h>
#include <Parsers/ASTExpressionList.h>
#include <Parsers/ASTLiteral.h>
#include <Parsers/Access/ASTPublicSSHKey.h>
#include <Storages/checkAndGetLiteralArgument.h>
#include <IO/parseDateTimeBestEffort.h>
#include <IO/ReadHelpers.h>
#include <IO/ReadBufferFromString.h>
#include <Common/OpenSSLHelpers.h>
#include <Poco/SHA1Engine.h>
@ -113,7 +117,8 @@ bool operator ==(const AuthenticationData & lhs, const AuthenticationData & rhs)
&& (lhs.ssh_keys == rhs.ssh_keys)
#endif
&& (lhs.http_auth_scheme == rhs.http_auth_scheme)
&& (lhs.http_auth_server_name == rhs.http_auth_server_name);
&& (lhs.http_auth_server_name == rhs.http_auth_server_name)
&& (lhs.valid_until == rhs.valid_until);
}
@ -384,14 +389,34 @@ std::shared_ptr<ASTAuthenticationData> AuthenticationData::toAST() const
throw Exception(ErrorCodes::LOGICAL_ERROR, "AST: Unexpected authentication type {}", toString(auth_type));
}
if (valid_until)
{
WriteBufferFromOwnString out;
writeDateTimeText(valid_until, out);
node->valid_until = std::make_shared<ASTLiteral>(out.str());
}
return node;
}
AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & query, ContextPtr context, bool validate)
{
time_t valid_until = 0;
if (query.valid_until)
{
valid_until = getValidUntilFromAST(query.valid_until, context);
}
if (query.type && query.type == AuthenticationType::NO_PASSWORD)
return AuthenticationData();
{
AuthenticationData auth_data;
auth_data.setValidUntil(valid_until);
return auth_data;
}
/// For this type of authentication we have ASTPublicSSHKey as children for ASTAuthenticationData
if (query.type && query.type == AuthenticationType::SSH_KEY)
@ -418,6 +443,7 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que
}
auth_data.setSSHKeys(std::move(keys));
auth_data.setValidUntil(valid_until);
return auth_data;
#else
throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "SSH is disabled, because ClickHouse is built without libssh");
@ -451,6 +477,8 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que
AuthenticationData auth_data(current_type);
auth_data.setValidUntil(valid_until);
if (validate)
context->getAccessControl().checkPasswordComplexityRules(value);
@ -494,6 +522,7 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que
}
AuthenticationData auth_data(*query.type);
auth_data.setValidUntil(valid_until);
if (query.contains_hash)
{

View File

@ -74,6 +74,9 @@ public:
const String & getHTTPAuthenticationServerName() const { return http_auth_server_name; }
void setHTTPAuthenticationServerName(const String & name) { http_auth_server_name = name; }
time_t getValidUntil() const { return valid_until; }
void setValidUntil(time_t valid_until_) { valid_until = valid_until_; }
friend bool operator ==(const AuthenticationData & lhs, const AuthenticationData & rhs);
friend bool operator !=(const AuthenticationData & lhs, const AuthenticationData & rhs) { return !(lhs == rhs); }
@ -106,6 +109,7 @@ private:
/// HTTP authentication properties
String http_auth_server_name;
HTTPAuthenticationScheme http_auth_scheme = HTTPAuthenticationScheme::BASIC;
time_t valid_until = 0;
};
}

View File

@ -99,6 +99,8 @@ enum class AccessType : uint8_t
M(CREATE_ARBITRARY_TEMPORARY_TABLE, "", GLOBAL, CREATE) /* allows to create and manipulate temporary tables
with arbitrary table engine */\
M(CREATE_FUNCTION, "", GLOBAL, CREATE) /* allows to execute CREATE FUNCTION */ \
M(CREATE_WORKLOAD, "", GLOBAL, CREATE) /* allows to execute CREATE WORKLOAD */ \
M(CREATE_RESOURCE, "", GLOBAL, CREATE) /* allows to execute CREATE RESOURCE */ \
M(CREATE_NAMED_COLLECTION, "", NAMED_COLLECTION, NAMED_COLLECTION_ADMIN) /* allows to execute CREATE NAMED COLLECTION */ \
M(CREATE, "", GROUP, ALL) /* allows to execute {CREATE|ATTACH} */ \
\
@ -108,6 +110,8 @@ enum class AccessType : uint8_t
implicitly enabled by the grant DROP_TABLE */\
M(DROP_DICTIONARY, "", DICTIONARY, DROP) /* allows to execute {DROP|DETACH} DICTIONARY */\
M(DROP_FUNCTION, "", GLOBAL, DROP) /* allows to execute DROP FUNCTION */\
M(DROP_WORKLOAD, "", GLOBAL, DROP) /* allows to execute DROP WORKLOAD */\
M(DROP_RESOURCE, "", GLOBAL, DROP) /* allows to execute DROP RESOURCE */\
M(DROP_NAMED_COLLECTION, "", NAMED_COLLECTION, NAMED_COLLECTION_ADMIN) /* allows to execute DROP NAMED COLLECTION */\
M(DROP, "", GROUP, ALL) /* allows to execute {DROP|DETACH} */\
\

View File

@ -701,15 +701,17 @@ bool ContextAccess::checkAccessImplHelper(const ContextPtr & context, AccessFlag
const AccessFlags dictionary_ddl = AccessType::CREATE_DICTIONARY | AccessType::DROP_DICTIONARY;
const AccessFlags function_ddl = AccessType::CREATE_FUNCTION | AccessType::DROP_FUNCTION;
const AccessFlags workload_ddl = AccessType::CREATE_WORKLOAD | AccessType::DROP_WORKLOAD;
const AccessFlags resource_ddl = AccessType::CREATE_RESOURCE | AccessType::DROP_RESOURCE;
const AccessFlags table_and_dictionary_ddl = table_ddl | dictionary_ddl;
const AccessFlags table_and_dictionary_and_function_ddl = table_ddl | dictionary_ddl | function_ddl;
const AccessFlags write_table_access = AccessType::INSERT | AccessType::OPTIMIZE;
const AccessFlags write_dcl_access = AccessType::ACCESS_MANAGEMENT - AccessType::SHOW_ACCESS;
const AccessFlags not_readonly_flags = write_table_access | table_and_dictionary_and_function_ddl | write_dcl_access | AccessType::SYSTEM | AccessType::KILL_QUERY;
const AccessFlags not_readonly_flags = write_table_access | table_and_dictionary_and_function_ddl | workload_ddl | resource_ddl | write_dcl_access | AccessType::SYSTEM | AccessType::KILL_QUERY;
const AccessFlags not_readonly_1_flags = AccessType::CREATE_TEMPORARY_TABLE;
const AccessFlags ddl_flags = table_ddl | dictionary_ddl | function_ddl;
const AccessFlags ddl_flags = table_ddl | dictionary_ddl | function_ddl | workload_ddl | resource_ddl;
const AccessFlags introspection_flags = AccessType::INTROSPECTION;
};
static const PrecalculatedFlags precalc;

View File

@ -554,7 +554,7 @@ std::optional<AuthResult> IAccessStorage::authenticateImpl(
continue;
}
if (areCredentialsValid(user->getName(), user->valid_until, auth_method, credentials, external_authenticators, auth_result.settings))
if (areCredentialsValid(user->getName(), auth_method, credentials, external_authenticators, auth_result.settings))
{
auth_result.authentication_data = auth_method;
return auth_result;
@ -579,7 +579,6 @@ std::optional<AuthResult> IAccessStorage::authenticateImpl(
bool IAccessStorage::areCredentialsValid(
const std::string & user_name,
time_t valid_until,
const AuthenticationData & authentication_method,
const Credentials & credentials,
const ExternalAuthenticators & external_authenticators,
@ -591,6 +590,7 @@ bool IAccessStorage::areCredentialsValid(
if (credentials.getUserName() != user_name)
return false;
auto valid_until = authentication_method.getValidUntil();
if (valid_until)
{
const time_t now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());

View File

@ -236,7 +236,6 @@ protected:
bool allow_plaintext_password) const;
virtual bool areCredentialsValid(
const std::string & user_name,
time_t valid_until,
const AuthenticationData & authentication_method,
const Credentials & credentials,
const ExternalAuthenticators & external_authenticators,

View File

@ -19,8 +19,7 @@ bool User::equal(const IAccessEntity & other) const
return (authentication_methods == other_user.authentication_methods)
&& (allowed_client_hosts == other_user.allowed_client_hosts)
&& (access == other_user.access) && (granted_roles == other_user.granted_roles) && (default_roles == other_user.default_roles)
&& (settings == other_user.settings) && (grantees == other_user.grantees) && (default_database == other_user.default_database)
&& (valid_until == other_user.valid_until);
&& (settings == other_user.settings) && (grantees == other_user.grantees) && (default_database == other_user.default_database);
}
void User::setName(const String & name_)
@ -88,7 +87,6 @@ void User::clearAllExceptDependencies()
access = {};
settings.removeSettingsKeepProfiles();
default_database = {};
valid_until = 0;
}
}

View File

@ -23,7 +23,6 @@ struct User : public IAccessEntity
SettingsProfileElements settings;
RolesOrUsersSet grantees = RolesOrUsersSet::AllTag{};
String default_database;
time_t valid_until = 0;
bool equal(const IAccessEntity & other) const override;
std::shared_ptr<IAccessEntity> clone() const override { return cloneImpl<User>(); }

View File

@ -227,8 +227,13 @@ void QueryAnalyzer::resolveConstantExpression(QueryTreeNodePtr & node, const Que
scope.context = context;
auto node_type = node->getNodeType();
if (node_type == QueryTreeNodeType::QUERY || node_type == QueryTreeNodeType::UNION)
{
evaluateScalarSubqueryIfNeeded(node, scope);
return;
}
if (table_expression && node_type != QueryTreeNodeType::QUERY && node_type != QueryTreeNodeType::UNION)
if (table_expression)
{
scope.expression_join_tree_node = table_expression;
validateTableExpressionModifiers(scope.expression_join_tree_node, scope);

View File

@ -136,6 +136,7 @@ add_headers_and_sources(dbms Storages/ObjectStorage/HDFS)
add_headers_and_sources(dbms Storages/ObjectStorage/Local)
add_headers_and_sources(dbms Storages/ObjectStorage/DataLakes)
add_headers_and_sources(dbms Common/NamedCollections)
add_headers_and_sources(dbms Common/Scheduler/Workload)
if (TARGET ch_contrib::amqp_cpp)
add_headers_and_sources(dbms Storages/RabbitMQ)

View File

@ -418,7 +418,7 @@ void ClientApplicationBase::init(int argc, char ** argv)
UInt64 max_client_memory_usage_int = parseWithSizeSuffix<UInt64>(max_client_memory_usage.c_str(), max_client_memory_usage.length());
total_memory_tracker.setHardLimit(max_client_memory_usage_int);
total_memory_tracker.setDescription("(total)");
total_memory_tracker.setDescription("Global");
total_memory_tracker.setMetric(CurrentMetrics::MemoryTracking);
}

View File

@ -1454,8 +1454,22 @@ void ClientBase::resetOutput()
/// Order is important: format, compression, file
if (output_format)
output_format->finalize();
try
{
if (output_format)
output_format->finalize();
}
catch (...)
{
/// We need to make sure we continue resetting output_format (will stop threads on parallel output)
/// as well as cleaning other output related setup
if (!have_error)
{
client_exception
= std::make_unique<Exception>(getCurrentExceptionMessageAndPattern(print_stack_trace), getCurrentExceptionCode());
have_error = true;
}
}
output_format.reset();
logs_out_stream.reset();

View File

@ -183,8 +183,14 @@
M(BuildVectorSimilarityIndexThreadsScheduled, "Number of queued or active jobs in the build vector similarity index thread pool.") \
\
M(DiskPlainRewritableAzureDirectoryMapSize, "Number of local-to-remote path entries in the 'plain_rewritable' in-memory map for AzureObjectStorage.") \
M(DiskPlainRewritableAzureFileCount, "Number of file entries in the 'plain_rewritable' in-memory map for AzureObjectStorage.") \
M(DiskPlainRewritableAzureUniqueFileNamesCount, "Number of unique file name entries in the 'plain_rewritable' in-memory map for AzureObjectStorage.") \
M(DiskPlainRewritableLocalDirectoryMapSize, "Number of local-to-remote path entries in the 'plain_rewritable' in-memory map for LocalObjectStorage.") \
M(DiskPlainRewritableLocalFileCount, "Number of file entries in the 'plain_rewritable' in-memory map for LocalObjectStorage.") \
M(DiskPlainRewritableLocalUniqueFileNamesCount, "Number of unique file name entries in the 'plain_rewritable' in-memory map for LocalObjectStorage.") \
M(DiskPlainRewritableS3DirectoryMapSize, "Number of local-to-remote path entries in the 'plain_rewritable' in-memory map for S3ObjectStorage.") \
M(DiskPlainRewritableS3FileCount, "Number of file entries in the 'plain_rewritable' in-memory map for S3ObjectStorage.") \
M(DiskPlainRewritableS3UniqueFileNamesCount, "Number of unique file name entries in the 'plain_rewritable' in-memory map for S3ObjectStorage.") \
\
M(MergeTreePartsLoaderThreads, "Number of threads in the MergeTree parts loader thread pool.") \
M(MergeTreePartsLoaderThreadsActive, "Number of threads in the MergeTree parts loader thread pool running a task.") \

View File

@ -68,15 +68,15 @@ inline std::string_view toDescription(OvercommitResult result)
case OvercommitResult::NONE:
return "";
case OvercommitResult::DISABLED:
return "Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero.";
return "Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero";
case OvercommitResult::MEMORY_FREED:
throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "OvercommitResult::MEMORY_FREED shouldn't be asked for description");
case OvercommitResult::SELECTED:
return "Query was selected to stop by OvercommitTracker.";
return "Query was selected to stop by OvercommitTracker";
case OvercommitResult::TIMEOUTED:
return "Waiting timeout for memory to be freed is reached.";
return "Waiting timeout for memory to be freed is reached";
case OvercommitResult::NOT_ENOUGH_FREED:
return "Memory overcommit has freed not enough memory.";
return "Memory overcommit has not freed enough memory";
}
}
@ -150,15 +150,23 @@ void MemoryTracker::logPeakMemoryUsage()
auto peak_bytes = peak.load(std::memory_order::relaxed);
if (peak_bytes < 128 * 1024)
return;
LOG_DEBUG(getLogger("MemoryTracker"),
"Peak memory usage{}: {}.", (description ? " " + std::string(description) : ""), ReadableSize(peak_bytes));
LOG_DEBUG(
getLogger("MemoryTracker"),
"{}{} memory usage: {}.",
description ? std::string(description) : "",
description ? " peak" : "Peak",
ReadableSize(peak_bytes));
}
void MemoryTracker::logMemoryUsage(Int64 current) const
{
const auto * description = description_ptr.load(std::memory_order_relaxed);
LOG_DEBUG(getLogger("MemoryTracker"),
"Current memory usage{}: {}.", (description ? " " + std::string(description) : ""), ReadableSize(current));
LOG_DEBUG(
getLogger("MemoryTracker"),
"{}{} memory usage: {}.",
description ? std::string(description) : "",
description ? " current" : "Current",
ReadableSize(current));
}
void MemoryTracker::injectFault() const
@ -178,9 +186,9 @@ void MemoryTracker::injectFault() const
const auto * description = description_ptr.load(std::memory_order_relaxed);
throw DB::Exception(
DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED,
"Memory tracker{}{}: fault injected (at specific point)",
description ? " " : "",
description ? description : "");
"{}{}: fault injected (at specific point)",
description ? description : "",
description ? " memory tracker" : "Memory tracker");
}
void MemoryTracker::debugLogBigAllocationWithoutCheck(Int64 size [[maybe_unused]])
@ -282,9 +290,9 @@ AllocationTrace MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceed
const auto * description = description_ptr.load(std::memory_order_relaxed);
throw DB::Exception(
DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED,
"Memory tracker{}{}: fault injected. Would use {} (attempt to allocate chunk of {} bytes), maximum: {}",
description ? " " : "",
"{}{}: fault injected. Would use {} (attempt to allocate chunk of {} bytes), maximum: {}",
description ? description : "",
description ? " memory tracker" : "Memory tracker",
formatReadableSizeWithBinarySuffix(will_be),
size,
formatReadableSizeWithBinarySuffix(current_hard_limit));
@ -305,6 +313,8 @@ AllocationTrace MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceed
if (overcommit_result != OvercommitResult::MEMORY_FREED)
{
bool overcommit_result_ignore
= overcommit_result == OvercommitResult::NONE || overcommit_result == OvercommitResult::DISABLED;
/// Revert
amount.fetch_sub(size, std::memory_order_relaxed);
rss.fetch_sub(size, std::memory_order_relaxed);
@ -314,18 +324,18 @@ AllocationTrace MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceed
ProfileEvents::increment(ProfileEvents::QueryMemoryLimitExceeded);
const auto * description = description_ptr.load(std::memory_order_relaxed);
throw DB::Exception(
DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED,
"Memory limit{}{} exceeded: "
"would use {} (attempt to allocate chunk of {} bytes), current RSS {}, maximum: {}."
"{}{}",
description ? " " : "",
description ? description : "",
formatReadableSizeWithBinarySuffix(will_be),
size,
formatReadableSizeWithBinarySuffix(rss.load(std::memory_order_relaxed)),
formatReadableSizeWithBinarySuffix(current_hard_limit),
overcommit_result == OvercommitResult::NONE ? "" : " OvercommitTracker decision: ",
toDescription(overcommit_result));
DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED,
"{}{} exceeded: "
"would use {} (attempt to allocate chunk of {} bytes), current RSS {}, maximum: {}."
"{}{}",
description ? description : "",
description ? " memory limit" : "Memory limit",
formatReadableSizeWithBinarySuffix(will_be),
size,
formatReadableSizeWithBinarySuffix(rss.load(std::memory_order_relaxed)),
formatReadableSizeWithBinarySuffix(current_hard_limit),
overcommit_result_ignore ? "" : " OvercommitTracker decision: ",
overcommit_result_ignore ? "" : toDescription(overcommit_result));
}
// If OvercommitTracker::needToStopQuery returned false, it guarantees that enough memory is freed.

View File

@ -6,6 +6,7 @@
/// Separate type (rather than `Int64` is used just to avoid implicit conversion errors and to default-initialize
struct Priority
{
Int64 value = 0; /// Note that lower value means higher priority.
constexpr operator Int64() const { return value; } /// NOLINT
using Value = Int64;
Value value = 0; /// Note that lower value means higher priority.
constexpr operator Value() const { return value; } /// NOLINT
};

View File

@ -26,6 +26,9 @@ class IClassifier : private boost::noncopyable
public:
virtual ~IClassifier() = default;
/// Returns true iff resource access is allowed by this classifier
virtual bool has(const String & resource_name) = 0;
/// Returns ResourceLink that should be used to access resource.
/// Returned link is valid until classifier destruction.
virtual ResourceLink get(const String & resource_name) = 0;
@ -46,12 +49,15 @@ public:
/// Initialize or reconfigure manager.
virtual void updateConfiguration(const Poco::Util::AbstractConfiguration & config) = 0;
/// Returns true iff given resource is controlled through this manager.
virtual bool hasResource(const String & resource_name) const = 0;
/// Obtain a classifier instance required to get access to resources.
/// Note that it holds resource configuration, so should be destructed when query is done.
virtual ClassifierPtr acquire(const String & classifier_name) = 0;
/// For introspection, see `system.scheduler` table
using VisitorFunc = std::function<void(const String & resource, const String & path, const String & type, const SchedulerNodePtr & node)>;
using VisitorFunc = std::function<void(const String & resource, const String & path, ISchedulerNode * node)>;
virtual void forEachNode(VisitorFunc visitor) = 0;
};

View File

@ -15,8 +15,7 @@ namespace DB
* When constraint is again satisfied, scheduleActivation() is called from finishRequest().
*
* Derived class behaviour requirements:
* - dequeueRequest() must fill `request->constraint` iff it is nullptr;
* - finishRequest() must be recursive: call to `parent_constraint->finishRequest()`.
* - dequeueRequest() must call `request->addConstraint()`.
*/
class ISchedulerConstraint : public ISchedulerNode
{
@ -25,34 +24,16 @@ public:
: ISchedulerNode(event_queue_, config, config_prefix)
{}
ISchedulerConstraint(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
: ISchedulerNode(event_queue_, info_)
{}
/// Resource consumption by `request` is finished.
/// Should be called outside of scheduling subsystem, implementation must be thread-safe.
virtual void finishRequest(ResourceRequest * request) = 0;
void setParent(ISchedulerNode * parent_) override
{
ISchedulerNode::setParent(parent_);
// Assign `parent_constraint` to the nearest parent derived from ISchedulerConstraint
for (ISchedulerNode * node = parent_; node != nullptr; node = node->parent)
{
if (auto * constraint = dynamic_cast<ISchedulerConstraint *>(node))
{
parent_constraint = constraint;
break;
}
}
}
/// For introspection of current state (true = satisfied, false = violated)
virtual bool isSatisfied() = 0;
protected:
// Reference to nearest parent that is also derived from ISchedulerConstraint.
// Request can traverse through multiple constraints while being dequeue from hierarchy,
// while finishing request should traverse the same chain in reverse order.
// NOTE: it must be immutable after initialization, because it is accessed in not thread-safe way from finishRequest()
ISchedulerConstraint * parent_constraint = nullptr;
};
}

View File

@ -57,7 +57,13 @@ struct SchedulerNodeInfo
SchedulerNodeInfo() = default;
explicit SchedulerNodeInfo(const Poco::Util::AbstractConfiguration & config = emptyConfig(), const String & config_prefix = {})
explicit SchedulerNodeInfo(double weight_, Priority priority_ = {})
{
setWeight(weight_);
setPriority(priority_);
}
explicit SchedulerNodeInfo(const Poco::Util::AbstractConfiguration & config, const String & config_prefix = {})
{
setWeight(config.getDouble(config_prefix + ".weight", weight));
setPriority(config.getInt64(config_prefix + ".priority", priority));
@ -68,7 +74,7 @@ struct SchedulerNodeInfo
if (value <= 0 || !isfinite(value))
throw Exception(
ErrorCodes::INVALID_SCHEDULER_NODE,
"Negative and non-finite node weights are not allowed: {}",
"Zero, negative and non-finite node weights are not allowed: {}",
value);
weight = value;
}
@ -78,6 +84,11 @@ struct SchedulerNodeInfo
priority.value = value;
}
void setPriority(Priority value)
{
priority = value;
}
// To check if configuration update required
bool equals(const SchedulerNodeInfo & o) const
{
@ -123,7 +134,14 @@ public:
, info(config, config_prefix)
{}
virtual ~ISchedulerNode() = default;
ISchedulerNode(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
: event_queue(event_queue_)
, info(info_)
{}
virtual ~ISchedulerNode();
virtual const String & getTypeName() const = 0;
/// Checks if two nodes configuration is equal
virtual bool equals(ISchedulerNode * other)
@ -134,10 +152,11 @@ public:
/// Attach new child
virtual void attachChild(const std::shared_ptr<ISchedulerNode> & child) = 0;
/// Detach and destroy child
/// Detach child
/// NOTE: child might be destroyed if the only reference was stored in parent
virtual void removeChild(ISchedulerNode * child) = 0;
/// Get attached child by name
/// Get attached child by name (for tests only)
virtual ISchedulerNode * getChild(const String & child_name) = 0;
/// Activation of child due to the first pending request
@ -147,7 +166,7 @@ public:
/// Returns true iff node is active
virtual bool isActive() = 0;
/// Returns number of active children
/// Returns number of active children (for introspection only).
virtual size_t activeChildren() = 0;
/// Returns the first request to be executed as the first component of resulting pair.
@ -155,10 +174,10 @@ public:
virtual std::pair<ResourceRequest *, bool> dequeueRequest() = 0;
/// Returns full path string using names of every parent
String getPath()
String getPath() const
{
String result;
ISchedulerNode * ptr = this;
const ISchedulerNode * ptr = this;
while (ptr->parent)
{
result = "/" + ptr->basename + result;
@ -168,10 +187,7 @@ public:
}
/// Attach to a parent (used by attachChild)
virtual void setParent(ISchedulerNode * parent_)
{
parent = parent_;
}
void setParent(ISchedulerNode * parent_);
protected:
/// Notify parents about the first pending request or constraint becoming satisfied.
@ -307,6 +323,15 @@ public:
pending.notify_one();
}
/// Removes an activation from queue
void cancelActivation(ISchedulerNode * node)
{
std::unique_lock lock{mutex};
if (node->is_linked())
activations.erase(activations.iterator_to(*node));
node->activation_event_id = 0;
}
/// Process single event if it exists
/// Note that postponing constraint are ignored, use it to empty the queue including postponed events on shutdown
/// Returns `true` iff event has been processed
@ -471,6 +496,20 @@ private:
std::atomic<TimePoint> manual_time{TimePoint()}; // for tests only
};
inline ISchedulerNode::~ISchedulerNode()
{
// Make sure there is no dangling reference in activations queue
event_queue->cancelActivation(this);
}
inline void ISchedulerNode::setParent(ISchedulerNode * parent_)
{
parent = parent_;
// Avoid activation of a detached node
if (parent == nullptr)
event_queue->cancelActivation(this);
}
inline void ISchedulerNode::scheduleActivation()
{
if (likely(parent))

View File

@ -21,6 +21,10 @@ public:
: ISchedulerNode(event_queue_, config, config_prefix)
{}
ISchedulerQueue(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
: ISchedulerNode(event_queue_, info_)
{}
// Wrapper for `enqueueRequest()` that should be used to account for available resource budget
// Returns `estimated_cost` that should be passed later to `adjustBudget()`
[[ nodiscard ]] ResourceCost enqueueRequestUsingBudget(ResourceRequest * request)
@ -47,6 +51,11 @@ public:
/// Should be called outside of scheduling subsystem, implementation must be thread-safe.
virtual bool cancelRequest(ResourceRequest * request) = 0;
/// Fails all the resource requests in queue and marks this queue as not usable.
/// Afterwards any new request will be failed on `enqueueRequest()`.
/// NOTE: This is done for queues that are about to be destructed.
virtual void purgeQueue() = 0;
/// For introspection
ResourceCost getBudget() const
{

View File

@ -5,11 +5,6 @@
namespace DB
{
namespace ErrorCodes
{
extern const int RESOURCE_NOT_FOUND;
}
ClassifierDescription::ClassifierDescription(const Poco::Util::AbstractConfiguration & config, const String & config_prefix)
{
Poco::Util::AbstractConfiguration::Keys keys;
@ -31,9 +26,11 @@ ClassifiersConfig::ClassifiersConfig(const Poco::Util::AbstractConfiguration & c
const ClassifierDescription & ClassifiersConfig::get(const String & classifier_name)
{
static ClassifierDescription empty;
if (auto it = classifiers.find(classifier_name); it != classifiers.end())
return it->second;
throw Exception(ErrorCodes::RESOURCE_NOT_FOUND, "Unknown workload classifier '{}' to access resources", classifier_name);
else
return empty;
}
}

View File

@ -10,6 +10,7 @@ namespace DB
/// Mapping of resource name into path string (e.g. "disk1" -> "/path/to/class")
struct ClassifierDescription : std::unordered_map<String, String>
{
ClassifierDescription() = default;
ClassifierDescription(const Poco::Util::AbstractConfiguration & config, const String & config_prefix);
};

View File

@ -1,7 +1,6 @@
#include <Common/Scheduler/Nodes/DynamicResourceManager.h>
#include <Common/Scheduler/Nodes/CustomResourceManager.h>
#include <Common/Scheduler/Nodes/SchedulerNodeFactory.h>
#include <Common/Scheduler/ResourceManagerFactory.h>
#include <Common/Scheduler/ISchedulerQueue.h>
#include <Common/Exception.h>
@ -21,7 +20,7 @@ namespace ErrorCodes
extern const int INVALID_SCHEDULER_NODE;
}
DynamicResourceManager::State::State(EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config)
CustomResourceManager::State::State(EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config)
: classifiers(config)
{
Poco::Util::AbstractConfiguration::Keys keys;
@ -35,7 +34,7 @@ DynamicResourceManager::State::State(EventQueue * event_queue, const Poco::Util:
}
}
DynamicResourceManager::State::Resource::Resource(
CustomResourceManager::State::Resource::Resource(
const String & name,
EventQueue * event_queue,
const Poco::Util::AbstractConfiguration & config,
@ -92,7 +91,7 @@ DynamicResourceManager::State::Resource::Resource(
throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "undefined root node path '/' for resource '{}'", name);
}
DynamicResourceManager::State::Resource::~Resource()
CustomResourceManager::State::Resource::~Resource()
{
// NOTE: we should rely on `attached_to` and cannot use `parent`,
// NOTE: because `parent` can be `nullptr` in case attachment is still in event queue
@ -106,14 +105,14 @@ DynamicResourceManager::State::Resource::~Resource()
}
}
DynamicResourceManager::State::Node::Node(const String & name, EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config, const std::string & config_prefix)
CustomResourceManager::State::Node::Node(const String & name, EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config, const std::string & config_prefix)
: type(config.getString(config_prefix + ".type", "fifo"))
, ptr(SchedulerNodeFactory::instance().get(type, event_queue, config, config_prefix))
{
ptr->basename = name;
}
bool DynamicResourceManager::State::Resource::equals(const DynamicResourceManager::State::Resource & o) const
bool CustomResourceManager::State::Resource::equals(const CustomResourceManager::State::Resource & o) const
{
if (nodes.size() != o.nodes.size())
return false;
@ -130,14 +129,14 @@ bool DynamicResourceManager::State::Resource::equals(const DynamicResourceManage
return true;
}
bool DynamicResourceManager::State::Node::equals(const DynamicResourceManager::State::Node & o) const
bool CustomResourceManager::State::Node::equals(const CustomResourceManager::State::Node & o) const
{
if (type != o.type)
return false;
return ptr->equals(o.ptr.get());
}
DynamicResourceManager::Classifier::Classifier(const DynamicResourceManager::StatePtr & state_, const String & classifier_name)
CustomResourceManager::Classifier::Classifier(const CustomResourceManager::StatePtr & state_, const String & classifier_name)
: state(state_)
{
// State is immutable, but nodes are mutable and thread-safe
@ -162,20 +161,25 @@ DynamicResourceManager::Classifier::Classifier(const DynamicResourceManager::Sta
}
}
ResourceLink DynamicResourceManager::Classifier::get(const String & resource_name)
bool CustomResourceManager::Classifier::has(const String & resource_name)
{
return resources.contains(resource_name);
}
ResourceLink CustomResourceManager::Classifier::get(const String & resource_name)
{
if (auto iter = resources.find(resource_name); iter != resources.end())
return iter->second;
throw Exception(ErrorCodes::RESOURCE_ACCESS_DENIED, "Access denied to resource '{}'", resource_name);
}
DynamicResourceManager::DynamicResourceManager()
CustomResourceManager::CustomResourceManager()
: state(new State())
{
scheduler.start();
}
void DynamicResourceManager::updateConfiguration(const Poco::Util::AbstractConfiguration & config)
void CustomResourceManager::updateConfiguration(const Poco::Util::AbstractConfiguration & config)
{
StatePtr new_state = std::make_shared<State>(scheduler.event_queue, config);
@ -217,7 +221,13 @@ void DynamicResourceManager::updateConfiguration(const Poco::Util::AbstractConfi
// NOTE: after mutex unlock `state` became available for Classifier(s) and must be immutable
}
ClassifierPtr DynamicResourceManager::acquire(const String & classifier_name)
bool CustomResourceManager::hasResource(const String & resource_name) const
{
std::lock_guard lock{mutex};
return state->resources.contains(resource_name);
}
ClassifierPtr CustomResourceManager::acquire(const String & classifier_name)
{
// Acquire a reference to the current state
StatePtr state_ref;
@ -229,7 +239,7 @@ ClassifierPtr DynamicResourceManager::acquire(const String & classifier_name)
return std::make_shared<Classifier>(state_ref, classifier_name);
}
void DynamicResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
void CustomResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
{
// Acquire a reference to the current state
StatePtr state_ref;
@ -244,7 +254,7 @@ void DynamicResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
{
for (auto & [name, resource] : state_ref->resources)
for (auto & [path, node] : resource->nodes)
visitor(name, path, node.type, node.ptr);
visitor(name, path, node.ptr.get());
promise.set_value();
});
@ -252,9 +262,4 @@ void DynamicResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
future.get();
}
void registerDynamicResourceManager(ResourceManagerFactory & factory)
{
factory.registerMethod<DynamicResourceManager>("dynamic");
}
}

View File

@ -10,7 +10,9 @@ namespace DB
{
/*
* Implementation of `IResourceManager` supporting arbitrary dynamic hierarchy of scheduler nodes.
* Implementation of `IResourceManager` supporting arbitrary hierarchy of scheduler nodes.
* Scheduling hierarchies for every resource is described through server xml or yaml configuration.
* Configuration could be changed dynamically without server restart.
* All resources are controlled by single root `SchedulerRoot`.
*
* State of manager is set of resources attached to the scheduler. States are referenced by classifiers.
@ -24,11 +26,12 @@ namespace DB
* violation will apply to fairness. Old version exists as long as there is at least one classifier
* instance referencing it. Classifiers are typically attached to queries and will be destructed with them.
*/
class DynamicResourceManager : public IResourceManager
class CustomResourceManager : public IResourceManager
{
public:
DynamicResourceManager();
CustomResourceManager();
void updateConfiguration(const Poco::Util::AbstractConfiguration & config) override;
bool hasResource(const String & resource_name) const override;
ClassifierPtr acquire(const String & classifier_name) override;
void forEachNode(VisitorFunc visitor) override;
@ -79,6 +82,7 @@ private:
{
public:
Classifier(const StatePtr & state_, const String & classifier_name);
bool has(const String & resource_name) override;
ResourceLink get(const String & resource_name) override;
private:
std::unordered_map<String, ResourceLink> resources; // accessible resources by names
@ -86,7 +90,7 @@ private:
};
SchedulerRoot scheduler;
std::mutex mutex;
mutable std::mutex mutex;
StatePtr state;
};

View File

@ -28,7 +28,7 @@ namespace ErrorCodes
* of a child is set to vruntime of "start" of the last request. This guarantees immediate processing
* of at least single request of newly activated children and thus best isolation and scheduling latency.
*/
class FairPolicy : public ISchedulerNode
class FairPolicy final : public ISchedulerNode
{
/// Scheduling state of a child
struct Item
@ -48,6 +48,23 @@ public:
: ISchedulerNode(event_queue_, config, config_prefix)
{}
FairPolicy(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
: ISchedulerNode(event_queue_, info_)
{}
~FairPolicy() override
{
// We need to clear `parent` in all children to avoid dangling references
while (!children.empty())
removeChild(children.begin()->second.get());
}
const String & getTypeName() const override
{
static String type_name("fair");
return type_name;
}
bool equals(ISchedulerNode * other) override
{
if (!ISchedulerNode::equals(other))

View File

@ -23,13 +23,28 @@ namespace ErrorCodes
/*
* FIFO queue to hold pending resource requests
*/
class FifoQueue : public ISchedulerQueue
class FifoQueue final : public ISchedulerQueue
{
public:
FifoQueue(EventQueue * event_queue_, const Poco::Util::AbstractConfiguration & config, const String & config_prefix)
: ISchedulerQueue(event_queue_, config, config_prefix)
{}
FifoQueue(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
: ISchedulerQueue(event_queue_, info_)
{}
~FifoQueue() override
{
purgeQueue();
}
const String & getTypeName() const override
{
static String type_name("fifo");
return type_name;
}
bool equals(ISchedulerNode * other) override
{
if (!ISchedulerNode::equals(other))
@ -42,6 +57,8 @@ public:
void enqueueRequest(ResourceRequest * request) override
{
std::lock_guard lock(mutex);
if (is_not_usable)
throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Scheduler queue is about to be destructed");
queue_cost += request->cost;
bool was_empty = requests.empty();
requests.push_back(*request);
@ -66,6 +83,8 @@ public:
bool cancelRequest(ResourceRequest * request) override
{
std::lock_guard lock(mutex);
if (is_not_usable)
return false; // Any request should already be failed or executed
if (request->is_linked())
{
// It's impossible to check that `request` is indeed inserted to this queue and not another queue.
@ -88,6 +107,19 @@ public:
return false;
}
void purgeQueue() override
{
std::lock_guard lock(mutex);
is_not_usable = true;
while (!requests.empty())
{
ResourceRequest * request = &requests.front();
requests.pop_front();
request->failed(std::make_exception_ptr(
Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Scheduler queue with resource request is about to be destructed")));
}
}
bool isActive() override
{
std::lock_guard lock(mutex);
@ -131,6 +163,7 @@ private:
std::mutex mutex;
Int64 queue_cost = 0;
boost::intrusive::list<ResourceRequest> requests;
bool is_not_usable = false;
};
}

View File

@ -0,0 +1,532 @@
#include <Common/Scheduler/Nodes/IOResourceManager.h>
#include <Common/Scheduler/Nodes/FifoQueue.h>
#include <Common/Scheduler/Nodes/FairPolicy.h>
#include <Common/logger_useful.h>
#include <Common/Exception.h>
#include <Common/StringUtils.h>
#include <Common/assert_cast.h>
#include <Common/typeid_cast.h>
#include <Common/Priority.h>
#include <Parsers/ASTCreateWorkloadQuery.h>
#include <Parsers/ASTCreateResourceQuery.h>
#include <memory>
#include <mutex>
#include <map>
namespace DB
{
namespace ErrorCodes
{
extern const int RESOURCE_NOT_FOUND;
extern const int INVALID_SCHEDULER_NODE;
extern const int LOGICAL_ERROR;
}
namespace
{
String getEntityName(const ASTPtr & ast)
{
if (auto * create = typeid_cast<ASTCreateWorkloadQuery *>(ast.get()))
return create->getWorkloadName();
if (auto * create = typeid_cast<ASTCreateResourceQuery *>(ast.get()))
return create->getResourceName();
return "unknown-workload-entity";
}
}
IOResourceManager::NodeInfo::NodeInfo(const ASTPtr & ast, const String & resource_name)
{
auto * create = assert_cast<ASTCreateWorkloadQuery *>(ast.get());
name = create->getWorkloadName();
parent = create->getWorkloadParent();
settings.updateFromChanges(create->changes, resource_name);
}
IOResourceManager::Resource::Resource(const ASTPtr & resource_entity_)
: resource_entity(resource_entity_)
, resource_name(getEntityName(resource_entity))
{
scheduler.start();
}
IOResourceManager::Resource::~Resource()
{
scheduler.stop();
}
void IOResourceManager::Resource::createNode(const NodeInfo & info)
{
if (info.name.empty())
throw Exception(ErrorCodes::LOGICAL_ERROR, "Workload must have a name in resource '{}'",
resource_name);
if (info.name == info.parent)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Self-referencing workload '{}' is not allowed in resource '{}'",
info.name, resource_name);
if (node_for_workload.contains(info.name))
throw Exception(ErrorCodes::LOGICAL_ERROR, "Node for creating workload '{}' already exist in resource '{}'",
info.name, resource_name);
if (!info.parent.empty() && !node_for_workload.contains(info.parent))
throw Exception(ErrorCodes::LOGICAL_ERROR, "Parent node '{}' for creating workload '{}' does not exist in resource '{}'",
info.parent, info.name, resource_name);
if (info.parent.empty() && root_node)
throw Exception(ErrorCodes::LOGICAL_ERROR, "The second root workload '{}' is not allowed (current root '{}') in resource '{}'",
info.name, root_node->basename, resource_name);
executeInSchedulerThread([&, this]
{
auto node = std::make_shared<UnifiedSchedulerNode>(scheduler.event_queue, info.settings);
node->basename = info.name;
if (!info.parent.empty())
node_for_workload[info.parent]->attachUnifiedChild(node);
else
{
root_node = node;
scheduler.attachChild(root_node);
}
node_for_workload[info.name] = node;
updateCurrentVersion();
});
}
void IOResourceManager::Resource::deleteNode(const NodeInfo & info)
{
if (!node_for_workload.contains(info.name))
throw Exception(ErrorCodes::LOGICAL_ERROR, "Node for removing workload '{}' does not exist in resource '{}'",
info.name, resource_name);
if (!info.parent.empty() && !node_for_workload.contains(info.parent))
throw Exception(ErrorCodes::LOGICAL_ERROR, "Parent node '{}' for removing workload '{}' does not exist in resource '{}'",
info.parent, info.name, resource_name);
auto node = node_for_workload[info.name];
if (node->hasUnifiedChildren())
throw Exception(ErrorCodes::LOGICAL_ERROR, "Removing workload '{}' with children in resource '{}'",
info.name, resource_name);
executeInSchedulerThread([&]
{
if (!info.parent.empty())
node_for_workload[info.parent]->detachUnifiedChild(node);
else
{
chassert(node == root_node);
scheduler.removeChild(root_node.get());
root_node.reset();
}
node_for_workload.erase(info.name);
updateCurrentVersion();
});
}
void IOResourceManager::Resource::updateNode(const NodeInfo & old_info, const NodeInfo & new_info)
{
if (old_info.name != new_info.name)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Updating a name of workload '{}' to '{}' is not allowed in resource '{}'",
old_info.name, new_info.name, resource_name);
if (old_info.parent != new_info.parent && (old_info.parent.empty() || new_info.parent.empty()))
throw Exception(ErrorCodes::LOGICAL_ERROR, "Workload '{}' invalid update of parent from '{}' to '{}' in resource '{}'",
old_info.name, old_info.parent, new_info.parent, resource_name);
if (!node_for_workload.contains(old_info.name))
throw Exception(ErrorCodes::LOGICAL_ERROR, "Node for updating workload '{}' does not exist in resource '{}'",
old_info.name, resource_name);
if (!old_info.parent.empty() && !node_for_workload.contains(old_info.parent))
throw Exception(ErrorCodes::LOGICAL_ERROR, "Old parent node '{}' for updating workload '{}' does not exist in resource '{}'",
old_info.parent, old_info.name, resource_name);
if (!new_info.parent.empty() && !node_for_workload.contains(new_info.parent))
throw Exception(ErrorCodes::LOGICAL_ERROR, "New parent node '{}' for updating workload '{}' does not exist in resource '{}'",
new_info.parent, new_info.name, resource_name);
executeInSchedulerThread([&, this]
{
auto node = node_for_workload[old_info.name];
bool detached = false;
if (UnifiedSchedulerNode::updateRequiresDetach(old_info.parent, new_info.parent, old_info.settings, new_info.settings))
{
if (!old_info.parent.empty())
node_for_workload[old_info.parent]->detachUnifiedChild(node);
detached = true;
}
node->updateSchedulingSettings(new_info.settings);
if (detached)
{
if (!new_info.parent.empty())
node_for_workload[new_info.parent]->attachUnifiedChild(node);
}
updateCurrentVersion();
});
}
void IOResourceManager::Resource::updateCurrentVersion()
{
auto previous_version = current_version;
// Create a full list of constraints and queues in the current hierarchy
current_version = std::make_shared<Version>();
if (root_node)
root_node->addRawPointerNodes(current_version->nodes);
// See details in version control section of description in IOResourceManager.h
if (previous_version)
{
previous_version->newer_version = current_version;
previous_version.reset(); // Destroys previous version nodes if there are no classifiers referencing it
}
}
IOResourceManager::Workload::Workload(IOResourceManager * resource_manager_, const ASTPtr & workload_entity_)
: resource_manager(resource_manager_)
, workload_entity(workload_entity_)
{
try
{
for (auto & [resource_name, resource] : resource_manager->resources)
resource->createNode(NodeInfo(workload_entity, resource_name));
}
catch (...)
{
throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected error in IOResourceManager: {}",
getCurrentExceptionMessage(/* with_stacktrace = */ true));
}
}
IOResourceManager::Workload::~Workload()
{
try
{
for (auto & [resource_name, resource] : resource_manager->resources)
resource->deleteNode(NodeInfo(workload_entity, resource_name));
}
catch (...)
{
throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected error in IOResourceManager: {}",
getCurrentExceptionMessage(/* with_stacktrace = */ true));
}
}
void IOResourceManager::Workload::updateWorkload(const ASTPtr & new_entity)
{
try
{
for (auto & [resource_name, resource] : resource_manager->resources)
resource->updateNode(NodeInfo(workload_entity, resource_name), NodeInfo(new_entity, resource_name));
workload_entity = new_entity;
}
catch (...)
{
throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected error in IOResourceManager: {}",
getCurrentExceptionMessage(/* with_stacktrace = */ true));
}
}
String IOResourceManager::Workload::getParent() const
{
return assert_cast<ASTCreateWorkloadQuery *>(workload_entity.get())->getWorkloadParent();
}
IOResourceManager::IOResourceManager(IWorkloadEntityStorage & storage_)
: storage(storage_)
, log{getLogger("IOResourceManager")}
{
subscription = storage.getAllEntitiesAndSubscribe(
[this] (const std::vector<IWorkloadEntityStorage::Event> & events)
{
for (const auto & [entity_type, entity_name, entity] : events)
{
switch (entity_type)
{
case WorkloadEntityType::Workload:
{
if (entity)
createOrUpdateWorkload(entity_name, entity);
else
deleteWorkload(entity_name);
break;
}
case WorkloadEntityType::Resource:
{
if (entity)
createOrUpdateResource(entity_name, entity);
else
deleteResource(entity_name);
break;
}
case WorkloadEntityType::MAX: break;
}
}
});
}
IOResourceManager::~IOResourceManager()
{
subscription.reset();
resources.clear();
workloads.clear();
}
void IOResourceManager::updateConfiguration(const Poco::Util::AbstractConfiguration &)
{
// No-op
}
void IOResourceManager::createOrUpdateWorkload(const String & workload_name, const ASTPtr & ast)
{
std::unique_lock lock{mutex};
if (auto workload_iter = workloads.find(workload_name); workload_iter != workloads.end())
workload_iter->second->updateWorkload(ast);
else
workloads.emplace(workload_name, std::make_shared<Workload>(this, ast));
}
void IOResourceManager::deleteWorkload(const String & workload_name)
{
std::unique_lock lock{mutex};
if (auto workload_iter = workloads.find(workload_name); workload_iter != workloads.end())
{
// Note that we rely of the fact that workload entity storage will not drop workload that is used as a parent
workloads.erase(workload_iter);
}
else // Workload to be deleted does not exist -- do nothing, throwing exceptions from a subscription is pointless
LOG_ERROR(log, "Delete workload that doesn't exist: {}", workload_name);
}
void IOResourceManager::createOrUpdateResource(const String & resource_name, const ASTPtr & ast)
{
std::unique_lock lock{mutex};
if (auto resource_iter = resources.find(resource_name); resource_iter != resources.end())
resource_iter->second->updateResource(ast);
else
{
// Add all workloads into the new resource
auto resource = std::make_shared<Resource>(ast);
for (Workload * workload : topologicallySortedWorkloads())
resource->createNode(NodeInfo(workload->workload_entity, resource_name));
// Attach the resource
resources.emplace(resource_name, resource);
}
}
void IOResourceManager::deleteResource(const String & resource_name)
{
std::unique_lock lock{mutex};
if (auto resource_iter = resources.find(resource_name); resource_iter != resources.end())
{
resources.erase(resource_iter);
}
else // Resource to be deleted does not exist -- do nothing, throwing exceptions from a subscription is pointless
LOG_ERROR(log, "Delete resource that doesn't exist: {}", resource_name);
}
IOResourceManager::Classifier::~Classifier()
{
// Detach classifier from all resources in parallel (executed in every scheduler thread)
std::vector<std::future<void>> futures;
{
std::unique_lock lock{mutex};
futures.reserve(attachments.size());
for (auto & [resource_name, attachment] : attachments)
{
futures.emplace_back(attachment.resource->detachClassifier(std::move(attachment.version)));
attachment.link.reset(); // Just in case because it is not valid any longer
}
}
// Wait for all tasks to finish (to avoid races in case of exceptions)
for (auto & future : futures)
future.wait();
// There should not be any exceptions because it just destruct few objects, but let's rethrow just in case
for (auto & future : futures)
future.get();
// This unreferences and probably destroys `Resource` objects.
// NOTE: We cannot do it in the scheduler threads (because thread cannot join itself).
attachments.clear();
}
std::future<void> IOResourceManager::Resource::detachClassifier(VersionPtr && version)
{
auto detach_promise = std::make_shared<std::promise<void>>(); // event queue task is std::function, which requires copy semanticss
auto future = detach_promise->get_future();
scheduler.event_queue->enqueue([detached_version = std::move(version), promise = std::move(detach_promise)] mutable
{
try
{
// Unreferences and probably destroys the version and scheduler nodes it owns.
// The main reason from moving destruction into the scheduler thread is to
// free memory in the same thread it was allocated to avoid memtrackers drift.
detached_version.reset();
promise->set_value();
}
catch (...)
{
promise->set_exception(std::current_exception());
}
});
return future;
}
bool IOResourceManager::Classifier::has(const String & resource_name)
{
std::unique_lock lock{mutex};
return attachments.contains(resource_name);
}
ResourceLink IOResourceManager::Classifier::get(const String & resource_name)
{
std::unique_lock lock{mutex};
if (auto iter = attachments.find(resource_name); iter != attachments.end())
return iter->second.link;
else
throw Exception(ErrorCodes::RESOURCE_NOT_FOUND, "Access denied to resource '{}'", resource_name);
}
void IOResourceManager::Classifier::attach(const ResourcePtr & resource, const VersionPtr & version, ResourceLink link)
{
std::unique_lock lock{mutex};
chassert(!attachments.contains(resource->getName()));
attachments[resource->getName()] = Attachment{.resource = resource, .version = version, .link = link};
}
void IOResourceManager::Resource::updateResource(const ASTPtr & new_resource_entity)
{
chassert(getEntityName(new_resource_entity) == resource_name);
resource_entity = new_resource_entity;
}
std::future<void> IOResourceManager::Resource::attachClassifier(Classifier & classifier, const String & workload_name)
{
auto attach_promise = std::make_shared<std::promise<void>>(); // event queue task is std::function, which requires copy semantics
auto future = attach_promise->get_future();
scheduler.event_queue->enqueue([&, this, promise = std::move(attach_promise)]
{
try
{
if (auto iter = node_for_workload.find(workload_name); iter != node_for_workload.end())
{
auto queue = iter->second->getQueue();
if (!queue)
throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Unable to use workload '{}' that have children for resource '{}'",
workload_name, resource_name);
classifier.attach(shared_from_this(), current_version, ResourceLink{.queue = queue.get()});
}
else
{
// This resource does not have specified workload. It is either unknown or managed by another resource manager.
// We leave this resource not attached to the classifier. Access denied will be thrown later on `classifier->get(resource_name)`
}
promise->set_value();
}
catch (...)
{
promise->set_exception(std::current_exception());
}
});
return future;
}
bool IOResourceManager::hasResource(const String & resource_name) const
{
std::unique_lock lock{mutex};
return resources.contains(resource_name);
}
ClassifierPtr IOResourceManager::acquire(const String & workload_name)
{
auto classifier = std::make_shared<Classifier>();
// Attach classifier to all resources in parallel (executed in every scheduler thread)
std::vector<std::future<void>> futures;
{
std::unique_lock lock{mutex};
futures.reserve(resources.size());
for (auto & [resource_name, resource] : resources)
futures.emplace_back(resource->attachClassifier(*classifier, workload_name));
}
// Wait for all tasks to finish (to avoid races in case of exceptions)
for (auto & future : futures)
future.wait();
// Rethrow exceptions if any
for (auto & future : futures)
future.get();
return classifier;
}
void IOResourceManager::Resource::forEachResourceNode(IResourceManager::VisitorFunc & visitor)
{
executeInSchedulerThread([&, this]
{
for (auto & [path, node] : node_for_workload)
{
node->forEachSchedulerNode([&] (ISchedulerNode * scheduler_node)
{
visitor(resource_name, scheduler_node->getPath(), scheduler_node);
});
}
});
}
void IOResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
{
// Copy resource to avoid holding mutex for a long time
std::unordered_map<String, ResourcePtr> resources_copy;
{
std::unique_lock lock{mutex};
resources_copy = resources;
}
/// Run tasks one by one to avoid concurrent calls to visitor
for (auto & [resource_name, resource] : resources_copy)
resource->forEachResourceNode(visitor);
}
void IOResourceManager::topologicallySortedWorkloadsImpl(Workload * workload, std::unordered_set<Workload *> & visited, std::vector<Workload *> & sorted_workloads)
{
if (visited.contains(workload))
return;
visited.insert(workload);
// Recurse into parent (if any)
String parent = workload->getParent();
if (!parent.empty())
{
auto parent_iter = workloads.find(parent);
chassert(parent_iter != workloads.end()); // validations check that all parents exist
topologicallySortedWorkloadsImpl(parent_iter->second.get(), visited, sorted_workloads);
}
sorted_workloads.push_back(workload);
}
std::vector<IOResourceManager::Workload *> IOResourceManager::topologicallySortedWorkloads()
{
std::vector<Workload *> sorted_workloads;
std::unordered_set<Workload *> visited;
for (auto & [workload_name, workload] : workloads)
topologicallySortedWorkloadsImpl(workload.get(), visited, sorted_workloads);
return sorted_workloads;
}
}

View File

@ -0,0 +1,281 @@
#pragma once
#include <base/defines.h>
#include <base/scope_guard.h>
#include <Common/Logger.h>
#include <Common/Scheduler/SchedulingSettings.h>
#include <Common/Scheduler/IResourceManager.h>
#include <Common/Scheduler/SchedulerRoot.h>
#include <Common/Scheduler/Nodes/UnifiedSchedulerNode.h>
#include <Common/Scheduler/Workload/IWorkloadEntityStorage.h>
#include <Parsers/IAST_fwd.h>
#include <boost/core/noncopyable.hpp>
#include <exception>
#include <memory>
#include <mutex>
#include <future>
#include <unordered_set>
namespace DB
{
/*
* Implementation of `IResourceManager` that creates hierarchy of scheduler nodes according to
* workload entities (WORKLOADs and RESOURCEs). It subscribes for updates in IWorkloadEntityStorage and
* creates hierarchy of UnifiedSchedulerNode identical to the hierarchy of WORKLOADs.
* For every RESOURCE an independent hierarchy of scheduler nodes is created.
*
* Manager process updates of WORKLOADs and RESOURCEs: CREATE/DROP/ALTER.
* When a RESOURCE is created (dropped) a corresponding scheduler nodes hierarchy is created (destroyed).
* After DROP RESOURCE parts of hierarchy might be kept alive while at least one query uses it.
*
* Manager is specific to IO only because it create scheduler node hierarchies for RESOURCEs having
* WRITE DISK and/or READ DISK definitions. CPU and memory resources are managed separately.
*
* Classifiers are used (1) to access IO resources and (2) to keep shared ownership of scheduling nodes.
* This allows `ResourceRequest` and `ResourceLink` to hold raw pointers as long as
* `ClassifierPtr` is acquired and held.
*
* === RESOURCE ARCHITECTURE ===
* Let's consider how a single resource is implemented. Every workload is represented by corresponding UnifiedSchedulerNode.
* Every UnifiedSchedulerNode manages its own subtree of ISchedulerNode objects (see details in UnifiedSchedulerNode.h)
* UnifiedSchedulerNode for workload w/o children has a queue, which provide a ResourceLink for consumption.
* Parent of the root workload for a resource is SchedulerRoot with its own scheduler thread.
* So every resource has its dedicated thread for processing of resource request and other events (see EventQueue).
*
* Here is an example of SQL and corresponding hierarchy of scheduler nodes:
* CREATE RESOURCE my_io_resource (...)
* CREATE WORKLOAD all
* CREATE WORKLOAD production PARENT all
* CREATE WORKLOAD development PARENT all
*
* root - SchedulerRoot (with scheduler thread and EventQueue)
* |
* all - UnifiedSchedulerNode
* |
* p0_fair - FairPolicy (part of parent UnifiedSchedulerNode internal structure)
* / \
* production development - UnifiedSchedulerNode
* | |
* queue queue - FifoQueue (part of parent UnifiedSchedulerNode internal structure)
*
* === UPDATING WORKLOADS ===
* Workload may be created, updated or deleted.
* Updating a child of a workload might lead to updating other workloads:
* 1. Workload itself: it's structure depend on settings of children workloads
* (e.g. fifo node of a leaf workload is remove when the first child is added;
* and a fair node is inserted after the first two children are added).
* 2. Other children: for them path to root might be changed (e.g. intermediate priority node is inserted)
*
* === VERSION CONTROL ===
* Versions are created on hierarchy updates and hold ownership of nodes that are used through raw pointers.
* Classifier reference version of every resource it use. Older version reference newer version.
* Here is a diagram explaining version control based on Version objects (for 1 resource):
*
* [nodes] [nodes] [nodes]
* ^ ^ ^
* | | |
* version1 --> version2 -...-> versionN
* ^ ^ ^
* | | |
* old_classifier new_classifier current_version
*
* Previous version should hold reference to a newer version. It is required for proper handling of updates.
* Classifiers that were created for any of old versions may use nodes of newer version due to updateNode().
* It may move a queue to a new position in the hierarchy or create/destroy constraints, thus resource requests
* created by old classifier may reference constraints of newer versions through `request->constraints` which
* is filled during dequeueRequest().
*
* === THREADS ===
* scheduler thread:
* - one thread per resource
* - uses event_queue (per resource) for processing w/o holding mutex for every scheduler node
* - handle resource requests
* - node activations
* - scheduler hierarchy updates
* query thread:
* - multiple independent threads
* - send resource requests
* - acquire and release classifiers (via scheduler event queues)
* control thread:
* - modify workload and resources through subscription
*
* === SYNCHRONIZATION ===
* List of related sync primitives and their roles:
* IOResourceManager::mutex
* - protects resource manager data structures - resource and workloads
* - serialize control thread actions
* IOResourceManager::Resource::scheduler->event_queue
* - serializes scheduler hierarchy events
* - events are created in control and query threads
* - all events are processed by specific scheduler thread
* - hierarchy-wide actions: requests dequeueing, activations propagation and nodes updates.
* - resource version control management
* FifoQueue::mutex and SemaphoreContraint::mutex
* - serializes query and scheduler threads on specific node accesses
* - resource request processing: enqueueRequest(), dequeueRequest() and finishRequest()
*/
class IOResourceManager : public IResourceManager
{
public:
explicit IOResourceManager(IWorkloadEntityStorage & storage_);
~IOResourceManager() override;
void updateConfiguration(const Poco::Util::AbstractConfiguration & config) override;
bool hasResource(const String & resource_name) const override;
ClassifierPtr acquire(const String & workload_name) override;
void forEachNode(VisitorFunc visitor) override;
private:
// Forward declarations
struct NodeInfo;
struct Version;
class Resource;
struct Workload;
class Classifier;
friend struct Workload;
using VersionPtr = std::shared_ptr<Version>;
using ResourcePtr = std::shared_ptr<Resource>;
using WorkloadPtr = std::shared_ptr<Workload>;
/// Helper for parsing workload AST for a specific resource
struct NodeInfo
{
String name; // Workload name
String parent; // Name of parent workload
SchedulingSettings settings; // Settings specific for a given resource
NodeInfo(const ASTPtr & ast, const String & resource_name);
};
/// Ownership control for scheduler nodes, which could be referenced by raw pointers
struct Version
{
std::vector<SchedulerNodePtr> nodes;
VersionPtr newer_version;
};
/// Holds a thread and hierarchy of unified scheduler nodes for specific RESOURCE
class Resource : public std::enable_shared_from_this<Resource>, boost::noncopyable
{
public:
explicit Resource(const ASTPtr & resource_entity_);
~Resource();
const String & getName() const { return resource_name; }
/// Hierarchy management
void createNode(const NodeInfo & info);
void deleteNode(const NodeInfo & info);
void updateNode(const NodeInfo & old_info, const NodeInfo & new_info);
/// Updates resource entity
void updateResource(const ASTPtr & new_resource_entity);
/// Updates a classifier to contain a reference for specified workload
std::future<void> attachClassifier(Classifier & classifier, const String & workload_name);
/// Remove classifier reference. This destroys scheduler nodes in proper scheduler thread
std::future<void> detachClassifier(VersionPtr && version);
/// Introspection
void forEachResourceNode(IOResourceManager::VisitorFunc & visitor);
private:
void updateCurrentVersion();
template <class Task>
void executeInSchedulerThread(Task && task)
{
std::promise<void> promise;
auto future = promise.get_future();
scheduler.event_queue->enqueue([&]
{
try
{
task();
promise.set_value();
}
catch (...)
{
promise.set_exception(std::current_exception());
}
});
future.get(); // Blocks until execution is done in the scheduler thread
}
ASTPtr resource_entity;
const String resource_name;
SchedulerRoot scheduler;
// TODO(serxa): consider using resource_manager->mutex + scheduler thread for updates and mutex only for reading to avoid slow acquire/release of classifier
/// These field should be accessed only by the scheduler thread
std::unordered_map<String, UnifiedSchedulerNodePtr> node_for_workload;
UnifiedSchedulerNodePtr root_node;
VersionPtr current_version;
};
struct Workload : boost::noncopyable
{
IOResourceManager * resource_manager;
ASTPtr workload_entity;
Workload(IOResourceManager * resource_manager_, const ASTPtr & workload_entity_);
~Workload();
void updateWorkload(const ASTPtr & new_entity);
String getParent() const;
};
class Classifier : public IClassifier
{
public:
~Classifier() override;
/// Implements IClassifier interface
/// NOTE: It is called from query threads (possibly multiple)
bool has(const String & resource_name) override;
ResourceLink get(const String & resource_name) override;
/// Attaches/detaches a specific resource
/// NOTE: It is called from scheduler threads (possibly multiple)
void attach(const ResourcePtr & resource, const VersionPtr & version, ResourceLink link);
void detach(const ResourcePtr & resource);
private:
IOResourceManager * resource_manager;
std::mutex mutex;
struct Attachment
{
ResourcePtr resource;
VersionPtr version;
ResourceLink link;
};
std::unordered_map<String, Attachment> attachments; // TSA_GUARDED_BY(mutex);
};
void createOrUpdateWorkload(const String & workload_name, const ASTPtr & ast);
void deleteWorkload(const String & workload_name);
void createOrUpdateResource(const String & resource_name, const ASTPtr & ast);
void deleteResource(const String & resource_name);
// Topological sorting of workloads
void topologicallySortedWorkloadsImpl(Workload * workload, std::unordered_set<Workload *> & visited, std::vector<Workload *> & sorted_workloads);
std::vector<Workload *> topologicallySortedWorkloads();
IWorkloadEntityStorage & storage;
scope_guard subscription;
mutable std::mutex mutex;
std::unordered_map<String, WorkloadPtr> workloads; // TSA_GUARDED_BY(mutex);
std::unordered_map<String, ResourcePtr> resources; // TSA_GUARDED_BY(mutex);
LoggerPtr log;
};
}

View File

@ -19,7 +19,7 @@ namespace ErrorCodes
* Scheduler node that implements priority scheduling policy.
* Requests are scheduled in order of priorities.
*/
class PriorityPolicy : public ISchedulerNode
class PriorityPolicy final : public ISchedulerNode
{
/// Scheduling state of a child
struct Item
@ -39,6 +39,23 @@ public:
: ISchedulerNode(event_queue_, config, config_prefix)
{}
explicit PriorityPolicy(EventQueue * event_queue_, const SchedulerNodeInfo & node_info)
: ISchedulerNode(event_queue_, node_info)
{}
~PriorityPolicy() override
{
// We need to clear `parent` in all children to avoid dangling references
while (!children.empty())
removeChild(children.begin()->second.get());
}
const String & getTypeName() const override
{
static String type_name("priority");
return type_name;
}
bool equals(ISchedulerNode * other) override
{
if (!ISchedulerNode::equals(other))

View File

@ -1,5 +1,6 @@
#pragma once
#include "Common/Scheduler/ISchedulerNode.h"
#include <Common/Scheduler/ISchedulerConstraint.h>
#include <mutex>
@ -13,7 +14,7 @@ namespace DB
* Limited concurrency constraint.
* Blocks if either number of concurrent in-flight requests exceeds `max_requests`, or their total cost exceeds `max_cost`
*/
class SemaphoreConstraint : public ISchedulerConstraint
class SemaphoreConstraint final : public ISchedulerConstraint
{
static constexpr Int64 default_max_requests = std::numeric_limits<Int64>::max();
static constexpr Int64 default_max_cost = std::numeric_limits<Int64>::max();
@ -24,6 +25,25 @@ public:
, max_cost(config.getInt64(config_prefix + ".max_cost", config.getInt64(config_prefix + ".max_bytes", default_max_cost)))
{}
SemaphoreConstraint(EventQueue * event_queue_, const SchedulerNodeInfo & info_, Int64 max_requests_, Int64 max_cost_)
: ISchedulerConstraint(event_queue_, info_)
, max_requests(max_requests_)
, max_cost(max_cost_)
{}
~SemaphoreConstraint() override
{
// We need to clear `parent` in child to avoid dangling references
if (child)
removeChild(child.get());
}
const String & getTypeName() const override
{
static String type_name("inflight_limit");
return type_name;
}
bool equals(ISchedulerNode * other) override
{
if (!ISchedulerNode::equals(other))
@ -68,15 +88,14 @@ public:
if (!request)
return {nullptr, false};
// Request has reference to the first (closest to leaf) `constraint`, which can have `parent_constraint`.
// The former is initialized here dynamically and the latter is initialized once during hierarchy construction.
if (!request->constraint)
request->constraint = this;
// Update state on request arrival
std::unique_lock lock(mutex);
requests++;
cost += request->cost;
if (request->addConstraint(this))
{
// Update state on request arrival
requests++;
cost += request->cost;
}
child_active = child_now_active;
if (!active())
busy_periods++;
@ -86,10 +105,6 @@ public:
void finishRequest(ResourceRequest * request) override
{
// Recursive traverse of parent flow controls in reverse order
if (parent_constraint)
parent_constraint->finishRequest(request);
// Update state on request departure
std::unique_lock lock(mutex);
bool was_active = active();
@ -109,6 +124,32 @@ public:
parent->activateChild(this);
}
/// Update limits.
/// Should be called from the scheduler thread because it could lead to activation or deactivation
void updateConstraints(const SchedulerNodePtr & self, Int64 new_max_requests, UInt64 new_max_cost)
{
std::unique_lock lock(mutex);
bool was_active = active();
max_requests = new_max_requests;
max_cost = new_max_cost;
if (parent)
{
// Activate on transition from inactive state
if (!was_active && active())
parent->activateChild(this);
// Deactivate on transition into inactive state
else if (was_active && !active())
{
// Node deactivation is usually done in dequeueRequest(), but we do not want to
// do extra call to active() on every request just to make sure there was no update().
// There is no interface method to do deactivation, so we do the following trick.
parent->removeChild(this);
parent->attachChild(self); // This call is the only reason we have `recursive_mutex`
}
}
}
bool isActive() override
{
std::unique_lock lock(mutex);
@ -150,10 +191,10 @@ private:
return satisfied() && child_active;
}
const Int64 max_requests = default_max_requests;
const Int64 max_cost = default_max_cost;
Int64 max_requests = default_max_requests;
Int64 max_cost = default_max_cost;
std::mutex mutex;
std::recursive_mutex mutex;
Int64 requests = 0;
Int64 cost = 0;
bool child_active = false;

View File

@ -3,8 +3,6 @@
#include <Common/Scheduler/ISchedulerConstraint.h>
#include <chrono>
#include <mutex>
#include <limits>
#include <utility>
@ -15,7 +13,7 @@ namespace DB
* Limited throughput constraint. Blocks if token-bucket constraint is violated:
* i.e. more than `max_burst + duration * max_speed` cost units (aka tokens) dequeued from this node in last `duration` seconds.
*/
class ThrottlerConstraint : public ISchedulerConstraint
class ThrottlerConstraint final : public ISchedulerConstraint
{
public:
static constexpr double default_burst_seconds = 1.0;
@ -28,10 +26,28 @@ public:
, tokens(max_burst)
{}
ThrottlerConstraint(EventQueue * event_queue_, const SchedulerNodeInfo & info_, double max_speed_, double max_burst_)
: ISchedulerConstraint(event_queue_, info_)
, max_speed(max_speed_)
, max_burst(max_burst_)
, last_update(event_queue_->now())
, tokens(max_burst)
{}
~ThrottlerConstraint() override
{
// We should cancel event on destruction to avoid dangling references from event queue
event_queue->cancelPostponed(postponed);
// We need to clear `parent` in child to avoid dangling reference
if (child)
removeChild(child.get());
}
const String & getTypeName() const override
{
static String type_name("bandwidth_limit");
return type_name;
}
bool equals(ISchedulerNode * other) override
@ -78,10 +94,7 @@ public:
if (!request)
return {nullptr, false};
// Request has reference to the first (closest to leaf) `constraint`, which can have `parent_constraint`.
// The former is initialized here dynamically and the latter is initialized once during hierarchy construction.
if (!request->constraint)
request->constraint = this;
// We don't do `request->addConstraint(this)` because `finishRequest()` is no-op
updateBucket(request->cost);
@ -92,12 +105,8 @@ public:
return {request, active()};
}
void finishRequest(ResourceRequest * request) override
void finishRequest(ResourceRequest *) override
{
// Recursive traverse of parent flow controls in reverse order
if (parent_constraint)
parent_constraint->finishRequest(request);
// NOTE: Token-bucket constraint does not require any action when consumption ends
}
@ -108,6 +117,21 @@ public:
parent->activateChild(this);
}
/// Update limits.
/// Should be called from the scheduler thread because it could lead to activation
void updateConstraints(double new_max_speed, double new_max_burst)
{
event_queue->cancelPostponed(postponed);
postponed = EventQueue::not_postponed;
bool was_active = active();
updateBucket(0, true); // To apply previous params for duration since `last_update`
max_speed = new_max_speed;
max_burst = new_max_burst;
updateBucket(0, false); // To postpone (if needed) using new params
if (!was_active && active() && parent)
parent->activateChild(this);
}
bool isActive() override
{
return active();
@ -150,7 +174,7 @@ private:
parent->activateChild(this);
}
void updateBucket(ResourceCost use = 0)
void updateBucket(ResourceCost use = 0, bool do_not_postpone = false)
{
auto now = event_queue->now();
if (max_speed > 0.0)
@ -160,7 +184,7 @@ private:
tokens -= use; // This is done outside min() to avoid passing large requests w/o token consumption after long idle period
// Postpone activation until there is positive amount of tokens
if (tokens < 0.0)
if (!do_not_postpone && tokens < 0.0)
{
auto delay_ns = std::chrono::nanoseconds(static_cast<Int64>(-tokens / max_speed * 1e9));
if (postponed == EventQueue::not_postponed)
@ -184,8 +208,8 @@ private:
return satisfied() && child_active;
}
const double max_speed{0}; /// in tokens per second
const double max_burst{0}; /// in tokens
double max_speed{0}; /// in tokens per second
double max_burst{0}; /// in tokens
EventQueue::TimePoint last_update;
UInt64 postponed = EventQueue::not_postponed;

View File

@ -0,0 +1,606 @@
#pragma once
#include <Common/Priority.h>
#include <Common/Scheduler/Nodes/PriorityPolicy.h>
#include <Common/Scheduler/Nodes/FairPolicy.h>
#include <Common/Scheduler/Nodes/ThrottlerConstraint.h>
#include <Common/Scheduler/Nodes/SemaphoreConstraint.h>
#include <Common/Scheduler/ISchedulerQueue.h>
#include <Common/Scheduler/Nodes/FifoQueue.h>
#include <Common/Scheduler/ISchedulerNode.h>
#include <Common/Scheduler/SchedulingSettings.h>
#include <Common/Exception.h>
#include <memory>
#include <unordered_map>
namespace DB
{
namespace ErrorCodes
{
extern const int INVALID_SCHEDULER_NODE;
extern const int LOGICAL_ERROR;
}
class UnifiedSchedulerNode;
using UnifiedSchedulerNodePtr = std::shared_ptr<UnifiedSchedulerNode>;
/*
* Unified scheduler node combines multiple nodes internally to provide all available scheduling policies and constraints.
* Whole scheduling hierarchy could "logically" consist of unified nodes only. Physically intermediate "internal" nodes
* are also present. This approach is easiers for manipulations in runtime than using multiple types of nodes.
*
* Unified node is capable of updating its internal structure based on:
* 1. Number of children (fifo if =0 or fairness/priority if >0).
* 2. Priorities of its children (for subtree structure).
* 3. `SchedulingSettings` associated with unified node (for throttler and semaphore constraints).
*
* In general, unified node has "internal" subtree with the following structure:
*
* THIS <-- UnifiedSchedulerNode object
* |
* THROTTLER <-- [Optional] Throttling scheduling constraint
* |
* [If no children]------ SEMAPHORE <-- [Optional] Semaphore constraint
* | |
* FIFO PRIORITY <-- [Optional] Scheduling policy distinguishing priorities
* .-------' '-------.
* FAIRNESS[p1] ... FAIRNESS[pN] <-- [Optional] Policies for fairness if priorities are equal
* / \ / \
* CHILD[p1,w1] ... CHILD[p1,wM] CHILD[pN,w1] ... CHILD[pN,wM] <-- Unified children (UnifiedSchedulerNode objects)
*
* NOTE: to distinguish different kinds of children we use the following terms:
* - immediate child: child of unified object (THROTTLER);
* - unified child: leaf of this "internal" subtree (CHILD[p,w]);
* - intermediate node: any child that is not UnifiedSchedulerNode (unified child or `this`)
*/
class UnifiedSchedulerNode final : public ISchedulerNode
{
private:
/// Helper function for managing a parent of a node
static void reparent(const SchedulerNodePtr & node, const SchedulerNodePtr & new_parent)
{
reparent(node, new_parent.get());
}
/// Helper function for managing a parent of a node
static void reparent(const SchedulerNodePtr & node, ISchedulerNode * new_parent)
{
chassert(node);
chassert(new_parent);
if (new_parent == node->parent)
return;
if (node->parent)
node->parent->removeChild(node.get());
new_parent->attachChild(node);
}
/// Helper function for managing a parent of a node
static void detach(const SchedulerNodePtr & node)
{
if (node->parent)
node->parent->removeChild(node.get());
}
/// A branch of the tree for a specific priority value
struct FairnessBranch
{
SchedulerNodePtr root; /// FairPolicy node is used if multiple children with the same priority are attached
std::unordered_map<String, UnifiedSchedulerNodePtr> children; // basename -> child
bool empty() const { return children.empty(); }
SchedulerNodePtr getRoot()
{
chassert(!children.empty());
if (root)
return root;
chassert(children.size() == 1);
return children.begin()->second;
}
/// Attaches a new child.
/// Returns root node if it has been changed to a different node, otherwise returns null.
[[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
{
if (auto [it, inserted] = children.emplace(child->basename, child); !inserted)
throw Exception(
ErrorCodes::INVALID_SCHEDULER_NODE,
"Can't add another child with the same path: {}",
it->second->getPath());
if (children.size() == 2)
{
// Insert fair node if we have just added the second child
chassert(!root);
root = std::make_shared<FairPolicy>(event_queue_, SchedulerNodeInfo{});
root->info.setPriority(child->info.priority);
root->basename = fmt::format("p{}_fair", child->info.priority.value);
for (auto & [_, node] : children)
reparent(node, root);
return root; // New root has been created
}
else if (children.size() == 1)
return child; // We have added single child so far and it is the new root
else
reparent(child, root);
return {}; // Root is the same
}
/// Detaches a child.
/// Returns root node if it has been changed to a different node, otherwise returns null.
/// NOTE: It could also return null if `empty()` after detaching
[[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue *, const UnifiedSchedulerNodePtr & child)
{
auto it = children.find(child->basename);
if (it == children.end())
return {}; // unknown child
detach(child);
children.erase(it);
if (children.size() == 1)
{
// Remove fair if the only child has left
chassert(root);
detach(root);
root.reset();
return children.begin()->second; // The last child is a new root now
}
else if (children.empty())
return {}; // We have detached the last child
else
return {}; // Root is the same (two or more children have left)
}
};
/// Handles all the children nodes with intermediate fair and/or priority nodes
struct ChildrenBranch
{
SchedulerNodePtr root; /// PriorityPolicy node is used if multiple children with different priority are attached
std::unordered_map<Priority::Value, FairnessBranch> branches; /// Branches for different priority values
// Returns true iff there are no unified children attached
bool empty() const { return branches.empty(); }
SchedulerNodePtr getRoot()
{
chassert(!branches.empty());
if (root)
return root;
return branches.begin()->second.getRoot(); // There should be exactly one child-branch
}
/// Attaches a new child.
/// Returns root node if it has been changed to a different node, otherwise returns null.
[[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
{
auto [it, new_branch] = branches.try_emplace(child->info.priority);
auto & child_branch = it->second;
auto branch_root = child_branch.attachUnifiedChild(event_queue_, child);
if (!new_branch)
{
if (branch_root)
{
if (root)
reparent(branch_root, root);
else
return branch_root;
}
return {};
}
else
{
chassert(branch_root);
if (branches.size() == 2)
{
// Insert priority node if we have just added the second branch
chassert(!root);
root = std::make_shared<PriorityPolicy>(event_queue_, SchedulerNodeInfo{});
root->basename = "prio";
for (auto & [_, branch] : branches)
reparent(branch.getRoot(), root);
return root; // New root has been created
}
else if (branches.size() == 1)
return child; // We have added single child so far and it is the new root
else
reparent(child, root);
return {}; // Root is the same
}
}
/// Detaches a child.
/// Returns root node if it has been changed to a different node, otherwise returns null.
/// NOTE: It could also return null if `empty()` after detaching
[[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
{
auto it = branches.find(child->info.priority);
if (it == branches.end())
return {}; // unknown child
auto & child_branch = it->second;
auto branch_root = child_branch.detachUnifiedChild(event_queue_, child);
if (child_branch.empty())
{
branches.erase(it);
if (branches.size() == 1)
{
// Remove priority node if the only child-branch has left
chassert(root);
detach(root);
root.reset();
return branches.begin()->second.getRoot(); // The last child-branch is a new root now
}
else if (branches.empty())
return {}; // We have detached the last child
else
return {}; // Root is the same (two or more children-branches have left)
}
if (branch_root)
{
if (root)
reparent(branch_root, root);
else
return branch_root;
}
return {}; // Root is the same
}
};
/// Handles degenerate case of zero children (a fifo queue) or delegate to `ChildrenBranch`.
struct QueueOrChildrenBranch
{
SchedulerNodePtr queue; /// FifoQueue node is used if there are no children
ChildrenBranch branch; /// Used if there is at least one child
SchedulerNodePtr getRoot()
{
if (queue)
return queue;
else
return branch.getRoot();
}
// Should be called after constructor, before any other methods
[[nodiscard]] SchedulerNodePtr initialize(EventQueue * event_queue_)
{
createQueue(event_queue_);
return queue;
}
/// Attaches a new child.
/// Returns root node if it has been changed to a different node, otherwise returns null.
[[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
{
if (queue)
removeQueue();
return branch.attachUnifiedChild(event_queue_, child);
}
/// Detaches a child.
/// Returns root node if it has been changed to a different node, otherwise returns null.
[[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
{
if (queue)
return {}; // No-op, it already has no children
auto branch_root = branch.detachUnifiedChild(event_queue_, child);
if (branch.empty())
{
createQueue(event_queue_);
return queue;
}
return branch_root;
}
private:
void createQueue(EventQueue * event_queue_)
{
queue = std::make_shared<FifoQueue>(event_queue_, SchedulerNodeInfo{});
queue->basename = "fifo";
}
void removeQueue()
{
// This unified node will not be able to process resource requests any longer
// All remaining resource requests are be aborted on queue destruction
detach(queue);
std::static_pointer_cast<ISchedulerQueue>(queue)->purgeQueue();
queue.reset();
}
};
/// Handles all the nodes under this unified node
/// Specifically handles constraints with `QueueOrChildrenBranch` under it
struct ConstraintsBranch
{
SchedulerNodePtr throttler;
SchedulerNodePtr semaphore;
QueueOrChildrenBranch branch;
SchedulingSettings settings;
// Should be called after constructor, before any other methods
[[nodiscard]] SchedulerNodePtr initialize(EventQueue * event_queue_, const SchedulingSettings & settings_)
{
settings = settings_;
SchedulerNodePtr node = branch.initialize(event_queue_);
if (settings.hasSemaphore())
{
semaphore = std::make_shared<SemaphoreConstraint>(event_queue_, SchedulerNodeInfo{}, settings.max_requests, settings.max_cost);
semaphore->basename = "semaphore";
reparent(node, semaphore);
node = semaphore;
}
if (settings.hasThrottler())
{
throttler = std::make_shared<ThrottlerConstraint>(event_queue_, SchedulerNodeInfo{}, settings.max_speed, settings.max_burst);
throttler->basename = "throttler";
reparent(node, throttler);
node = throttler;
}
return node;
}
/// Attaches a new child.
/// Returns root node if it has been changed to a different node, otherwise returns null.
[[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
{
if (auto branch_root = branch.attachUnifiedChild(event_queue_, child))
{
// If both semaphore and throttler exist we should reparent to the farthest from the root
if (semaphore)
reparent(branch_root, semaphore);
else if (throttler)
reparent(branch_root, throttler);
else
return branch_root;
}
return {};
}
/// Detaches a child.
/// Returns root node if it has been changed to a different node, otherwise returns null.
[[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
{
if (auto branch_root = branch.detachUnifiedChild(event_queue_, child))
{
if (semaphore)
reparent(branch_root, semaphore);
else if (throttler)
reparent(branch_root, throttler);
else
return branch_root;
}
return {};
}
/// Updates constraint-related nodes.
/// Returns root node if it has been changed to a different node, otherwise returns null.
[[nodiscard]] SchedulerNodePtr updateSchedulingSettings(EventQueue * event_queue_, const SchedulingSettings & new_settings)
{
SchedulerNodePtr node = branch.getRoot();
if (!settings.hasSemaphore() && new_settings.hasSemaphore()) // Add semaphore
{
semaphore = std::make_shared<SemaphoreConstraint>(event_queue_, SchedulerNodeInfo{}, new_settings.max_requests, new_settings.max_cost);
semaphore->basename = "semaphore";
reparent(node, semaphore);
node = semaphore;
}
else if (settings.hasSemaphore() && !new_settings.hasSemaphore()) // Remove semaphore
{
detach(semaphore);
semaphore.reset();
}
else if (settings.hasSemaphore() && new_settings.hasSemaphore()) // Update semaphore
{
static_cast<SemaphoreConstraint&>(*semaphore).updateConstraints(semaphore, new_settings.max_requests, new_settings.max_cost);
node = semaphore;
}
if (!settings.hasThrottler() && new_settings.hasThrottler()) // Add throttler
{
throttler = std::make_shared<ThrottlerConstraint>(event_queue_, SchedulerNodeInfo{}, new_settings.max_speed, new_settings.max_burst);
throttler->basename = "throttler";
reparent(node, throttler);
node = throttler;
}
else if (settings.hasThrottler() && !new_settings.hasThrottler()) // Remove throttler
{
detach(throttler);
throttler.reset();
}
else if (settings.hasThrottler() && new_settings.hasThrottler()) // Update throttler
{
static_cast<ThrottlerConstraint&>(*throttler).updateConstraints(new_settings.max_speed, new_settings.max_burst);
node = throttler;
}
settings = new_settings;
return node;
}
};
public:
explicit UnifiedSchedulerNode(EventQueue * event_queue_, const SchedulingSettings & settings)
: ISchedulerNode(event_queue_, SchedulerNodeInfo(settings.weight, settings.priority))
{
immediate_child = impl.initialize(event_queue, settings);
reparent(immediate_child, this);
}
~UnifiedSchedulerNode() override
{
// We need to clear `parent` in child to avoid dangling references
if (immediate_child)
removeChild(immediate_child.get());
}
/// Attaches a unified child as a leaf of internal subtree and insert or update all the intermediate nodes
/// NOTE: Do not confuse with `attachChild()` which is used only for immediate children
void attachUnifiedChild(const UnifiedSchedulerNodePtr & child)
{
if (auto new_child = impl.attachUnifiedChild(event_queue, child))
reparent(new_child, this);
}
/// Detaches unified child and update all the intermediate nodes.
/// Detached child could be safely attached to another parent.
/// NOTE: Do not confuse with `removeChild()` which is used only for immediate children
void detachUnifiedChild(const UnifiedSchedulerNodePtr & child)
{
if (auto new_child = impl.detachUnifiedChild(event_queue, child))
reparent(new_child, this);
}
static bool updateRequiresDetach(const String & old_parent, const String & new_parent, const SchedulingSettings & old_settings, const SchedulingSettings & new_settings)
{
return old_parent != new_parent || old_settings.priority != new_settings.priority;
}
/// Updates scheduling settings. Set of constraints might change.
/// NOTE: Caller is responsible for detaching and attaching if `updateRequiresDetach` returns true
void updateSchedulingSettings(const SchedulingSettings & new_settings)
{
info.setPriority(new_settings.priority);
info.setWeight(new_settings.weight);
if (auto new_child = impl.updateSchedulingSettings(event_queue, new_settings))
reparent(new_child, this);
}
const SchedulingSettings & getSettings() const
{
return impl.settings;
}
/// Returns the queue to be used for resource requests or `nullptr` if it has unified children
std::shared_ptr<ISchedulerQueue> getQueue() const
{
return static_pointer_cast<ISchedulerQueue>(impl.branch.queue);
}
/// Collects nodes that could be accessed with raw pointers by resource requests (queue and constraints)
/// NOTE: This is a building block for classifier. Note that due to possible movement of a queue, set of constraints
/// for that queue might change in future, and `request->constraints` might reference nodes not in
/// the initial set of nodes returned by `addRawPointerNodes()`. To avoid destruction of such additional nodes
/// classifier must (indirectly) hold nodes return by `addRawPointerNodes()` for all future versions of
/// all unified nodes. Such a version control is done by `IOResourceManager`.
void addRawPointerNodes(std::vector<SchedulerNodePtr> & nodes)
{
// NOTE: `impl.throttler` could be skipped, because ThrottlerConstraint does not call `request->addConstraint()`
if (impl.semaphore)
nodes.push_back(impl.semaphore);
if (impl.branch.queue)
nodes.push_back(impl.branch.queue);
for (auto & [_, branch] : impl.branch.branch.branches)
{
for (auto & [_, child] : branch.children)
child->addRawPointerNodes(nodes);
}
}
bool hasUnifiedChildren() const
{
return impl.branch.queue == nullptr;
}
/// Introspection. Calls a visitor for self and every internal node. Do not recurse into unified children.
void forEachSchedulerNode(std::function<void(ISchedulerNode *)> visitor)
{
visitor(this);
if (impl.throttler)
visitor(impl.throttler.get());
if (impl.semaphore)
visitor(impl.semaphore.get());
if (impl.branch.queue)
visitor(impl.branch.queue.get());
if (impl.branch.branch.root) // priority
visitor(impl.branch.branch.root.get());
for (auto & [_, branch] : impl.branch.branch.branches)
{
if (branch.root) // fairness
visitor(branch.root.get());
}
}
protected: // Hide all the ISchedulerNode interface methods as an implementation details
const String & getTypeName() const override
{
static String type_name("unified");
return type_name;
}
bool equals(ISchedulerNode *) override
{
throw Exception(ErrorCodes::LOGICAL_ERROR, "UnifiedSchedulerNode should not be used with CustomResourceManager");
}
/// Attaches an immediate child (used through `reparent()`)
void attachChild(const SchedulerNodePtr & child_) override
{
immediate_child = child_;
immediate_child->setParent(this);
// Activate if required
if (immediate_child->isActive())
activateChild(immediate_child.get());
}
/// Removes an immediate child (used through `reparent()`)
void removeChild(ISchedulerNode * child) override
{
if (immediate_child.get() == child)
{
child_active = false; // deactivate
immediate_child->setParent(nullptr); // detach
immediate_child.reset();
}
}
ISchedulerNode * getChild(const String & child_name) override
{
if (immediate_child->basename == child_name)
return immediate_child.get();
else
return nullptr;
}
std::pair<ResourceRequest *, bool> dequeueRequest() override
{
auto [request, child_now_active] = immediate_child->dequeueRequest();
if (!request)
return {nullptr, false};
child_active = child_now_active;
if (!child_active)
busy_periods++;
incrementDequeued(request->cost);
return {request, child_active};
}
bool isActive() override
{
return child_active;
}
/// Shows number of immediate active children (for introspection)
size_t activeChildren() override
{
return child_active;
}
/// Activate an immediate child
void activateChild(ISchedulerNode * child) override
{
if (child == immediate_child.get())
if (!std::exchange(child_active, true) && parent)
parent->activateChild(this);
}
private:
ConstraintsBranch impl;
SchedulerNodePtr immediate_child; // An immediate child (actually the root of the whole subtree)
bool child_active = false;
};
}

View File

@ -1,15 +0,0 @@
#include <Common/Scheduler/Nodes/registerResourceManagers.h>
#include <Common/Scheduler/ResourceManagerFactory.h>
namespace DB
{
void registerDynamicResourceManager(ResourceManagerFactory &);
void registerResourceManagers()
{
auto & factory = ResourceManagerFactory::instance();
registerDynamicResourceManager(factory);
}
}

View File

@ -1,8 +0,0 @@
#pragma once
namespace DB
{
void registerResourceManagers();
}

View File

@ -1,5 +1,8 @@
#pragma once
#include <gtest/gtest.h>
#include <Common/Scheduler/SchedulingSettings.h>
#include <Common/Scheduler/IResourceManager.h>
#include <Common/Scheduler/SchedulerRoot.h>
#include <Common/Scheduler/ResourceGuard.h>
@ -7,26 +10,35 @@
#include <Common/Scheduler/Nodes/PriorityPolicy.h>
#include <Common/Scheduler/Nodes/FifoQueue.h>
#include <Common/Scheduler/Nodes/SemaphoreConstraint.h>
#include <Common/Scheduler/Nodes/UnifiedSchedulerNode.h>
#include <Common/Scheduler/Nodes/registerSchedulerNodes.h>
#include <Common/Scheduler/Nodes/registerResourceManagers.h>
#include <Poco/Util/XMLConfiguration.h>
#include <atomic>
#include <barrier>
#include <exception>
#include <functional>
#include <memory>
#include <unordered_map>
#include <mutex>
#include <set>
#include <sstream>
#include <utility>
namespace DB
{
namespace ErrorCodes
{
extern const int RESOURCE_ACCESS_DENIED;
}
struct ResourceTestBase
{
ResourceTestBase()
{
[[maybe_unused]] static bool typesRegistered = [] { registerSchedulerNodes(); registerResourceManagers(); return true; }();
[[maybe_unused]] static bool typesRegistered = [] { registerSchedulerNodes(); return true; }();
}
template <class TClass>
@ -37,10 +49,16 @@ struct ResourceTestBase
Poco::AutoPtr config{new Poco::Util::XMLConfiguration(stream)};
String config_prefix = "node";
return add<TClass>(event_queue, root_node, path, std::ref(*config), config_prefix);
}
template <class TClass, class... Args>
static TClass * add(EventQueue * event_queue, SchedulerNodePtr & root_node, const String & path, Args... args)
{
if (path == "/")
{
EXPECT_TRUE(root_node.get() == nullptr);
root_node.reset(new TClass(event_queue, *config, config_prefix));
root_node.reset(new TClass(event_queue, std::forward<Args>(args)...));
return static_cast<TClass *>(root_node.get());
}
@ -65,73 +83,114 @@ struct ResourceTestBase
}
EXPECT_TRUE(!child_name.empty()); // wrong path
SchedulerNodePtr node = std::make_shared<TClass>(event_queue, *config, config_prefix);
SchedulerNodePtr node = std::make_shared<TClass>(event_queue, std::forward<Args>(args)...);
node->basename = child_name;
parent->attachChild(node);
return static_cast<TClass *>(node.get());
}
};
struct ConstraintTest : public SemaphoreConstraint
{
explicit ConstraintTest(EventQueue * event_queue_, const Poco::Util::AbstractConfiguration & config = emptyConfig(), const String & config_prefix = {})
: SemaphoreConstraint(event_queue_, config, config_prefix)
{}
std::pair<ResourceRequest *, bool> dequeueRequest() override
{
auto [request, active] = SemaphoreConstraint::dequeueRequest();
if (request)
{
std::unique_lock lock(mutex);
requests.insert(request);
}
return {request, active};
}
void finishRequest(ResourceRequest * request) override
{
{
std::unique_lock lock(mutex);
requests.erase(request);
}
SemaphoreConstraint::finishRequest(request);
}
std::mutex mutex;
std::set<ResourceRequest *> requests;
};
class ResourceTestClass : public ResourceTestBase
{
struct Request : public ResourceRequest
{
ResourceTestClass * test;
String name;
Request(ResourceCost cost_, const String & name_)
Request(ResourceTestClass * test_, ResourceCost cost_, const String & name_)
: ResourceRequest(cost_)
, test(test_)
, name(name_)
{}
void execute() override
{
}
void failed(const std::exception_ptr &) override
{
test->failed_cost += cost;
delete this;
}
};
public:
~ResourceTestClass()
{
if (root_node)
dequeue(); // Just to avoid any leaks of `Request` object
}
template <class TClass>
void add(const String & path, const String & xml = {})
{
ResourceTestBase::add<TClass>(&event_queue, root_node, path, xml);
}
template <class TClass, class... Args>
void addCustom(const String & path, Args... args)
{
ResourceTestBase::add<TClass>(&event_queue, root_node, path, std::forward<Args>(args)...);
}
UnifiedSchedulerNodePtr createUnifiedNode(const String & basename, const SchedulingSettings & settings = {})
{
return createUnifiedNode(basename, {}, settings);
}
UnifiedSchedulerNodePtr createUnifiedNode(const String & basename, const UnifiedSchedulerNodePtr & parent, const SchedulingSettings & settings = {})
{
auto node = std::make_shared<UnifiedSchedulerNode>(&event_queue, settings);
node->basename = basename;
if (parent)
{
parent->attachUnifiedChild(node);
}
else
{
EXPECT_TRUE(root_node.get() == nullptr);
root_node = node;
}
return node;
}
// Updates the parent and/or scheduling settings for a specidfied `node`.
// Unit test implementation must make sure that all needed queues and constraints are not going to be destroyed.
// Normally it is the responsibility of IOResourceManager, but we do not use it here, so manual version control is required.
// (see IOResourceManager::Resource::updateCurrentVersion() fo details)
void updateUnifiedNode(const UnifiedSchedulerNodePtr & node, const UnifiedSchedulerNodePtr & old_parent, const UnifiedSchedulerNodePtr & new_parent, const SchedulingSettings & new_settings)
{
EXPECT_TRUE((old_parent && new_parent) || (!old_parent && !new_parent)); // changing root node is not supported
bool detached = false;
if (UnifiedSchedulerNode::updateRequiresDetach(
old_parent ? old_parent->basename : "",
new_parent ? new_parent->basename : "",
node->getSettings(),
new_settings))
{
if (old_parent)
old_parent->detachUnifiedChild(node);
detached = true;
}
node->updateSchedulingSettings(new_settings);
if (detached && new_parent)
new_parent->attachUnifiedChild(node);
}
void enqueue(const UnifiedSchedulerNodePtr & node, const std::vector<ResourceCost> & costs)
{
enqueueImpl(node->getQueue().get(), costs, node->basename);
}
void enqueue(const String & path, const std::vector<ResourceCost> & costs)
{
ASSERT_TRUE(root_node.get() != nullptr); // root should be initialized first
ISchedulerNode * node = root_node.get();
size_t pos = 1;
while (pos < path.length())
while (node && pos < path.length())
{
size_t slash = path.find('/', pos);
if (slash != String::npos)
@ -146,13 +205,17 @@ public:
pos = String::npos;
}
}
ISchedulerQueue * queue = dynamic_cast<ISchedulerQueue *>(node);
ASSERT_TRUE(queue != nullptr); // not a queue
if (node)
enqueueImpl(dynamic_cast<ISchedulerQueue *>(node), costs);
}
void enqueueImpl(ISchedulerQueue * queue, const std::vector<ResourceCost> & costs, const String & name = {})
{
ASSERT_TRUE(queue != nullptr); // not a queue
if (!queue)
return; // to make clang-analyzer-core.NonNullParamChecker happy
for (ResourceCost cost : costs)
{
queue->enqueueRequest(new Request(cost, queue->basename));
}
queue->enqueueRequest(new Request(this, cost, name.empty() ? queue->basename : name));
processEvents(); // to activate queues
}
@ -208,6 +271,12 @@ public:
consumed_cost[name] -= value;
}
void failed(ResourceCost value)
{
EXPECT_EQ(failed_cost, value);
failed_cost -= value;
}
void processEvents()
{
while (event_queue.tryProcess()) {}
@ -217,8 +286,11 @@ private:
EventQueue event_queue;
SchedulerNodePtr root_node;
std::unordered_map<String, ResourceCost> consumed_cost;
ResourceCost failed_cost = 0;
};
enum EnqueueOnlyEnum { EnqueueOnly };
template <class TManager>
struct ResourceTestManager : public ResourceTestBase
{
@ -230,16 +302,49 @@ struct ResourceTestManager : public ResourceTestBase
struct Guard : public ResourceGuard
{
ResourceTestManager & t;
ResourceCost cost;
Guard(ResourceTestManager & t_, ResourceLink link_, ResourceCost cost)
: ResourceGuard(ResourceGuard::Metrics::getIOWrite(), link_, cost, Lock::Defer)
/// Works like regular ResourceGuard, ready for consumption after constructor
Guard(ResourceTestManager & t_, ResourceLink link_, ResourceCost cost_)
: ResourceGuard(ResourceGuard::Metrics::getIOWrite(), link_, cost_, Lock::Defer)
, t(t_)
, cost(cost_)
{
t.onEnqueue(link);
waitExecute();
}
/// Just enqueue resource request, do not block (needed for tests to sync). Call `waitExecuted()` afterwards
Guard(ResourceTestManager & t_, ResourceLink link_, ResourceCost cost_, EnqueueOnlyEnum)
: ResourceGuard(ResourceGuard::Metrics::getIOWrite(), link_, cost_, Lock::Defer)
, t(t_)
, cost(cost_)
{
t.onEnqueue(link);
}
/// Waits for ResourceRequest::execute() to be called for enqueued request
void waitExecute()
{
lock();
t.onExecute(link);
consume(cost);
}
/// Waits for ResourceRequest::failure() to be called for enqueued request
void waitFailed(const String & pattern)
{
try
{
lock();
FAIL();
}
catch (Exception & e)
{
ASSERT_EQ(e.code(), ErrorCodes::RESOURCE_ACCESS_DENIED);
ASSERT_TRUE(e.message().contains(pattern));
}
}
};
struct TItem
@ -264,10 +369,24 @@ struct ResourceTestManager : public ResourceTestBase
, busy_period(thread_count)
{}
enum DoNotInitManagerEnum { DoNotInitManager };
explicit ResourceTestManager(size_t thread_count, DoNotInitManagerEnum)
: busy_period(thread_count)
{}
~ResourceTestManager()
{
wait();
}
void wait()
{
for (auto & thread : threads)
thread.join();
{
if (thread.joinable())
thread.join();
}
}
void update(const String & xml)

View File

@ -2,15 +2,15 @@
#include <Common/Scheduler/Nodes/tests/ResourceTest.h>
#include <Common/Scheduler/Nodes/DynamicResourceManager.h>
#include <Common/Scheduler/Nodes/CustomResourceManager.h>
#include <Poco/Util/XMLConfiguration.h>
using namespace DB;
using ResourceTest = ResourceTestManager<DynamicResourceManager>;
using ResourceTest = ResourceTestManager<CustomResourceManager>;
using TestGuard = ResourceTest::Guard;
TEST(SchedulerDynamicResourceManager, Smoke)
TEST(SchedulerCustomResourceManager, Smoke)
{
ResourceTest t;
@ -31,25 +31,25 @@ TEST(SchedulerDynamicResourceManager, Smoke)
</clickhouse>
)CONFIG");
ClassifierPtr cA = t.manager->acquire("A");
ClassifierPtr cB = t.manager->acquire("B");
ClassifierPtr c_a = t.manager->acquire("A");
ClassifierPtr c_b = t.manager->acquire("B");
for (int i = 0; i < 10; i++)
{
ResourceGuard gA(ResourceGuard::Metrics::getIOWrite(), cA->get("res1"), 1, ResourceGuard::Lock::Defer);
gA.lock();
gA.consume(1);
gA.unlock();
ResourceGuard g_a(ResourceGuard::Metrics::getIOWrite(), c_a->get("res1"), 1, ResourceGuard::Lock::Defer);
g_a.lock();
g_a.consume(1);
g_a.unlock();
ResourceGuard gB(ResourceGuard::Metrics::getIOWrite(), cB->get("res1"));
gB.unlock();
ResourceGuard g_b(ResourceGuard::Metrics::getIOWrite(), c_b->get("res1"));
g_b.unlock();
ResourceGuard gC(ResourceGuard::Metrics::getIORead(), cB->get("res1"));
gB.consume(2);
ResourceGuard g_c(ResourceGuard::Metrics::getIORead(), c_b->get("res1"));
g_b.consume(2);
}
}
TEST(SchedulerDynamicResourceManager, Fairness)
TEST(SchedulerCustomResourceManager, Fairness)
{
// Total cost for A and B cannot differ for more than 1 (every request has cost equal to 1).
// Requests from A use `value = 1` and from B `value = -1` is used.

View File

@ -13,6 +13,12 @@ public:
, log(log_)
{}
const String & getTypeName() const override
{
static String type_name("fake");
return type_name;
}
void attachChild(const SchedulerNodePtr & child) override
{
log += " +" + child->basename;

View File

@ -0,0 +1,335 @@
#include <gtest/gtest.h>
#include <Core/Defines.h>
#include <Core/Settings.h>
#include <Common/Scheduler/Nodes/tests/ResourceTest.h>
#include <Common/Scheduler/Workload/WorkloadEntityStorageBase.h>
#include <Common/Scheduler/Nodes/IOResourceManager.h>
#include <Interpreters/Context.h>
#include <Parsers/parseQuery.h>
#include <Parsers/ASTCreateWorkloadQuery.h>
#include <Parsers/ASTCreateResourceQuery.h>
#include <Parsers/ASTDropWorkloadQuery.h>
#include <Parsers/ASTDropResourceQuery.h>
#include <Parsers/ParserCreateWorkloadQuery.h>
#include <Parsers/ParserCreateResourceQuery.h>
#include <Parsers/ParserDropWorkloadQuery.h>
#include <Parsers/ParserDropResourceQuery.h>
using namespace DB;
class WorkloadEntityTestStorage : public WorkloadEntityStorageBase
{
public:
WorkloadEntityTestStorage()
: WorkloadEntityStorageBase(Context::getGlobalContextInstance())
{}
void loadEntities() override {}
void executeQuery(const String & query)
{
ParserCreateWorkloadQuery create_workload_p;
ParserDropWorkloadQuery drop_workload_p;
ParserCreateResourceQuery create_resource_p;
ParserDropResourceQuery drop_resource_p;
auto parse = [&] (IParser & parser)
{
String error;
const char * end = query.data();
return tryParseQuery(
parser,
end,
query.data() + query.size(),
error,
false,
"",
false,
0,
DBMS_DEFAULT_MAX_PARSER_DEPTH,
DBMS_DEFAULT_MAX_PARSER_BACKTRACKS,
true);
};
if (ASTPtr create_workload = parse(create_workload_p))
{
auto & parsed = create_workload->as<ASTCreateWorkloadQuery &>();
auto workload_name = parsed.getWorkloadName();
bool throw_if_exists = !parsed.if_not_exists && !parsed.or_replace;
bool replace_if_exists = parsed.or_replace;
storeEntity(
nullptr,
WorkloadEntityType::Workload,
workload_name,
create_workload,
throw_if_exists,
replace_if_exists,
{});
}
else if (ASTPtr create_resource = parse(create_resource_p))
{
auto & parsed = create_resource->as<ASTCreateResourceQuery &>();
auto resource_name = parsed.getResourceName();
bool throw_if_exists = !parsed.if_not_exists && !parsed.or_replace;
bool replace_if_exists = parsed.or_replace;
storeEntity(
nullptr,
WorkloadEntityType::Resource,
resource_name,
create_resource,
throw_if_exists,
replace_if_exists,
{});
}
else if (ASTPtr drop_workload = parse(drop_workload_p))
{
auto & parsed = drop_workload->as<ASTDropWorkloadQuery &>();
bool throw_if_not_exists = !parsed.if_exists;
removeEntity(
nullptr,
WorkloadEntityType::Workload,
parsed.workload_name,
throw_if_not_exists);
}
else if (ASTPtr drop_resource = parse(drop_resource_p))
{
auto & parsed = drop_resource->as<ASTDropResourceQuery &>();
bool throw_if_not_exists = !parsed.if_exists;
removeEntity(
nullptr,
WorkloadEntityType::Resource,
parsed.resource_name,
throw_if_not_exists);
}
else
throw Exception(ErrorCodes::LOGICAL_ERROR, "Invalid query in WorkloadEntityTestStorage: {}", query);
}
private:
WorkloadEntityStorageBase::OperationResult storeEntityImpl(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
ASTPtr create_entity_query,
bool throw_if_exists,
bool replace_if_exists,
const Settings & settings) override
{
UNUSED(current_context, entity_type, entity_name, create_entity_query, throw_if_exists, replace_if_exists, settings);
return OperationResult::Ok;
}
WorkloadEntityStorageBase::OperationResult removeEntityImpl(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
bool throw_if_not_exists) override
{
UNUSED(current_context, entity_type, entity_name, throw_if_not_exists);
return OperationResult::Ok;
}
};
struct ResourceTest : ResourceTestManager<IOResourceManager>
{
WorkloadEntityTestStorage storage;
explicit ResourceTest(size_t thread_count = 1)
: ResourceTestManager(thread_count, DoNotInitManager)
{
manager = std::make_shared<IOResourceManager>(storage);
}
void query(const String & query_str)
{
storage.executeQuery(query_str);
}
template <class Func>
void async(const String & workload, Func func)
{
threads.emplace_back([=, this, func2 = std::move(func)]
{
ClassifierPtr classifier = manager->acquire(workload);
func2(classifier);
});
}
template <class Func>
void async(const String & workload, const String & resource, Func func)
{
threads.emplace_back([=, this, func2 = std::move(func)]
{
ClassifierPtr classifier = manager->acquire(workload);
ResourceLink link = classifier->get(resource);
func2(link);
});
}
};
using TestGuard = ResourceTest::Guard;
TEST(SchedulerIOResourceManager, Smoke)
{
ResourceTest t;
t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)");
t.query("CREATE WORKLOAD all SETTINGS max_requests = 10");
t.query("CREATE WORKLOAD A in all");
t.query("CREATE WORKLOAD B in all SETTINGS weight = 3");
ClassifierPtr c_a = t.manager->acquire("A");
ClassifierPtr c_b = t.manager->acquire("B");
for (int i = 0; i < 10; i++)
{
ResourceGuard g_a(ResourceGuard::Metrics::getIOWrite(), c_a->get("res1"), 1, ResourceGuard::Lock::Defer);
g_a.lock();
g_a.consume(1);
g_a.unlock();
ResourceGuard g_b(ResourceGuard::Metrics::getIOWrite(), c_b->get("res1"));
g_b.unlock();
ResourceGuard g_c(ResourceGuard::Metrics::getIORead(), c_b->get("res1"));
g_b.consume(2);
}
}
TEST(SchedulerIOResourceManager, Fairness)
{
// Total cost for A and B cannot differ for more than 1 (every request has cost equal to 1).
// Requests from A use `value = 1` and from B `value = -1` is used.
std::atomic<Int64> unfairness = 0;
auto fairness_diff = [&] (Int64 value)
{
Int64 cur_unfairness = unfairness.fetch_add(value, std::memory_order_relaxed) + value;
EXPECT_NEAR(cur_unfairness, 0, 1);
};
constexpr size_t threads_per_queue = 2;
int requests_per_thread = 100;
ResourceTest t(2 * threads_per_queue + 1);
t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)");
t.query("CREATE WORKLOAD all SETTINGS max_requests = 1");
t.query("CREATE WORKLOAD A IN all");
t.query("CREATE WORKLOAD B IN all");
t.query("CREATE WORKLOAD leader IN all");
for (int thread = 0; thread < threads_per_queue; thread++)
{
t.threads.emplace_back([&]
{
ClassifierPtr c = t.manager->acquire("A");
ResourceLink link = c->get("res1");
t.startBusyPeriod(link, 1, requests_per_thread);
for (int request = 0; request < requests_per_thread; request++)
{
TestGuard g(t, link, 1);
fairness_diff(1);
}
});
}
for (int thread = 0; thread < threads_per_queue; thread++)
{
t.threads.emplace_back([&]
{
ClassifierPtr c = t.manager->acquire("B");
ResourceLink link = c->get("res1");
t.startBusyPeriod(link, 1, requests_per_thread);
for (int request = 0; request < requests_per_thread; request++)
{
TestGuard g(t, link, 1);
fairness_diff(-1);
}
});
}
ClassifierPtr c = t.manager->acquire("leader");
ResourceLink link = c->get("res1");
t.blockResource(link);
t.wait(); // Wait for threads to finish before destructing locals
}
TEST(SchedulerIOResourceManager, DropNotEmptyQueue)
{
ResourceTest t;
t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)");
t.query("CREATE WORKLOAD all SETTINGS max_requests = 1");
t.query("CREATE WORKLOAD intermediate IN all");
std::barrier sync_before_enqueue(2);
std::barrier sync_before_drop(3);
std::barrier sync_after_drop(2);
t.async("intermediate", "res1", [&] (ResourceLink link)
{
TestGuard g(t, link, 1);
sync_before_enqueue.arrive_and_wait();
sync_before_drop.arrive_and_wait(); // 1st resource request is consuming
sync_after_drop.arrive_and_wait(); // 1st resource request is still consuming
});
sync_before_enqueue.arrive_and_wait(); // to maintain correct order of resource requests
t.async("intermediate", "res1", [&] (ResourceLink link)
{
TestGuard g(t, link, 1, EnqueueOnly);
sync_before_drop.arrive_and_wait(); // 2nd resource request is enqueued
g.waitFailed("is about to be destructed");
});
sync_before_drop.arrive_and_wait(); // main thread triggers FifoQueue destruction by adding a unified child
t.query("CREATE WORKLOAD leaf IN intermediate");
sync_after_drop.arrive_and_wait();
t.wait(); // Wait for threads to finish before destructing locals
}
TEST(SchedulerIOResourceManager, DropNotEmptyQueueLong)
{
ResourceTest t;
t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)");
t.query("CREATE WORKLOAD all SETTINGS max_requests = 1");
t.query("CREATE WORKLOAD intermediate IN all");
static constexpr int queue_size = 100;
std::barrier sync_before_enqueue(2);
std::barrier sync_before_drop(2 + queue_size);
std::barrier sync_after_drop(2);
t.async("intermediate", "res1", [&] (ResourceLink link)
{
TestGuard g(t, link, 1);
sync_before_enqueue.arrive_and_wait();
sync_before_drop.arrive_and_wait(); // 1st resource request is consuming
sync_after_drop.arrive_and_wait(); // 1st resource request is still consuming
});
sync_before_enqueue.arrive_and_wait(); // to maintain correct order of resource requests
for (int i = 0; i < queue_size; i++)
{
t.async("intermediate", "res1", [&] (ResourceLink link)
{
TestGuard g(t, link, 1, EnqueueOnly);
sync_before_drop.arrive_and_wait(); // many resource requests are enqueued
g.waitFailed("is about to be destructed");
});
}
sync_before_drop.arrive_and_wait(); // main thread triggers FifoQueue destruction by adding a unified child
t.query("CREATE WORKLOAD leaf IN intermediate");
sync_after_drop.arrive_and_wait();
t.wait(); // Wait for threads to finish before destructing locals
}

View File

@ -8,18 +8,17 @@ using namespace DB;
using ResourceTest = ResourceTestClass;
/// Tests disabled because of leaks in the test themselves: https://github.com/ClickHouse/ClickHouse/issues/67678
TEST(DISABLED_SchedulerFairPolicy, Factory)
TEST(SchedulerFairPolicy, Factory)
{
ResourceTest t;
Poco::AutoPtr cfg = new Poco::Util::XMLConfiguration();
SchedulerNodePtr fair = SchedulerNodeFactory::instance().get("fair", /* event_queue = */ nullptr, *cfg, "");
EventQueue event_queue;
SchedulerNodePtr fair = SchedulerNodeFactory::instance().get("fair", &event_queue, *cfg, "");
EXPECT_TRUE(dynamic_cast<FairPolicy *>(fair.get()) != nullptr);
}
TEST(DISABLED_SchedulerFairPolicy, FairnessWeights)
TEST(SchedulerFairPolicy, FairnessWeights)
{
ResourceTest t;
@ -43,7 +42,7 @@ TEST(DISABLED_SchedulerFairPolicy, FairnessWeights)
t.consumed("B", 20);
}
TEST(DISABLED_SchedulerFairPolicy, Activation)
TEST(SchedulerFairPolicy, Activation)
{
ResourceTest t;
@ -79,7 +78,7 @@ TEST(DISABLED_SchedulerFairPolicy, Activation)
t.consumed("B", 10);
}
TEST(DISABLED_SchedulerFairPolicy, FairnessMaxMin)
TEST(SchedulerFairPolicy, FairnessMaxMin)
{
ResourceTest t;
@ -103,7 +102,7 @@ TEST(DISABLED_SchedulerFairPolicy, FairnessMaxMin)
t.consumed("A", 20);
}
TEST(DISABLED_SchedulerFairPolicy, HierarchicalFairness)
TEST(SchedulerFairPolicy, HierarchicalFairness)
{
ResourceTest t;

View File

@ -8,18 +8,17 @@ using namespace DB;
using ResourceTest = ResourceTestClass;
/// Tests disabled because of leaks in the test themselves: https://github.com/ClickHouse/ClickHouse/issues/67678
TEST(DISABLED_SchedulerPriorityPolicy, Factory)
TEST(SchedulerPriorityPolicy, Factory)
{
ResourceTest t;
Poco::AutoPtr cfg = new Poco::Util::XMLConfiguration();
SchedulerNodePtr prio = SchedulerNodeFactory::instance().get("priority", /* event_queue = */ nullptr, *cfg, "");
EventQueue event_queue;
SchedulerNodePtr prio = SchedulerNodeFactory::instance().get("priority", &event_queue, *cfg, "");
EXPECT_TRUE(dynamic_cast<PriorityPolicy *>(prio.get()) != nullptr);
}
TEST(DISABLED_SchedulerPriorityPolicy, Priorities)
TEST(SchedulerPriorityPolicy, Priorities)
{
ResourceTest t;
@ -53,7 +52,7 @@ TEST(DISABLED_SchedulerPriorityPolicy, Priorities)
t.consumed("C", 0);
}
TEST(DISABLED_SchedulerPriorityPolicy, Activation)
TEST(SchedulerPriorityPolicy, Activation)
{
ResourceTest t;
@ -94,7 +93,7 @@ TEST(DISABLED_SchedulerPriorityPolicy, Activation)
t.consumed("C", 0);
}
TEST(DISABLED_SchedulerPriorityPolicy, SinglePriority)
TEST(SchedulerPriorityPolicy, SinglePriority)
{
ResourceTest t;

View File

@ -1,5 +1,6 @@
#include <gtest/gtest.h>
#include <Common/Scheduler/Nodes/SemaphoreConstraint.h>
#include <Common/Scheduler/Nodes/tests/ResourceTest.h>
#include <Common/Scheduler/SchedulerRoot.h>
@ -101,6 +102,11 @@ struct MyRequest : public ResourceRequest
if (on_execute)
on_execute();
}
void failed(const std::exception_ptr &) override
{
FAIL();
}
};
TEST(SchedulerRoot, Smoke)
@ -108,14 +114,14 @@ TEST(SchedulerRoot, Smoke)
ResourceTest t;
ResourceHolder r1(t);
auto * fc1 = r1.add<ConstraintTest>("/", "<max_requests>1</max_requests>");
auto * fc1 = r1.add<SemaphoreConstraint>("/", "<max_requests>1</max_requests>");
r1.add<PriorityPolicy>("/prio");
auto a = r1.addQueue("/prio/A", "<priority>1</priority>");
auto b = r1.addQueue("/prio/B", "<priority>2</priority>");
r1.registerResource();
ResourceHolder r2(t);
auto * fc2 = r2.add<ConstraintTest>("/", "<max_requests>1</max_requests>");
auto * fc2 = r2.add<SemaphoreConstraint>("/", "<max_requests>1</max_requests>");
r2.add<PriorityPolicy>("/prio");
auto c = r2.addQueue("/prio/C", "<priority>-1</priority>");
auto d = r2.addQueue("/prio/D", "<priority>-2</priority>");
@ -123,25 +129,25 @@ TEST(SchedulerRoot, Smoke)
{
ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), a);
EXPECT_TRUE(fc1->requests.contains(&rg.request));
EXPECT_TRUE(fc1->getInflights().first == 1);
rg.consume(1);
}
{
ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), b);
EXPECT_TRUE(fc1->requests.contains(&rg.request));
EXPECT_TRUE(fc1->getInflights().first == 1);
rg.consume(1);
}
{
ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), c);
EXPECT_TRUE(fc2->requests.contains(&rg.request));
EXPECT_TRUE(fc2->getInflights().first == 1);
rg.consume(1);
}
{
ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), d);
EXPECT_TRUE(fc2->requests.contains(&rg.request));
EXPECT_TRUE(fc2->getInflights().first == 1);
rg.consume(1);
}
}
@ -151,7 +157,7 @@ TEST(SchedulerRoot, Budget)
ResourceTest t;
ResourceHolder r1(t);
r1.add<ConstraintTest>("/", "<max_requests>1</max_requests>");
r1.add<SemaphoreConstraint>("/", "<max_requests>1</max_requests>");
r1.add<PriorityPolicy>("/prio");
auto a = r1.addQueue("/prio/A", "");
r1.registerResource();
@ -176,7 +182,7 @@ TEST(SchedulerRoot, Cancel)
ResourceTest t;
ResourceHolder r1(t);
auto * fc1 = r1.add<ConstraintTest>("/", "<max_requests>1</max_requests>");
auto * fc1 = r1.add<SemaphoreConstraint>("/", "<max_requests>1</max_requests>");
r1.add<PriorityPolicy>("/prio");
auto a = r1.addQueue("/prio/A", "<priority>1</priority>");
auto b = r1.addQueue("/prio/B", "<priority>2</priority>");
@ -189,7 +195,7 @@ TEST(SchedulerRoot, Cancel)
MyRequest request(1,[&]
{
sync.arrive_and_wait(); // (A)
EXPECT_TRUE(fc1->requests.contains(&request));
EXPECT_TRUE(fc1->getInflights().first == 1);
sync.arrive_and_wait(); // (B)
request.finish();
destruct_sync.arrive_and_wait(); // (C)
@ -214,5 +220,5 @@ TEST(SchedulerRoot, Cancel)
consumer1.join();
consumer2.join();
EXPECT_TRUE(fc1->requests.empty());
EXPECT_TRUE(fc1->getInflights().first == 0);
}

View File

@ -10,9 +10,7 @@ using namespace DB;
using ResourceTest = ResourceTestClass;
/// Tests disabled because of leaks in the test themselves: https://github.com/ClickHouse/ClickHouse/issues/67678
TEST(DISABLED_SchedulerThrottlerConstraint, LeakyBucketConstraint)
TEST(SchedulerThrottlerConstraint, LeakyBucketConstraint)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -42,7 +40,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, LeakyBucketConstraint)
t.consumed("A", 10);
}
TEST(DISABLED_SchedulerThrottlerConstraint, Unlimited)
TEST(SchedulerThrottlerConstraint, Unlimited)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -59,7 +57,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, Unlimited)
}
}
TEST(DISABLED_SchedulerThrottlerConstraint, Pacing)
TEST(SchedulerThrottlerConstraint, Pacing)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -79,7 +77,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, Pacing)
}
}
TEST(DISABLED_SchedulerThrottlerConstraint, BucketFilling)
TEST(SchedulerThrottlerConstraint, BucketFilling)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -113,7 +111,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, BucketFilling)
t.consumed("A", 3);
}
TEST(DISABLED_SchedulerThrottlerConstraint, PeekAndAvgLimits)
TEST(SchedulerThrottlerConstraint, PeekAndAvgLimits)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -141,7 +139,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, PeekAndAvgLimits)
}
}
TEST(DISABLED_SchedulerThrottlerConstraint, ThrottlerAndFairness)
TEST(SchedulerThrottlerConstraint, ThrottlerAndFairness)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -160,22 +158,22 @@ TEST(DISABLED_SchedulerThrottlerConstraint, ThrottlerAndFairness)
t.enqueue("/fair/B", {req_cost});
}
double shareA = 0.1;
double shareB = 0.9;
double share_a = 0.1;
double share_b = 0.9;
// Bandwidth-latency coupling due to fairness: worst latency is inversely proportional to share
auto max_latencyA = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / shareA));
auto max_latencyB = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / shareB));
auto max_latency_a = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / share_a));
auto max_latency_b = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / share_b));
double consumedA = 0;
double consumedB = 0;
double consumed_a = 0;
double consumed_b = 0;
for (int seconds = 0; seconds < 100; seconds++)
{
t.process(start + std::chrono::seconds(seconds));
double arrival_curve = 100.0 + 10.0 * seconds + req_cost;
t.consumed("A", static_cast<ResourceCost>(arrival_curve * shareA - consumedA), max_latencyA);
t.consumed("B", static_cast<ResourceCost>(arrival_curve * shareB - consumedB), max_latencyB);
consumedA = arrival_curve * shareA;
consumedB = arrival_curve * shareB;
t.consumed("A", static_cast<ResourceCost>(arrival_curve * share_a - consumed_a), max_latency_a);
t.consumed("B", static_cast<ResourceCost>(arrival_curve * share_b - consumed_b), max_latency_b);
consumed_a = arrival_curve * share_a;
consumed_b = arrival_curve * share_b;
}
}

View File

@ -0,0 +1,748 @@
#include <chrono>
#include <gtest/gtest.h>
#include <Common/Scheduler/ResourceGuard.h>
#include <Common/Scheduler/ResourceLink.h>
#include <Common/Scheduler/Nodes/tests/ResourceTest.h>
#include <Common/Priority.h>
#include <Common/Scheduler/Nodes/FairPolicy.h>
#include <Common/Scheduler/Nodes/UnifiedSchedulerNode.h>
using namespace DB;
using ResourceTest = ResourceTestClass;
TEST(SchedulerUnifiedNode, Smoke)
{
ResourceTest t;
t.addCustom<UnifiedSchedulerNode>("/", SchedulingSettings{});
t.enqueue("/fifo", {10, 10});
t.dequeue(2);
t.consumed("fifo", 20);
}
TEST(SchedulerUnifiedNode, FairnessWeight)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{}});
auto b = t.createUnifiedNode("B", all, {.weight = 3.0, .priority = Priority{}});
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
t.dequeue(4);
t.consumed("A", 10);
t.consumed("B", 30);
t.dequeue(4);
t.consumed("A", 10);
t.consumed("B", 30);
t.dequeue();
t.consumed("A", 60);
t.consumed("B", 20);
}
TEST(SchedulerUnifiedNode, FairnessActivation)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all);
auto b = t.createUnifiedNode("B", all);
auto c = t.createUnifiedNode("C", all);
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(b, {10});
t.enqueue(c, {10, 10});
t.dequeue(3);
t.consumed("A", 10);
t.consumed("B", 10);
t.consumed("C", 10);
t.dequeue(4);
t.consumed("A", 30);
t.consumed("B", 0);
t.consumed("C", 10);
t.enqueue(b, {10, 10});
t.dequeue(1);
t.consumed("B", 10);
t.enqueue(c, {10, 10});
t.dequeue(1);
t.consumed("C", 10);
t.dequeue(2); // A B or B A
t.consumed("A", 10);
t.consumed("B", 10);
}
TEST(SchedulerUnifiedNode, FairnessMaxMin)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all);
auto b = t.createUnifiedNode("B", all);
t.enqueue(a, {10, 10}); // make sure A is never empty
for (int i = 0; i < 10; i++)
{
t.enqueue(a, {10, 10, 10, 10});
t.enqueue(b, {10, 10});
t.dequeue(6);
t.consumed("A", 40);
t.consumed("B", 20);
}
t.dequeue(2);
t.consumed("A", 20);
}
TEST(SchedulerUnifiedNode, FairnessHierarchical)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto x = t.createUnifiedNode("X", all);
auto y = t.createUnifiedNode("Y", all);
auto a = t.createUnifiedNode("A", x);
auto b = t.createUnifiedNode("B", x);
auto c = t.createUnifiedNode("C", y);
auto d = t.createUnifiedNode("D", y);
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
for (int i = 0; i < 4; i++)
{
t.dequeue(8);
t.consumed("A", 20);
t.consumed("B", 20);
t.consumed("C", 20);
t.consumed("D", 20);
}
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
for (int i = 0; i < 4; i++)
{
t.dequeue(8);
t.consumed("A", 40);
t.consumed("C", 20);
t.consumed("D", 20);
}
t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
for (int i = 0; i < 4; i++)
{
t.dequeue(8);
t.consumed("B", 40);
t.consumed("C", 20);
t.consumed("D", 20);
}
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
for (int i = 0; i < 4; i++)
{
t.dequeue(8);
t.consumed("A", 20);
t.consumed("B", 20);
t.consumed("C", 40);
}
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
for (int i = 0; i < 4; i++)
{
t.dequeue(8);
t.consumed("A", 20);
t.consumed("B", 20);
t.consumed("D", 40);
}
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
for (int i = 0; i < 4; i++)
{
t.dequeue(8);
t.consumed("A", 40);
t.consumed("D", 40);
}
}
TEST(SchedulerUnifiedNode, Priority)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all, {.priority = Priority{3}});
auto b = t.createUnifiedNode("B", all, {.priority = Priority{2}});
auto c = t.createUnifiedNode("C", all, {.priority = Priority{1}});
t.enqueue(a, {10, 10, 10});
t.enqueue(b, {10, 10, 10});
t.enqueue(c, {10, 10, 10});
t.dequeue(2);
t.consumed("A", 0);
t.consumed("B", 0);
t.consumed("C", 20);
t.dequeue(2);
t.consumed("A", 0);
t.consumed("B", 10);
t.consumed("C", 10);
t.dequeue(2);
t.consumed("A", 0);
t.consumed("B", 20);
t.consumed("C", 0);
t.dequeue();
t.consumed("A", 30);
t.consumed("B", 0);
t.consumed("C", 0);
}
TEST(SchedulerUnifiedNode, PriorityActivation)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all, {.priority = Priority{3}});
auto b = t.createUnifiedNode("B", all, {.priority = Priority{2}});
auto c = t.createUnifiedNode("C", all, {.priority = Priority{1}});
t.enqueue(a, {10, 10, 10, 10, 10, 10});
t.enqueue(b, {10});
t.enqueue(c, {10, 10});
t.dequeue(3);
t.consumed("A", 0);
t.consumed("B", 10);
t.consumed("C", 20);
t.dequeue(2);
t.consumed("A", 20);
t.consumed("B", 0);
t.consumed("C", 0);
t.enqueue(b, {10, 10, 10});
t.dequeue(2);
t.consumed("A", 0);
t.consumed("B", 20);
t.consumed("C", 0);
t.enqueue(c, {10, 10});
t.dequeue(3);
t.consumed("A", 0);
t.consumed("B", 10);
t.consumed("C", 20);
t.dequeue(2);
t.consumed("A", 20);
t.consumed("B", 0);
t.consumed("C", 0);
}
TEST(SchedulerUnifiedNode, List)
{
ResourceTest t;
std::list<UnifiedSchedulerNodePtr> list;
list.push_back(t.createUnifiedNode("all"));
for (int length = 1; length < 5; length++)
{
String name = fmt::format("L{}", length);
list.push_back(t.createUnifiedNode(name, list.back()));
for (int i = 0; i < 3; i++)
{
t.enqueue(list.back(), {10, 10});
t.dequeue(1);
t.consumed(name, 10);
for (int j = 0; j < 3; j++)
{
t.enqueue(list.back(), {10, 10, 10});
t.dequeue(1);
t.consumed(name, 10);
t.dequeue(1);
t.consumed(name, 10);
t.dequeue(1);
t.consumed(name, 10);
}
t.dequeue(1);
t.consumed(name, 10);
}
}
}
TEST(SchedulerUnifiedNode, ThrottlerLeakyBucket)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
t.process(start, 0);
auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 20.0});
t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10});
t.process(start + std::chrono::seconds(0));
t.consumed("all", 30); // It is allowed to go below zero for exactly one resource request
t.process(start + std::chrono::seconds(1));
t.consumed("all", 10);
t.process(start + std::chrono::seconds(2));
t.consumed("all", 10);
t.process(start + std::chrono::seconds(3));
t.consumed("all", 10);
t.process(start + std::chrono::seconds(4));
t.consumed("all", 10);
t.process(start + std::chrono::seconds(100500));
t.consumed("all", 10);
}
TEST(SchedulerUnifiedNode, ThrottlerPacing)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
t.process(start, 0);
// Zero burst allows you to send one request of any `size` and than throttle for `size/max_speed` seconds.
// Useful if outgoing traffic should be "paced", i.e. have the least possible burstiness.
auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 1.0, .max_burst = 0.0});
t.enqueue(all, {1, 2, 3, 1, 2, 1});
int output[] = {1, 2, 0, 3, 0, 0, 1, 2, 0, 1, 0};
for (int i = 0; i < std::size(output); i++)
{
t.process(start + std::chrono::seconds(i));
t.consumed("all", output[i]);
}
}
TEST(SchedulerUnifiedNode, ThrottlerBucketFilling)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
t.process(start, 0);
auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0});
t.enqueue(all, {100});
t.process(start + std::chrono::seconds(0));
t.consumed("all", 100); // consume all tokens, but it is still active (not negative)
t.process(start + std::chrono::seconds(5));
t.consumed("all", 0); // There was nothing to consume
t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10, 10, 10});
t.process(start + std::chrono::seconds(5));
t.consumed("all", 60); // 5 sec * 10 tokens/sec = 50 tokens + 1 extra request to go below zero
t.process(start + std::chrono::seconds(100));
t.consumed("all", 40); // Consume rest
t.process(start + std::chrono::seconds(200));
t.enqueue(all, {95, 1, 1, 1, 1, 1, 1, 1, 1, 1});
t.process(start + std::chrono::seconds(200));
t.consumed("all", 101); // check we cannot consume more than max_burst + 1 request
t.process(start + std::chrono::seconds(100500));
t.consumed("all", 3);
}
TEST(SchedulerUnifiedNode, ThrottlerAndFairness)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
t.process(start, 0);
auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0});
auto a = t.createUnifiedNode("A", all, {.weight = 10.0, .priority = Priority{}});
auto b = t.createUnifiedNode("B", all, {.weight = 90.0, .priority = Priority{}});
ResourceCost req_cost = 1;
ResourceCost total_cost = 2000;
for (int i = 0; i < total_cost / req_cost; i++)
{
t.enqueue(a, {req_cost});
t.enqueue(b, {req_cost});
}
double share_a = 0.1;
double share_b = 0.9;
// Bandwidth-latency coupling due to fairness: worst latency is inversely proportional to share
auto max_latency_a = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / share_a));
auto max_latency_b = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / share_b));
double consumed_a = 0;
double consumed_b = 0;
for (int seconds = 0; seconds < 100; seconds++)
{
t.process(start + std::chrono::seconds(seconds));
double arrival_curve = 100.0 + 10.0 * seconds + req_cost;
t.consumed("A", static_cast<ResourceCost>(arrival_curve * share_a - consumed_a), max_latency_a);
t.consumed("B", static_cast<ResourceCost>(arrival_curve * share_b - consumed_b), max_latency_b);
consumed_a = arrival_curve * share_a;
consumed_b = arrival_curve * share_b;
}
}
TEST(SchedulerUnifiedNode, QueueWithRequestsDestruction)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
t.enqueue(all, {10, 10}); // enqueue reqeuests to be canceled
// This will destroy the queue and fail both requests
auto a = t.createUnifiedNode("A", all);
t.failed(20);
// Check that everything works fine after destruction
auto b = t.createUnifiedNode("B", all);
t.enqueue(a, {10, 10}); // make sure A is never empty
for (int i = 0; i < 10; i++)
{
t.enqueue(a, {10, 10, 10, 10});
t.enqueue(b, {10, 10});
t.dequeue(6);
t.consumed("A", 40);
t.consumed("B", 20);
}
t.dequeue(2);
t.consumed("A", 20);
}
TEST(SchedulerUnifiedNode, ResourceGuardException)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
t.enqueue(all, {10, 10}); // enqueue reqeuests to be canceled
std::thread consumer([queue = all->getQueue()]
{
ResourceLink link{.queue = queue.get()};
bool caught = false;
try
{
ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), link);
}
catch (...)
{
caught = true;
}
ASSERT_TRUE(caught);
});
// This will destroy the queue and fail both requests
auto a = t.createUnifiedNode("A", all);
t.failed(20);
consumer.join();
// Check that everything works fine after destruction
auto b = t.createUnifiedNode("B", all);
t.enqueue(a, {10, 10}); // make sure A is never empty
for (int i = 0; i < 10; i++)
{
t.enqueue(a, {10, 10, 10, 10});
t.enqueue(b, {10, 10});
t.dequeue(6);
t.consumed("A", 40);
t.consumed("B", 20);
}
t.dequeue(2);
t.consumed("A", 20);
}
TEST(SchedulerUnifiedNode, UpdateWeight)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{}});
auto b = t.createUnifiedNode("B", all, {.weight = 3.0, .priority = Priority{}});
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
t.dequeue(4);
t.consumed("A", 10);
t.consumed("B", 30);
t.updateUnifiedNode(b, all, all, {.weight = 1.0, .priority = Priority{}});
t.dequeue(4);
t.consumed("A", 20);
t.consumed("B", 20);
t.dequeue(4);
t.consumed("A", 20);
t.consumed("B", 20);
}
TEST(SchedulerUnifiedNode, UpdatePriority)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{}});
auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{}});
t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
t.dequeue(2);
t.consumed("A", 10);
t.consumed("B", 10);
t.updateUnifiedNode(a, all, all, {.weight = 1.0, .priority = Priority{-1}});
t.dequeue(2);
t.consumed("A", 20);
t.consumed("B", 0);
t.updateUnifiedNode(b, all, all, {.weight = 1.0, .priority = Priority{-2}});
t.dequeue(2);
t.consumed("A", 0);
t.consumed("B", 20);
t.updateUnifiedNode(a, all, all, {.weight = 1.0, .priority = Priority{-2}});
t.dequeue(2);
t.consumed("A", 10);
t.consumed("B", 10);
}
TEST(SchedulerUnifiedNode, UpdateParentOfLeafNode)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{1}});
auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{2}});
auto x = t.createUnifiedNode("X", a, {});
auto y = t.createUnifiedNode("Y", b, {});
t.enqueue(x, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(y, {10, 10, 10, 10, 10, 10, 10, 10});
t.dequeue(2);
t.consumed("X", 20);
t.consumed("Y", 0);
t.updateUnifiedNode(x, a, b, {});
t.dequeue(2);
t.consumed("X", 10);
t.consumed("Y", 10);
t.updateUnifiedNode(y, b, a, {});
t.dequeue(2);
t.consumed("X", 0);
t.consumed("Y", 20);
t.updateUnifiedNode(y, a, all, {});
t.updateUnifiedNode(x, b, all, {});
t.dequeue(4);
t.consumed("X", 20);
t.consumed("Y", 20);
}
TEST(SchedulerUnifiedNode, UpdatePriorityOfIntermediateNode)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{1}});
auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{2}});
auto x1 = t.createUnifiedNode("X1", a, {});
auto y1 = t.createUnifiedNode("Y1", b, {});
auto x2 = t.createUnifiedNode("X2", a, {});
auto y2 = t.createUnifiedNode("Y2", b, {});
t.enqueue(x1, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(y1, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(x2, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(y2, {10, 10, 10, 10, 10, 10, 10, 10});
t.dequeue(4);
t.consumed("X1", 20);
t.consumed("Y1", 0);
t.consumed("X2", 20);
t.consumed("Y2", 0);
t.updateUnifiedNode(a, all, all, {.weight = 1.0, .priority = Priority{2}});
t.dequeue(4);
t.consumed("X1", 10);
t.consumed("Y1", 10);
t.consumed("X2", 10);
t.consumed("Y2", 10);
t.updateUnifiedNode(b, all, all, {.weight = 1.0, .priority = Priority{1}});
t.dequeue(4);
t.consumed("X1", 0);
t.consumed("Y1", 20);
t.consumed("X2", 0);
t.consumed("Y2", 20);
}
TEST(SchedulerUnifiedNode, UpdateParentOfIntermediateNode)
{
ResourceTest t;
auto all = t.createUnifiedNode("all");
auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{1}});
auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{2}});
auto c = t.createUnifiedNode("C", a, {});
auto d = t.createUnifiedNode("D", b, {});
auto x1 = t.createUnifiedNode("X1", c, {});
auto y1 = t.createUnifiedNode("Y1", d, {});
auto x2 = t.createUnifiedNode("X2", c, {});
auto y2 = t.createUnifiedNode("Y2", d, {});
t.enqueue(x1, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(y1, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(x2, {10, 10, 10, 10, 10, 10, 10, 10});
t.enqueue(y2, {10, 10, 10, 10, 10, 10, 10, 10});
t.dequeue(4);
t.consumed("X1", 20);
t.consumed("Y1", 0);
t.consumed("X2", 20);
t.consumed("Y2", 0);
t.updateUnifiedNode(c, a, b, {});
t.dequeue(4);
t.consumed("X1", 10);
t.consumed("Y1", 10);
t.consumed("X2", 10);
t.consumed("Y2", 10);
t.updateUnifiedNode(d, b, a, {});
t.dequeue(4);
t.consumed("X1", 0);
t.consumed("Y1", 20);
t.consumed("X2", 0);
t.consumed("Y2", 20);
}
TEST(SchedulerUnifiedNode, UpdateThrottlerMaxSpeed)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
t.process(start, 0);
auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 20.0});
t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10});
t.process(start + std::chrono::seconds(0));
t.consumed("all", 30); // It is allowed to go below zero for exactly one resource request
t.process(start + std::chrono::seconds(1));
t.consumed("all", 10);
t.process(start + std::chrono::seconds(2));
t.consumed("all", 10);
t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 1.0, .max_burst = 20.0});
t.process(start + std::chrono::seconds(12));
t.consumed("all", 10);
t.process(start + std::chrono::seconds(22));
t.consumed("all", 10);
t.process(start + std::chrono::seconds(100500));
t.consumed("all", 10);
}
TEST(SchedulerUnifiedNode, UpdateThrottlerMaxBurst)
{
ResourceTest t;
EventQueue::TimePoint start = std::chrono::system_clock::now();
t.process(start, 0);
auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0});
t.enqueue(all, {100});
t.process(start + std::chrono::seconds(0));
t.consumed("all", 100); // consume all tokens, but it is still active (not negative)
t.process(start + std::chrono::seconds(2));
t.consumed("all", 0); // There was nothing to consume
t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 10.0, .max_burst = 30.0});
t.process(start + std::chrono::seconds(5));
t.consumed("all", 0); // There was nothing to consume
t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10, 10, 10});
t.process(start + std::chrono::seconds(5));
t.consumed("all", 40); // min(30 tokens, 5 sec * 10 tokens/sec) = 30 tokens + 1 extra request to go below zero
t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0});
t.process(start + std::chrono::seconds(100));
t.consumed("all", 60); // Consume rest
t.process(start + std::chrono::seconds(150));
t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 100.0, .max_burst = 200.0});
t.process(start + std::chrono::seconds(200));
t.enqueue(all, {195, 1, 1, 1, 1, 1, 1, 1, 1, 1});
t.process(start + std::chrono::seconds(200));
t.consumed("all", 201); // check we cannot consume more than max_burst + 1 request
t.process(start + std::chrono::seconds(100500));
t.consumed("all", 3);
}

View File

@ -12,6 +12,7 @@
#include <Common/CurrentMetrics.h>
#include <condition_variable>
#include <exception>
#include <mutex>
@ -34,6 +35,11 @@ namespace CurrentMetrics
namespace DB
{
namespace ErrorCodes
{
extern const int RESOURCE_ACCESS_DENIED;
}
/*
* Scoped resource guard.
* Waits for resource to be available in constructor and releases resource in destructor
@ -109,12 +115,25 @@ public:
dequeued_cv.notify_one();
}
// This function is executed inside scheduler thread and wakes thread that issued this `request`.
// That thread will throw an exception.
void failed(const std::exception_ptr & ptr) override
{
std::unique_lock lock(mutex);
chassert(state == Enqueued);
state = Dequeued;
exception = ptr;
dequeued_cv.notify_one();
}
void wait()
{
CurrentMetrics::Increment scheduled(metrics->scheduled_count);
auto timer = CurrentThread::getProfileEvents().timer(metrics->wait_microseconds);
std::unique_lock lock(mutex);
dequeued_cv.wait(lock, [this] { return state == Dequeued; });
if (exception)
throw Exception(ErrorCodes::RESOURCE_ACCESS_DENIED, "Resource request failed: {}", getExceptionMessage(exception, /* with_stacktrace = */ false));
}
void finish(ResourceCost real_cost_, ResourceLink link_)
@ -151,6 +170,7 @@ public:
std::mutex mutex;
std::condition_variable dequeued_cv;
RequestState state = Finished;
std::exception_ptr exception;
};
/// Creates pending request for resource; blocks while resource is not available (unless `Lock::Defer`)

View File

@ -1,55 +0,0 @@
#pragma once
#include <Common/ErrorCodes.h>
#include <Common/Exception.h>
#include <Common/Scheduler/IResourceManager.h>
#include <boost/noncopyable.hpp>
#include <memory>
#include <mutex>
#include <unordered_map>
namespace DB
{
namespace ErrorCodes
{
extern const int INVALID_SCHEDULER_NODE;
}
class ResourceManagerFactory : private boost::noncopyable
{
public:
static ResourceManagerFactory & instance()
{
static ResourceManagerFactory ret;
return ret;
}
ResourceManagerPtr get(const String & name)
{
std::lock_guard lock{mutex};
if (auto iter = methods.find(name); iter != methods.end())
return iter->second();
throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Unknown scheduler node type: {}", name);
}
template <class TDerived>
void registerMethod(const String & name)
{
std::lock_guard lock{mutex};
methods[name] = [] ()
{
return std::make_shared<TDerived>();
};
}
private:
std::mutex mutex;
using Method = std::function<ResourceManagerPtr()>;
std::unordered_map<String, Method> methods;
};
}

View File

@ -1,13 +1,34 @@
#include <Common/Scheduler/ResourceRequest.h>
#include <Common/Scheduler/ISchedulerConstraint.h>
#include <Common/Exception.h>
#include <ranges>
namespace DB
{
void ResourceRequest::finish()
{
if (constraint)
constraint->finishRequest(this);
// Iterate over constraints in reverse order
for (ISchedulerConstraint * constraint : std::ranges::reverse_view(constraints))
{
if (constraint)
constraint->finishRequest(this);
}
}
bool ResourceRequest::addConstraint(ISchedulerConstraint * new_constraint)
{
for (auto & constraint : constraints)
{
if (!constraint)
{
constraint = new_constraint;
return true;
}
}
return false;
}
}

View File

@ -2,7 +2,9 @@
#include <boost/intrusive/list.hpp>
#include <base/types.h>
#include <array>
#include <limits>
#include <exception>
namespace DB
{
@ -15,6 +17,9 @@ class ISchedulerConstraint;
using ResourceCost = Int64;
constexpr ResourceCost ResourceCostMax = std::numeric_limits<int>::max();
/// Max number of constraints for a request to pass though (depth of constraints chain)
constexpr size_t ResourceMaxConstraints = 8;
/*
* Request for a resource consumption. The main moving part of the scheduling subsystem.
* Resource requests processing workflow:
@ -39,8 +44,7 @@ constexpr ResourceCost ResourceCostMax = std::numeric_limits<int>::max();
*
* Request can also be canceled before (3) using ISchedulerQueue::cancelRequest().
* Returning false means it is too late for request to be canceled. It should be processed in a regular way.
* Returning true means successful cancel and therefore steps (4) and (5) are not going to happen
* and step (6) MUST be omitted.
* Returning true means successful cancel and therefore steps (4) and (5) are not going to happen.
*/
class ResourceRequest : public boost::intrusive::list_base_hook<>
{
@ -49,9 +53,10 @@ public:
/// NOTE: If cost is not known in advance, ResourceBudget should be used (note that every ISchedulerQueue has it)
ResourceCost cost;
/// Scheduler node to be notified on consumption finish
/// Auto-filled during request enqueue/dequeue
ISchedulerConstraint * constraint;
/// Scheduler nodes to be notified on consumption finish
/// Auto-filled during request dequeue
/// Vector is not used to avoid allocations in the scheduler thread
std::array<ISchedulerConstraint *, ResourceMaxConstraints> constraints;
explicit ResourceRequest(ResourceCost cost_ = 1)
{
@ -62,7 +67,8 @@ public:
void reset(ResourceCost cost_)
{
cost = cost_;
constraint = nullptr;
for (auto & constraint : constraints)
constraint = nullptr;
// Note that list_base_hook should be reset independently (by intrusive list)
}
@ -74,11 +80,18 @@ public:
/// (e.g. setting an std::promise or creating a job in a thread pool)
virtual void execute() = 0;
/// Callback to trigger an error in case if resource is unavailable.
virtual void failed(const std::exception_ptr & ptr) = 0;
/// Stop resource consumption and notify resource scheduler.
/// Should be called when resource consumption is finished by consumer.
/// ResourceRequest should not be destructed or reset before calling to `finish()`.
/// WARNING: this function MUST not be called if request was canceled.
/// It is okay to call finish() even for failed and canceled requests (it will be no-op)
void finish();
/// Is called from the scheduler thread to fill `constraints` chain
/// Returns `true` iff constraint was added successfully
bool addConstraint(ISchedulerConstraint * new_constraint);
};
}

View File

@ -28,27 +28,27 @@ namespace ErrorCodes
* Resource scheduler root node with a dedicated thread.
* Immediate children correspond to different resources.
*/
class SchedulerRoot : public ISchedulerNode
class SchedulerRoot final : public ISchedulerNode
{
private:
struct TResource
struct Resource
{
SchedulerNodePtr root;
// Intrusive cyclic list of active resources
TResource * next = nullptr;
TResource * prev = nullptr;
Resource * next = nullptr;
Resource * prev = nullptr;
explicit TResource(const SchedulerNodePtr & root_)
explicit Resource(const SchedulerNodePtr & root_)
: root(root_)
{
root->info.parent.ptr = this;
}
// Get pointer stored by ctor in info
static TResource * get(SchedulerNodeInfo & info)
static Resource * get(SchedulerNodeInfo & info)
{
return reinterpret_cast<TResource *>(info.parent.ptr);
return reinterpret_cast<Resource *>(info.parent.ptr);
}
};
@ -60,6 +60,8 @@ public:
~SchedulerRoot() override
{
stop();
while (!children.empty())
removeChild(children.begin()->first);
}
/// Runs separate scheduler thread
@ -95,6 +97,12 @@ public:
}
}
const String & getTypeName() const override
{
static String type_name("scheduler");
return type_name;
}
bool equals(ISchedulerNode * other) override
{
if (!ISchedulerNode::equals(other))
@ -179,16 +187,11 @@ public:
void activateChild(ISchedulerNode * child) override
{
activate(TResource::get(child->info));
}
void setParent(ISchedulerNode *) override
{
abort(); // scheduler must be the root and this function should not be called
activate(Resource::get(child->info));
}
private:
void activate(TResource * value)
void activate(Resource * value)
{
assert(value->next == nullptr && value->prev == nullptr);
if (current == nullptr) // No active children
@ -206,7 +209,7 @@ private:
}
}
void deactivate(TResource * value)
void deactivate(Resource * value)
{
if (value->next == nullptr)
return; // Already deactivated
@ -251,8 +254,8 @@ private:
request->execute();
}
TResource * current = nullptr; // round-robin pointer
std::unordered_map<ISchedulerNode *, TResource> children; // resources by pointer
Resource * current = nullptr; // round-robin pointer
std::unordered_map<ISchedulerNode *, Resource> children; // resources by pointer
std::atomic<bool> stop_flag = false;
EventQueue events;
ThreadFromGlobalPool scheduler;

View File

@ -0,0 +1,130 @@
#include <limits>
#include <Common/Scheduler/SchedulingSettings.h>
#include <Common/Scheduler/ISchedulerNode.h>
#include <Parsers/ASTSetQuery.h>
namespace DB
{
namespace ErrorCodes
{
extern const int BAD_ARGUMENTS;
}
void SchedulingSettings::updateFromChanges(const ASTCreateWorkloadQuery::SettingsChanges & changes, const String & resource_name)
{
struct {
std::optional<Float64> new_weight;
std::optional<Priority> new_priority;
std::optional<Float64> new_max_speed;
std::optional<Float64> new_max_burst;
std::optional<Int64> new_max_requests;
std::optional<Int64> new_max_cost;
static Float64 getNotNegativeFloat64(const String & name, const Field & field)
{
{
UInt64 val;
if (field.tryGet(val))
return static_cast<Float64>(val); // We dont mind slight loss of precision
}
{
Int64 val;
if (field.tryGet(val))
{
if (val < 0)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Unexpected negative Int64 value for workload setting '{}'", name);
return static_cast<Float64>(val); // We dont mind slight loss of precision
}
}
return field.safeGet<Float64>();
}
static Int64 getNotNegativeInt64(const String & name, const Field & field)
{
{
UInt64 val;
if (field.tryGet(val))
{
// Saturate on overflow
if (val > static_cast<UInt64>(std::numeric_limits<Int64>::max()))
val = std::numeric_limits<Int64>::max();
return static_cast<Int64>(val);
}
}
{
Int64 val;
if (field.tryGet(val))
{
if (val < 0)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Unexpected negative Int64 value for workload setting '{}'", name);
return val;
}
}
return field.safeGet<Int64>();
}
void read(const String & name, const Field & value)
{
if (name == "weight")
new_weight = getNotNegativeFloat64(name, value);
else if (name == "priority")
new_priority = Priority{value.safeGet<Priority::Value>()};
else if (name == "max_speed")
new_max_speed = getNotNegativeFloat64(name, value);
else if (name == "max_burst")
new_max_burst = getNotNegativeFloat64(name, value);
else if (name == "max_requests")
new_max_requests = getNotNegativeInt64(name, value);
else if (name == "max_cost")
new_max_cost = getNotNegativeInt64(name, value);
}
} regular, specific;
// Read changed setting values
for (const auto & [name, value, resource] : changes)
{
if (resource.empty())
regular.read(name, value);
else if (resource == resource_name)
specific.read(name, value);
}
auto get_value = [] <typename T> (const std::optional<T> & specific_new, const std::optional<T> & regular_new, T & old)
{
if (specific_new)
return *specific_new;
if (regular_new)
return *regular_new;
return old;
};
// Validate that we could use values read in a scheduler node
{
SchedulerNodeInfo validating_node(
get_value(specific.new_weight, regular.new_weight, weight),
get_value(specific.new_priority, regular.new_priority, priority));
}
// Commit new values.
// Previous values are left intentionally for ALTER query to be able to skip not mentioned setting values
weight = get_value(specific.new_weight, regular.new_weight, weight);
priority = get_value(specific.new_priority, regular.new_priority, priority);
if (specific.new_max_speed || regular.new_max_speed)
{
max_speed = get_value(specific.new_max_speed, regular.new_max_speed, max_speed);
// We always set max_burst if max_speed is changed.
// This is done for users to be able to ignore more advanced max_burst setting and rely only on max_speed
max_burst = default_burst_seconds * max_speed;
}
max_burst = get_value(specific.new_max_burst, regular.new_max_burst, max_burst);
max_requests = get_value(specific.new_max_requests, regular.new_max_requests, max_requests);
max_cost = get_value(specific.new_max_cost, regular.new_max_cost, max_cost);
}
}

View File

@ -0,0 +1,39 @@
#pragma once
#include <base/types.h>
#include <Common/Priority.h>
#include <Parsers/ASTCreateWorkloadQuery.h>
#include <limits>
namespace DB
{
struct SchedulingSettings
{
/// Priority and weight among siblings
Float64 weight = 1.0;
Priority priority;
/// Throttling constraints.
/// Up to 2 independent throttlers: one for average speed and one for peek speed.
static constexpr Float64 default_burst_seconds = 1.0;
Float64 max_speed = 0; // Zero means unlimited
Float64 max_burst = 0; // default is `default_burst_seconds * max_speed`
/// Limits total number of concurrent resource requests that are allowed to consume
static constexpr Int64 default_max_requests = std::numeric_limits<Int64>::max();
Int64 max_requests = default_max_requests;
/// Limits total cost of concurrent resource requests that are allowed to consume
static constexpr Int64 default_max_cost = std::numeric_limits<Int64>::max();
Int64 max_cost = default_max_cost;
bool hasThrottler() const { return max_speed != 0; }
bool hasSemaphore() const { return max_requests != default_max_requests || max_cost != default_max_cost; }
void updateFromChanges(const ASTCreateWorkloadQuery::SettingsChanges & changes, const String & resource_name = {});
};
}

View File

@ -0,0 +1,91 @@
#pragma once
#include <base/types.h>
#include <base/scope_guard.h>
#include <Interpreters/Context_fwd.h>
#include <Parsers/IAST_fwd.h>
namespace DB
{
class IAST;
struct Settings;
enum class WorkloadEntityType : uint8_t
{
Workload,
Resource,
MAX
};
/// Interface for a storage of workload entities (WORKLOAD and RESOURCE).
class IWorkloadEntityStorage
{
public:
virtual ~IWorkloadEntityStorage() = default;
/// Whether this storage can replicate entities to another node.
virtual bool isReplicated() const { return false; }
virtual String getReplicationID() const { return ""; }
/// Loads all entities. Can be called once - if entities are already loaded the function does nothing.
virtual void loadEntities() = 0;
/// Get entity by name. If no entity stored with entity_name throws exception.
virtual ASTPtr get(const String & entity_name) const = 0;
/// Get entity by name. If no entity stored with entity_name return nullptr.
virtual ASTPtr tryGet(const String & entity_name) const = 0;
/// Check if entity with entity_name is stored.
virtual bool has(const String & entity_name) const = 0;
/// Get all entity names.
virtual std::vector<String> getAllEntityNames() const = 0;
/// Get all entity names of specified type.
virtual std::vector<String> getAllEntityNames(WorkloadEntityType entity_type) const = 0;
/// Get all entities.
virtual std::vector<std::pair<String, ASTPtr>> getAllEntities() const = 0;
/// Check whether any entity have been stored.
virtual bool empty() const = 0;
/// Stops watching.
virtual void stopWatching() {}
/// Stores an entity.
virtual bool storeEntity(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
ASTPtr create_entity_query,
bool throw_if_exists,
bool replace_if_exists,
const Settings & settings) = 0;
/// Removes an entity.
virtual bool removeEntity(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
bool throw_if_not_exists) = 0;
struct Event
{
WorkloadEntityType type;
String name;
ASTPtr entity; /// new or changed entity, null if removed
};
using OnChangedHandler = std::function<void(const std::vector<Event> &)>;
/// Gets all current entries, pass them through `handler` and subscribes for all later changes.
virtual scope_guard getAllEntitiesAndSubscribe(const OnChangedHandler & handler) = 0;
};
}

View File

@ -0,0 +1,287 @@
#include <Common/Scheduler/Workload/WorkloadEntityDiskStorage.h>
#include <Common/StringUtils.h>
#include <Common/atomicRename.h>
#include <Common/escapeForFileName.h>
#include <Common/logger_useful.h>
#include <Common/quoteString.h>
#include <Core/Settings.h>
#include <IO/ReadBufferFromFile.h>
#include <IO/ReadHelpers.h>
#include <IO/WriteBufferFromFile.h>
#include <IO/WriteHelpers.h>
#include <Interpreters/Context.h>
#include <Parsers/parseQuery.h>
#include <Parsers/formatAST.h>
#include <Parsers/ParserCreateWorkloadQuery.h>
#include <Parsers/ParserCreateResourceQuery.h>
#include <Poco/DirectoryIterator.h>
#include <Poco/Logger.h>
#include <filesystem>
namespace fs = std::filesystem;
namespace DB
{
namespace Setting
{
extern const SettingsUInt64 max_parser_backtracks;
extern const SettingsUInt64 max_parser_depth;
extern const SettingsBool fsync_metadata;
}
namespace ErrorCodes
{
extern const int DIRECTORY_DOESNT_EXIST;
extern const int BAD_ARGUMENTS;
}
namespace
{
constexpr std::string_view workload_prefix = "workload_";
constexpr std::string_view resource_prefix = "resource_";
constexpr std::string_view sql_suffix = ".sql";
/// Converts a path to an absolute path and append it with a separator.
String makeDirectoryPathCanonical(const String & directory_path)
{
auto canonical_directory_path = std::filesystem::weakly_canonical(directory_path);
if (canonical_directory_path.has_filename())
canonical_directory_path += std::filesystem::path::preferred_separator;
return canonical_directory_path;
}
}
WorkloadEntityDiskStorage::WorkloadEntityDiskStorage(const ContextPtr & global_context_, const String & dir_path_)
: WorkloadEntityStorageBase(global_context_)
, dir_path{makeDirectoryPathCanonical(dir_path_)}
{
log = getLogger("WorkloadEntityDiskStorage");
}
ASTPtr WorkloadEntityDiskStorage::tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name)
{
return tryLoadEntity(entity_type, entity_name, getFilePath(entity_type, entity_name), /* check_file_exists= */ true);
}
ASTPtr WorkloadEntityDiskStorage::tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name, const String & path, bool check_file_exists)
{
LOG_DEBUG(log, "Loading workload entity {} from file {}", backQuote(entity_name), path);
try
{
if (check_file_exists && !fs::exists(path))
return nullptr;
/// There is .sql file with workload entity creation statement.
ReadBufferFromFile in(path);
String entity_create_query;
readStringUntilEOF(entity_create_query, in);
auto parse = [&] (auto parser)
{
return parseQuery(
parser,
entity_create_query.data(),
entity_create_query.data() + entity_create_query.size(),
"",
0,
global_context->getSettingsRef()[Setting::max_parser_depth],
global_context->getSettingsRef()[Setting::max_parser_backtracks]);
};
switch (entity_type)
{
case WorkloadEntityType::Workload: return parse(ParserCreateWorkloadQuery());
case WorkloadEntityType::Resource: return parse(ParserCreateResourceQuery());
case WorkloadEntityType::MAX: return nullptr;
}
}
catch (...)
{
tryLogCurrentException(log, fmt::format("while loading workload entity {} from path {}", backQuote(entity_name), path));
return nullptr; /// Failed to load this entity, will ignore it
}
}
void WorkloadEntityDiskStorage::loadEntities()
{
if (!entities_loaded)
loadEntitiesImpl();
}
void WorkloadEntityDiskStorage::loadEntitiesImpl()
{
LOG_INFO(log, "Loading workload entities from {}", dir_path);
if (!std::filesystem::exists(dir_path))
{
LOG_DEBUG(log, "The directory for workload entities ({}) does not exist: nothing to load", dir_path);
return;
}
std::vector<std::pair<String, ASTPtr>> entities_name_and_queries;
Poco::DirectoryIterator dir_end;
for (Poco::DirectoryIterator it(dir_path); it != dir_end; ++it)
{
if (it->isDirectory())
continue;
const String & file_name = it.name();
if (file_name.starts_with(workload_prefix) && file_name.ends_with(sql_suffix))
{
String name = unescapeForFileName(file_name.substr(
workload_prefix.size(),
file_name.size() - workload_prefix.size() - sql_suffix.size()));
if (name.empty())
continue;
ASTPtr ast = tryLoadEntity(WorkloadEntityType::Workload, name, dir_path + it.name(), /* check_file_exists= */ false);
if (ast)
entities_name_and_queries.emplace_back(name, ast);
}
if (file_name.starts_with(resource_prefix) && file_name.ends_with(sql_suffix))
{
String name = unescapeForFileName(file_name.substr(
resource_prefix.size(),
file_name.size() - resource_prefix.size() - sql_suffix.size()));
if (name.empty())
continue;
ASTPtr ast = tryLoadEntity(WorkloadEntityType::Resource, name, dir_path + it.name(), /* check_file_exists= */ false);
if (ast)
entities_name_and_queries.emplace_back(name, ast);
}
}
setAllEntities(entities_name_and_queries);
entities_loaded = true;
LOG_DEBUG(log, "Workload entities loaded");
}
void WorkloadEntityDiskStorage::createDirectory()
{
std::error_code create_dir_error_code;
fs::create_directories(dir_path, create_dir_error_code);
if (!fs::exists(dir_path) || !fs::is_directory(dir_path) || create_dir_error_code)
throw Exception(ErrorCodes::DIRECTORY_DOESNT_EXIST, "Couldn't create directory {} reason: '{}'",
dir_path, create_dir_error_code.message());
}
WorkloadEntityStorageBase::OperationResult WorkloadEntityDiskStorage::storeEntityImpl(
const ContextPtr & /*current_context*/,
WorkloadEntityType entity_type,
const String & entity_name,
ASTPtr create_entity_query,
bool throw_if_exists,
bool replace_if_exists,
const Settings & settings)
{
createDirectory();
String file_path = getFilePath(entity_type, entity_name);
LOG_DEBUG(log, "Storing workload entity {} to file {}", backQuote(entity_name), file_path);
if (fs::exists(file_path))
{
if (throw_if_exists)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' already exists", entity_name);
else if (!replace_if_exists)
return OperationResult::Failed;
}
String temp_file_path = file_path + ".tmp";
try
{
WriteBufferFromFile out(temp_file_path);
formatAST(*create_entity_query, out, false);
writeChar('\n', out);
out.next();
if (settings[Setting::fsync_metadata])
out.sync();
out.close();
if (replace_if_exists)
fs::rename(temp_file_path, file_path);
else
renameNoReplace(temp_file_path, file_path);
}
catch (...)
{
fs::remove(temp_file_path);
throw;
}
LOG_TRACE(log, "Entity {} stored", backQuote(entity_name));
return OperationResult::Ok;
}
WorkloadEntityStorageBase::OperationResult WorkloadEntityDiskStorage::removeEntityImpl(
const ContextPtr & /*current_context*/,
WorkloadEntityType entity_type,
const String & entity_name,
bool throw_if_not_exists)
{
String file_path = getFilePath(entity_type, entity_name);
LOG_DEBUG(log, "Removing workload entity {} stored in file {}", backQuote(entity_name), file_path);
bool existed = fs::remove(file_path);
if (!existed)
{
if (throw_if_not_exists)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' doesn't exist", entity_name);
else
return OperationResult::Failed;
}
LOG_TRACE(log, "Entity {} removed", backQuote(entity_name));
return OperationResult::Ok;
}
String WorkloadEntityDiskStorage::getFilePath(WorkloadEntityType entity_type, const String & entity_name) const
{
String file_path;
switch (entity_type)
{
case WorkloadEntityType::Workload:
{
file_path = dir_path + "workload_" + escapeForFileName(entity_name) + ".sql";
break;
}
case WorkloadEntityType::Resource:
{
file_path = dir_path + "resource_" + escapeForFileName(entity_name) + ".sql";
break;
}
case WorkloadEntityType::MAX: break;
}
return file_path;
}
}

View File

@ -0,0 +1,44 @@
#pragma once
#include <Common/Scheduler/Workload/WorkloadEntityStorageBase.h>
#include <Interpreters/Context_fwd.h>
#include <Parsers/IAST_fwd.h>
namespace DB
{
/// Loads workload entities from a specified folder.
class WorkloadEntityDiskStorage : public WorkloadEntityStorageBase
{
public:
WorkloadEntityDiskStorage(const ContextPtr & global_context_, const String & dir_path_);
void loadEntities() override;
private:
OperationResult storeEntityImpl(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
ASTPtr create_entity_query,
bool throw_if_exists,
bool replace_if_exists,
const Settings & settings) override;
OperationResult removeEntityImpl(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
bool throw_if_not_exists) override;
void createDirectory();
void loadEntitiesImpl();
ASTPtr tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name);
ASTPtr tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name, const String & file_path, bool check_file_exists);
String getFilePath(WorkloadEntityType entity_type, const String & entity_name) const;
String dir_path;
std::atomic<bool> entities_loaded = false;
};
}

View File

@ -0,0 +1,273 @@
#include <Common/Scheduler/Workload/WorkloadEntityKeeperStorage.h>
#include <Interpreters/Context.h>
#include <Parsers/ASTCreateWorkloadQuery.h>
#include <Parsers/ASTCreateResourceQuery.h>
#include <Parsers/ParserCreateWorkloadEntity.h>
#include <Parsers/formatAST.h>
#include <Parsers/parseQuery.h>
#include <base/sleep.h>
#include <Common/Exception.h>
#include <Common/ZooKeeper/KeeperException.h>
#include <Common/escapeForFileName.h>
#include <Common/logger_useful.h>
#include <Common/quoteString.h>
#include <Common/scope_guard_safe.h>
#include <Common/setThreadName.h>
#include <Core/Settings.h>
namespace DB
{
namespace Setting
{
extern const SettingsUInt64 max_parser_backtracks;
extern const SettingsUInt64 max_parser_depth;
}
namespace ErrorCodes
{
extern const int BAD_ARGUMENTS;
extern const int LOGICAL_ERROR;
}
WorkloadEntityKeeperStorage::WorkloadEntityKeeperStorage(
const ContextPtr & global_context_, const String & zookeeper_path_)
: WorkloadEntityStorageBase(global_context_)
, zookeeper_getter{[global_context_]() { return global_context_->getZooKeeper(); }}
, zookeeper_path{zookeeper_path_}
, watch{std::make_shared<WatchEvent>()}
{
log = getLogger("WorkloadEntityKeeperStorage");
if (zookeeper_path.empty())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "ZooKeeper path must be non-empty");
if (zookeeper_path.back() == '/')
zookeeper_path.pop_back();
/// If zookeeper chroot prefix is used, path should start with '/', because chroot concatenates without it.
if (zookeeper_path.front() != '/')
zookeeper_path = "/" + zookeeper_path;
}
WorkloadEntityKeeperStorage::~WorkloadEntityKeeperStorage()
{
SCOPE_EXIT_SAFE(stopWatchingThread());
}
void WorkloadEntityKeeperStorage::startWatchingThread()
{
if (!watching_flag.exchange(true))
watching_thread = ThreadFromGlobalPool(&WorkloadEntityKeeperStorage::processWatchQueue, this);
}
void WorkloadEntityKeeperStorage::stopWatchingThread()
{
if (watching_flag.exchange(false))
{
watch->cv.notify_one();
if (watching_thread.joinable())
watching_thread.join();
}
}
zkutil::ZooKeeperPtr WorkloadEntityKeeperStorage::getZooKeeper()
{
auto [zookeeper, session_status] = zookeeper_getter.getZooKeeper();
if (session_status == zkutil::ZooKeeperCachingGetter::SessionStatus::New)
{
/// It's possible that we connected to different [Zoo]Keeper instance
/// so we may read a bit stale state.
zookeeper->sync(zookeeper_path);
createRootNodes(zookeeper);
auto lock = getLock();
refreshEntities(zookeeper);
}
return zookeeper;
}
void WorkloadEntityKeeperStorage::loadEntities()
{
/// loadEntities() is called at start from Server::main(), so it's better not to stop here on no connection to ZooKeeper or any other error.
/// However the watching thread must be started anyway in case the connection will be established later.
try
{
auto lock = getLock();
refreshEntities(getZooKeeper());
}
catch (...)
{
tryLogCurrentException(log, "Failed to load workload entities");
}
startWatchingThread();
}
void WorkloadEntityKeeperStorage::processWatchQueue()
{
LOG_DEBUG(log, "Started watching thread");
setThreadName("WrkldEntWatch");
UInt64 handled = 0;
while (watching_flag)
{
try
{
/// Re-initialize ZooKeeper session if expired
getZooKeeper();
{
std::unique_lock lock{watch->mutex};
if (!watch->cv.wait_for(lock, std::chrono::seconds(10), [&] { return !watching_flag || handled != watch->triggered; }))
continue;
handled = watch->triggered;
}
auto lock = getLock();
refreshEntities(getZooKeeper());
}
catch (...)
{
tryLogCurrentException(log, "Will try to restart watching thread after error");
zookeeper_getter.resetCache();
sleepForSeconds(5);
}
}
LOG_DEBUG(log, "Stopped watching thread");
}
void WorkloadEntityKeeperStorage::stopWatching()
{
stopWatchingThread();
}
void WorkloadEntityKeeperStorage::createRootNodes(const zkutil::ZooKeeperPtr & zookeeper)
{
zookeeper->createAncestors(zookeeper_path);
// If node does not exist we consider it to be equal to empty node: no workload entities
zookeeper->createIfNotExists(zookeeper_path, "");
}
WorkloadEntityStorageBase::OperationResult WorkloadEntityKeeperStorage::storeEntityImpl(
const ContextPtr & /*current_context*/,
WorkloadEntityType entity_type,
const String & entity_name,
ASTPtr create_entity_query,
bool /*throw_if_exists*/,
bool /*replace_if_exists*/,
const Settings &)
{
LOG_DEBUG(log, "Storing workload entity {}", backQuote(entity_name));
String new_data = serializeAllEntities(Event{entity_type, entity_name, create_entity_query});
auto zookeeper = getZooKeeper();
Coordination::Stat stat;
auto code = zookeeper->trySet(zookeeper_path, new_data, current_version, &stat);
if (code != Coordination::Error::ZOK)
{
refreshEntities(zookeeper);
return OperationResult::Retry;
}
current_version = stat.version;
LOG_DEBUG(log, "Workload entity {} stored", backQuote(entity_name));
return OperationResult::Ok;
}
WorkloadEntityStorageBase::OperationResult WorkloadEntityKeeperStorage::removeEntityImpl(
const ContextPtr & /*current_context*/,
WorkloadEntityType entity_type,
const String & entity_name,
bool /*throw_if_not_exists*/)
{
LOG_DEBUG(log, "Removing workload entity {}", backQuote(entity_name));
String new_data = serializeAllEntities(Event{entity_type, entity_name, {}});
auto zookeeper = getZooKeeper();
Coordination::Stat stat;
auto code = zookeeper->trySet(zookeeper_path, new_data, current_version, &stat);
if (code != Coordination::Error::ZOK)
{
refreshEntities(zookeeper);
return OperationResult::Retry;
}
current_version = stat.version;
LOG_DEBUG(log, "Workload entity {} removed", backQuote(entity_name));
return OperationResult::Ok;
}
std::pair<String, Int32> WorkloadEntityKeeperStorage::getDataAndSetWatch(const zkutil::ZooKeeperPtr & zookeeper)
{
const auto data_watcher = [my_watch = watch](const Coordination::WatchResponse & response)
{
if (response.type == Coordination::Event::CHANGED)
{
std::unique_lock lock{my_watch->mutex};
my_watch->triggered++;
my_watch->cv.notify_one();
}
};
Coordination::Stat stat;
String data;
bool exists = zookeeper->tryGetWatch(zookeeper_path, data, &stat, data_watcher);
if (!exists)
{
createRootNodes(zookeeper);
data = zookeeper->getWatch(zookeeper_path, &stat, data_watcher);
}
return {data, stat.version};
}
void WorkloadEntityKeeperStorage::refreshEntities(const zkutil::ZooKeeperPtr & zookeeper)
{
auto [data, version] = getDataAndSetWatch(zookeeper);
if (version == current_version)
return;
LOG_DEBUG(log, "Refreshing workload entities from keeper");
ASTs queries;
ParserCreateWorkloadEntity parser;
const char * begin = data.data(); /// begin of current query
const char * pos = begin; /// parser moves pos from begin to the end of current query
const char * end = begin + data.size();
while (pos < end)
{
queries.emplace_back(parseQueryAndMovePosition(parser, pos, end, "", true, 0, DBMS_DEFAULT_MAX_PARSER_DEPTH, DBMS_DEFAULT_MAX_PARSER_BACKTRACKS));
while (isWhitespaceASCII(*pos) || *pos == ';')
++pos;
}
/// Read and parse all SQL entities from data we just read from ZooKeeper
std::vector<std::pair<String, ASTPtr>> new_entities;
for (const auto & query : queries)
{
LOG_TRACE(log, "Read keeper entity definition: {}", serializeAST(*query));
if (auto * create_workload_query = query->as<ASTCreateWorkloadQuery>())
new_entities.emplace_back(create_workload_query->getWorkloadName(), query);
else if (auto * create_resource_query = query->as<ASTCreateResourceQuery>())
new_entities.emplace_back(create_resource_query->getResourceName(), query);
else
throw Exception(ErrorCodes::LOGICAL_ERROR, "Invalid workload entity query in keeper storage: {}", query->getID());
}
setAllEntities(new_entities);
current_version = version;
LOG_DEBUG(log, "Workload entities refreshing is done");
}
}

View File

@ -0,0 +1,71 @@
#pragma once
#include <Common/Scheduler/Workload/WorkloadEntityStorageBase.h>
#include <Interpreters/Context_fwd.h>
#include <Parsers/IAST_fwd.h>
#include <Common/ThreadPool.h>
#include <Common/ZooKeeper/ZooKeeperCachingGetter.h>
#include <condition_variable>
#include <mutex>
namespace DB
{
/// Loads RESOURCE and WORKLOAD sql objects from Keeper.
class WorkloadEntityKeeperStorage : public WorkloadEntityStorageBase
{
public:
WorkloadEntityKeeperStorage(const ContextPtr & global_context_, const String & zookeeper_path_);
~WorkloadEntityKeeperStorage() override;
bool isReplicated() const override { return true; }
String getReplicationID() const override { return zookeeper_path; }
void loadEntities() override;
void stopWatching() override;
private:
OperationResult storeEntityImpl(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
ASTPtr create_entity_query,
bool throw_if_exists,
bool replace_if_exists,
const Settings & settings) override;
OperationResult removeEntityImpl(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
bool throw_if_not_exists) override;
void processWatchQueue();
zkutil::ZooKeeperPtr getZooKeeper();
void startWatchingThread();
void stopWatchingThread();
void createRootNodes(const zkutil::ZooKeeperPtr & zookeeper);
std::pair<String, Int32> getDataAndSetWatch(const zkutil::ZooKeeperPtr & zookeeper);
void refreshEntities(const zkutil::ZooKeeperPtr & zookeeper);
zkutil::ZooKeeperCachingGetter zookeeper_getter;
String zookeeper_path;
Int32 current_version = 0;
ThreadFromGlobalPool watching_thread;
std::atomic<bool> watching_flag = false;
struct WatchEvent
{
std::mutex mutex;
std::condition_variable cv;
UInt64 triggered = 0;
};
std::shared_ptr<WatchEvent> watch;
};
}

View File

@ -0,0 +1,773 @@
#include <Common/Scheduler/Workload/WorkloadEntityStorageBase.h>
#include <Common/Scheduler/SchedulingSettings.h>
#include <Common/logger_useful.h>
#include <Core/Settings.h>
#include <Interpreters/Context.h>
#include <Parsers/ASTCreateWorkloadQuery.h>
#include <Parsers/ASTCreateResourceQuery.h>
#include <Parsers/formatAST.h>
#include <IO/WriteBufferFromString.h>
#include <boost/container/flat_set.hpp>
#include <boost/range/algorithm/copy.hpp>
#include <mutex>
#include <queue>
#include <unordered_set>
namespace DB
{
namespace ErrorCodes
{
extern const int BAD_ARGUMENTS;
extern const int LOGICAL_ERROR;
}
namespace
{
/// Removes details from a CREATE query to be used as workload entity definition
ASTPtr normalizeCreateWorkloadEntityQuery(const IAST & create_query)
{
auto ptr = create_query.clone();
if (auto * res = typeid_cast<ASTCreateWorkloadQuery *>(ptr.get()))
{
res->if_not_exists = false;
res->or_replace = false;
}
if (auto * res = typeid_cast<ASTCreateResourceQuery *>(ptr.get()))
{
res->if_not_exists = false;
res->or_replace = false;
}
return ptr;
}
/// Returns a type of a workload entity `ptr`
WorkloadEntityType getEntityType(const ASTPtr & ptr)
{
if (auto * res = typeid_cast<ASTCreateWorkloadQuery *>(ptr.get()))
return WorkloadEntityType::Workload;
if (auto * res = typeid_cast<ASTCreateResourceQuery *>(ptr.get()))
return WorkloadEntityType::Resource;
chassert(false);
return WorkloadEntityType::MAX;
}
bool entityEquals(const ASTPtr & lhs, const ASTPtr & rhs)
{
if (auto * a = typeid_cast<ASTCreateWorkloadQuery *>(lhs.get()))
{
if (auto * b = typeid_cast<ASTCreateWorkloadQuery *>(rhs.get()))
{
return std::forward_as_tuple(a->getWorkloadName(), a->getWorkloadParent(), a->changes)
== std::forward_as_tuple(b->getWorkloadName(), b->getWorkloadParent(), b->changes);
}
}
if (auto * a = typeid_cast<ASTCreateResourceQuery *>(lhs.get()))
{
if (auto * b = typeid_cast<ASTCreateResourceQuery *>(rhs.get()))
return std::forward_as_tuple(a->getResourceName(), a->operations)
== std::forward_as_tuple(b->getResourceName(), b->operations);
}
return false;
}
/// Workload entities could reference each other.
/// This enum defines all possible reference types
enum class ReferenceType
{
Parent, // Source workload references target workload as a parent
ForResource // Source workload references target resource in its `SETTINGS x = y FOR resource` clause
};
/// Runs a `func` callback for every reference from `source` to `target`.
/// This function is the source of truth defining what `target` references are stored in a workload `source_entity`
void forEachReference(
const ASTPtr & source_entity,
std::function<void(const String & target, const String & source, ReferenceType type)> func)
{
if (auto * res = typeid_cast<ASTCreateWorkloadQuery *>(source_entity.get()))
{
// Parent reference
String parent = res->getWorkloadParent();
if (!parent.empty())
func(parent, res->getWorkloadName(), ReferenceType::Parent);
// References to RESOURCEs mentioned in SETTINGS clause after FOR keyword
std::unordered_set<String> resources;
for (const auto & [name, value, resource] : res->changes)
{
if (!resource.empty())
resources.insert(resource);
}
for (const String & resource : resources)
func(resource, res->getWorkloadName(), ReferenceType::ForResource);
}
if (auto * res = typeid_cast<ASTCreateResourceQuery *>(source_entity.get()))
{
// RESOURCE has no references to be validated, we allow mentioned disks to be created later
}
}
/// Helper for recursive DFS
void topologicallySortedWorkloadsImpl(const String & name, const ASTPtr & ast, const std::unordered_map<String, ASTPtr> & workloads, std::unordered_set<String> & visited, std::vector<std::pair<String, ASTPtr>> & sorted_workloads)
{
if (visited.contains(name))
return;
visited.insert(name);
// Recurse into parent (if any)
String parent = typeid_cast<ASTCreateWorkloadQuery *>(ast.get())->getWorkloadParent();
if (!parent.empty())
{
auto parent_iter = workloads.find(parent);
if (parent_iter == workloads.end())
throw Exception(ErrorCodes::LOGICAL_ERROR, "Workload metadata inconsistency: Workload '{}' parent '{}' does not exist. This must be fixed manually.", name, parent);
topologicallySortedWorkloadsImpl(parent, parent_iter->second, workloads, visited, sorted_workloads);
}
sorted_workloads.emplace_back(name, ast);
}
/// Returns pairs {worload_name, create_workload_ast} in order that respect child-parent relation (parent first, then children)
std::vector<std::pair<String, ASTPtr>> topologicallySortedWorkloads(const std::unordered_map<String, ASTPtr> & workloads)
{
std::vector<std::pair<String, ASTPtr>> sorted_workloads;
std::unordered_set<String> visited;
for (const auto & [name, ast] : workloads)
topologicallySortedWorkloadsImpl(name, ast, workloads, visited, sorted_workloads);
return sorted_workloads;
}
/// Helper for recursive DFS
void topologicallySortedDependenciesImpl(
const String & name,
const std::unordered_map<String, std::unordered_set<String>> & dependencies,
std::unordered_set<String> & visited,
std::vector<String> & result)
{
if (visited.contains(name))
return;
visited.insert(name);
if (auto it = dependencies.find(name); it != dependencies.end())
{
for (const String & dep : it->second)
topologicallySortedDependenciesImpl(dep, dependencies, visited, result);
}
result.emplace_back(name);
}
/// Returns nodes in topological order that respect `dependencies` (key is node name, value is set of dependencies)
std::vector<String> topologicallySortedDependencies(const std::unordered_map<String, std::unordered_set<String>> & dependencies)
{
std::unordered_set<String> visited; // Set to track visited nodes
std::vector<String> result; // Result to store nodes in topologically sorted order
// Perform DFS for each node in the graph
for (const auto & [name, _] : dependencies)
topologicallySortedDependenciesImpl(name, dependencies, visited, result);
return result;
}
/// Represents a change of a workload entity (WORKLOAD or RESOURCE)
struct EntityChange
{
String name; /// Name of entity
ASTPtr before; /// Entity before change (CREATE if not set)
ASTPtr after; /// Entity after change (DROP if not set)
std::vector<IWorkloadEntityStorage::Event> toEvents() const
{
if (!after)
return {{getEntityType(before), name, {}}};
else if (!before)
return {{getEntityType(after), name, after}};
else
{
auto type_before = getEntityType(before);
auto type_after = getEntityType(after);
// If type changed, we have to remove an old entity and add a new one
if (type_before != type_after)
return {{type_before, name, {}}, {type_after, name, after}};
else
return {{type_after, name, after}};
}
}
};
/// Returns `changes` ordered for execution.
/// Every intemediate state during execution will be consistent (i.e. all references will be valid)
/// NOTE: It does not validate changes, any problem will be detected during execution.
/// NOTE: There will be no error if valid order does not exist.
std::vector<EntityChange> topologicallySortedChanges(const std::vector<EntityChange> & changes)
{
// Construct map from entity name into entity change
std::unordered_map<String, const EntityChange *> change_by_name;
for (const auto & change : changes)
change_by_name[change.name] = &change;
// Construct references maps (before changes and after changes)
std::unordered_map<String, std::unordered_set<String>> old_sources; // Key is target. Value is set of names of source entities.
std::unordered_map<String, std::unordered_set<String>> new_targets; // Key is source. Value is set of names of target entities.
for (const auto & change : changes)
{
if (change.before)
{
forEachReference(change.before,
[&] (const String & target, const String & source, ReferenceType)
{
old_sources[target].insert(source);
});
}
if (change.after)
{
forEachReference(change.after,
[&] (const String & target, const String & source, ReferenceType)
{
new_targets[source].insert(target);
});
}
}
// There are consistency rules that regulate order in which changes must be applied (see below).
// Construct DAG of dependencies between changes.
std::unordered_map<String, std::unordered_set<String>> dependencies; // Key is entity name. Value is set of names of entity that should be changed first.
for (const auto & change : changes)
{
dependencies.emplace(change.name, std::unordered_set<String>{}); // Make sure we create nodes that have no dependencies
for (const auto & event : change.toEvents())
{
if (!event.entity) // DROP
{
// Rule 1: Entity can only be removed after all existing references to it are removed as well.
for (const String & source : old_sources[event.name])
{
if (change_by_name.contains(source))
dependencies[event.name].insert(source);
}
}
else // CREATE || CREATE OR REPLACE
{
// Rule 2: Entity can only be created after all entities it references are created as well.
for (const String & target : new_targets[event.name])
{
if (auto it = change_by_name.find(target); it != change_by_name.end())
{
const EntityChange & target_change = *it->second;
// If target is creating, it should be created first.
// (But if target is updating, there is no dependency).
if (!target_change.before)
dependencies[event.name].insert(target);
}
}
}
}
}
// Topological sort of changes to respect consistency rules
std::vector<EntityChange> result;
for (const String & name : topologicallySortedDependencies(dependencies))
result.push_back(*change_by_name[name]);
return result;
}
}
WorkloadEntityStorageBase::WorkloadEntityStorageBase(ContextPtr global_context_)
: handlers(std::make_shared<Handlers>())
, global_context(std::move(global_context_))
, log{getLogger("WorkloadEntityStorage")} // could be overridden in derived class
{}
ASTPtr WorkloadEntityStorageBase::get(const String & entity_name) const
{
if (auto result = tryGet(entity_name))
return result;
throw Exception(ErrorCodes::BAD_ARGUMENTS,
"The workload entity name '{}' is not saved",
entity_name);
}
ASTPtr WorkloadEntityStorageBase::tryGet(const String & entity_name) const
{
std::lock_guard lock(mutex);
auto it = entities.find(entity_name);
if (it == entities.end())
return nullptr;
return it->second;
}
bool WorkloadEntityStorageBase::has(const String & entity_name) const
{
return tryGet(entity_name) != nullptr;
}
std::vector<String> WorkloadEntityStorageBase::getAllEntityNames() const
{
std::vector<String> entity_names;
std::lock_guard lock(mutex);
entity_names.reserve(entities.size());
for (const auto & [name, _] : entities)
entity_names.emplace_back(name);
return entity_names;
}
std::vector<String> WorkloadEntityStorageBase::getAllEntityNames(WorkloadEntityType entity_type) const
{
std::vector<String> entity_names;
std::lock_guard lock(mutex);
for (const auto & [name, entity] : entities)
{
if (getEntityType(entity) == entity_type)
entity_names.emplace_back(name);
}
return entity_names;
}
bool WorkloadEntityStorageBase::empty() const
{
std::lock_guard lock(mutex);
return entities.empty();
}
bool WorkloadEntityStorageBase::storeEntity(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
ASTPtr create_entity_query,
bool throw_if_exists,
bool replace_if_exists,
const Settings & settings)
{
if (entity_name.empty())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity name should not be empty.");
create_entity_query = normalizeCreateWorkloadEntityQuery(*create_entity_query);
auto * workload = typeid_cast<ASTCreateWorkloadQuery *>(create_entity_query.get());
auto * resource = typeid_cast<ASTCreateResourceQuery *>(create_entity_query.get());
while (true)
{
std::unique_lock lock{mutex};
ASTPtr old_entity; // entity to be REPLACED
if (auto it = entities.find(entity_name); it != entities.end())
{
if (throw_if_exists)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' already exists", entity_name);
else if (!replace_if_exists)
return false;
else
old_entity = it->second;
}
// Validate CREATE OR REPLACE
if (old_entity)
{
auto * old_workload = typeid_cast<ASTCreateWorkloadQuery *>(old_entity.get());
auto * old_resource = typeid_cast<ASTCreateResourceQuery *>(old_entity.get());
if (workload && !old_workload)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' already exists, but it is not a workload", entity_name);
if (resource && !old_resource)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' already exists, but it is not a resource", entity_name);
if (workload && !old_workload->hasParent() && workload->hasParent())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "It is not allowed to remove root workload");
}
// Validate workload
if (workload)
{
if (!workload->hasParent())
{
if (!root_name.empty() && root_name != workload->getWorkloadName())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "The second root is not allowed. You should probably add 'PARENT {}' clause.", root_name);
}
SchedulingSettings validator;
validator.updateFromChanges(workload->changes);
}
forEachReference(create_entity_query,
[this, workload] (const String & target, const String & source, ReferenceType type)
{
if (auto it = entities.find(target); it == entities.end())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' references another workload entity '{}' that doesn't exist", source, target);
switch (type)
{
case ReferenceType::Parent:
{
if (typeid_cast<ASTCreateWorkloadQuery *>(entities[target].get()) == nullptr)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload parent should reference another workload, not '{}'.", target);
break;
}
case ReferenceType::ForResource:
{
if (typeid_cast<ASTCreateResourceQuery *>(entities[target].get()) == nullptr)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload settings should reference resource in FOR clause, not '{}'.", target);
// Validate that we could parse the settings for specific resource
SchedulingSettings validator;
validator.updateFromChanges(workload->changes, target);
break;
}
}
// Detect reference cycles.
// The only way to create a cycle is to add an edge that will be a part of a new cycle.
// We are going to add an edge: `source` -> `target`, so we ensure there is no path back `target` -> `source`.
if (isIndirectlyReferenced(source, target))
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity cycles are not allowed");
});
auto result = storeEntityImpl(
current_context,
entity_type,
entity_name,
create_entity_query,
throw_if_exists,
replace_if_exists,
settings);
if (result == OperationResult::Retry)
continue; // Entities were updated, we need to rerun all the validations
if (result == OperationResult::Ok)
{
Event event{entity_type, entity_name, create_entity_query};
applyEvent(lock, event);
unlockAndNotify(lock, {std::move(event)});
}
return result == OperationResult::Ok;
}
}
bool WorkloadEntityStorageBase::removeEntity(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
bool throw_if_not_exists)
{
while (true)
{
std::unique_lock lock(mutex);
auto it = entities.find(entity_name);
if (it == entities.end())
{
if (throw_if_not_exists)
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' doesn't exist", entity_name);
else
return false;
}
if (auto reference_it = references.find(entity_name); reference_it != references.end())
{
String names;
for (const String & name : reference_it->second)
names += " " + name;
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' cannot be dropped. It is referenced by:{}", entity_name, names);
}
auto result = removeEntityImpl(
current_context,
entity_type,
entity_name,
throw_if_not_exists);
if (result == OperationResult::Retry)
continue; // Entities were updated, we need to rerun all the validations
if (result == OperationResult::Ok)
{
Event event{entity_type, entity_name, {}};
applyEvent(lock, event);
unlockAndNotify(lock, {std::move(event)});
}
return result == OperationResult::Ok;
}
}
scope_guard WorkloadEntityStorageBase::getAllEntitiesAndSubscribe(const OnChangedHandler & handler)
{
scope_guard result;
std::vector<Event> current_state;
{
std::lock_guard lock{mutex};
current_state = orderEntities(entities);
std::lock_guard lock2{handlers->mutex};
handlers->list.push_back(handler);
auto handler_it = std::prev(handlers->list.end());
result = [my_handlers = handlers, handler_it]
{
std::lock_guard lock3{my_handlers->mutex};
my_handlers->list.erase(handler_it);
};
}
// When you subscribe you get all the entities back to your handler immediately if already loaded, or later when loaded
handler(current_state);
return result;
}
void WorkloadEntityStorageBase::unlockAndNotify(
std::unique_lock<std::recursive_mutex> & lock,
std::vector<Event> tx)
{
if (tx.empty())
return;
std::vector<OnChangedHandler> current_handlers;
{
std::lock_guard handlers_lock{handlers->mutex};
boost::range::copy(handlers->list, std::back_inserter(current_handlers));
}
lock.unlock();
for (const auto & handler : current_handlers)
{
try
{
handler(tx);
}
catch (...)
{
tryLogCurrentException(__PRETTY_FUNCTION__);
}
}
}
std::unique_lock<std::recursive_mutex> WorkloadEntityStorageBase::getLock() const
{
return std::unique_lock{mutex};
}
void WorkloadEntityStorageBase::setAllEntities(const std::vector<std::pair<String, ASTPtr>> & raw_new_entities)
{
std::unordered_map<String, ASTPtr> new_entities;
for (const auto & [entity_name, create_query] : raw_new_entities)
new_entities[entity_name] = normalizeCreateWorkloadEntityQuery(*create_query);
std::unique_lock lock(mutex);
// Fill vector of `changes` based on difference between current `entities` and `new_entities`
std::vector<EntityChange> changes;
for (const auto & [entity_name, entity] : entities)
{
if (auto it = new_entities.find(entity_name); it != new_entities.end())
{
if (!entityEquals(entity, it->second))
{
changes.emplace_back(entity_name, entity, it->second); // Update entities that are present in both `new_entities` and `entities`
LOG_TRACE(log, "Workload entity {} was updated", entity_name);
}
else
LOG_TRACE(log, "Workload entity {} is the same", entity_name);
}
else
{
changes.emplace_back(entity_name, entity, ASTPtr{}); // Remove entities that are not present in `new_entities`
LOG_TRACE(log, "Workload entity {} was dropped", entity_name);
}
}
for (const auto & [entity_name, entity] : new_entities)
{
if (!entities.contains(entity_name))
{
changes.emplace_back(entity_name, ASTPtr{}, entity); // Create entities that are only present in `new_entities`
LOG_TRACE(log, "Workload entity {} was created", entity_name);
}
}
// Sort `changes` to respect consistency of references and apply them one by one.
std::vector<Event> tx;
for (const auto & change : topologicallySortedChanges(changes))
{
for (const auto & event : change.toEvents())
{
// TODO(serxa): do validation and throw LOGICAL_ERROR if failed
applyEvent(lock, event);
tx.push_back(event);
}
}
// Notify subscribers
unlockAndNotify(lock, tx);
}
void WorkloadEntityStorageBase::applyEvent(
std::unique_lock<std::recursive_mutex> &,
const Event & event)
{
if (event.entity) // CREATE || CREATE OR REPLACE
{
LOG_DEBUG(log, "Create or replace workload entity: {}", serializeAST(*event.entity));
auto * workload = typeid_cast<ASTCreateWorkloadQuery *>(event.entity.get());
// Validate workload
if (workload && !workload->hasParent())
root_name = workload->getWorkloadName();
// Remove references of a replaced entity (only for CREATE OR REPLACE)
if (auto it = entities.find(event.name); it != entities.end())
removeReferences(it->second);
// Insert references of created entity
insertReferences(event.entity);
// Store in memory
entities[event.name] = event.entity;
}
else // DROP
{
auto it = entities.find(event.name);
chassert(it != entities.end());
LOG_DEBUG(log, "Drop workload entity: {}", event.name);
if (event.name == root_name)
root_name.clear();
// Clean up references
removeReferences(it->second);
// Remove from memory
entities.erase(it);
}
}
std::vector<std::pair<String, ASTPtr>> WorkloadEntityStorageBase::getAllEntities() const
{
std::lock_guard lock{mutex};
std::vector<std::pair<String, ASTPtr>> all_entities;
all_entities.reserve(entities.size());
std::copy(entities.begin(), entities.end(), std::back_inserter(all_entities));
return all_entities;
}
bool WorkloadEntityStorageBase::isIndirectlyReferenced(const String & target, const String & source)
{
std::queue<String> bfs;
std::unordered_set<String> visited;
visited.insert(target);
bfs.push(target);
while (!bfs.empty())
{
String current = bfs.front();
bfs.pop();
if (current == source)
return true;
if (auto it = references.find(current); it != references.end())
{
for (const String & node : it->second)
{
if (visited.contains(node))
continue;
visited.insert(node);
bfs.push(node);
}
}
}
return false;
}
void WorkloadEntityStorageBase::insertReferences(const ASTPtr & entity)
{
if (!entity)
return;
forEachReference(entity,
[this] (const String & target, const String & source, ReferenceType)
{
references[target].insert(source);
});
}
void WorkloadEntityStorageBase::removeReferences(const ASTPtr & entity)
{
if (!entity)
return;
forEachReference(entity,
[this] (const String & target, const String & source, ReferenceType)
{
references[target].erase(source);
if (references[target].empty())
references.erase(target);
});
}
std::vector<WorkloadEntityStorageBase::Event> WorkloadEntityStorageBase::orderEntities(
const std::unordered_map<String, ASTPtr> & all_entities,
std::optional<Event> change)
{
std::vector<Event> result;
std::unordered_map<String, ASTPtr> workloads;
for (const auto & [entity_name, ast] : all_entities)
{
if (typeid_cast<ASTCreateWorkloadQuery *>(ast.get()))
{
if (change && change->name == entity_name)
continue; // Skip this workload if it is removed or updated
workloads.emplace(entity_name, ast);
}
else if (typeid_cast<ASTCreateResourceQuery *>(ast.get()))
{
if (change && change->name == entity_name)
continue; // Skip this resource if it is removed or updated
// Resources should go first because workloads could reference them
result.emplace_back(WorkloadEntityType::Resource, entity_name, ast);
}
else
throw Exception(ErrorCodes::LOGICAL_ERROR, "Invalid workload entity type '{}'", ast->getID());
}
// Introduce new entity described by `change`
if (change && change->entity)
{
if (change->type == WorkloadEntityType::Workload)
workloads.emplace(change->name, change->entity);
else if (change->type == WorkloadEntityType::Resource)
result.emplace_back(WorkloadEntityType::Resource, change->name, change->entity);
}
// Workloads should go in an order such that children are enlisted only after its parent
for (auto & [entity_name, ast] : topologicallySortedWorkloads(workloads))
result.emplace_back(WorkloadEntityType::Workload, entity_name, ast);
return result;
}
String WorkloadEntityStorageBase::serializeAllEntities(std::optional<Event> change)
{
std::unique_lock<std::recursive_mutex> lock;
auto ordered_entities = orderEntities(entities, change);
WriteBufferFromOwnString buf;
for (const auto & event : ordered_entities)
{
formatAST(*event.entity, buf, false, true);
buf.write(";\n", 2);
}
return buf.str();
}
}

View File

@ -0,0 +1,126 @@
#pragma once
#include <unordered_map>
#include <list>
#include <mutex>
#include <unordered_set>
#include <Common/Scheduler/Workload/IWorkloadEntityStorage.h>
#include <Interpreters/Context_fwd.h>
#include <Parsers/IAST.h>
namespace DB
{
class WorkloadEntityStorageBase : public IWorkloadEntityStorage
{
public:
explicit WorkloadEntityStorageBase(ContextPtr global_context_);
ASTPtr get(const String & entity_name) const override;
ASTPtr tryGet(const String & entity_name) const override;
bool has(const String & entity_name) const override;
std::vector<String> getAllEntityNames() const override;
std::vector<String> getAllEntityNames(WorkloadEntityType entity_type) const override;
std::vector<std::pair<String, ASTPtr>> getAllEntities() const override;
bool empty() const override;
bool storeEntity(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
ASTPtr create_entity_query,
bool throw_if_exists,
bool replace_if_exists,
const Settings & settings) override;
bool removeEntity(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
bool throw_if_not_exists) override;
scope_guard getAllEntitiesAndSubscribe(
const OnChangedHandler & handler) override;
protected:
enum class OperationResult
{
Ok,
Failed,
Retry
};
virtual OperationResult storeEntityImpl(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
ASTPtr create_entity_query,
bool throw_if_exists,
bool replace_if_exists,
const Settings & settings) = 0;
virtual OperationResult removeEntityImpl(
const ContextPtr & current_context,
WorkloadEntityType entity_type,
const String & entity_name,
bool throw_if_not_exists) = 0;
std::unique_lock<std::recursive_mutex> getLock() const;
/// Replace current `entities` with `new_entities` and notifies subscribers.
/// Note that subscribers will be notified with a sequence of events.
/// It is guaranteed that all itermediate states (between every pair of consecutive events)
/// will be consistent (all references between entities will be valid)
void setAllEntities(const std::vector<std::pair<String, ASTPtr>> & new_entities);
/// Serialize `entities` stored in memory plus one optional `change` into multiline string
String serializeAllEntities(std::optional<Event> change = {});
private:
/// Change state in memory
void applyEvent(std::unique_lock<std::recursive_mutex> & lock, const Event & event);
/// Notify subscribers about changes describe by vector of events `tx`
void unlockAndNotify(std::unique_lock<std::recursive_mutex> & lock, std::vector<Event> tx);
/// Return true iff `references` has a path from `source` to `target`
bool isIndirectlyReferenced(const String & target, const String & source);
/// Adds references that are described by `entity` to `references`
void insertReferences(const ASTPtr & entity);
/// Removes references that are described by `entity` from `references`
void removeReferences(const ASTPtr & entity);
/// Returns an ordered vector of `entities`
std::vector<Event> orderEntities(
const std::unordered_map<String, ASTPtr> & all_entities,
std::optional<Event> change = {});
struct Handlers
{
std::mutex mutex;
std::list<OnChangedHandler> list;
};
/// shared_ptr is here for safety because WorkloadEntityStorageBase can be destroyed before all subscriptions are removed.
std::shared_ptr<Handlers> handlers;
mutable std::recursive_mutex mutex;
std::unordered_map<String, ASTPtr> entities; /// Maps entity name into CREATE entity query
// Validation
std::unordered_map<String, std::unordered_set<String>> references; /// Keep track of references between entities. Key is target. Value is set of sources
String root_name; /// current root workload name
protected:
ContextPtr global_context;
LoggerPtr log;
};
}

View File

@ -0,0 +1,45 @@
#include <Common/Scheduler/Workload/createWorkloadEntityStorage.h>
#include <Common/Scheduler/Workload/WorkloadEntityDiskStorage.h>
#include <Common/Scheduler/Workload/WorkloadEntityKeeperStorage.h>
#include <Interpreters/Context.h>
#include <Poco/Util/AbstractConfiguration.h>
#include <filesystem>
#include <memory>
namespace fs = std::filesystem;
namespace DB
{
namespace ErrorCodes
{
extern const int INVALID_CONFIG_PARAMETER;
}
std::unique_ptr<IWorkloadEntityStorage> createWorkloadEntityStorage(const ContextMutablePtr & global_context)
{
const String zookeeper_path_key = "workload_zookeeper_path";
const String disk_path_key = "workload_path";
const auto & config = global_context->getConfigRef();
if (config.has(zookeeper_path_key))
{
if (config.has(disk_path_key))
{
throw Exception(
ErrorCodes::INVALID_CONFIG_PARAMETER,
"'{}' and '{}' must not be both specified in the config",
zookeeper_path_key,
disk_path_key);
}
return std::make_unique<WorkloadEntityKeeperStorage>(global_context, config.getString(zookeeper_path_key));
}
String default_path = fs::path{global_context->getPath()} / "workload" / "";
String path = config.getString(disk_path_key, default_path);
return std::make_unique<WorkloadEntityDiskStorage>(global_context, path);
}
}

View File

@ -0,0 +1,11 @@
#pragma once
#include <Interpreters/Context_fwd.h>
#include <Common/Scheduler/Workload/IWorkloadEntityStorage.h>
namespace DB
{
std::unique_ptr<IWorkloadEntityStorage> createWorkloadEntityStorage(const ContextMutablePtr & global_context);
}

View File

@ -0,0 +1,104 @@
#include <Common/Scheduler/createResourceManager.h>
#include <Common/Scheduler/Nodes/CustomResourceManager.h>
#include <Common/Scheduler/Nodes/IOResourceManager.h>
#include <Interpreters/Context.h>
#include <Poco/Util/AbstractConfiguration.h>
#include <memory>
#include <vector>
namespace DB
{
namespace ErrorCodes
{
extern const int RESOURCE_ACCESS_DENIED;
}
class ResourceManagerDispatcher : public IResourceManager
{
private:
class Classifier : public IClassifier
{
public:
void addClassifier(const ClassifierPtr & classifier)
{
classifiers.push_back(classifier);
}
bool has(const String & resource_name) override
{
for (const auto & classifier : classifiers)
{
if (classifier->has(resource_name))
return true;
}
return false;
}
ResourceLink get(const String & resource_name) override
{
for (auto & classifier : classifiers)
{
if (classifier->has(resource_name))
return classifier->get(resource_name);
}
throw Exception(ErrorCodes::RESOURCE_ACCESS_DENIED, "Access denied to resource '{}'", resource_name);
}
private:
std::vector<ClassifierPtr> classifiers; // should be constant after initialization to avoid races
};
public:
void addManager(const ResourceManagerPtr & manager)
{
managers.push_back(manager);
}
void updateConfiguration(const Poco::Util::AbstractConfiguration & config) override
{
for (auto & manager : managers)
manager->updateConfiguration(config);
}
bool hasResource(const String & resource_name) const override
{
for (const auto & manager : managers)
{
if (manager->hasResource(resource_name))
return true;
}
return false;
}
ClassifierPtr acquire(const String & workload_name) override
{
auto classifier = std::make_shared<Classifier>();
for (const auto & manager : managers)
classifier->addClassifier(manager->acquire(workload_name));
return classifier;
}
void forEachNode(VisitorFunc visitor) override
{
for (const auto & manager : managers)
manager->forEachNode(visitor);
}
private:
std::vector<ResourceManagerPtr> managers; // Should be constant after initialization to avoid races
};
ResourceManagerPtr createResourceManager(const ContextMutablePtr & global_context)
{
auto dispatcher = std::make_shared<ResourceManagerDispatcher>();
// NOTE: if the same resource is described by both managers, then manager added earlier will be used.
dispatcher->addManager(std::make_shared<CustomResourceManager>());
dispatcher->addManager(std::make_shared<IOResourceManager>(global_context->getWorkloadEntityStorage()));
return dispatcher;
}
}

View File

@ -0,0 +1,11 @@
#pragma once
#include <Interpreters/Context_fwd.h>
#include <Common/Scheduler/IResourceManager.h>
namespace DB
{
ResourceManagerPtr createResourceManager(const ContextMutablePtr & global_context);
}

View File

@ -78,7 +78,7 @@ ThreadStatus::ThreadStatus(bool check_current_thread_on_destruction_)
last_rusage = std::make_unique<RUsageCounters>();
memory_tracker.setDescription("(for thread)");
memory_tracker.setDescription("Thread");
log = getLogger("ThreadStatus");
current_thread = this;

View File

@ -8,6 +8,7 @@ namespace DB
{
namespace ErrorCodes
{
extern const int INCORRECT_DATA;
extern const int UNKNOWN_SETTING;
}
@ -31,11 +32,19 @@ void BaseSettingsHelpers::writeFlags(Flags flags, WriteBuffer & out)
}
BaseSettingsHelpers::Flags BaseSettingsHelpers::readFlags(ReadBuffer & in)
UInt64 BaseSettingsHelpers::readFlags(ReadBuffer & in)
{
UInt64 res;
readVarUInt(res, in);
return static_cast<Flags>(res);
return res;
}
SettingsTierType BaseSettingsHelpers::getTier(UInt64 flags)
{
int8_t tier = static_cast<int8_t>(flags & Flags::TIER);
if (tier > SettingsTierType::BETA)
throw Exception(ErrorCodes::INCORRECT_DATA, "Unknown tier value: '{}'", tier);
return SettingsTierType{tier};
}

View File

@ -2,6 +2,7 @@
#include <unordered_map>
#include <Core/SettingsFields.h>
#include <Core/SettingsTierType.h>
#include <Core/SettingsWriteFormat.h>
#include <IO/Operators.h>
#include <base/range.h>
@ -21,6 +22,27 @@ namespace DB
class ReadBuffer;
class WriteBuffer;
struct BaseSettingsHelpers
{
[[noreturn]] static void throwSettingNotFound(std::string_view name);
static void warningSettingNotFound(std::string_view name);
static void writeString(std::string_view str, WriteBuffer & out);
static String readString(ReadBuffer & in);
enum Flags : UInt64
{
IMPORTANT = 0x01,
CUSTOM = 0x02,
TIER = 0x0c, /// 0b1100 == 2 bits
/// If adding new flags, consider first if Tier might need more bits
};
static SettingsTierType getTier(UInt64 flags);
static void writeFlags(Flags flags, WriteBuffer & out);
static UInt64 readFlags(ReadBuffer & in);
};
/** Template class to define collections of settings.
* If you create a new setting, please also add it to ./utils/check-style/check-settings-style
* for validation
@ -138,7 +160,7 @@ public:
const char * getTypeName() const;
const char * getDescription() const;
bool isCustom() const;
bool isObsolete() const;
SettingsTierType getTier() const;
bool operator==(const SettingFieldRef & other) const { return (getName() == other.getName()) && (getValue() == other.getValue()); }
bool operator!=(const SettingFieldRef & other) const { return !(*this == other); }
@ -225,24 +247,6 @@ private:
std::conditional_t<Traits::allow_custom_settings, CustomSettingMap, boost::blank> custom_settings_map;
};
struct BaseSettingsHelpers
{
[[noreturn]] static void throwSettingNotFound(std::string_view name);
static void warningSettingNotFound(std::string_view name);
static void writeString(std::string_view str, WriteBuffer & out);
static String readString(ReadBuffer & in);
enum Flags : UInt64
{
IMPORTANT = 0x01,
CUSTOM = 0x02,
OBSOLETE = 0x04,
};
static void writeFlags(Flags flags, WriteBuffer & out);
static Flags readFlags(ReadBuffer & in);
};
template <typename TTraits>
void BaseSettings<TTraits>::set(std::string_view name, const Field & value)
{
@ -477,7 +481,7 @@ void BaseSettings<TTraits>::read(ReadBuffer & in, SettingsWriteFormat format)
size_t index = accessor.find(name);
using Flags = BaseSettingsHelpers::Flags;
Flags flags{0};
UInt64 flags{0};
if (format >= SettingsWriteFormat::STRINGS_WITH_FLAGS)
flags = BaseSettingsHelpers::readFlags(in);
bool is_important = (flags & Flags::IMPORTANT);
@ -797,14 +801,14 @@ bool BaseSettings<TTraits>::SettingFieldRef::isCustom() const
}
template <typename TTraits>
bool BaseSettings<TTraits>::SettingFieldRef::isObsolete() const
SettingsTierType BaseSettings<TTraits>::SettingFieldRef::getTier() const
{
if constexpr (Traits::allow_custom_settings)
{
if (custom_setting)
return false;
return SettingsTierType::PRODUCTION;
}
return accessor->isObsolete(index);
return accessor->getTier(index);
}
using AliasMap = std::unordered_map<std::string_view, std::string_view>;
@ -835,8 +839,8 @@ using AliasMap = std::unordered_map<std::string_view, std::string_view>;
const String & getName(size_t index) const { return field_infos[index].name; } \
const char * getTypeName(size_t index) const { return field_infos[index].type; } \
const char * getDescription(size_t index) const { return field_infos[index].description; } \
bool isImportant(size_t index) const { return field_infos[index].is_important; } \
bool isObsolete(size_t index) const { return field_infos[index].is_obsolete; } \
bool isImportant(size_t index) const { return field_infos[index].flags & BaseSettingsHelpers::Flags::IMPORTANT; } \
SettingsTierType getTier(size_t index) const { return BaseSettingsHelpers::getTier(field_infos[index].flags); } \
Field castValueUtil(size_t index, const Field & value) const { return field_infos[index].cast_value_util_function(value); } \
String valueToStringUtil(size_t index, const Field & value) const { return field_infos[index].value_to_string_util_function(value); } \
Field stringToValueUtil(size_t index, const String & str) const { return field_infos[index].string_to_value_util_function(str); } \
@ -856,8 +860,7 @@ using AliasMap = std::unordered_map<std::string_view, std::string_view>;
String name; \
const char * type; \
const char * description; \
bool is_important; \
bool is_obsolete; \
UInt64 flags; \
Field (*cast_value_util_function)(const Field &); \
String (*value_to_string_util_function)(const Field &); \
Field (*string_to_value_util_function)(const String &); \
@ -968,8 +971,8 @@ struct DefineAliases
/// NOLINTNEXTLINE
#define IMPLEMENT_SETTINGS_TRAITS_(TYPE, NAME, DEFAULT, DESCRIPTION, FLAGS) \
res.field_infos.emplace_back( \
FieldInfo{#NAME, #TYPE, DESCRIPTION, (FLAGS) & IMPORTANT, \
static_cast<bool>((FLAGS) & BaseSettingsHelpers::Flags::OBSOLETE), \
FieldInfo{#NAME, #TYPE, DESCRIPTION, \
static_cast<UInt64>(FLAGS), \
[](const Field & value) -> Field { return static_cast<Field>(SettingField##TYPE{value}); }, \
[](const Field & value) -> String { return SettingField##TYPE{value}.toString(); }, \
[](const String & str) -> Field { SettingField##TYPE temp; temp.parseFromString(str); return static_cast<Field>(temp); }, \

View File

@ -192,6 +192,13 @@ namespace DB
DECLARE(UInt64, parts_killer_pool_size, 128, "Threads for cleanup of shared merge tree outdated threads. Only available in ClickHouse Cloud", 0) \
DECLARE(UInt64, keeper_multiread_batch_size, 10'000, "Maximum size of batch for MultiRead request to [Zoo]Keeper that support batching. If set to 0, batching is disabled. Available only in ClickHouse Cloud.", 0) \
DECLARE(Bool, use_legacy_mongodb_integration, true, "Use the legacy MongoDB integration implementation. Note: it's highly recommended to set this option to false, since legacy implementation will be removed in the future. Please submit any issues you encounter with the new implementation.", 0) \
\
DECLARE(UInt64, prefetch_threadpool_pool_size, 100, "Size of background pool for prefetches for remote object storages", 0) \
DECLARE(UInt64, prefetch_threadpool_queue_size, 1000000, "Number of tasks which is possible to push into prefetches pool", 0) \
DECLARE(UInt64, load_marks_threadpool_pool_size, 50, "Size of background pool for marks loading", 0) \
DECLARE(UInt64, load_marks_threadpool_queue_size, 1000000, "Number of tasks which is possible to push into prefetches pool", 0) \
DECLARE(UInt64, threadpool_writer_pool_size, 100, "Size of background pool for write requests to object storages", 0) \
DECLARE(UInt64, threadpool_writer_queue_size, 1000000, "Number of tasks which is possible to push into background pool for write requests to object storages", 0)
/// If you add a setting which can be updated at runtime, please update 'changeable_settings' map in dumpToSystemServerSettingsColumns below
@ -339,7 +346,7 @@ void ServerSettings::dumpToSystemServerSettingsColumns(ServerSettingColumnsParam
res_columns[4]->insert(setting.getDescription());
res_columns[5]->insert(setting.getTypeName());
res_columns[6]->insert(is_changeable ? changeable_settings_it->second.second : ChangeableWithoutRestart::No);
res_columns[7]->insert(setting.isObsolete());
res_columns[7]->insert(setting.getTier() == SettingsTierType::OBSOLETE);
}
}
}

View File

@ -1,7 +1,5 @@
#include <Columns/ColumnArray.h>
#include <Columns/ColumnMap.h>
#include <Core/BaseSettings.h>
#include <Core/BaseSettingsFwdMacros.h>
#include <Core/BaseSettingsFwdMacrosImpl.h>
#include <Core/BaseSettingsProgramOptions.h>
#include <Core/DistributedCacheProtocol.h>
@ -40,10 +38,17 @@ namespace ErrorCodes
* Note: as an alternative, we could implement settings to be completely dynamic in the form of the map: String -> Field,
* but we are not going to do it, because settings are used everywhere as static struct fields.
*
* `flags` can be either 0 or IMPORTANT.
* A setting is "IMPORTANT" if it affects the results of queries and can't be ignored by older versions.
* `flags` can include a Tier (BETA | EXPERIMENTAL) and an optional bitwise AND with IMPORTANT.
* The default (0) means a PRODUCTION ready setting
*
* When adding new or changing existing settings add them to the settings changes history in SettingsChangesHistory.h
* A setting is "IMPORTANT" if it affects the results of queries and can't be ignored by older versions.
* Tiers:
* EXPERIMENTAL: The feature is in active development stage. Mostly for developers or for ClickHouse enthusiasts.
* BETA: There are no known bugs problems in the functionality, but the outcome of using it together with other
* features/components is unknown and correctness is not guaranteed.
* PRODUCTION (Default): The feature is safe to use along with other features from the PRODUCTION tier.
*
* When adding new or changing existing settings add them to the settings changes history in SettingsChangesHistory.cpp
* for tracking settings changes in different versions and for special `compatibility` settings to work correctly.
*/
@ -437,6 +442,9 @@ Enables or disables creating a new file on each insert in azure engine tables
)", 0) \
DECLARE(Bool, s3_check_objects_after_upload, false, R"(
Check each uploaded object to s3 with head request to be sure that upload was successful
)", 0) \
DECLARE(Bool, azure_check_objects_after_upload, false, R"(
Check each uploaded object in azure blob storage to be sure that upload was successful
)", 0) \
DECLARE(Bool, s3_allow_parallel_part_upload, true, R"(
Use multiple threads for s3 multipart upload. It may lead to slightly higher memory usage
@ -4451,9 +4459,8 @@ Optimize GROUP BY when all keys in block are constant
DECLARE(Bool, legacy_column_name_of_tuple_literal, false, R"(
List all names of element of large tuple literals in their column names instead of hash. This settings exists only for compatibility reasons. It makes sense to set to 'true', while doing rolling update of cluster from version lower than 21.7 to higher.
)", 0) \
DECLARE(Bool, enable_named_columns_in_function_tuple, true, R"(
DECLARE(Bool, enable_named_columns_in_function_tuple, false, R"(
Generate named tuples in function tuple() when all names are unique and can be treated as unquoted identifiers.
Beware that this setting might currently result in broken queries. It's not recommended to use in production
)", 0) \
\
DECLARE(Bool, query_plan_enable_optimizations, true, R"(
@ -5104,6 +5111,9 @@ Only in ClickHouse Cloud. A maximum number of unacknowledged in-flight packets i
)", 0) \
DECLARE(UInt64, distributed_cache_data_packet_ack_window, DistributedCache::ACK_DATA_PACKET_WINDOW, R"(
Only in ClickHouse Cloud. A window for sending ACK for DataPacket sequence in a single distributed cache read request
)", 0) \
DECLARE(Bool, distributed_cache_discard_connection_if_unread_data, true, R"(
Only in ClickHouse Cloud. Discard connection if some data is unread.
)", 0) \
\
DECLARE(Bool, parallelize_output_from_storages, true, R"(
@ -5503,90 +5513,102 @@ For testing purposes. Replaces all external table functions to Null to not initi
DECLARE(Bool, restore_replace_external_dictionary_source_to_null, false, R"(
Replace external dictionary sources to Null on restore. Useful for testing purposes
)", 0) \
DECLARE(Bool, create_if_not_exists, false, R"(
Enable `IF NOT EXISTS` for `CREATE` statement by default. If either this setting or `IF NOT EXISTS` is specified and a table with the provided name already exists, no exception will be thrown.
)", 0) \
DECLARE(Bool, enforce_strict_identifier_format, false, R"(
If enabled, only allow identifiers containing alphanumeric characters and underscores.
)", 0) \
DECLARE(Bool, mongodb_throw_on_unsupported_query, true, R"(
If enabled, MongoDB tables will return an error when a MongoDB query cannot be built. Otherwise, ClickHouse reads the full table and processes it locally. This option does not apply to the legacy implementation or when 'allow_experimental_analyzer=0'.
)", 0) \
\
/* ###################################### */ \
/* ######## EXPERIMENTAL FEATURES ####### */ \
/* ###################################### */ \
DECLARE(Bool, allow_experimental_materialized_postgresql_table, false, R"(
Allows to use the MaterializedPostgreSQL table engine. Disabled by default, because this feature is experimental
)", 0) \
DECLARE(Bool, allow_experimental_funnel_functions, false, R"(
Enable experimental functions for funnel analysis.
)", 0) \
DECLARE(Bool, allow_experimental_nlp_functions, false, R"(
Enable experimental functions for natural language processing.
)", 0) \
DECLARE(Bool, allow_experimental_hash_functions, false, R"(
Enable experimental hash functions
)", 0) \
DECLARE(Bool, allow_experimental_object_type, false, R"(
Allow Object and JSON data types
)", 0) \
DECLARE(Bool, allow_experimental_time_series_table, false, R"(
Allows creation of tables with the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine.
/* Parallel replicas */ \
DECLARE(UInt64, allow_experimental_parallel_reading_from_replicas, 0, R"(
Use up to `max_parallel_replicas` the number of replicas from each shard for SELECT query execution. Reading is parallelized and coordinated dynamically. 0 - disabled, 1 - enabled, silently disable them in case of failure, 2 - enabled, throw an exception in case of failure
)", BETA) ALIAS(enable_parallel_replicas) \
DECLARE(NonZeroUInt64, max_parallel_replicas, 1, R"(
The maximum number of replicas for each shard when executing a query.
Possible values:
- 0 the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is disabled.
- 1 the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is enabled.
)", 0) \
DECLARE(Bool, allow_experimental_vector_similarity_index, false, R"(
Allow experimental vector similarity index
)", 0) \
DECLARE(Bool, allow_experimental_variant_type, false, R"(
Allows creation of experimental [Variant](../../sql-reference/data-types/variant.md).
)", 0) \
DECLARE(Bool, allow_experimental_dynamic_type, false, R"(
Allow Dynamic data type
)", 0) \
DECLARE(Bool, allow_experimental_json_type, false, R"(
Allow JSON data type
)", 0) \
DECLARE(Bool, allow_experimental_codecs, false, R"(
If it is set to true, allow to specify experimental compression codecs (but we don't have those yet and this option does nothing).
)", 0) \
DECLARE(Bool, allow_experimental_shared_set_join, true, R"(
Only in ClickHouse Cloud. Allow to create ShareSet and SharedJoin
)", 0) \
DECLARE(UInt64, max_limit_for_ann_queries, 1'000'000, R"(
SELECT queries with LIMIT bigger than this setting cannot use vector similarity indexes. Helps to prevent memory overflows in vector similarity indexes.
)", 0) \
DECLARE(UInt64, hnsw_candidate_list_size_for_search, 256, R"(
The size of the dynamic candidate list when searching the vector similarity index, also known as 'ef_search'.
)", 0) \
DECLARE(Bool, throw_on_unsupported_query_inside_transaction, true, R"(
Throw exception if unsupported query is used inside transaction
)", 0) \
DECLARE(TransactionsWaitCSNMode, wait_changes_become_visible_after_commit_mode, TransactionsWaitCSNMode::WAIT_UNKNOWN, R"(
Wait for committed changes to become actually visible in the latest snapshot
)", 0) \
DECLARE(Bool, implicit_transaction, false, R"(
If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback)
)", 0) \
DECLARE(UInt64, grace_hash_join_initial_buckets, 1, R"(
Initial number of grace hash join buckets
)", 0) \
DECLARE(UInt64, grace_hash_join_max_buckets, 1024, R"(
Limit on the number of grace hash join buckets
)", 0) \
DECLARE(UInt64, join_to_sort_minimum_perkey_rows, 40, R"(
The lower limit of per-key average rows in the right table to determine whether to rerange the right table by key in left or inner join. This setting ensures that the optimization is not applied for sparse table keys
)", 0) \
DECLARE(UInt64, join_to_sort_maximum_table_rows, 10000, R"(
The maximum number of rows in the right table to determine whether to rerange the right table by key in left or inner join.
)", 0) \
DECLARE(Bool, allow_experimental_join_right_table_sorting, false, R"(
If it is set to true, and the conditions of `join_to_sort_minimum_perkey_rows` and `join_to_sort_maximum_table_rows` are met, rerange the right table by key to improve the performance in left or inner hash join.
- Positive integer.
**Additional Info**
This options will produce different results depending on the settings used.
:::note
This setting will produce incorrect results when joins or subqueries are involved, and all tables don't meet certain requirements. See [Distributed Subqueries and max_parallel_replicas](../../sql-reference/operators/in.md/#max_parallel_replica-subqueries) for more details.
:::
### Parallel processing using `SAMPLE` key
A query may be processed faster if it is executed on several servers in parallel. But the query performance may degrade in the following cases:
- The position of the sampling key in the partitioning key does not allow efficient range scans.
- Adding a sampling key to the table makes filtering by other columns less efficient.
- The sampling key is an expression that is expensive to calculate.
- The cluster latency distribution has a long tail, so that querying more servers increases the query overall latency.
### Parallel processing using [parallel_replicas_custom_key](#parallel_replicas_custom_key)
This setting is useful for any replicated table.
)", 0) \
DECLARE(ParallelReplicasMode, parallel_replicas_mode, ParallelReplicasMode::READ_TASKS, R"(
Type of filter to use with custom key for parallel replicas. default - use modulo operation on the custom key, range - use range filter on custom key using all possible values for the value type of custom key.
)", BETA) \
DECLARE(UInt64, parallel_replicas_count, 0, R"(
This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the number of parallel replicas participating in query processing.
)", BETA) \
DECLARE(UInt64, parallel_replica_offset, 0, R"(
This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the index of the replica participating in query processing among parallel replicas.
)", BETA) \
DECLARE(String, parallel_replicas_custom_key, "", R"(
An arbitrary integer expression that can be used to split work between replicas for a specific table.
The value can be any integer expression.
Simple expressions using primary keys are preferred.
If the setting is used on a cluster that consists of a single shard with multiple replicas, those replicas will be converted into virtual shards.
Otherwise, it will behave same as for `SAMPLE` key, it will use multiple replicas of each shard.
)", BETA) \
DECLARE(UInt64, parallel_replicas_custom_key_range_lower, 0, R"(
Allows the filter type `range` to split the work evenly between replicas based on the custom range `[parallel_replicas_custom_key_range_lower, INT_MAX]`.
When used in conjunction with [parallel_replicas_custom_key_range_upper](#parallel_replicas_custom_key_range_upper), it lets the filter evenly split the work over replicas for the range `[parallel_replicas_custom_key_range_lower, parallel_replicas_custom_key_range_upper]`.
Note: This setting will not cause any additional data to be filtered during query processing, rather it changes the points at which the range filter breaks up the range `[0, INT_MAX]` for parallel processing.
)", BETA) \
DECLARE(UInt64, parallel_replicas_custom_key_range_upper, 0, R"(
Allows the filter type `range` to split the work evenly between replicas based on the custom range `[0, parallel_replicas_custom_key_range_upper]`. A value of 0 disables the upper bound, setting it the max value of the custom key expression.
When used in conjunction with [parallel_replicas_custom_key_range_lower](#parallel_replicas_custom_key_range_lower), it lets the filter evenly split the work over replicas for the range `[parallel_replicas_custom_key_range_lower, parallel_replicas_custom_key_range_upper]`.
Note: This setting will not cause any additional data to be filtered during query processing, rather it changes the points at which the range filter breaks up the range `[0, INT_MAX]` for parallel processing
)", BETA) \
DECLARE(String, cluster_for_parallel_replicas, "", R"(
Cluster for a shard in which current server is located
)", BETA) \
DECLARE(Bool, parallel_replicas_allow_in_with_subquery, true, R"(
If true, subquery for IN will be executed on every follower replica.
)", BETA) \
DECLARE(Float, parallel_replicas_single_task_marks_count_multiplier, 2, R"(
A multiplier which will be added during calculation for minimal number of marks to retrieve from coordinator. This will be applied only for remote replicas.
)", BETA) \
DECLARE(Bool, parallel_replicas_for_non_replicated_merge_tree, false, R"(
If true, ClickHouse will use parallel replicas algorithm also for non-replicated MergeTree tables
)", BETA) \
DECLARE(UInt64, parallel_replicas_min_number_of_rows_per_replica, 0, R"(
Limit the number of replicas used in a query to (estimated rows to read / min_number_of_rows_per_replica). The max is still limited by 'max_parallel_replicas'
)", BETA) \
DECLARE(Bool, parallel_replicas_prefer_local_join, true, R"(
If true, and JOIN can be executed with parallel replicas algorithm, and all storages of right JOIN part are *MergeTree, local JOIN will be used instead of GLOBAL JOIN.
)", BETA) \
DECLARE(UInt64, parallel_replicas_mark_segment_size, 0, R"(
Parts virtually divided into segments to be distributed between replicas for parallel reading. This setting controls the size of these segments. Not recommended to change until you're absolutely sure in what you're doing. Value should be in range [128; 16384]
)", BETA) \
DECLARE(Bool, parallel_replicas_local_plan, false, R"(
Build local plan for local replica
)", BETA) \
\
DECLARE(Bool, allow_experimental_analyzer, true, R"(
Allow new query analyzer.
)", IMPORTANT | BETA) ALIAS(enable_analyzer) \
DECLARE(Bool, analyzer_compatibility_join_using_top_level_identifier, false, R"(
Force to resolve identifier in JOIN USING from projection (for example, in `SELECT a + 1 AS b FROM t1 JOIN t2 USING (b)` join will be performed by `t1.a + 1 = t2.b`, rather then `t1.b = t2.b`).
)", BETA) \
\
DECLARE(Timezone, session_timezone, "", R"(
Sets the implicit time zone of the current session or query.
The implicit time zone is the time zone applied to values of type DateTime/DateTime64 which have no explicitly specified time zone.
@ -5646,126 +5668,121 @@ This happens due to different parsing pipelines:
**See also**
- [timezone](../server-configuration-parameters/settings.md#timezone)
)", BETA) \
DECLARE(Bool, create_if_not_exists, false, R"(
Enable `IF NOT EXISTS` for `CREATE` statement by default. If either this setting or `IF NOT EXISTS` is specified and a table with the provided name already exists, no exception will be thrown.
)", 0) \
DECLARE(Bool, enforce_strict_identifier_format, false, R"(
If enabled, only allow identifiers containing alphanumeric characters and underscores.
)", 0) \
DECLARE(Bool, mongodb_throw_on_unsupported_query, true, R"(
If enabled, MongoDB tables will return an error when a MongoDB query cannot be built. Otherwise, ClickHouse reads the full table and processes it locally. This option does not apply to the legacy implementation or when 'allow_experimental_analyzer=0'.
)", 0) \
DECLARE(Bool, implicit_select, false, R"(
Allow writing simple SELECT queries without the leading SELECT keyword, which makes it simple for calculator-style usage, e.g. `1 + 2` becomes a valid query.
)", 0) \
DECLARE(Bool, use_hive_partitioning, false, R"(
When enabled, ClickHouse will detect Hive-style partitioning in path (`/name=value/`) in file-like table engines [File](../../engines/table-engines/special/file.md#hive-style-partitioning)/[S3](../../engines/table-engines/integrations/s3.md#hive-style-partitioning)/[URL](../../engines/table-engines/special/url.md#hive-style-partitioning)/[HDFS](../../engines/table-engines/integrations/hdfs.md#hive-style-partitioning)/[AzureBlobStorage](../../engines/table-engines/integrations/azureBlobStorage.md#hive-style-partitioning) and will allow to use partition columns as virtual columns in the query. These virtual columns will have the same names as in the partitioned path, but starting with `_`.
)", 0)\
\
DECLARE(Bool, allow_statistics_optimize, false, R"(
Allows using statistics to optimize queries
)", 0) ALIAS(allow_statistic_optimize) \
DECLARE(Bool, allow_experimental_statistics, false, R"(
Allows defining columns with [statistics](../../engines/table-engines/mergetree-family/mergetree.md#table_engine-mergetree-creating-a-table) and [manipulate statistics](../../engines/table-engines/mergetree-family/mergetree.md#column-statistics).
)", 0) ALIAS(allow_experimental_statistic) \
\
/* Parallel replicas */ \
DECLARE(UInt64, allow_experimental_parallel_reading_from_replicas, 0, R"(
Use up to `max_parallel_replicas` the number of replicas from each shard for SELECT query execution. Reading is parallelized and coordinated dynamically. 0 - disabled, 1 - enabled, silently disable them in case of failure, 2 - enabled, throw an exception in case of failure
)", 0) ALIAS(enable_parallel_replicas) \
DECLARE(NonZeroUInt64, max_parallel_replicas, 1, R"(
The maximum number of replicas for each shard when executing a query.
/* ####################################################### */ \
/* ########### START OF EXPERIMENTAL FEATURES ############ */ \
/* ## ADD PRODUCTION / BETA FEATURES BEFORE THIS BLOCK ## */ \
/* ####################################################### */ \
\
DECLARE(Bool, allow_experimental_materialized_postgresql_table, false, R"(
Allows to use the MaterializedPostgreSQL table engine. Disabled by default, because this feature is experimental
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_funnel_functions, false, R"(
Enable experimental functions for funnel analysis.
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_nlp_functions, false, R"(
Enable experimental functions for natural language processing.
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_hash_functions, false, R"(
Enable experimental hash functions
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_object_type, false, R"(
Allow Object and JSON data types
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_time_series_table, false, R"(
Allows creation of tables with the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine.
Possible values:
- Positive integer.
**Additional Info**
This options will produce different results depending on the settings used.
:::note
This setting will produce incorrect results when joins or subqueries are involved, and all tables don't meet certain requirements. See [Distributed Subqueries and max_parallel_replicas](../../sql-reference/operators/in.md/#max_parallel_replica-subqueries) for more details.
:::
### Parallel processing using `SAMPLE` key
A query may be processed faster if it is executed on several servers in parallel. But the query performance may degrade in the following cases:
- The position of the sampling key in the partitioning key does not allow efficient range scans.
- Adding a sampling key to the table makes filtering by other columns less efficient.
- The sampling key is an expression that is expensive to calculate.
- The cluster latency distribution has a long tail, so that querying more servers increases the query overall latency.
### Parallel processing using [parallel_replicas_custom_key](#parallel_replicas_custom_key)
This setting is useful for any replicated table.
)", 0) \
DECLARE(ParallelReplicasMode, parallel_replicas_mode, ParallelReplicasMode::READ_TASKS, R"(
Type of filter to use with custom key for parallel replicas. default - use modulo operation on the custom key, range - use range filter on custom key using all possible values for the value type of custom key.
)", 0) \
DECLARE(UInt64, parallel_replicas_count, 0, R"(
This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the number of parallel replicas participating in query processing.
)", 0) \
DECLARE(UInt64, parallel_replica_offset, 0, R"(
This is internal setting that should not be used directly and represents an implementation detail of the 'parallel replicas' mode. This setting will be automatically set up by the initiator server for distributed queries to the index of the replica participating in query processing among parallel replicas.
)", 0) \
DECLARE(String, parallel_replicas_custom_key, "", R"(
An arbitrary integer expression that can be used to split work between replicas for a specific table.
The value can be any integer expression.
Simple expressions using primary keys are preferred.
If the setting is used on a cluster that consists of a single shard with multiple replicas, those replicas will be converted into virtual shards.
Otherwise, it will behave same as for `SAMPLE` key, it will use multiple replicas of each shard.
)", 0) \
DECLARE(UInt64, parallel_replicas_custom_key_range_lower, 0, R"(
Allows the filter type `range` to split the work evenly between replicas based on the custom range `[parallel_replicas_custom_key_range_lower, INT_MAX]`.
When used in conjunction with [parallel_replicas_custom_key_range_upper](#parallel_replicas_custom_key_range_upper), it lets the filter evenly split the work over replicas for the range `[parallel_replicas_custom_key_range_lower, parallel_replicas_custom_key_range_upper]`.
Note: This setting will not cause any additional data to be filtered during query processing, rather it changes the points at which the range filter breaks up the range `[0, INT_MAX]` for parallel processing.
)", 0) \
DECLARE(UInt64, parallel_replicas_custom_key_range_upper, 0, R"(
Allows the filter type `range` to split the work evenly between replicas based on the custom range `[0, parallel_replicas_custom_key_range_upper]`. A value of 0 disables the upper bound, setting it the max value of the custom key expression.
When used in conjunction with [parallel_replicas_custom_key_range_lower](#parallel_replicas_custom_key_range_lower), it lets the filter evenly split the work over replicas for the range `[parallel_replicas_custom_key_range_lower, parallel_replicas_custom_key_range_upper]`.
Note: This setting will not cause any additional data to be filtered during query processing, rather it changes the points at which the range filter breaks up the range `[0, INT_MAX]` for parallel processing
)", 0) \
DECLARE(String, cluster_for_parallel_replicas, "", R"(
Cluster for a shard in which current server is located
)", 0) \
DECLARE(Bool, parallel_replicas_allow_in_with_subquery, true, R"(
If true, subquery for IN will be executed on every follower replica.
)", 0) \
DECLARE(Float, parallel_replicas_single_task_marks_count_multiplier, 2, R"(
A multiplier which will be added during calculation for minimal number of marks to retrieve from coordinator. This will be applied only for remote replicas.
)", 0) \
DECLARE(Bool, parallel_replicas_for_non_replicated_merge_tree, false, R"(
If true, ClickHouse will use parallel replicas algorithm also for non-replicated MergeTree tables
)", 0) \
DECLARE(UInt64, parallel_replicas_min_number_of_rows_per_replica, 0, R"(
Limit the number of replicas used in a query to (estimated rows to read / min_number_of_rows_per_replica). The max is still limited by 'max_parallel_replicas'
)", 0) \
DECLARE(Bool, parallel_replicas_prefer_local_join, true, R"(
If true, and JOIN can be executed with parallel replicas algorithm, and all storages of right JOIN part are *MergeTree, local JOIN will be used instead of GLOBAL JOIN.
)", 0) \
DECLARE(UInt64, parallel_replicas_mark_segment_size, 0, R"(
Parts virtually divided into segments to be distributed between replicas for parallel reading. This setting controls the size of these segments. Not recommended to change until you're absolutely sure in what you're doing. Value should be in range [128; 16384]
- 0 the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is disabled.
- 1 the [TimeSeries](../../engines/table-engines/integrations/time-series.md) table engine is enabled.
)", 0) \
DECLARE(Bool, allow_experimental_vector_similarity_index, false, R"(
Allow experimental vector similarity index
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_variant_type, false, R"(
Allows creation of experimental [Variant](../../sql-reference/data-types/variant.md).
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_dynamic_type, false, R"(
Allow Dynamic data type
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_json_type, false, R"(
Allow JSON data type
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_codecs, false, R"(
If it is set to true, allow to specify experimental compression codecs (but we don't have those yet and this option does nothing).
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_shared_set_join, true, R"(
Only in ClickHouse Cloud. Allow to create ShareSet and SharedJoin
)", EXPERIMENTAL) \
DECLARE(UInt64, max_limit_for_ann_queries, 1'000'000, R"(
SELECT queries with LIMIT bigger than this setting cannot use vector similarity indexes. Helps to prevent memory overflows in vector similarity indexes.
)", EXPERIMENTAL) \
DECLARE(UInt64, hnsw_candidate_list_size_for_search, 256, R"(
The size of the dynamic candidate list when searching the vector similarity index, also known as 'ef_search'.
)", EXPERIMENTAL) \
DECLARE(Bool, throw_on_unsupported_query_inside_transaction, true, R"(
Throw exception if unsupported query is used inside transaction
)", EXPERIMENTAL) \
DECLARE(TransactionsWaitCSNMode, wait_changes_become_visible_after_commit_mode, TransactionsWaitCSNMode::WAIT_UNKNOWN, R"(
Wait for committed changes to become actually visible in the latest snapshot
)", EXPERIMENTAL) \
DECLARE(Bool, implicit_transaction, false, R"(
If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback)
)", EXPERIMENTAL) \
DECLARE(UInt64, grace_hash_join_initial_buckets, 1, R"(
Initial number of grace hash join buckets
)", EXPERIMENTAL) \
DECLARE(UInt64, grace_hash_join_max_buckets, 1024, R"(
Limit on the number of grace hash join buckets
)", EXPERIMENTAL) \
DECLARE(UInt64, join_to_sort_minimum_perkey_rows, 40, R"(
The lower limit of per-key average rows in the right table to determine whether to rerange the right table by key in left or inner join. This setting ensures that the optimization is not applied for sparse table keys
)", EXPERIMENTAL) \
DECLARE(UInt64, join_to_sort_maximum_table_rows, 10000, R"(
The maximum number of rows in the right table to determine whether to rerange the right table by key in left or inner join.
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_join_right_table_sorting, false, R"(
If it is set to true, and the conditions of `join_to_sort_minimum_perkey_rows` and `join_to_sort_maximum_table_rows` are met, rerange the right table by key to improve the performance in left or inner hash join.
)", EXPERIMENTAL) \
DECLARE(Bool, use_hive_partitioning, false, R"(
When enabled, ClickHouse will detect Hive-style partitioning in path (`/name=value/`) in file-like table engines [File](../../engines/table-engines/special/file.md#hive-style-partitioning)/[S3](../../engines/table-engines/integrations/s3.md#hive-style-partitioning)/[URL](../../engines/table-engines/special/url.md#hive-style-partitioning)/[HDFS](../../engines/table-engines/integrations/hdfs.md#hive-style-partitioning)/[AzureBlobStorage](../../engines/table-engines/integrations/azureBlobStorage.md#hive-style-partitioning) and will allow to use partition columns as virtual columns in the query. These virtual columns will have the same names as in the partitioned path, but starting with `_`.
)", EXPERIMENTAL)\
\
DECLARE(Bool, allow_statistics_optimize, false, R"(
Allows using statistics to optimize queries
)", EXPERIMENTAL) ALIAS(allow_statistic_optimize) \
DECLARE(Bool, allow_experimental_statistics, false, R"(
Allows defining columns with [statistics](../../engines/table-engines/mergetree-family/mergetree.md#table_engine-mergetree-creating-a-table) and [manipulate statistics](../../engines/table-engines/mergetree-family/mergetree.md#column-statistics).
)", EXPERIMENTAL) ALIAS(allow_experimental_statistic) \
\
DECLARE(Bool, allow_archive_path_syntax, true, R"(
File/S3 engines/table function will parse paths with '::' as '\\<archive\\> :: \\<file\\>' if archive has correct extension
)", 0) \
DECLARE(Bool, parallel_replicas_local_plan, false, R"(
Build local plan for local replica
)", 0) \
)", EXPERIMENTAL) \
\
DECLARE(Bool, allow_experimental_inverted_index, false, R"(
If it is set to true, allow to use experimental inverted index.
)", 0) \
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_full_text_index, false, R"(
If it is set to true, allow to use experimental full-text index.
)", 0) \
)", EXPERIMENTAL) \
\
DECLARE(Bool, allow_experimental_join_condition, false, R"(
Support join with inequal conditions which involve columns from both left and right table. e.g. t1.y < t2.y.
)", 0) \
\
DECLARE(Bool, allow_experimental_analyzer, true, R"(
Allow new query analyzer.
)", IMPORTANT) ALIAS(enable_analyzer) \
DECLARE(Bool, analyzer_compatibility_join_using_top_level_identifier, false, R"(
Force to resolve identifier in JOIN USING from projection (for example, in `SELECT a + 1 AS b FROM t1 JOIN t2 USING (b)` join will be performed by `t1.a + 1 = t2.b`, rather then `t1.b = t2.b`).
)", 0) \
\
DECLARE(Bool, allow_experimental_live_view, false, R"(
@ -5778,43 +5795,43 @@ Possible values:
)", 0) \
DECLARE(Seconds, live_view_heartbeat_interval, 15, R"(
The heartbeat interval in seconds to indicate live query is alive.
)", 0) \
)", EXPERIMENTAL) \
DECLARE(UInt64, max_live_view_insert_blocks_before_refresh, 64, R"(
Limit maximum number of inserted blocks after which mergeable blocks are dropped and query is re-executed.
)", 0) \
)", EXPERIMENTAL) \
\
DECLARE(Bool, allow_experimental_window_view, false, R"(
Enable WINDOW VIEW. Not mature enough.
)", 0) \
)", EXPERIMENTAL) \
DECLARE(Seconds, window_view_clean_interval, 60, R"(
The clean interval of window view in seconds to free outdated data.
)", 0) \
)", EXPERIMENTAL) \
DECLARE(Seconds, window_view_heartbeat_interval, 15, R"(
The heartbeat interval in seconds to indicate watch query is alive.
)", 0) \
)", EXPERIMENTAL) \
DECLARE(Seconds, wait_for_window_view_fire_signal_timeout, 10, R"(
Timeout for waiting for window view fire signal in event time processing
)", 0) \
)", EXPERIMENTAL) \
\
DECLARE(Bool, stop_refreshable_materialized_views_on_startup, false, R"(
On server startup, prevent scheduling of refreshable materialized views, as if with SYSTEM STOP VIEWS. You can manually start them with SYSTEM START VIEWS or SYSTEM START VIEW \\<name\\> afterwards. Also applies to newly created views. Has no effect on non-refreshable materialized views.
)", 0) \
)", EXPERIMENTAL) \
\
DECLARE(Bool, allow_experimental_database_materialized_mysql, false, R"(
Allow to create database with Engine=MaterializedMySQL(...).
)", 0) \
)", EXPERIMENTAL) \
DECLARE(Bool, allow_experimental_database_materialized_postgresql, false, R"(
Allow to create database with Engine=MaterializedPostgreSQL(...).
)", 0) \
)", EXPERIMENTAL) \
\
/** Experimental feature for moving data between shards. */ \
DECLARE(Bool, allow_experimental_query_deduplication, false, R"(
Experimental data deduplication for SELECT queries based on part UUIDs
)", 0) \
DECLARE(Bool, implicit_select, false, R"(
Allow writing simple SELECT queries without the leading SELECT keyword, which makes it simple for calculator-style usage, e.g. `1 + 2` becomes a valid query.
)", 0)
)", EXPERIMENTAL) \
\
/* ####################################################### */ \
/* ############ END OF EXPERIMENTAL FEATURES ############# */ \
/* ####################################################### */ \
// End of COMMON_SETTINGS
// Please add settings related to formats in Core/FormatFactorySettings.h, move obsolete settings to OBSOLETE_SETTINGS and obsolete format settings to OBSOLETE_FORMAT_SETTINGS.
@ -5893,13 +5910,14 @@ Allow writing simple SELECT queries without the leading SELECT keyword, which ma
/** The section above is for obsolete settings. Do not add anything there. */
#endif /// __CLION_IDE__
#define LIST_OF_SETTINGS(M, ALIAS) \
COMMON_SETTINGS(M, ALIAS) \
OBSOLETE_SETTINGS(M, ALIAS) \
FORMAT_FACTORY_SETTINGS(M, ALIAS) \
OBSOLETE_FORMAT_SETTINGS(M, ALIAS) \
// clang-format on
DECLARE_SETTINGS_TRAITS_ALLOW_CUSTOM_SETTINGS(SettingsTraits, LIST_OF_SETTINGS)
IMPLEMENT_SETTINGS_TRAITS(SettingsTraits, LIST_OF_SETTINGS)
@ -6007,7 +6025,7 @@ void SettingsImpl::checkNoSettingNamesAtTopLevel(const Poco::Util::AbstractConfi
{
const auto & name = setting.getName();
bool should_skip_check = name == "max_table_size_to_drop" || name == "max_partition_size_to_drop";
if (config.has(name) && !setting.isObsolete() && !should_skip_check)
if (config.has(name) && (setting.getTier() != SettingsTierType::OBSOLETE) && !should_skip_check)
{
throw Exception(ErrorCodes::UNKNOWN_ELEMENT_IN_CONFIG, "A setting '{}' appeared at top level in config {}."
" But it is user-level setting that should be located in users.xml inside <profiles> section for specific profile."
@ -6183,7 +6201,7 @@ std::vector<std::string_view> Settings::getChangedAndObsoleteNames() const
std::vector<std::string_view> setting_names;
for (const auto & setting : impl->allChanged())
{
if (setting.isObsolete())
if (setting.getTier() == SettingsTierType::OBSOLETE)
setting_names.emplace_back(setting.getName());
}
return setting_names;
@ -6232,7 +6250,8 @@ void Settings::dumpToSystemSettingsColumns(MutableColumnsAndConstraints & params
res_columns[6]->insert(writability == SettingConstraintWritability::CONST);
res_columns[7]->insert(setting.getTypeName());
res_columns[8]->insert(setting.getDefaultValueString());
res_columns[10]->insert(setting.isObsolete());
res_columns[10]->insert(setting.getTier() == SettingsTierType::OBSOLETE);
res_columns[11]->insert(setting.getTier());
};
const auto & settings_to_aliases = SettingsImpl::Traits::settingsToAliases();

View File

@ -64,6 +64,7 @@ static std::initializer_list<std::pair<ClickHouseVersion, SettingsChangesHistory
},
{"24.11",
{
{"distributed_cache_discard_connection_if_unread_data", true, true, "New setting"},
}
},
{"24.10",
@ -87,7 +88,7 @@ static std::initializer_list<std::pair<ClickHouseVersion, SettingsChangesHistory
{"input_format_binary_read_json_as_string", false, false, "Add new setting to read values of JSON type as JSON string in RowBinary input format"},
{"min_free_disk_bytes_to_perform_insert", 0, 0, "New setting."},
{"min_free_disk_ratio_to_perform_insert", 0.0, 0.0, "New setting."},
{"enable_named_columns_in_function_tuple", false, true, "Re-enable the setting since all known bugs are fixed"},
{"enable_named_columns_in_function_tuple", false, false, "Disabled pending usability improvements"},
{"cloud_mode_database_engine", 1, 1, "A setting for ClickHouse Cloud"},
{"allow_experimental_shared_set_join", 1, 1, "A setting for ClickHouse Cloud"},
{"read_through_distributed_cache", 0, 0, "A setting for ClickHouse Cloud"},
@ -111,7 +112,8 @@ static std::initializer_list<std::pair<ClickHouseVersion, SettingsChangesHistory
{"hnsw_candidate_list_size_for_search", 64, 256, "New setting. Previously, the value was optionally specified in CREATE INDEX and 64 by default."},
{"allow_reorder_prewhere_conditions", false, true, "New setting"},
{"input_format_parquet_bloom_filter_push_down", false, true, "When reading Parquet files, skip whole row groups based on the WHERE/PREWHERE expressions and bloom filter in the Parquet metadata."},
{"date_time_64_output_format_cut_trailing_zeros_align_to_groups_of_thousands", false, false, "Dynamically trim the trailing zeros of datetime64 values to adjust the output scale to (0, 3, 6), corresponding to 'seconds', 'milliseconds', and 'microseconds'."}
{"date_time_64_output_format_cut_trailing_zeros_align_to_groups_of_thousands", false, false, "Dynamically trim the trailing zeros of datetime64 values to adjust the output scale to (0, 3, 6), corresponding to 'seconds', 'milliseconds', and 'microseconds'."},
{"azure_check_objects_after_upload", false, false, "Check each uploaded object in azure blob storage to be sure that upload was successful"},
}
},
{"24.9",

View File

@ -2,8 +2,8 @@
// clang-format off
#define MAKE_OBSOLETE(M, TYPE, NAME, DEFAULT) \
M(TYPE, NAME, DEFAULT, "Obsolete setting, does nothing.", BaseSettingsHelpers::Flags::OBSOLETE)
M(TYPE, NAME, DEFAULT, "Obsolete setting, does nothing.", SettingsTierType::OBSOLETE)
/// NOTE: ServerSettings::loadSettingsFromConfig() should be updated to include this settings
#define MAKE_DEPRECATED_BY_SERVER_CONFIG(M, TYPE, NAME, DEFAULT) \
M(TYPE, NAME, DEFAULT, "User-level setting is deprecated, and it must be defined in the server configuration instead.", BaseSettingsHelpers::Flags::OBSOLETE)
M(TYPE, NAME, DEFAULT, "User-level setting is deprecated, and it must be defined in the server configuration instead.", SettingsTierType::OBSOLETE)

View File

@ -0,0 +1,19 @@
#include <Core/SettingsTierType.h>
#include <DataTypes/DataTypeEnum.h>
namespace DB
{
std::shared_ptr<DataTypeEnum8> getSettingsTierEnum()
{
return std::make_shared<DataTypeEnum8>(
DataTypeEnum8::Values
{
{"Production", static_cast<Int8>(SettingsTierType::PRODUCTION)},
{"Obsolete", static_cast<Int8>(SettingsTierType::OBSOLETE)},
{"Experimental", static_cast<Int8>(SettingsTierType::EXPERIMENTAL)},
{"Beta", static_cast<Int8>(SettingsTierType::BETA)}
});
}
}

View File

@ -0,0 +1,26 @@
#pragma once
#include <Core/Types.h>
#include <cstdint>
#include <memory>
namespace DB
{
template <typename Type>
class DataTypeEnum;
using DataTypeEnum8 = DataTypeEnum<Int8>;
// Make it signed for compatibility with DataTypeEnum8
enum SettingsTierType : int8_t
{
PRODUCTION = 0b0000,
OBSOLETE = 0b0100,
EXPERIMENTAL = 0b1000,
BETA = 0b1100
};
std::shared_ptr<DataTypeEnum8> getSettingsTierEnum();
}

View File

@ -195,6 +195,15 @@ struct SortCursorHelper
/// The last row of this cursor is no larger than the first row of the another cursor.
return !derived().greaterAt(rhs.derived(), impl->rows - 1, 0);
}
bool ALWAYS_INLINE totallyLess(const SortCursorHelper & rhs) const
{
if (impl->rows == 0 || rhs.impl->rows == 0)
return false;
/// The last row of this cursor is less than the first row of the another cursor.
return rhs.derived().template greaterAt<false>(derived(), 0, impl->rows - 1);
}
};
@ -203,6 +212,7 @@ struct SortCursor : SortCursorHelper<SortCursor>
using SortCursorHelper<SortCursor>::SortCursorHelper;
/// The specified row of this cursor is greater than the specified row of another cursor.
template <bool consider_order = true>
bool ALWAYS_INLINE greaterAt(const SortCursor & rhs, size_t lhs_pos, size_t rhs_pos) const
{
#if USE_EMBEDDED_COMPILER
@ -218,7 +228,10 @@ struct SortCursor : SortCursorHelper<SortCursor>
if (res < 0)
return false;
return impl->order > rhs.impl->order;
if constexpr (consider_order)
return impl->order > rhs.impl->order;
else
return false;
}
#endif
@ -235,7 +248,10 @@ struct SortCursor : SortCursorHelper<SortCursor>
return false;
}
return impl->order > rhs.impl->order;
if constexpr (consider_order)
return impl->order > rhs.impl->order;
else
return false;
}
};
@ -245,6 +261,7 @@ struct SimpleSortCursor : SortCursorHelper<SimpleSortCursor>
{
using SortCursorHelper<SimpleSortCursor>::SortCursorHelper;
template <bool consider_order = true>
bool ALWAYS_INLINE greaterAt(const SimpleSortCursor & rhs, size_t lhs_pos, size_t rhs_pos) const
{
int res = 0;
@ -271,7 +288,10 @@ struct SimpleSortCursor : SortCursorHelper<SimpleSortCursor>
if (res < 0)
return false;
return impl->order > rhs.impl->order;
if constexpr (consider_order)
return impl->order > rhs.impl->order;
else
return false;
}
};
@ -280,6 +300,7 @@ struct SpecializedSingleColumnSortCursor : SortCursorHelper<SpecializedSingleCol
{
using SortCursorHelper<SpecializedSingleColumnSortCursor>::SortCursorHelper;
template <bool consider_order = true>
bool ALWAYS_INLINE greaterAt(const SortCursorHelper<SpecializedSingleColumnSortCursor> & rhs, size_t lhs_pos, size_t rhs_pos) const
{
auto & this_impl = this->impl;
@ -302,7 +323,10 @@ struct SpecializedSingleColumnSortCursor : SortCursorHelper<SpecializedSingleCol
if (res < 0)
return false;
return this_impl->order > rhs.impl->order;
if constexpr (consider_order)
return this_impl->order > rhs.impl->order;
else
return false;
}
};
@ -311,6 +335,7 @@ struct SortCursorWithCollation : SortCursorHelper<SortCursorWithCollation>
{
using SortCursorHelper<SortCursorWithCollation>::SortCursorHelper;
template <bool consider_order = true>
bool ALWAYS_INLINE greaterAt(const SortCursorWithCollation & rhs, size_t lhs_pos, size_t rhs_pos) const
{
for (size_t i = 0; i < impl->sort_columns_size; ++i)
@ -330,7 +355,10 @@ struct SortCursorWithCollation : SortCursorHelper<SortCursorWithCollation>
if (res < 0)
return false;
}
return impl->order > rhs.impl->order;
if constexpr (consider_order)
return impl->order > rhs.impl->order;
else
return false;
}
};

View File

@ -161,7 +161,7 @@ String getNameForSubstreamPath(
String stream_name,
SubstreamIterator begin,
SubstreamIterator end,
bool escape_tuple_delimiter)
bool escape_for_file_name)
{
using Substream = ISerialization::Substream;
@ -186,7 +186,7 @@ String getNameForSubstreamPath(
/// Because nested data may be represented not by Array of Tuple,
/// but by separate Array columns with names in a form of a.b,
/// and name is encoded as a whole.
if (it->type == Substream::TupleElement && escape_tuple_delimiter)
if (it->type == Substream::TupleElement && escape_for_file_name)
stream_name += escapeForFileName(substream_name);
else
stream_name += substream_name;
@ -206,7 +206,7 @@ String getNameForSubstreamPath(
else if (it->type == SubstreamType::ObjectSharedData)
stream_name += ".object_shared_data";
else if (it->type == SubstreamType::ObjectTypedPath || it->type == SubstreamType::ObjectDynamicPath)
stream_name += "." + it->object_path_name;
stream_name += "." + (escape_for_file_name ? escapeForFileName(it->object_path_name) : it->object_path_name);
}
return stream_name;
@ -434,6 +434,14 @@ bool ISerialization::isDynamicSubcolumn(const DB::ISerialization::SubstreamPath
return false;
}
bool ISerialization::isLowCardinalityDictionarySubcolumn(const DB::ISerialization::SubstreamPath & path)
{
if (path.empty())
return false;
return path[path.size() - 1].type == SubstreamType::DictionaryKeys;
}
ISerialization::SubstreamData ISerialization::createFromPath(const SubstreamPath & path, size_t prefix_len)
{
assert(prefix_len <= path.size());

View File

@ -463,6 +463,8 @@ public:
/// Returns true if stream with specified path corresponds to dynamic subcolumn.
static bool isDynamicSubcolumn(const SubstreamPath & path, size_t prefix_len);
static bool isLowCardinalityDictionarySubcolumn(const SubstreamPath & path);
protected:
template <typename State, typename StatePtr>
State * checkAndGetState(const StatePtr & state) const;

View File

@ -54,7 +54,7 @@ void SerializationLowCardinality::enumerateStreams(
.withSerializationInfo(data.serialization_info);
settings.path.back().data = dict_data;
dict_inner_serialization->enumerateStreams(settings, callback, dict_data);
callback(settings.path);
settings.path.back() = Substream::DictionaryIndexes;
settings.path.back().data = data;

View File

@ -365,7 +365,7 @@ AsynchronousBoundedReadBuffer::~AsynchronousBoundedReadBuffer()
}
catch (...)
{
tryLogCurrentException(__PRETTY_FUNCTION__);
tryLogCurrentException(log);
}
}

View File

@ -30,6 +30,7 @@ namespace DB
namespace ErrorCodes
{
extern const int AZURE_BLOB_STORAGE_ERROR;
extern const int LOGICAL_ERROR;
}
@ -72,6 +73,7 @@ WriteBufferFromAzureBlobStorage::WriteBufferFromAzureBlobStorage(
std::move(schedule_),
settings_->max_inflight_parts_for_one_file,
limited_log))
, check_objects_after_upload(settings_->check_objects_after_upload)
{
allocateBuffer();
}
@ -178,6 +180,24 @@ void WriteBufferFromAzureBlobStorage::finalizeImpl()
execWithRetry([&](){ block_blob_client.CommitBlockList(block_ids); }, max_unexpected_write_error_retries);
LOG_TRACE(log, "Committed {} blocks for blob `{}`", block_ids.size(), blob_path);
}
if (check_objects_after_upload)
{
try
{
auto blob_client = blob_container_client->GetBlobClient(blob_path);
blob_client.GetProperties();
}
catch (const Azure::Storage::StorageException & e)
{
if (e.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound)
throw Exception(
ErrorCodes::AZURE_BLOB_STORAGE_ERROR,
"Object {} not uploaded to azure blob storage, it's a bug in Azure Blob Storage or its API.",
blob_path);
throw;
}
}
}
void WriteBufferFromAzureBlobStorage::nextImpl()

View File

@ -90,6 +90,7 @@ private:
size_t hidden_size = 0;
std::unique_ptr<TaskTracker> task_tracker;
bool check_objects_after_upload = false;
std::deque<PartData> detached_part_data;
};

View File

@ -39,6 +39,7 @@ namespace Setting
extern const SettingsUInt64 azure_sdk_max_retries;
extern const SettingsUInt64 azure_sdk_retry_initial_backoff_ms;
extern const SettingsUInt64 azure_sdk_retry_max_backoff_ms;
extern const SettingsBool azure_check_objects_after_upload;
}
namespace ErrorCodes
@ -352,6 +353,7 @@ std::unique_ptr<RequestSettings> getRequestSettings(const Settings & query_setti
settings->sdk_max_retries = query_settings[Setting::azure_sdk_max_retries];
settings->sdk_retry_initial_backoff_ms = query_settings[Setting::azure_sdk_retry_initial_backoff_ms];
settings->sdk_retry_max_backoff_ms = query_settings[Setting::azure_sdk_retry_max_backoff_ms];
settings->check_objects_after_upload = query_settings[Setting::azure_check_objects_after_upload];
return settings;
}
@ -389,6 +391,8 @@ std::unique_ptr<RequestSettings> getRequestSettings(const Poco::Util::AbstractCo
settings->sdk_retry_initial_backoff_ms = config.getUInt64(config_prefix + ".retry_initial_backoff_ms", settings_ref[Setting::azure_sdk_retry_initial_backoff_ms]);
settings->sdk_retry_max_backoff_ms = config.getUInt64(config_prefix + ".retry_max_backoff_ms", settings_ref[Setting::azure_sdk_retry_max_backoff_ms]);
settings->check_objects_after_upload = config.getBool(config_prefix + ".check_objects_after_upload", settings_ref[Setting::azure_check_objects_after_upload]);
if (config.has(config_prefix + ".curl_ip_resolve"))
{
using CurlOptions = Azure::Core::Http::CurlTransportOptions;

View File

@ -50,6 +50,7 @@ struct RequestSettings
size_t sdk_retry_initial_backoff_ms = 10;
size_t sdk_retry_max_backoff_ms = 1000;
bool use_native_copy = false;
bool check_objects_after_upload = false;
using CurlOptions = Azure::Core::Http::CurlTransportOptions;
CurlOptions::CurlOptIPResolve curl_ip_resolve = CurlOptions::CURL_IPRESOLVE_WHATEVER;

View File

@ -18,7 +18,8 @@
#include <Disks/FakeDiskTransaction.h>
#include <Poco/Util/AbstractConfiguration.h>
#include <Interpreters/Context.h>
#include <Common/Scheduler/Workload/IWorkloadEntityStorage.h>
#include <Parsers/ASTCreateResourceQuery.h>
namespace DB
{
@ -71,8 +72,8 @@ DiskObjectStorage::DiskObjectStorage(
, metadata_storage(std::move(metadata_storage_))
, object_storage(std::move(object_storage_))
, send_metadata(config.getBool(config_prefix + ".send_metadata", false))
, read_resource_name(config.getString(config_prefix + ".read_resource", ""))
, write_resource_name(config.getString(config_prefix + ".write_resource", ""))
, read_resource_name_from_config(config.getString(config_prefix + ".read_resource", ""))
, write_resource_name_from_config(config.getString(config_prefix + ".write_resource", ""))
, metadata_helper(std::make_unique<DiskObjectStorageRemoteMetadataRestoreHelper>(this, ReadSettings{}, WriteSettings{}))
{
data_source_description = DataSourceDescription{
@ -83,6 +84,98 @@ DiskObjectStorage::DiskObjectStorage(
.is_encrypted = false,
.is_cached = object_storage->supportsCache(),
};
resource_changes_subscription = Context::getGlobalContextInstance()->getWorkloadEntityStorage().getAllEntitiesAndSubscribe(
[this] (const std::vector<IWorkloadEntityStorage::Event> & events)
{
std::unique_lock lock{resource_mutex};
// Sets of matching resource names. Required to resolve possible conflicts in deterministic way
std::set<String> new_read_resource_name_from_sql;
std::set<String> new_write_resource_name_from_sql;
std::set<String> new_read_resource_name_from_sql_any;
std::set<String> new_write_resource_name_from_sql_any;
// Current state
if (!read_resource_name_from_sql.empty())
new_read_resource_name_from_sql.insert(read_resource_name_from_sql);
if (!write_resource_name_from_sql.empty())
new_write_resource_name_from_sql.insert(write_resource_name_from_sql);
if (!read_resource_name_from_sql_any.empty())
new_read_resource_name_from_sql_any.insert(read_resource_name_from_sql_any);
if (!write_resource_name_from_sql_any.empty())
new_write_resource_name_from_sql_any.insert(write_resource_name_from_sql_any);
// Process all updates in specified order
for (const auto & [entity_type, resource_name, resource] : events)
{
if (entity_type == WorkloadEntityType::Resource)
{
if (resource) // CREATE RESOURCE
{
auto * create = typeid_cast<ASTCreateResourceQuery *>(resource.get());
chassert(create);
for (const auto & [mode, disk] : create->operations)
{
if (!disk)
{
switch (mode)
{
case ASTCreateResourceQuery::AccessMode::Read: new_read_resource_name_from_sql_any.insert(resource_name); break;
case ASTCreateResourceQuery::AccessMode::Write: new_write_resource_name_from_sql_any.insert(resource_name); break;
}
}
else if (*disk == name)
{
switch (mode)
{
case ASTCreateResourceQuery::AccessMode::Read: new_read_resource_name_from_sql.insert(resource_name); break;
case ASTCreateResourceQuery::AccessMode::Write: new_write_resource_name_from_sql.insert(resource_name); break;
}
}
}
}
else // DROP RESOURCE
{
new_read_resource_name_from_sql.erase(resource_name);
new_write_resource_name_from_sql.erase(resource_name);
new_read_resource_name_from_sql_any.erase(resource_name);
new_write_resource_name_from_sql_any.erase(resource_name);
}
}
}
String old_read_resource = getReadResourceNameNoLock();
String old_write_resource = getWriteResourceNameNoLock();
// Apply changes
if (!new_read_resource_name_from_sql_any.empty())
read_resource_name_from_sql_any = *new_read_resource_name_from_sql_any.begin();
else
read_resource_name_from_sql_any.clear();
if (!new_write_resource_name_from_sql_any.empty())
write_resource_name_from_sql_any = *new_write_resource_name_from_sql_any.begin();
else
write_resource_name_from_sql_any.clear();
if (!new_read_resource_name_from_sql.empty())
read_resource_name_from_sql = *new_read_resource_name_from_sql.begin();
else
read_resource_name_from_sql.clear();
if (!new_write_resource_name_from_sql.empty())
write_resource_name_from_sql = *new_write_resource_name_from_sql.begin();
else
write_resource_name_from_sql.clear();
String new_read_resource = getReadResourceNameNoLock();
String new_write_resource = getWriteResourceNameNoLock();
if (old_read_resource != new_read_resource)
LOG_INFO(log, "Using resource '{}' instead of '{}' for READ", new_read_resource, old_read_resource);
if (old_write_resource != new_write_resource)
LOG_INFO(log, "Using resource '{}' instead of '{}' for WRITE", new_write_resource, old_write_resource);
});
}
StoredObjects DiskObjectStorage::getStorageObjects(const String & local_path) const
@ -480,13 +573,29 @@ static inline Settings updateIOSchedulingSettings(const Settings & settings, con
String DiskObjectStorage::getReadResourceName() const
{
std::unique_lock lock(resource_mutex);
return read_resource_name;
return getReadResourceNameNoLock();
}
String DiskObjectStorage::getWriteResourceName() const
{
std::unique_lock lock(resource_mutex);
return write_resource_name;
return getWriteResourceNameNoLock();
}
String DiskObjectStorage::getReadResourceNameNoLock() const
{
if (read_resource_name_from_config.empty())
return read_resource_name_from_sql.empty() ? read_resource_name_from_sql_any : read_resource_name_from_sql;
else
return read_resource_name_from_config;
}
String DiskObjectStorage::getWriteResourceNameNoLock() const
{
if (write_resource_name_from_config.empty())
return write_resource_name_from_sql.empty() ? write_resource_name_from_sql_any : write_resource_name_from_sql;
else
return write_resource_name_from_config;
}
std::unique_ptr<ReadBufferFromFileBase> DiskObjectStorage::readFile(
@ -607,10 +716,10 @@ void DiskObjectStorage::applyNewSettings(
{
std::unique_lock lock(resource_mutex);
if (String new_read_resource_name = config.getString(config_prefix + ".read_resource", ""); new_read_resource_name != read_resource_name)
read_resource_name = new_read_resource_name;
if (String new_write_resource_name = config.getString(config_prefix + ".write_resource", ""); new_write_resource_name != write_resource_name)
write_resource_name = new_write_resource_name;
if (String new_read_resource_name = config.getString(config_prefix + ".read_resource", ""); new_read_resource_name != read_resource_name_from_config)
read_resource_name_from_config = new_read_resource_name;
if (String new_write_resource_name = config.getString(config_prefix + ".write_resource", ""); new_write_resource_name != write_resource_name_from_config)
write_resource_name_from_config = new_write_resource_name;
}
IDisk::applyNewSettings(config, context_, config_prefix, disk_map);

View File

@ -6,6 +6,8 @@
#include <Disks/ObjectStorages/IMetadataStorage.h>
#include <Common/re2.h>
#include <base/scope_guard.h>
#include "config.h"
@ -228,6 +230,8 @@ private:
String getReadResourceName() const;
String getWriteResourceName() const;
String getReadResourceNameNoLock() const;
String getWriteResourceNameNoLock() const;
const String object_key_prefix;
LoggerPtr log;
@ -246,8 +250,13 @@ private:
const bool send_metadata;
mutable std::mutex resource_mutex;
String read_resource_name;
String write_resource_name;
String read_resource_name_from_config; // specified in disk config.xml read_resource element
String write_resource_name_from_config; // specified in disk config.xml write_resource element
String read_resource_name_from_sql; // described by CREATE RESOURCE query with READ DISK clause
String write_resource_name_from_sql; // described by CREATE RESOURCE query with WRITE DISK clause
String read_resource_name_from_sql_any; // described by CREATE RESOURCE query with READ ANY DISK clause
String write_resource_name_from_sql_any; // described by CREATE RESOURCE query with WRITE ANY DISK clause
scope_guard resource_changes_subscription;
std::unique_ptr<DiskObjectStorageRemoteMetadataRestoreHelper> metadata_helper;
};

Some files were not shown because too many files have changed in this diff Show More