Merge branch 'master' into tostring-fix

2024-11-21 23:21:59 +00:00 · 2024-11-01 17:22:18 +08:00 · 2024-11-01 17:22:18 +08:00 · b61ac84527
commit b61ac84527
parent d00a3a5e1e ee2aa2f227
409 changed files with 12489 additions and 1423 deletions
--- a/.gitignore
+++ b/.gitignore
@ -159,6 +159,7 @@ website/package-lock.json
 /programs/server/store
 /programs/server/uuid
 /programs/server/coordination
+/programs/server/workload

 # temporary test files
 tests/queries/0_stateless/test_*
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,4 +1,5 @@
 ### Table of Contents
+**[ClickHouse release v24.10, 2024-10-31](#2410)**<br/>
 **[ClickHouse release v24.9, 2024-09-26](#249)**<br/>
 **[ClickHouse release v24.8 LTS, 2024-08-20](#248)**<br/>
 **[ClickHouse release v24.7, 2024-07-30](#247)**<br/>
@ -12,6 +13,165 @@

 # 2024 Changelog

+### <a id="2410"></a> ClickHouse release 24.10, 2024-10-31
+
+#### Backward Incompatible Change
+* Allow to write `SETTINGS` before `FORMAT` in a chain of queries with `UNION` when subqueries are inside parentheses. This closes [#39712](https://github.com/ClickHouse/ClickHouse/issues/39712). Change the behavior when a query has the SETTINGS clause specified twice in a sequence. The closest SETTINGS clause will have a preference for the corresponding subquery. In the previous versions, the outermost SETTINGS clause could take a preference over the inner one. [#68614](https://github.com/ClickHouse/ClickHouse/pull/68614) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Reordering of filter conditions from `[PRE]WHERE` clause is now allowed by default. It could be disabled by setting `allow_reorder_prewhere_conditions` to `false`. [#70657](https://github.com/ClickHouse/ClickHouse/pull/70657) ([Nikita Taranov](https://github.com/nickitat)).
+* Remove the `idxd-config` library, which has an incompatible license. This also removes the experimental Intel DeflateQPL codec. [#70987](https://github.com/ClickHouse/ClickHouse/pull/70987) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+
+#### New Feature
+* Allow to grant access to the wildcard prefixes. `GRANT SELECT ON db.table_pefix_* TO user`. [#65311](https://github.com/ClickHouse/ClickHouse/pull/65311) ([pufit](https://github.com/pufit)).
+* If you press space bar during query runtime, the client will display a real-time table with detailed metrics. You can enable it globally with the new `--progress-table` option in clickhouse-client; a new `--enable-progress-table-toggle` is associated with the `--progress-table` option, and toggles the rendering of the progress table by pressing the control key (Space). [#63689](https://github.com/ClickHouse/ClickHouse/pull/63689) ([Maria Khristenko](https://github.com/mariaKhr)), [#70423](https://github.com/ClickHouse/ClickHouse/pull/70423) ([Julia Kartseva](https://github.com/jkartseva)).
+* Allow to cache read files for object storage table engines and data lakes using hash from ETag + file path as cache key. [#70135](https://github.com/ClickHouse/ClickHouse/pull/70135) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Support creating a table with a query: `CREATE TABLE ... CLONE AS ...`. It clones the source table's schema and then attaches all partitions to the newly created table. This feature is only supported with tables of the `MergeTree` family Closes [#65015](https://github.com/ClickHouse/ClickHouse/issues/65015). [#69091](https://github.com/ClickHouse/ClickHouse/pull/69091) ([tuanpach](https://github.com/tuanpach)).
+* Add a new system table, `system.query_metric_log` which contains history of memory and metric values from table system.events for individual queries, periodically flushed to disk. [#66532](https://github.com/ClickHouse/ClickHouse/pull/66532) ([Pablo Marcos](https://github.com/pamarcos)).
+* A simple SELECT query can be written with implicit SELECT to enable calculator-style expressions, e.g., `ch "1 + 2"`. This is controlled by a new setting, `implicit_select`. [#68502](https://github.com/ClickHouse/ClickHouse/pull/68502) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Support the `--copy` mode for clickhouse local as a shortcut for format conversion [#68503](https://github.com/ClickHouse/ClickHouse/issues/68503). [#68583](https://github.com/ClickHouse/ClickHouse/pull/68583) ([Denis Hananein](https://github.com/denis-hananein)).
+* Add a builtin HTML page for visualizing merges which is available at the `/merges` path. [#70821](https://github.com/ClickHouse/ClickHouse/pull/70821) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Add support for `arrayUnion` function. [#68989](https://github.com/ClickHouse/ClickHouse/pull/68989) ([Peter Nguyen](https://github.com/petern48)).
+* Allow parametrised SQL aliases. [#50665](https://github.com/ClickHouse/ClickHouse/pull/50665) ([Anton Kozlov](https://github.com/tonickkozlov)).
+* A new aggregate function `quantileExactWeightedInterpolated`, which is a interpolated version based on quantileExactWeighted. Some people may wonder why we need a new `quantileExactWeightedInterpolated` since we already have `quantileExactInterpolatedWeighted`. The reason is the new one is more accurate than the old one. This is for spark compatibility. [#69619](https://github.com/ClickHouse/ClickHouse/pull/69619) ([李扬](https://github.com/taiyang-li)).
+* A new function `arrayElementOrNull`. It returns `NULL` if the array index is out of range or a Map key not found. [#69646](https://github.com/ClickHouse/ClickHouse/pull/69646) ([李扬](https://github.com/taiyang-li)).
+* Allows users to specify regular expressions through new `message_regexp` and `message_regexp_negative` fields in the `config.xml` file to filter out logging. The logging is applied to the formatted un-colored text for the most intuitive developer experience. [#69657](https://github.com/ClickHouse/ClickHouse/pull/69657) ([Peter Nguyen](https://github.com/petern48)).
+* Added `RIPEMD160` function, which computes the RIPEMD-160 cryptographic hash of a string. Example: `SELECT HEX(RIPEMD160('The quick brown fox jumps over the lazy dog'))` returns `37F332F68DB77BD9D7EDD4969571AD671CF9DD3B`. [#70087](https://github.com/ClickHouse/ClickHouse/pull/70087) ([Dergousov Maxim](https://github.com/m7kss1)).
+* Support reading `Iceberg` tables on `HDFS`. [#70268](https://github.com/ClickHouse/ClickHouse/pull/70268) ([flynn](https://github.com/ucasfl)).
+* Support for CTE in the form of `WITH ... INSERT`, as previously we only supported `INSERT ... WITH ...`. [#70593](https://github.com/ClickHouse/ClickHouse/pull/70593) ([Shichao Jin](https://github.com/jsc0218)).
+* MongoDB integration: support for all MongoDB types, support for WHERE and ORDER BY statements on MongoDB side, restriction for expressions unsupported by MongoDB. Note that the new inegration is disabled by default, to use it, please set `<use_legacy_mongodb_integration>` to `false` in server config. [#63279](https://github.com/ClickHouse/ClickHouse/pull/63279) ([Kirill Nikiforov](https://github.com/allmazz)).
+* A new function `getSettingOrDefault` added to return the default value and avoid exception if a custom setting is not found in the current profile. [#69917](https://github.com/ClickHouse/ClickHouse/pull/69917) ([Shankar](https://github.com/shiyer7474)).
+
+#### Experimental feature
+* Refreshable materialized views are production ready. [#70550](https://github.com/ClickHouse/ClickHouse/pull/70550) ([Michael Kolupaev](https://github.com/al13n321)). Refreshable materialized views are now supported in Replicated databases. [#60669](https://github.com/ClickHouse/ClickHouse/pull/60669) ([Michael Kolupaev](https://github.com/al13n321)).
+* Parallel replicas are moved from experimental to beta. Reworked settings that control the behavior of parallel replicas algorithms. A quick recap: ClickHouse has four different algorithms for parallel reading involving multiple replicas, which is reflected in the setting `parallel_replicas_mode`, the default value for it is `read_tasks` Additionally, the toggle-switch setting `enable_parallel_replicas` has been added. [#63151](https://github.com/ClickHouse/ClickHouse/pull/63151) ([Alexey Milovidov](https://github.com/alexey-milovidov)), ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)).
+* Support for the `Dynamic` type in most functions by executing them on internal types inside `Dynamic`. [#69691](https://github.com/ClickHouse/ClickHouse/pull/69691) ([Pavel Kruglov](https://github.com/Avogar)).
+* Allow to read/write the `JSON` type as a binary string in `RowBinary` format under settings `input_format_binary_read_json_as_string/output_format_binary_write_json_as_string`. [#70288](https://github.com/ClickHouse/ClickHouse/pull/70288) ([Pavel Kruglov](https://github.com/Avogar)).
+* Allow to serialize/deserialize `JSON` column as single String column in the Native format. For output use setting `output_format_native_write_json_as_string`. For input, use serialization version `1` before the column data. [#70312](https://github.com/ClickHouse/ClickHouse/pull/70312) ([Pavel Kruglov](https://github.com/Avogar)).
+* Introduced a special (experimental) mode of a merge selector for MergeTree tables which makes it more aggressive for the partitions that are close to the limit by the number of parts. It is controlled by the `merge_selector_use_blurry_base` MergeTree-level setting. [#70645](https://github.com/ClickHouse/ClickHouse/pull/70645) ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)).
+* Implement generic ser/de between Avro's `Union` and ClickHouse's `Variant` types. Resolves [#69713](https://github.com/ClickHouse/ClickHouse/issues/69713). [#69712](https://github.com/ClickHouse/ClickHouse/pull/69712) ([Jiří Kozlovský](https://github.com/jirislav)).
+
+#### Performance Improvement
+* Refactor `IDisk` and `IObjectStorage` for better performance. Tables from `plain` and `plain_rewritable` object storages will initialize faster. [#68146](https://github.com/ClickHouse/ClickHouse/pull/68146) ([Alexey Milovidov](https://github.com/alexey-milovidov), [Julia Kartseva](https://github.com/jkartseva)). Do not call the LIST object storage API when determining if a file or directory exists on the plain rewritable disk, as it can be cost-inefficient. [#70852](https://github.com/ClickHouse/ClickHouse/pull/70852) ([Julia Kartseva](https://github.com/jkartseva)). Reduce the number of object storage HEAD API requests in the plain_rewritable disk. [#70915](https://github.com/ClickHouse/ClickHouse/pull/70915) ([Julia Kartseva](https://github.com/jkartseva)).
+* Added an ability to parse data directly into sparse columns. [#69828](https://github.com/ClickHouse/ClickHouse/pull/69828) ([Anton Popov](https://github.com/CurtizJ)).
+* Improved performance of parsing formats with high number of missed values (e.g. `JSONEachRow`). [#69875](https://github.com/ClickHouse/ClickHouse/pull/69875) ([Anton Popov](https://github.com/CurtizJ)).
+* Supports parallel reading of parquet row groups and prefetching of row groups in single-threaded mode. [#69862](https://github.com/ClickHouse/ClickHouse/pull/69862) ([LiuNeng](https://github.com/liuneng1994)).
+* Support minmax index for `pointInPolygon`. [#62085](https://github.com/ClickHouse/ClickHouse/pull/62085) ([JackyWoo](https://github.com/JackyWoo)).
+* Use bloom filters when reading Parquet files. [#62966](https://github.com/ClickHouse/ClickHouse/pull/62966) ([Arthur Passos](https://github.com/arthurpassos)).
+* Lock-free parts rename to avoid INSERT affect SELECT (due to parts lock) (under normal circumstances with `fsync_part_directory`, QPS of SELECT with INSERT in parallel, increased 2x, under heavy load the effect is even bigger). Note, this only includes `ReplicatedMergeTree` for now. [#64955](https://github.com/ClickHouse/ClickHouse/pull/64955) ([Azat Khuzhin](https://github.com/azat)).
+* Respect `ttl_only_drop_parts` on `materialize ttl`; only read necessary columns to recalculate TTL and drop parts by replacing them with an empty one. [#65488](https://github.com/ClickHouse/ClickHouse/pull/65488) ([Andrey Zvonov](https://github.com/zvonand)).
+* Optimized thread creation in the ThreadPool to minimize lock contention. Thread creation is now performed outside of the critical section to avoid delays in job scheduling and thread management under high load conditions. This leads to a much more responsive ClickHouse under heavy concurrent load. [#68694](https://github.com/ClickHouse/ClickHouse/pull/68694) ([filimonov](https://github.com/filimonov)).
+* Enable reading `LowCardinality` string columns from `ORC`. [#69481](https://github.com/ClickHouse/ClickHouse/pull/69481) ([李扬](https://github.com/taiyang-li)).
+* Use `LowCardinality` for `ProfileEvents` in system logs such as `part_log`, `query_views_log`, `filesystem_cache_log`. [#70152](https://github.com/ClickHouse/ClickHouse/pull/70152) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Improve performance of `fromUnixTimestamp`/`toUnixTimestamp` functions. [#71042](https://github.com/ClickHouse/ClickHouse/pull/71042) ([kevinyhzou](https://github.com/KevinyhZou)).
+* Don't disable nonblocking read from page cache for the entire server when reading from a blocking I/O. This was leading to a poorer performance when a single filesystem (e.g., tmpfs) didn't support the `preadv2` syscall while others do. [#70299](https://github.com/ClickHouse/ClickHouse/pull/70299) ([Antonio Andelic](https://github.com/antonio2368)).
+* `ALTER TABLE .. REPLACE PARTITION` doesn't wait anymore for mutations/merges that happen in other partitions. [#59138](https://github.com/ClickHouse/ClickHouse/pull/59138) ([Vasily Nemkov](https://github.com/Enmk)).
+* Don't do validation when synchronizing ACL from Keeper. It's validating during creation. It shouldn't matter that much, but there are installations with tens of thousands or even more user created, and the unnecessary hash validation can take a long time to finish during server startup (it synchronizes everything from keeper). [#70644](https://github.com/ClickHouse/ClickHouse/pull/70644) ([Raúl Marín](https://github.com/Algunenano)).
+
+#### Improvement
+* `CREATE TABLE AS` will copy `PRIMARY KEY`, `ORDER BY`, and similar clauses (of `MergeTree` tables). [#69739](https://github.com/ClickHouse/ClickHouse/pull/69739) ([sakulali](https://github.com/sakulali)).
+* Support 64-bit XID in Keeper. It can be enabled with the `use_xid_64` configuration value. [#69908](https://github.com/ClickHouse/ClickHouse/pull/69908) ([Antonio Andelic](https://github.com/antonio2368)).
+* Command-line arguments for Bool settings are set to true when no value is provided for the argument (e.g. `clickhouse-client --optimize_aggregation_in_order --query "SELECT 1"`). [#70459](https://github.com/ClickHouse/ClickHouse/pull/70459) ([davidtsuk](https://github.com/davidtsuk)).
+* Added user-level settings `min_free_disk_bytes_to_throw_insert` and `min_free_disk_ratio_to_throw_insert` to prevent insertions on disks that are almost full. [#69755](https://github.com/ClickHouse/ClickHouse/pull/69755) ([Marco Vilas Boas](https://github.com/marco-vb)).
+* Embedded documentation for settings will be strictly more detailed and complete than the documentation on the website. This is the first step before making the website documentation always auto-generated from the source code. This has long-standing implications: - it will be guaranteed to have every setting; - there is no chance of having default values obsolete; - we can generate this documentation for each ClickHouse version; - the documentation can be displayed by the server itself even without Internet access. Generate the docs on the website from the source code. [#70289](https://github.com/ClickHouse/ClickHouse/pull/70289) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Allow empty needle in the function `replace`, the same behavior with PostgreSQL. [#69918](https://github.com/ClickHouse/ClickHouse/pull/69918) ([zhanglistar](https://github.com/zhanglistar)).
+* Allow empty needle in functions `replaceRegexp*`. [#70053](https://github.com/ClickHouse/ClickHouse/pull/70053) ([zhanglistar](https://github.com/zhanglistar)).
+* Symbolic links for tables in the `data/database_name/` directory are created for the actual paths to the table's data, depending on the storage policy, instead of the `store/...` directory on the default disk. [#61777](https://github.com/ClickHouse/ClickHouse/pull/61777) ([Kirill](https://github.com/kirillgarbar)).
+* While parsing an `Enum` field from `JSON`, a string containing an integer will be interpreted as the corresponding `Enum` element. This closes [#65119](https://github.com/ClickHouse/ClickHouse/issues/65119). [#66801](https://github.com/ClickHouse/ClickHouse/pull/66801) ([scanhex12](https://github.com/scanhex12)).
+* Allow `TRIM` -ing `LEADING` or `TRAILING` empty string as a no-op. Closes [#67792](https://github.com/ClickHouse/ClickHouse/issues/67792). [#68455](https://github.com/ClickHouse/ClickHouse/pull/68455) ([Peter Nguyen](https://github.com/petern48)).
+* Improve compatibility of `cast(timestamp as String)` with Spark. [#69179](https://github.com/ClickHouse/ClickHouse/pull/69179) ([Wenzheng Liu](https://github.com/lwz9103)).
+* Always use the new analyzer to calculate constant expressions when `enable_analyzer` is set to `true`. Support calculation of `executable` table function arguments without using `SELECT` query for constant expressions. [#69292](https://github.com/ClickHouse/ClickHouse/pull/69292) ([Dmitry Novik](https://github.com/novikd)).
+* Add a setting `enable_secure_identifiers` to disallow identifiers with special characters. [#69411](https://github.com/ClickHouse/ClickHouse/pull/69411) ([tuanpach](https://github.com/tuanpach)).
+* Add `show_create_query_identifier_quoting_rule` to define identifier quoting behavior in the `SHOW CREATE TABLE` query result. Possible values: - `user_display`: When the identifiers is a keyword. - `when_necessary`: When the identifiers is one of `{"distinct", "all", "table"}` and when it could lead to ambiguity: column names, dictionary attribute names. - `always`: Always quote identifiers. [#69448](https://github.com/ClickHouse/ClickHouse/pull/69448) ([tuanpach](https://github.com/tuanpach)).
+* Improve restoring of access entities' dependencies [#69563](https://github.com/ClickHouse/ClickHouse/pull/69563) ([Vitaly Baranov](https://github.com/vitlibar)).
+* If you run `clickhouse-client` or other CLI application, and it starts up slowly due to an overloaded server, and you start typing your query, such as `SELECT`, the previous versions will display the remaining of the terminal echo contents before printing the greetings message, such as `SELECTClickHouse local version 24.10.1.1.` instead of `ClickHouse local version 24.10.1.1.`. Now it is fixed. This closes [#31696](https://github.com/ClickHouse/ClickHouse/issues/31696). [#69856](https://github.com/ClickHouse/ClickHouse/pull/69856) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Add new column `readonly_duration` to the `system.replicas` table. Needed to be able to distinguish actual readonly replicas from sentinel ones in alerts. [#69871](https://github.com/ClickHouse/ClickHouse/pull/69871) ([Miсhael Stetsyuk](https://github.com/mstetsyuk)).
+* Change the type of `join_output_by_rowlist_perkey_rows_threshold` setting type to unsigned integer. [#69886](https://github.com/ClickHouse/ClickHouse/pull/69886) ([kevinyhzou](https://github.com/KevinyhZou)).
+* Enhance OpenTelemetry span logging to include query settings. [#70011](https://github.com/ClickHouse/ClickHouse/pull/70011) ([sharathks118](https://github.com/sharathks118)).
+* Add diagnostic info about higher-order array functions if lambda result type is unexpected. [#70093](https://github.com/ClickHouse/ClickHouse/pull/70093) ([ttanay](https://github.com/ttanay)).
+* Keeper improvement: less locking during cluster changes. [#70275](https://github.com/ClickHouse/ClickHouse/pull/70275) ([Antonio Andelic](https://github.com/antonio2368)).
+* Add `WITH IMPLICIT` and `FINAL` keywords to the `SHOW GRANTS` command. Fix a minor bug with implicit grants: [#70094](https://github.com/ClickHouse/ClickHouse/issues/70094). [#70293](https://github.com/ClickHouse/ClickHouse/pull/70293) ([pufit](https://github.com/pufit)).
+* Respect `compatibility` for MergeTree settings. The `compatibility` value is taken from the `default` profile on server startup, and default MergeTree settings are changed accordingly. Further changes of the `compatibility` setting do not affect MergeTree settings. [#70322](https://github.com/ClickHouse/ClickHouse/pull/70322) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Avoid spamming the logs with large HTTP response bodies in case of errors during inter-server communication. [#70487](https://github.com/ClickHouse/ClickHouse/pull/70487) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Added a new setting `max_parts_to_move` to control the maximum number of parts that can be moved at once. [#70520](https://github.com/ClickHouse/ClickHouse/pull/70520) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Limit the frequency of certain log messages. [#70601](https://github.com/ClickHouse/ClickHouse/pull/70601) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* `CHECK TABLE` with `PART` qualifier was incorrectly formatted in the client. [#70660](https://github.com/ClickHouse/ClickHouse/pull/70660) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Support writing the column index and the offset index using parquet native writer. [#70669](https://github.com/ClickHouse/ClickHouse/pull/70669) ([LiuNeng](https://github.com/liuneng1994)).
+* Support parsing `DateTime64` for microsecond and timezone in joda syntax ("joda" is a popular Java library for date and time, and the "joda syntax" is that library's style). [#70737](https://github.com/ClickHouse/ClickHouse/pull/70737) ([kevinyhzou](https://github.com/KevinyhZou)).
+* Changed an approach to figure out if a cloud storage supports [batch delete](https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html) or not. [#70786](https://github.com/ClickHouse/ClickHouse/pull/70786) ([Vitaly Baranov](https://github.com/vitlibar)).
+* Support for Parquet page v2 in the native reader. [#70807](https://github.com/ClickHouse/ClickHouse/pull/70807) ([Arthur Passos](https://github.com/arthurpassos)).
+* A check if table has both `storage_policy` and `disk` set. A check if a new storage policy is compatible with an old one when using `disk` setting is added. [#70839](https://github.com/ClickHouse/ClickHouse/pull/70839) ([Kirill](https://github.com/kirillgarbar)).
+* Add `system.s3_queue_settings` and `system.azure_queue_settings`. [#70841](https://github.com/ClickHouse/ClickHouse/pull/70841) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Functions `base58Encode` and `base58Decode` now accept arguments of type `FixedString`. Example: `SELECT base58Encode(toFixedString('plaintext', 9));`. [#70846](https://github.com/ClickHouse/ClickHouse/pull/70846) ([Faizan Patel](https://github.com/faizan2786)).
+* Add the `partition` column to every entry type of the part log. Previously, it was set only for some entries. This closes [#70819](https://github.com/ClickHouse/ClickHouse/issues/70819). [#70848](https://github.com/ClickHouse/ClickHouse/pull/70848) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Add `MergeStart` and `MutateStart` events into `system.part_log` which helps with merges analysis and visualization. [#70850](https://github.com/ClickHouse/ClickHouse/pull/70850) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Add a profile event about the number of merged source parts. It allows the monitoring of the fanout of the merge tree in production. [#70908](https://github.com/ClickHouse/ClickHouse/pull/70908) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Background downloads to the filesystem cache were enabled back. [#70929](https://github.com/ClickHouse/ClickHouse/pull/70929) ([Nikita Taranov](https://github.com/nickitat)).
+* Add a new merge selector algorithm, named `Trivial`, for professional usage only. It is worse than the `Simple` merge selector. [#70969](https://github.com/ClickHouse/ClickHouse/pull/70969) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Support for atomic `CREATE OR REPLACE VIEW`. [#70536](https://github.com/ClickHouse/ClickHouse/pull/70536) ([tuanpach](https://github.com/tuanpach))
+* Added `strict_once` mode to aggregate function `windowFunnel` to avoid counting one event several times in case it matches multiple conditions, close [#21835](https://github.com/ClickHouse/ClickHouse/issues/21835). [#69738](https://github.com/ClickHouse/ClickHouse/pull/69738) ([Vladimir Cherkasov](https://github.com/vdimir)).
+
+#### Bug Fix (user-visible misbehavior in an official stable release)
+* Apply configuration updates in global context object. It fixes issues like [#62308](https://github.com/ClickHouse/ClickHouse/issues/62308). [#62944](https://github.com/ClickHouse/ClickHouse/pull/62944) ([Amos Bird](https://github.com/amosbird)).
+* Fix `ReadSettings` not using user set values, because defaults were only used. [#65625](https://github.com/ClickHouse/ClickHouse/pull/65625) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Fix type mismatch issue in `sumMapFiltered` when using signed arguments. [#58408](https://github.com/ClickHouse/ClickHouse/pull/58408) ([Chen768959](https://github.com/Chen768959)).
+* Fix toHour-like conversion functions' monotonicity when optional time zone argument is passed. [#60264](https://github.com/ClickHouse/ClickHouse/pull/60264) ([Amos Bird](https://github.com/amosbird)).
+* Relax `supportsPrewhere` check for `Merge` tables. This fixes [#61064](https://github.com/ClickHouse/ClickHouse/issues/61064). It was hardened unnecessarily in [#60082](https://github.com/ClickHouse/ClickHouse/issues/60082). [#61091](https://github.com/ClickHouse/ClickHouse/pull/61091) ([Amos Bird](https://github.com/amosbird)).
+* Fix `use_concurrency_control` setting handling for proper `concurrent_threads_soft_limit_num` limit enforcing. This enables concurrency control by default because previously it was broken. [#61473](https://github.com/ClickHouse/ClickHouse/pull/61473) ([Sergei Trifonov](https://github.com/serxa)).
+* Fix incorrect `JOIN ON` section optimization in case of `IS NULL` check under any other function (like `NOT`) that may lead to wrong results. Closes [#67915](https://github.com/ClickHouse/ClickHouse/issues/67915). [#68049](https://github.com/ClickHouse/ClickHouse/pull/68049) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Prevent `ALTER` queries that would make the `CREATE` query of tables invalid. [#68574](https://github.com/ClickHouse/ClickHouse/pull/68574) ([János Benjamin Antal](https://github.com/antaljanosbenjamin)).
+* Fix inconsistent AST formatting for `negate` (`-`) and `NOT` functions with tuples and arrays. [#68600](https://github.com/ClickHouse/ClickHouse/pull/68600) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Fix insertion of incomplete type into `Dynamic` during deserialization. It could lead to `Parameter out of bound` errors. [#69291](https://github.com/ClickHouse/ClickHouse/pull/69291) ([Pavel Kruglov](https://github.com/Avogar)).
+* Zero-copy replication, which is experimental and should not be used in production: fix inf loop after `restore replica` in the replicated merge tree with zero copy. [#69293](https://github.com/CljmnickHouse/ClickHouse/pull/69293) ([MikhailBurdukov](https://github.com/MikhailBurdukov)).
+* Return back default value of `processing_threads_num` as number of cpu cores in storage `S3Queue`. [#69384](https://github.com/ClickHouse/ClickHouse/pull/69384) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Bypass try/catch flow when de/serializing nested repeated protobuf to nested columns (fixes [#41971](https://github.com/ClickHouse/ClickHouse/issues/41971)). [#69556](https://github.com/ClickHouse/ClickHouse/pull/69556) ([Eliot Hautefeuille](https://github.com/hileef)).
+* Fix crash during insertion into FixedString column in PostgreSQL engine. [#69584](https://github.com/ClickHouse/ClickHouse/pull/69584) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix crash when executing `create view t as (with recursive 42 as ttt select ttt);`. [#69676](https://github.com/ClickHouse/ClickHouse/pull/69676) ([Han Fei](https://github.com/hanfei1991)).
+* Fixed `maxMapState` throwing 'Bad get' if value type is DateTime64. [#69787](https://github.com/ClickHouse/ClickHouse/pull/69787) ([Michael Kolupaev](https://github.com/al13n321)).
+* Fix `getSubcolumn` with `LowCardinality` columns by overriding `useDefaultImplementationForLowCardinalityColumns` to return `true`. [#69831](https://github.com/ClickHouse/ClickHouse/pull/69831) ([Miсhael Stetsyuk](https://github.com/mstetsyuk)).
+* Fix permanent blocked distributed sends if a DROP of distributed table failed. [#69843](https://github.com/ClickHouse/ClickHouse/pull/69843) ([Azat Khuzhin](https://github.com/azat)).
+* Fix non-cancellable queries containing WITH FILL with NaN keys. This closes [#69261](https://github.com/ClickHouse/ClickHouse/issues/69261). [#69845](https://github.com/ClickHouse/ClickHouse/pull/69845) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Fix analyzer default with old compatibility value. [#69895](https://github.com/ClickHouse/ClickHouse/pull/69895) ([Raúl Marín](https://github.com/Algunenano)).
+* Don't check dependencies during CREATE OR REPLACE VIEW during DROP of old table. Previously CREATE OR REPLACE query failed when there are dependent tables of the recreated view. [#69907](https://github.com/ClickHouse/ClickHouse/pull/69907) ([Pavel Kruglov](https://github.com/Avogar)).
+* Something for Decimal. Fixes [#69730](https://github.com/ClickHouse/ClickHouse/issues/69730). [#69978](https://github.com/ClickHouse/ClickHouse/pull/69978) ([Arthur Passos](https://github.com/arthurpassos)).
+* Now DEFINER/INVOKER will work with parameterized views. [#69984](https://github.com/ClickHouse/ClickHouse/pull/69984) ([pufit](https://github.com/pufit)).
+* Fix parsing for view's  definers. [#69985](https://github.com/ClickHouse/ClickHouse/pull/69985) ([pufit](https://github.com/pufit)).
+* Fixed a bug when the timezone could change the result of the query with a `Date` or `Date32` arguments. [#70036](https://github.com/ClickHouse/ClickHouse/pull/70036) ([Yarik Briukhovetskyi](https://github.com/yariks5s)).
+* Fixes `Block structure mismatch` for queries with nested views and `WHERE` condition. Fixes [#66209](https://github.com/ClickHouse/ClickHouse/issues/66209). [#70054](https://github.com/ClickHouse/ClickHouse/pull/70054) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Avoid reusing columns among different named tuples when evaluating `tuple` functions. This fixes [#70022](https://github.com/ClickHouse/ClickHouse/issues/70022). [#70103](https://github.com/ClickHouse/ClickHouse/pull/70103) ([Amos Bird](https://github.com/amosbird)).
+* Fix wrong LOGICAL_ERROR when replacing literals in ranges. [#70122](https://github.com/ClickHouse/ClickHouse/pull/70122) ([Pablo Marcos](https://github.com/pamarcos)).
+* Check for Nullable(Nothing) type during ALTER TABLE MODIFY COLUMN/QUERY to prevent tables with such data type. [#70123](https://github.com/ClickHouse/ClickHouse/pull/70123) ([Pavel Kruglov](https://github.com/Avogar)).
+* Proper error message for illegal query `JOIN ... ON *` , close [#68650](https://github.com/ClickHouse/ClickHouse/issues/68650). [#70124](https://github.com/ClickHouse/ClickHouse/pull/70124) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Fix wrong result with skipping index. [#70127](https://github.com/ClickHouse/ClickHouse/pull/70127) ([Raúl Marín](https://github.com/Algunenano)).
+* Fix data race in ColumnObject/ColumnTuple decompress method that could lead to heap use after free. [#70137](https://github.com/ClickHouse/ClickHouse/pull/70137) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix possible hung in ALTER COLUMN with Dynamic type. [#70144](https://github.com/ClickHouse/ClickHouse/pull/70144) ([Pavel Kruglov](https://github.com/Avogar)).
+* Now ClickHouse will consider more errors as retriable and will not mark data parts as broken in case of such errors. [#70145](https://github.com/ClickHouse/ClickHouse/pull/70145) ([alesapin](https://github.com/alesapin)).
+* Use correct `max_types` parameter during Dynamic type creation for JSON subcolumn. [#70147](https://github.com/ClickHouse/ClickHouse/pull/70147) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix the password being displayed in `system.query_log` for users with bcrypt password authentication method. [#70148](https://github.com/ClickHouse/ClickHouse/pull/70148) ([Nikolay Degterinsky](https://github.com/evillique)).
+* Fix event counter for the native interface (InterfaceNativeSendBytes). [#70153](https://github.com/ClickHouse/ClickHouse/pull/70153) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
+* Fix possible crash related to JSON columns. [#70172](https://github.com/ClickHouse/ClickHouse/pull/70172) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix multiple issues with arrayMin and arrayMax. [#70207](https://github.com/ClickHouse/ClickHouse/pull/70207) ([Raúl Marín](https://github.com/Algunenano)).
+* Respect setting allow_simdjson in the JSON type parser. [#70218](https://github.com/ClickHouse/ClickHouse/pull/70218) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix a null pointer dereference on creating a materialized view with two selects and an `INTERSECT`, e.g. `CREATE MATERIALIZED VIEW v0 AS (SELECT 1) INTERSECT (SELECT 1);`. [#70264](https://github.com/ClickHouse/ClickHouse/pull/70264) ([Konstantin Bogdanov](https://github.com/thevar1able)).
+* Don't modify global settings with startup scripts. Previously, changing a setting in a startup script would change it globally. [#70310](https://github.com/ClickHouse/ClickHouse/pull/70310) ([Antonio Andelic](https://github.com/antonio2368)).
+* Fix ALTER of `Dynamic` type with reducing max_types parameter that could lead to server crash. [#70328](https://github.com/ClickHouse/ClickHouse/pull/70328) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix crash when using WITH FILL incorrectly. [#70338](https://github.com/ClickHouse/ClickHouse/pull/70338) ([Raúl Marín](https://github.com/Algunenano)).
+* Fix possible use-after-free in `SYSTEM DROP FORMAT SCHEMA CACHE FOR Protobuf`. [#70358](https://github.com/ClickHouse/ClickHouse/pull/70358) ([Azat Khuzhin](https://github.com/azat)).
+* Fix crash during GROUP BY JSON sub-object subcolumn. [#70374](https://github.com/ClickHouse/ClickHouse/pull/70374) ([Pavel Kruglov](https://github.com/Avogar)).
+* Don't prefetch parts for vertical merges if part has no rows. [#70452](https://github.com/ClickHouse/ClickHouse/pull/70452) ([Antonio Andelic](https://github.com/antonio2368)).
+* Fix crash in WHERE with lambda functions. [#70464](https://github.com/ClickHouse/ClickHouse/pull/70464) ([Raúl Marín](https://github.com/Algunenano)).
+* Fix table creation with `CREATE ... AS table_function(...)` with database `Replicated` and unavailable table function source on secondary replica. [#70511](https://github.com/ClickHouse/ClickHouse/pull/70511) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Ignore all output on async insert with `wait_for_async_insert=1`. Closes [#62644](https://github.com/ClickHouse/ClickHouse/issues/62644). [#70530](https://github.com/ClickHouse/ClickHouse/pull/70530) ([Konstantin Bogdanov](https://github.com/thevar1able)).
+* Ignore frozen_metadata.txt while traversing shadow directory from system.remote_data_paths. [#70590](https://github.com/ClickHouse/ClickHouse/pull/70590) ([Aleksei Filatov](https://github.com/aalexfvk)).
+* Fix creation of stateful window functions on misaligned memory. [#70631](https://github.com/ClickHouse/ClickHouse/pull/70631) ([Raúl Marín](https://github.com/Algunenano)).
+* Fixed rare crashes in `SELECT`-s and merges after adding a column of `Array` type with non-empty default expression. [#70695](https://github.com/ClickHouse/ClickHouse/pull/70695) ([Anton Popov](https://github.com/CurtizJ)).
+* Insert into table function s3 will respect query settings. [#70696](https://github.com/ClickHouse/ClickHouse/pull/70696) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Fix infinite recursion when inferring a protobuf schema when skipping unsupported fields is enabled. [#70697](https://github.com/ClickHouse/ClickHouse/pull/70697) ([Raúl Marín](https://github.com/Algunenano)).
+* Disable enable_named_columns_in_function_tuple by default. [#70833](https://github.com/ClickHouse/ClickHouse/pull/70833) ([Raúl Marín](https://github.com/Algunenano)).
+* Fix S3Queue table engine setting processing_threads_num not being effective in case it was deduced from the number of cpu cores on the server. [#70837](https://github.com/ClickHouse/ClickHouse/pull/70837) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Normalize named tuple arguments in aggregation states. This fixes [#69732](https://github.com/ClickHouse/ClickHouse/issues/69732) . [#70853](https://github.com/ClickHouse/ClickHouse/pull/70853) ([Amos Bird](https://github.com/amosbird)).
+* Fix a logical error due to negative zeros in the two-level hash table. This closes [#70973](https://github.com/ClickHouse/ClickHouse/issues/70973). [#70979](https://github.com/ClickHouse/ClickHouse/pull/70979) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Fix `limit by`, `limit with ties` for distributed and parallel replicas. [#70880](https://github.com/ClickHouse/ClickHouse/pull/70880) ([Nikita Taranov](https://github.com/nickitat)).
+
+
 ### <a id="249"></a> ClickHouse release 24.9, 2024-09-26

 #### Backward Incompatible Change
--- a/README.md
+++ b/README.md
@ -42,31 +42,19 @@ Keep an eye out for upcoming meetups and events around the world. Somewhere else

 Upcoming meetups

-* [Jakarta Meetup](https://www.meetup.com/clickhouse-indonesia-user-group/events/303191359/) - October 1
-* [Singapore Meetup](https://www.meetup.com/clickhouse-singapore-meetup-group/events/303212064/) - October 3
-* [Madrid Meetup](https://www.meetup.com/clickhouse-spain-user-group/events/303096564/) - October 22
-* [Oslo Meetup](https://www.meetup.com/open-source-real-time-data-warehouse-real-time-analytics/events/302938622) - October 31
 * [Barcelona Meetup](https://www.meetup.com/clickhouse-spain-user-group/events/303096876/) - November 12
 * [Ghent Meetup](https://www.meetup.com/clickhouse-belgium-user-group/events/303049405/) - November 19
 * [Dubai Meetup](https://www.meetup.com/clickhouse-dubai-meetup-group/events/303096989/) - November 21
 * [Paris Meetup](https://www.meetup.com/clickhouse-france-user-group/events/303096434) - November 26
+* [Amsterdam Meetup](https://www.meetup.com/clickhouse-netherlands-user-group/events/303638814) - December 3
+* [New York Meetup](https://www.meetup.com/clickhouse-new-york-user-group/events/304268174) - December 9
+* [San Francisco Meetup](https://www.meetup.com/clickhouse-silicon-valley-meetup-group/events/304286951/) - December 12

 Recently completed meetups

-* [ClickHouse Guangzhou User Group Meetup](https://mp.weixin.qq.com/s/GSvo-7xUoVzCsuUvlLTpCw) - August 25
-* [Seattle Meetup (Statsig)](https://www.meetup.com/clickhouse-seattle-user-group/events/302518075/) - August 27
-* [Melbourne Meetup](https://www.meetup.com/clickhouse-australia-user-group/events/302732666/) - August 27
-* [Sydney Meetup](https://www.meetup.com/clickhouse-australia-user-group/events/302862966/) - September 5
-* [Zurich Meetup](https://www.meetup.com/clickhouse-switzerland-meetup-group/events/302267429/) - September 5
-* [San Francisco Meetup (Cloudflare)](https://www.meetup.com/clickhouse-silicon-valley-meetup-group/events/302540575) - September 5
-* [Raleigh Meetup (Deutsche Bank)](https://www.meetup.com/triangletechtalks/events/302723486/) - September 9
-* [New York Meetup (Rokt)](https://www.meetup.com/clickhouse-new-york-user-group/events/302575342) - September 10
-* [Toronto Meetup (Shopify)](https://www.meetup.com/clickhouse-toronto-user-group/events/301490855/) - September 10
-* [Chicago Meetup (Jump Capital)](https://lu.ma/43tvmrfw) - September 12
-* [London Meetup](https://www.meetup.com/clickhouse-london-user-group/events/302977267) - September 17
-* [Austin Meetup](https://www.meetup.com/clickhouse-austin-user-group/events/302558689/) - September 17
-* [Bangalore Meetup](https://www.meetup.com/clickhouse-bangalore-user-group/events/303208274/) - September 18
-* [Tel Aviv Meetup](https://www.meetup.com/clickhouse-meetup-israel/events/303095121) - September 22
+* [Madrid Meetup](https://www.meetup.com/clickhouse-spain-user-group/events/303096564/) - October 22
+* [Singapore Meetup](https://www.meetup.com/clickhouse-singapore-meetup-group/events/303212064/) - October 3
+* [Jakarta Meetup](https://www.meetup.com/clickhouse-indonesia-user-group/events/303191359/) - October 1

 ## Recent Recordings
 * **Recent Meetup Videos**: [Meetup Playlist](https://www.youtube.com/playlist?list=PL0Z2YDlm0b3iNDUzpY1S3L_iV4nARda_U) Whenever possible recordings of the ClickHouse Community Meetups are edited and presented as individual talks. Current featuring "Modern SQL in 2023", "Fast, Concurrent, and Consistent Asynchronous INSERTS in ClickHouse", and "Full-Text Indices: Design and Experiments"
--- a/docker/test/base/setup_export_logs.sh
+++ b/docker/test/base/setup_export_logs.sh
@ -25,7 +25,7 @@ EXTRA_COLUMNS_EXPRESSION_TRACE_LOG="${EXTRA_COLUMNS_EXPRESSION}, arrayMap(x -> d

 # coverage_log needs more columns for symbolization, but only symbol names (the line numbers are too heavy to calculate)
 EXTRA_COLUMNS_COVERAGE_LOG="${EXTRA_COLUMNS} symbols Array(LowCardinality(String)), "
-EXTRA_COLUMNS_EXPRESSION_COVERAGE_LOG="${EXTRA_COLUMNS_EXPRESSION}, arrayMap(x -> demangle(addressToSymbol(x)), coverage)::Array(LowCardinality(String)) AS symbols"
+EXTRA_COLUMNS_EXPRESSION_COVERAGE_LOG="${EXTRA_COLUMNS_EXPRESSION}, arrayDistinct(arrayMap(x -> demangle(addressToSymbol(x)), coverage))::Array(LowCardinality(String)) AS symbols"


 function __set_connection_args
--- a/docker/test/style/Dockerfile
+++ b/docker/test/style/Dockerfile
@ -28,7 +28,7 @@ COPY requirements.txt /
 RUN pip3 install --no-cache-dir -r requirements.txt

 RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && locale-gen en_US.UTF-8
-ENV LC_ALL en_US.UTF-8
+ENV LC_ALL=en_US.UTF-8

 # Architecture of the image when BuildKit/buildx is used
 ARG TARGETARCH
--- a/docker/test/style/requirements.txt
+++ b/docker/test/style/requirements.txt
@ -12,6 +12,7 @@ charset-normalizer==3.3.2
 click==8.1.7
 codespell==2.2.1
 cryptography==43.0.1
+datacompy==0.7.3
 Deprecated==1.2.14
 dill==0.3.8
 flake8==4.0.1
@ -23,6 +24,7 @@ mccabe==0.6.1
 multidict==6.0.5
 mypy==1.8.0
 mypy-extensions==1.0.0
+pandas==2.2.3
 packaging==24.1
 pathspec==0.9.0
 pip==24.1.1
--- a/docs/en/engines/table-engines/integrations/s3.md
+++ b/docs/en/engines/table-engines/integrations/s3.md
@ -290,6 +290,7 @@ The following settings can be specified in configuration file for given endpoint
 - `expiration_window_seconds` — Grace period for checking if expiration-based credentials have expired. Optional, default value is `120`.
 - `no_sign_request` - Ignore all the credentials so requests are not signed. Useful for accessing public buckets.
 - `header` —  Adds specified HTTP header to a request to given endpoint. Optional, can be specified multiple times.
+- `access_header` - Adds specified HTTP header to a request to given endpoint, in cases where there are no other credentials from another source.
 - `server_side_encryption_customer_key_base64` — If specified, required headers for accessing S3 objects with SSE-C encryption will be set. Optional.
 - `server_side_encryption_kms_key_id` - If specified, required headers for accessing S3 objects with [SSE-KMS encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html) will be set. If an empty string is specified, the AWS managed S3 key will be used. Optional.
 - `server_side_encryption_kms_encryption_context` - If specified alongside `server_side_encryption_kms_key_id`, the given encryption context header for SSE-KMS will be set. Optional.
@ -320,6 +321,32 @@ The following settings can be specified in configuration file for given endpoint
 </s3>
 ```

+## Working with archives
+
+Suppose that we have several archive files with following URIs on S3:
+
+- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-10.csv.zip'
+- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-11.csv.zip'
+- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-12.csv.zip'
+
+Extracting data from these archives is possible using ::. Globs can be used both in the url part as well as in the part after :: (responsible for the name of a file inside the archive).
+
+``` sql
+SELECT *
+FROM s3(
+   'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-1{0..2}.csv.zip :: *.csv'
+);
+```
+
+:::note 
+ClickHouse supports three archive formats:
+ZIP
+TAR
+7Z
+While ZIP and TAR archives can be accessed from any supported storage location, 7Z archives can only be read from the local filesystem where ClickHouse is installed.  
+:::
+
+
 ## Accessing public buckets

 ClickHouse tries to fetch credentials from many different types of sources.
--- a/docs/en/operations/server-configuration-parameters/settings.md
+++ b/docs/en/operations/server-configuration-parameters/settings.md
@ -3224,6 +3224,34 @@ Default value: "default"
 **See Also**
 - [Workload Scheduling](/docs/en/operations/workload-scheduling.md)

+## workload_path {#workload_path}
+
+The directory used as a storage for all `CREATE WORKLOAD` and `CREATE RESOURCE` queries. By default `/workload/` folder under server working directory is used.
+
+**Example**
+
+``` xml
+<workload_path>/var/lib/clickhouse/workload/</workload_path>
+```
+
+**See Also**
+- [Workload Hierarchy](/docs/en/operations/workload-scheduling.md#workloads)
+- [workload_zookeeper_path](#workload_zookeeper_path)
+
+## workload_zookeeper_path {#workload_zookeeper_path}
+
+The path to a ZooKeeper node, which is used as a storage for all `CREATE WORKLOAD` and `CREATE RESOURCE` queries. For consistency all SQL definitions are stored as a value of this single znode. By default ZooKeeper is not used and definitions are stored on [disk](#workload_path).
+
+**Example**
+
+``` xml
+<workload_zookeeper_path>/clickhouse/workload/definitions.sql</workload_zookeeper_path>
+```
+
+**See Also**
+- [Workload Hierarchy](/docs/en/operations/workload-scheduling.md#workloads)
+- [workload_path](#workload_path)
+
 ## max_authentication_methods_per_user {#max_authentication_methods_per_user}

 The maximum number of authentication methods a user can be created with or altered to.
--- a/docs/en/operations/system-tables/merge_tree_settings.md
+++ b/docs/en/operations/system-tables/merge_tree_settings.md
@ -18,6 +18,11 @@ Columns:
    - `1` — Current user can’t change the setting.
 - `type` ([String](../../sql-reference/data-types/string.md)) — Setting type (implementation specific string value).
 - `is_obsolete` ([UInt8](../../sql-reference/data-types/int-uint.md#uint-ranges)) - Shows whether a setting is obsolete.
+- `tier` ([Enum8](../../sql-reference/data-types/enum.md)) — Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their development and the expectations one might have when using them. Values:
+    - `'Production'` — The feature is stable, safe to use and does not have issues interacting with other **production** features. .
+    - `'Beta'` — The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome.
+    - `'Experimental'` — The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time.
+    - `'Obsolete'` — No longer supported. Either it is already removed or it will be removed in future releases.

 **Example**
 ```sql
--- a/docs/en/operations/system-tables/resources.md
+++ b/docs/en/operations/system-tables/resources.md
@ -0,0 +1,37 @@
+---
+slug: /en/operations/system-tables/resources
+---
+# resources
+
+Contains information for [resources](/docs/en/operations/workload-scheduling.md#workload_entity_storage) residing on the local server. The table contains a row for every resource.
+
+Example:
+
+``` sql
+SELECT *
+FROM system.resources
+FORMAT Vertical
+```
+
+``` text
+Row 1:
+──────
+name:         io_read
+read_disks:   ['s3']
+write_disks:  []
+create_query: CREATE RESOURCE io_read (READ DISK s3)
+
+Row 2:
+──────
+name:         io_write
+read_disks:   []
+write_disks:  ['s3']
+create_query: CREATE RESOURCE io_write (WRITE DISK s3)
+```
+
+Columns:
+
+- `name` (`String`) - Resource name.
+- `read_disks` (`Array(String)`) - The array of disk names that uses this resource for read operations.
+- `write_disks` (`Array(String)`) - The array of disk names that uses this resource for write operations.
+- `create_query` (`String`) - The definition of the resource.
--- a/docs/en/operations/system-tables/settings.md
+++ b/docs/en/operations/system-tables/settings.md
@ -18,6 +18,11 @@ Columns:
    - `1` — Current user can’t change the setting.
 - `default` ([String](../../sql-reference/data-types/string.md)) — Setting default value.
 - `is_obsolete` ([UInt8](../../sql-reference/data-types/int-uint.md#uint-ranges)) - Shows whether a setting is obsolete.
+- `tier` ([Enum8](../../sql-reference/data-types/enum.md)) — Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their development and the expectations one might have when using them. Values:
+    - `'Production'` — The feature is stable, safe to use and does not have issues interacting with other **production** features. .
+    - `'Beta'` — The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome.
+    - `'Experimental'` — The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time.
+    - `'Obsolete'` — No longer supported. Either it is already removed or it will be removed in future releases.

 **Example**

@ -26,19 +31,99 @@ The following example shows how to get information about settings which name con
 ``` sql
 SELECT *
 FROM system.settings
-WHERE name LIKE '%min_i%'
+WHERE name LIKE '%min_insert_block_size_%'
+FORMAT Vertical
 ```

 ``` text
-┌─name───────────────────────────────────────────────_─value─────_─changed─_─description───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────_─min──_─max──_─readonly─_─type─────────_─default───_─alias_for─_─is_obsolete─┐
-│ min_insert_block_size_rows                         │ 1048449   │       0 │ Squash blocks passed to INSERT query to specified size in rows, if blocks are not big enough.                                                                         │ ____ │ ____ │        0 │ UInt64       │ 1048449   │           │           0 │
-│ min_insert_block_size_bytes                        │ 268402944 │       0 │ Squash blocks passed to INSERT query to specified size in bytes, if blocks are not big enough.                                                                        │ ____ │ ____ │        0 │ UInt64       │ 268402944 │           │           0 │
-│ min_insert_block_size_rows_for_materialized_views  │ 0         │       0 │ Like min_insert_block_size_rows, but applied only during pushing to MATERIALIZED VIEW (default: min_insert_block_size_rows)                                           │ ____ │ ____ │        0 │ UInt64       │ 0         │           │           0 │
-│ min_insert_block_size_bytes_for_materialized_views │ 0         │       0 │ Like min_insert_block_size_bytes, but applied only during pushing to MATERIALIZED VIEW (default: min_insert_block_size_bytes)                                         │ ____ │ ____ │        0 │ UInt64       │ 0         │           │           0 │
-│ read_backoff_min_interval_between_events_ms        │ 1000      │       0 │ Settings to reduce the number of threads in case of slow reads. Do not pay attention to the event, if the previous one has passed less than a certain amount of time. │ ____ │ ____ │        0 │ Milliseconds │ 1000      │           │           0 │
-└────────────────────────────────────────────────────┴───────────┴─────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-──────────────────────────────────────────────────────┴──────┴──────┴──────────┴──────────────┴───────────┴───────────┴─────────────┘
-```
+Row 1:
+──────
+name:        min_insert_block_size_rows
+value:       1048449
+changed:     0
+description: Sets the minimum number of rows in the block that can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones.
+
+Possible values:
+
+- Positive integer.
+- 0 — Squashing disabled.
+min:         ᴺᵁᴸᴸ
+max:         ᴺᵁᴸᴸ
+readonly:    0
+type:        UInt64
+default:     1048449
+alias_for:   
+is_obsolete: 0
+tier:        Production
+
+Row 2:
+──────
+name:        min_insert_block_size_bytes
+value:       268402944
+changed:     0
+description: Sets the minimum number of bytes in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones.
+
+Possible values:
+
+- Positive integer.
+- 0 — Squashing disabled.
+min:         ᴺᵁᴸᴸ
+max:         ᴺᵁᴸᴸ
+readonly:    0
+type:        UInt64
+default:     268402944
+alias_for:   
+is_obsolete: 0
+tier:        Production
+
+Row 3:
+──────
+name:        min_insert_block_size_rows_for_materialized_views
+value:       0
+changed:     0
+description: Sets the minimum number of rows in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. This setting is applied only for blocks inserted into [materialized view](../../sql-reference/statements/create/view.md). By adjusting this setting, you control blocks squashing while pushing to materialized view and avoid excessive memory usage.
+
+Possible values:
+
+- Any positive integer.
+- 0 — Squashing disabled.
+
+**See Also**
+
+- [min_insert_block_size_rows](#min-insert-block-size-rows)
+min:         ᴺᵁᴸᴸ
+max:         ᴺᵁᴸᴸ
+readonly:    0
+type:        UInt64
+default:     0
+alias_for:   
+is_obsolete: 0
+tier:        Production
+
+Row 4:
+──────
+name:        min_insert_block_size_bytes_for_materialized_views
+value:       0
+changed:     0
+description: Sets the minimum number of bytes in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. This setting is applied only for blocks inserted into [materialized view](../../sql-reference/statements/create/view.md). By adjusting this setting, you control blocks squashing while pushing to materialized view and avoid excessive memory usage.
+
+Possible values:
+
+- Any positive integer.
+- 0 — Squashing disabled.
+
+**See also**
+
+- [min_insert_block_size_bytes](#min-insert-block-size-bytes)
+min:         ᴺᵁᴸᴸ
+max:         ᴺᵁᴸᴸ
+readonly:    0
+type:        UInt64
+default:     0
+alias_for:   
+is_obsolete: 0
+tier:        Production
+ ```

 Using of `WHERE changed` can be useful, for example, when you want to check:

--- a/docs/en/operations/system-tables/workloads.md
+++ b/docs/en/operations/system-tables/workloads.md
@ -0,0 +1,40 @@
+---
+slug: /en/operations/system-tables/workloads
+---
+# workloads
+
+Contains information for [workloads](/docs/en/operations/workload-scheduling.md#workload_entity_storage) residing on the local server. The table contains a row for every workload.
+
+Example:
+
+``` sql
+SELECT *
+FROM system.workloads
+FORMAT Vertical
+```
+
+``` text
+Row 1:
+──────
+name:         production
+parent:       all
+create_query: CREATE WORKLOAD production IN `all` SETTINGS weight = 9
+
+Row 2:
+──────
+name:         development
+parent:       all
+create_query: CREATE WORKLOAD development IN `all`
+
+Row 3:
+──────
+name:         all
+parent:
+create_query: CREATE WORKLOAD `all`
+```
+
+Columns:
+
+- `name` (`String`) - Workload name.
+- `parent` (`String`) - Parent workload name.
+- `create_query` (`String`) - The definition of the workload.
--- a/docs/en/operations/workload-scheduling.md
+++ b/docs/en/operations/workload-scheduling.md
@ -43,6 +43,20 @@ Example:
 </clickhouse>
 ```

+An alternative way to express which disks are used by a resource is SQL syntax:
+
+```sql
+CREATE RESOURCE resource_name (WRITE DISK disk1, READ DISK disk2)
+```
+
+Resource could be used for any number of disk for READ or WRITE or both for READ and WRITE. There a syntax allowing to use a resource for all the disks:
+
+```sql
+CREATE RESOURCE all_io (READ ANY DISK, WRITE ANY DISK);
+```
+
+Note that server configuration options have priority over SQL way to define resources.
+
 ## Workload markup {#workload_markup}

 Queries can be marked with setting `workload` to distinguish different workloads. If `workload` is not set, than value "default" is used. Note that you are able to specify the other value using settings profiles. Setting constraints can be used to make `workload` constant if you want all queries from the user to be marked with fixed value of `workload` setting.
@ -153,9 +167,48 @@ Example:
 </clickhouse>
 ```

+## Workload hierarchy (SQL only) {#workloads}
+
+Defining resources and classifiers in XML could be challenging. ClickHouse provides SQL syntax that is much more convenient. All resources that were created with `CREATE RESOURCE` share the same structure of the hierarchy, but could differ in some aspects. Every workload created with `CREATE WORKLOAD` maintains a few automatically created scheduling nodes for every resource. A child workload can be created inside another parent workload. Here is the example that defines exactly the same hierarchy as XML configuration above:
+
+```sql
+CREATE RESOURCE network_write (WRITE DISK s3)
+CREATE RESOURCE network_read (READ DISK s3)
+CREATE WORKLOAD all SETTINGS max_requests = 100
+CREATE WORKLOAD development IN all
+CREATE WORKLOAD production IN all SETTINGS weight = 3
+```
+
+The name of a leaf workload without children could be used in query settings `SETTINGS workload = 'name'`. Note that workload classifiers are also created automatically when using SQL syntax.
+
+To customize workload the following settings could be used:
+* `priority` - sibling workloads are served according to static priority values (lower value means higher priority).
+* `weight` - sibling workloads having the same static priority share resources according to weights.
+* `max_requests` - the limit on the number of concurrent resource requests in this workload.
+* `max_cost` - the limit on the total inflight bytes count of concurrent resource requests in this workload.
+* `max_speed` - the limit on byte processing rate of this workload (the limit is independent for every resource).
+* `max_burst` - maximum number of bytes that could be processed by the workload without being throttled (for every resource independently).
+
+Note that workload settings are translated into a proper set of scheduling nodes. For more details, see the description of the scheduling node [types and options](#hierarchy).
+
+There is no way to specify different hierarchies of workloads for different resources. But there is a way to specify different workload setting value for a specific resource:
+
+```sql
+CREATE OR REPLACE WORKLOAD all SETTINGS max_requests = 100, max_speed = 1000000 FOR network_read, max_speed = 2000000 FOR network_write
+```
+
+Also note that workload or resource could not be dropped if it is referenced from another workload. To update a definition of a workload use `CREATE OR REPLACE WORKLOAD` query.
+
+## Workloads and resources storage {#workload_entity_storage}
+Definitions of all workloads and resources in the form of `CREATE WORKLOAD` and `CREATE RESOURCE` queries are stored persistently either on disk at `workload_path` or in ZooKeeper at `workload_zookeeper_path`. ZooKeeper storage is recommended to achieve consistency between nodes. Alternatively `ON CLUSTER` clause could be used along with disk storage.
+
 ## See also
 - [system.scheduler](/docs/en/operations/system-tables/scheduler.md)
+ - [system.workloads](/docs/en/operations/system-tables/workloads.md)
+ - [system.resources](/docs/en/operations/system-tables/resources.md)
 - [merge_workload](/docs/en/operations/settings/merge-tree-settings.md#merge_workload) merge tree setting
 - [merge_workload](/docs/en/operations/server-configuration-parameters/settings.md#merge_workload) global server setting
 - [mutation_workload](/docs/en/operations/settings/merge-tree-settings.md#mutation_workload) merge tree setting
 - [mutation_workload](/docs/en/operations/server-configuration-parameters/settings.md#mutation_workload) global server setting
+ - [workload_path](/docs/en/operations/server-configuration-parameters/settings.md#workload_path) global server setting
+ - [workload_zookeeper_path](/docs/en/operations/server-configuration-parameters/settings.md#workload_zookeeper_path) global server setting
--- a/docs/en/sql-reference/statements/alter/user.md
+++ b/docs/en/sql-reference/statements/alter/user.md
@ -12,7 +12,7 @@ Syntax:
 ``` sql
 ALTER USER [IF EXISTS] name1 [RENAME TO new_name |, name2 [,...]] 
    [ON CLUSTER cluster_name]
-    [NOT IDENTIFIED | RESET AUTHENTICATION METHODS TO NEW | {IDENTIFIED | ADD IDENTIFIED} {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} 
+    [NOT IDENTIFIED | RESET AUTHENTICATION METHODS TO NEW | {IDENTIFIED | ADD IDENTIFIED} {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} [VALID UNTIL datetime]
    [, {[{plaintext_password | sha256_password | sha256_hash | ...}] BY {'password' | 'hash'}} | {ldap SERVER 'server_name'} | {...} | ... [,...]]]
    [[ADD | DROP] HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE]
    [VALID UNTIL datetime]
@ -91,3 +91,15 @@ Reset authentication methods and keep the most recent added one:
 ``` sql
 ALTER USER user1 RESET AUTHENTICATION METHODS TO NEW
 ```
+
+## VALID UNTIL Clause
+
+Allows you to specify the expiration date and, optionally, the time for an authentication method. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`.
+The `VALID UNTIL` clause can only be specified along with an authentication method, except for the case where no authentication method has been specified in the query. In this scenario, the `VALID UNTIL` clause will be applied to all existing authentication methods.
+
+Examples:
+
+- `ALTER USER name1 VALID UNTIL '2025-01-01'`
+- `ALTER USER name1 VALID UNTIL '2025-01-01 12:00:00 UTC'`
+- `ALTER USER name1 VALID UNTIL 'infinity'`
+- `ALTER USER name1 IDENTIFIED WITH plaintext_password BY 'no_expiration', bcrypt_password BY 'expiration_set' VALID UNTIL'2025-01-01''`
--- a/docs/en/sql-reference/statements/create/user.md
+++ b/docs/en/sql-reference/statements/create/user.md
@ -11,7 +11,7 @@ Syntax:

 ``` sql
 CREATE USER [IF NOT EXISTS | OR REPLACE] name1 [, name2 [,...]] [ON CLUSTER cluster_name]
-    [NOT IDENTIFIED | IDENTIFIED {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} 
+    [NOT IDENTIFIED | IDENTIFIED {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} [VALID UNTIL datetime] 
    [, {[{plaintext_password | sha256_password | sha256_hash | ...}] BY {'password' | 'hash'}} | {ldap SERVER 'server_name'} | {...} | ... [,...]]]
    [HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE]
    [VALID UNTIL datetime]
@ -178,7 +178,8 @@ ClickHouse treats `user_name@'address'` as a username as a whole. Thus, technica

 ## VALID UNTIL Clause

-Allows you to specify the expiration date and, optionally, the time for user credentials. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`.
+Allows you to specify the expiration date and, optionally, the time for an authentication method. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`.
+The `VALID UNTIL` clause can only be specified along with an authentication method, except for the case where no authentication method has been specified in the query. In this scenario, the `VALID UNTIL` clause will be applied to all existing authentication methods.

 Examples:

@ -186,6 +187,7 @@ Examples:
 - `CREATE USER name1 VALID UNTIL '2025-01-01 12:00:00 UTC'`
 - `CREATE USER name1 VALID UNTIL 'infinity'`
 - ```CREATE USER name1 VALID UNTIL '2025-01-01 12:00:00 `Asia/Tokyo`'```
+- `CREATE USER name1 IDENTIFIED WITH plaintext_password BY 'no_expiration', bcrypt_password BY 'expiration_set' VALID UNTIL '2025-01-01''`

 ## GRANTEES Clause

--- a/docs/en/sql-reference/statements/grant.md
+++ b/docs/en/sql-reference/statements/grant.md
@ -78,6 +78,10 @@ Specifying privileges you can use asterisk (`*`) instead of a table or a databas
 Also, you can omit database name. In this case privileges are granted for current database.
 For example, `GRANT SELECT ON * TO john` grants the privilege on all the tables in the current database, `GRANT SELECT ON mytable TO john` grants the privilege on the `mytable` table in the current database.

+:::note
+The feature described below is available starting with the 24.10 ClickHouse version.
+:::
+
 You can also put asterisks at the end of a table or a database name. This feature allows you to grant privileges on an abstract prefix of the table's path.
 Example: `GRANT SELECT ON db.my_tables* TO john`. This query allows `john` to execute the `SELECT` query over all the `db` database tables with the prefix `my_tables*`.

--- a/docs/en/sql-reference/statements/kill.md
+++ b/docs/en/sql-reference/statements/kill.md
@ -83,7 +83,7 @@ The presence of long-running or incomplete mutations often indicates that a Clic
 - Or manually kill some of these mutations by sending a `KILL` command.

 ``` sql
-KILL MUTATION [ON CLUSTER cluster]
+KILL MUTATION
  WHERE <where expression to SELECT FROM system.mutations query>
  [TEST]
  [FORMAT format]
@ -135,7 +135,6 @@ KILL MUTATION WHERE database = 'default' AND table = 'table'
 -- Cancel the specific mutation:
 KILL MUTATION WHERE database = 'default' AND table = 'table' AND mutation_id = 'mutation_3.txt'
 ```
-:::tip If you are killing a mutation in ClickHouse Cloud or in a self-managed cluster, then be sure to use the ```ON CLUSTER [cluster-name]``` option, in order to ensure the mutation is killed on all replicas:::

 The query is useful when a mutation is stuck and cannot finish (e.g. if some function in the mutation query throws an exception when applied to the data contained in the table).

--- a/docs/en/sql-reference/table-functions/s3.md
+++ b/docs/en/sql-reference/table-functions/s3.md
@ -284,6 +284,14 @@ FROM s3(
 );
 ```

+:::note 
+ClickHouse supports three archive formats:
+ZIP
+TAR
+7Z
+While ZIP and TAR archives can be accessed from any supported storage location, 7Z archives can only be read from the local filesystem where ClickHouse is installed.  
+:::
+

 ## Virtual Columns {#virtual-columns}

--- a/docs/ru/engines/table-engines/integrations/s3.md
+++ b/docs/ru/engines/table-engines/integrations/s3.md
@ -138,6 +138,7 @@ CREATE TABLE table_with_asterisk (name String, value UInt32)
 -   `use_insecure_imds_request` — признак использования менее безопасного соединения при выполнении запроса к IMDS при получении учётных данных из метаданных Amazon EC2. Значение по умолчанию — `false`.
 -   `region` — название региона S3.
 -   `header` — добавляет указанный HTTP-заголовок к запросу на заданную точку приема запроса. Может быть определен несколько раз.
+-   `access_header` - добавляет указанный HTTP-заголовок к запросу на заданную точку приема запроса, в случая если не указаны другие способы авторизации.
 -   `server_side_encryption_customer_key_base64` — устанавливает необходимые заголовки для доступа к объектам S3 с шифрованием SSE-C.
 -   `single_read_retries` — Максимальное количество попыток запроса при единичном чтении. Значение по умолчанию — `4`.

--- a/programs/keeper/Keeper.cpp
+++ b/programs/keeper/Keeper.cpp
@ -590,6 +590,7 @@ try

 #if USE_SSL
            CertificateReloader::instance().tryLoad(*config);
+            CertificateReloader::instance().tryLoadClient(*config);
 #endif
        });

--- a/programs/server/Server.cpp
+++ b/programs/server/Server.cpp
@ -86,7 +86,7 @@
 #include <Dictionaries/registerDictionaries.h>
 #include <Disks/registerDisks.h>
 #include <Common/Scheduler/Nodes/registerSchedulerNodes.h>
-#include <Common/Scheduler/Nodes/registerResourceManagers.h>
+#include <Common/Scheduler/Workload/IWorkloadEntityStorage.h>
 #include <Common/Config/ConfigReloader.h>
 #include <Server/HTTPHandlerFactory.h>
 #include "MetricsTransmitter.h"
@ -168,6 +168,7 @@ namespace ServerSetting
 {
    extern const ServerSettingsUInt32 asynchronous_heavy_metrics_update_period_s;
    extern const ServerSettingsUInt32 asynchronous_metrics_update_period_s;
+    extern const ServerSettingsBool asynchronous_metrics_enable_heavy_metrics;
    extern const ServerSettingsBool async_insert_queue_flush_on_shutdown;
    extern const ServerSettingsUInt64 async_insert_threads;
    extern const ServerSettingsBool async_load_databases;
@ -206,7 +207,6 @@ namespace ServerSetting
    extern const ServerSettingsBool format_alter_operations_with_parentheses;
    extern const ServerSettingsUInt64 global_profiler_cpu_time_period_ns;
    extern const ServerSettingsUInt64 global_profiler_real_time_period_ns;
-    extern const ServerSettingsDouble gwp_asan_force_sample_probability;
    extern const ServerSettingsUInt64 http_connections_soft_limit;
    extern const ServerSettingsUInt64 http_connections_store_limit;
    extern const ServerSettingsUInt64 http_connections_warn_limit;
@ -621,7 +621,7 @@ void sanityChecks(Server & server)
 #if defined(OS_LINUX)
    try
    {
-        const std::unordered_set<std::string> fastClockSources = {
+        const std::unordered_set<std::string> fast_clock_sources = {
            // ARM clock
            "arch_sys_counter",
            // KVM guest clock
@ -630,7 +630,7 @@ void sanityChecks(Server & server)
            "tsc",
        };
        const char * filename = "/sys/devices/system/clocksource/clocksource0/current_clocksource";
-        if (!fastClockSources.contains(readLine(filename)))
+        if (!fast_clock_sources.contains(readLine(filename)))
            server.context()->addWarningMessage("Linux is not using a fast clock source. Performance can be degraded. Check " + String(filename));
    }
    catch (...) // NOLINT(bugprone-empty-catch)
@ -920,7 +920,6 @@ try
    registerFormats();
    registerRemoteFileMetadatas();
    registerSchedulerNodes();
-    registerResourceManagers();

    CurrentMetrics::set(CurrentMetrics::Revision, ClickHouseRevision::getVersionRevision());
    CurrentMetrics::set(CurrentMetrics::VersionInteger, ClickHouseRevision::getVersionInteger());
@ -1061,6 +1060,7 @@ try
    ServerAsynchronousMetrics async_metrics(
        global_context,
        server_settings[ServerSetting::asynchronous_metrics_update_period_s],
+        server_settings[ServerSetting::asynchronous_metrics_enable_heavy_metrics],
        server_settings[ServerSetting::asynchronous_heavy_metrics_update_period_s],
        [&]() -> std::vector<ProtocolServerMetrics>
        {
@ -1928,10 +1928,6 @@ try
            if (global_context->isServerCompletelyStarted())
                CannotAllocateThreadFaultInjector::setFaultProbability(new_server_settings[ServerSetting::cannot_allocate_thread_fault_injection_probability]);

-#if USE_GWP_ASAN
-            GWPAsan::setForceSampleProbability(new_server_settings[ServerSetting::gwp_asan_force_sample_probability]);
-#endif
-
            ProfileEvents::increment(ProfileEvents::MainConfigLoads);

            /// Must be the last.
@ -2256,6 +2252,8 @@ try
        database_catalog.assertDatabaseExists(default_database);
        /// Load user-defined SQL functions.
        global_context->getUserDefinedSQLObjectsStorage().loadObjects();
+        /// Load WORKLOADs and RESOURCEs.
+        global_context->getWorkloadEntityStorage().loadEntities();

        global_context->getRefreshSet().setRefreshesStopped(false);
    }
@ -2343,6 +2341,7 @@ try

 #if USE_SSL
        CertificateReloader::instance().tryLoad(config());
+        CertificateReloader::instance().tryLoadClient(config());
 #endif

        /// Must be done after initialization of `servers`, because async_metrics will access `servers` variable from its thread.
@ -2438,7 +2437,6 @@ try

 #if USE_GWP_ASAN
        GWPAsan::initFinished();
-        GWPAsan::setForceSampleProbability(server_settings[ServerSetting::gwp_asan_force_sample_probability]);
 #endif

        try
--- a/programs/server/config.xml
+++ b/programs/server/config.xml
@ -1399,6 +1399,10 @@
     If not specified they will be stored locally. -->
    <!-- <user_defined_zookeeper_path>/clickhouse/user_defined</user_defined_zookeeper_path> -->

+    <!-- Path in ZooKeeper to store workload and resource created by the command CREATE WORKLOAD and CREATE REESOURCE.
+     If not specified they will be stored locally. -->
+    <!-- <workload_zookeeper_path>/clickhouse/workload/definitions.sql</workload_zookeeper_path> -->
+
    <!-- Uncomment if you want data to be compressed 30-100% better.
         Don't do that if you just started using ClickHouse.
      -->
--- a/src/Access/AuthenticationData.cpp
+++ b/src/Access/AuthenticationData.cpp
@ -1,12 +1,16 @@
 #include <Access/AccessControl.h>
 #include <Access/AuthenticationData.h>
 #include <Common/Exception.h>
+#include <Interpreters/Access/getValidUntilFromAST.h>
 #include <Interpreters/Context.h>
 #include <Interpreters/evaluateConstantExpression.h>
 #include <Parsers/ASTExpressionList.h>
 #include <Parsers/ASTLiteral.h>
 #include <Parsers/Access/ASTPublicSSHKey.h>
 #include <Storages/checkAndGetLiteralArgument.h>
+#include <IO/parseDateTimeBestEffort.h>
+#include <IO/ReadHelpers.h>
+#include <IO/ReadBufferFromString.h>

 #include <Common/OpenSSLHelpers.h>
 #include <Poco/SHA1Engine.h>
@ -113,7 +117,8 @@ bool operator ==(const AuthenticationData & lhs, const AuthenticationData & rhs)
        && (lhs.ssh_keys == rhs.ssh_keys)
 #endif
        && (lhs.http_auth_scheme == rhs.http_auth_scheme)
-        && (lhs.http_auth_server_name == rhs.http_auth_server_name);
+        && (lhs.http_auth_server_name == rhs.http_auth_server_name)
+        && (lhs.valid_until == rhs.valid_until);
 }


@ -384,14 +389,34 @@ std::shared_ptr<ASTAuthenticationData> AuthenticationData::toAST() const
            throw Exception(ErrorCodes::LOGICAL_ERROR, "AST: Unexpected authentication type {}", toString(auth_type));
    }

+
+    if (valid_until)
+    {
+        WriteBufferFromOwnString out;
+        writeDateTimeText(valid_until, out);
+
+        node->valid_until = std::make_shared<ASTLiteral>(out.str());
+    }
+
    return node;
 }


 AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & query, ContextPtr context, bool validate)
 {
+    time_t valid_until = 0;
+
+    if (query.valid_until)
+    {
+        valid_until = getValidUntilFromAST(query.valid_until, context);
+    }
+
    if (query.type && query.type == AuthenticationType::NO_PASSWORD)
-        return AuthenticationData();
+    {
+        AuthenticationData auth_data;
+        auth_data.setValidUntil(valid_until);
+        return auth_data;
+    }

    /// For this type of authentication we have ASTPublicSSHKey as children for ASTAuthenticationData
    if (query.type && query.type == AuthenticationType::SSH_KEY)
@ -418,6 +443,7 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que
        }

        auth_data.setSSHKeys(std::move(keys));
+        auth_data.setValidUntil(valid_until);
        return auth_data;
 #else
        throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "SSH is disabled, because ClickHouse is built without libssh");
@ -451,6 +477,8 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que

        AuthenticationData auth_data(current_type);

+        auth_data.setValidUntil(valid_until);
+
        if (validate)
            context->getAccessControl().checkPasswordComplexityRules(value);

@ -494,6 +522,7 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que
    }

    AuthenticationData auth_data(*query.type);
+    auth_data.setValidUntil(valid_until);

    if (query.contains_hash)
    {
--- a/src/Access/AuthenticationData.h
+++ b/src/Access/AuthenticationData.h
@ -74,6 +74,9 @@ public:
    const String & getHTTPAuthenticationServerName() const { return http_auth_server_name; }
    void setHTTPAuthenticationServerName(const String & name) { http_auth_server_name = name; }

+    time_t getValidUntil() const { return valid_until; }
+    void setValidUntil(time_t valid_until_) { valid_until = valid_until_; }
+
    friend bool operator ==(const AuthenticationData & lhs, const AuthenticationData & rhs);
    friend bool operator !=(const AuthenticationData & lhs, const AuthenticationData & rhs) { return !(lhs == rhs); }

@ -106,6 +109,7 @@ private:
    /// HTTP authentication properties
    String http_auth_server_name;
    HTTPAuthenticationScheme http_auth_scheme = HTTPAuthenticationScheme::BASIC;
+    time_t valid_until = 0;
 };

 }
--- a/src/Access/Common/AccessType.h
+++ b/src/Access/Common/AccessType.h
@ -99,6 +99,8 @@ enum class AccessType : uint8_t
    M(CREATE_ARBITRARY_TEMPORARY_TABLE, "", GLOBAL, CREATE)  /* allows to create  and manipulate temporary tables
                                                                with arbitrary table engine */\
    M(CREATE_FUNCTION, "", GLOBAL, CREATE) /* allows to execute CREATE FUNCTION */ \
+    M(CREATE_WORKLOAD, "", GLOBAL, CREATE) /* allows to execute CREATE WORKLOAD */ \
+    M(CREATE_RESOURCE, "", GLOBAL, CREATE) /* allows to execute CREATE RESOURCE */ \
    M(CREATE_NAMED_COLLECTION, "", NAMED_COLLECTION, NAMED_COLLECTION_ADMIN) /* allows to execute CREATE NAMED COLLECTION */ \
    M(CREATE, "", GROUP, ALL) /* allows to execute {CREATE|ATTACH} */ \
    \
@ -108,6 +110,8 @@ enum class AccessType : uint8_t
                                    implicitly enabled by the grant DROP_TABLE */\
    M(DROP_DICTIONARY, "", DICTIONARY, DROP) /* allows to execute {DROP|DETACH} DICTIONARY */\
    M(DROP_FUNCTION, "", GLOBAL, DROP) /* allows to execute DROP FUNCTION */\
+    M(DROP_WORKLOAD, "", GLOBAL, DROP) /* allows to execute DROP WORKLOAD */\
+    M(DROP_RESOURCE, "", GLOBAL, DROP) /* allows to execute DROP RESOURCE */\
    M(DROP_NAMED_COLLECTION, "", NAMED_COLLECTION, NAMED_COLLECTION_ADMIN) /* allows to execute DROP NAMED COLLECTION */\
    M(DROP, "", GROUP, ALL) /* allows to execute {DROP|DETACH} */\
    \
@ -159,6 +163,7 @@ enum class AccessType : uint8_t
    M(SYSTEM_SHUTDOWN, "SYSTEM KILL, SHUTDOWN", GLOBAL, SYSTEM) \
    M(SYSTEM_DROP_DNS_CACHE, "SYSTEM DROP DNS, DROP DNS CACHE, DROP DNS", GLOBAL, SYSTEM_DROP_CACHE)  \
    M(SYSTEM_DROP_CONNECTIONS_CACHE, "SYSTEM DROP CONNECTIONS CACHE, DROP CONNECTIONS CACHE", GLOBAL, SYSTEM_DROP_CACHE)  \
+    M(SYSTEM_PREWARM_MARK_CACHE, "SYSTEM PREWARM MARK, PREWARM MARK CACHE, PREWARM MARKS", GLOBAL, SYSTEM_DROP_CACHE) \
    M(SYSTEM_DROP_MARK_CACHE, "SYSTEM DROP MARK, DROP MARK CACHE, DROP MARKS", GLOBAL, SYSTEM_DROP_CACHE) \
    M(SYSTEM_DROP_UNCOMPRESSED_CACHE, "SYSTEM DROP UNCOMPRESSED, DROP UNCOMPRESSED CACHE, DROP UNCOMPRESSED", GLOBAL, SYSTEM_DROP_CACHE) \
    M(SYSTEM_DROP_MMAP_CACHE, "SYSTEM DROP MMAP, DROP MMAP CACHE, DROP MMAP", GLOBAL, SYSTEM_DROP_CACHE) \
--- a/src/Access/ContextAccess.cpp
+++ b/src/Access/ContextAccess.cpp
@ -701,15 +701,17 @@ bool ContextAccess::checkAccessImplHelper(const ContextPtr & context, AccessFlag

        const AccessFlags dictionary_ddl = AccessType::CREATE_DICTIONARY | AccessType::DROP_DICTIONARY;
        const AccessFlags function_ddl = AccessType::CREATE_FUNCTION | AccessType::DROP_FUNCTION;
+        const AccessFlags workload_ddl = AccessType::CREATE_WORKLOAD | AccessType::DROP_WORKLOAD;
+        const AccessFlags resource_ddl = AccessType::CREATE_RESOURCE | AccessType::DROP_RESOURCE;
        const AccessFlags table_and_dictionary_ddl = table_ddl | dictionary_ddl;
        const AccessFlags table_and_dictionary_and_function_ddl = table_ddl | dictionary_ddl | function_ddl;
        const AccessFlags write_table_access = AccessType::INSERT | AccessType::OPTIMIZE;
        const AccessFlags write_dcl_access = AccessType::ACCESS_MANAGEMENT - AccessType::SHOW_ACCESS;

-        const AccessFlags not_readonly_flags = write_table_access | table_and_dictionary_and_function_ddl | write_dcl_access | AccessType::SYSTEM | AccessType::KILL_QUERY;
+        const AccessFlags not_readonly_flags = write_table_access | table_and_dictionary_and_function_ddl | workload_ddl | resource_ddl | write_dcl_access | AccessType::SYSTEM | AccessType::KILL_QUERY;
        const AccessFlags not_readonly_1_flags = AccessType::CREATE_TEMPORARY_TABLE;

-        const AccessFlags ddl_flags = table_ddl | dictionary_ddl | function_ddl;
+        const AccessFlags ddl_flags = table_ddl | dictionary_ddl | function_ddl | workload_ddl | resource_ddl;
        const AccessFlags introspection_flags = AccessType::INTROSPECTION;
    };
    static const PrecalculatedFlags precalc;
--- a/src/Access/IAccessStorage.cpp
+++ b/src/Access/IAccessStorage.cpp
@ -554,7 +554,7 @@ std::optional<AuthResult> IAccessStorage::authenticateImpl(
                    continue;
                }

-                if (areCredentialsValid(user->getName(), user->valid_until, auth_method, credentials, external_authenticators, auth_result.settings))
+                if (areCredentialsValid(user->getName(), auth_method, credentials, external_authenticators, auth_result.settings))
                {
                    auth_result.authentication_data = auth_method;
                    return auth_result;
@ -579,7 +579,6 @@ std::optional<AuthResult> IAccessStorage::authenticateImpl(

 bool IAccessStorage::areCredentialsValid(
    const std::string & user_name,
-    time_t valid_until,
    const AuthenticationData & authentication_method,
    const Credentials & credentials,
    const ExternalAuthenticators & external_authenticators,
@ -591,6 +590,7 @@ bool IAccessStorage::areCredentialsValid(
    if (credentials.getUserName() != user_name)
        return false;

+    auto valid_until = authentication_method.getValidUntil();
    if (valid_until)
    {
        const time_t now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
--- a/src/Access/IAccessStorage.h
+++ b/src/Access/IAccessStorage.h
@ -236,7 +236,6 @@ protected:
        bool allow_plaintext_password) const;
    virtual bool areCredentialsValid(
        const std::string & user_name,
-        time_t valid_until,
        const AuthenticationData & authentication_method,
        const Credentials & credentials,
        const ExternalAuthenticators & external_authenticators,
--- a/src/Access/User.cpp
+++ b/src/Access/User.cpp
@ -19,8 +19,7 @@ bool User::equal(const IAccessEntity & other) const
    return (authentication_methods == other_user.authentication_methods)
        && (allowed_client_hosts == other_user.allowed_client_hosts)
        && (access == other_user.access) && (granted_roles == other_user.granted_roles) && (default_roles == other_user.default_roles)
-        && (settings == other_user.settings) && (grantees == other_user.grantees) && (default_database == other_user.default_database)
-        && (valid_until == other_user.valid_until);
+        && (settings == other_user.settings) && (grantees == other_user.grantees) && (default_database == other_user.default_database);
 }

 void User::setName(const String & name_)
@ -88,7 +87,6 @@ void User::clearAllExceptDependencies()
    access = {};
    settings.removeSettingsKeepProfiles();
    default_database = {};
-    valid_until = 0;
 }

 }
--- a/src/Access/User.h
+++ b/src/Access/User.h
@ -23,7 +23,6 @@ struct User : public IAccessEntity
    SettingsProfileElements settings;
    RolesOrUsersSet grantees = RolesOrUsersSet::AllTag{};
    String default_database;
-    time_t valid_until = 0;

    bool equal(const IAccessEntity & other) const override;
    std::shared_ptr<IAccessEntity> clone() const override { return cloneImpl<User>(); }
--- a/src/Analyzer/Resolve/QueryAnalyzer.cpp
+++ b/src/Analyzer/Resolve/QueryAnalyzer.cpp
@ -227,8 +227,13 @@ void QueryAnalyzer::resolveConstantExpression(QueryTreeNodePtr & node, const Que
        scope.context = context;

    auto node_type = node->getNodeType();
+    if (node_type == QueryTreeNodeType::QUERY || node_type == QueryTreeNodeType::UNION)
+    {
+        evaluateScalarSubqueryIfNeeded(node, scope);
+        return;
+    }

-    if (table_expression && node_type != QueryTreeNodeType::QUERY && node_type != QueryTreeNodeType::UNION)
+    if (table_expression)
    {
        scope.expression_join_tree_node = table_expression;
        validateTableExpressionModifiers(scope.expression_join_tree_node, scope);
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@ -136,6 +136,7 @@ add_headers_and_sources(dbms Storages/ObjectStorage/HDFS)
 add_headers_and_sources(dbms Storages/ObjectStorage/Local)
 add_headers_and_sources(dbms Storages/ObjectStorage/DataLakes)
 add_headers_and_sources(dbms Common/NamedCollections)
+add_headers_and_sources(dbms Common/Scheduler/Workload)

 if (TARGET ch_contrib::amqp_cpp)
    add_headers_and_sources(dbms Storages/RabbitMQ)
--- a/src/Client/ClientApplicationBase.cpp
+++ b/src/Client/ClientApplicationBase.cpp
@ -418,7 +418,7 @@ void ClientApplicationBase::init(int argc, char ** argv)
        UInt64 max_client_memory_usage_int = parseWithSizeSuffix<UInt64>(max_client_memory_usage.c_str(), max_client_memory_usage.length());

        total_memory_tracker.setHardLimit(max_client_memory_usage_int);
-        total_memory_tracker.setDescription("(total)");
+        total_memory_tracker.setDescription("Global");
        total_memory_tracker.setMetric(CurrentMetrics::MemoryTracking);
    }

--- a/src/Client/ClientBase.cpp
+++ b/src/Client/ClientBase.cpp
@ -470,8 +470,7 @@ void ClientBase::onData(Block & block, ASTPtr parsed_query)
    {
        if (!need_render_progress && select_into_file && !select_into_file_and_stdout)
            error_stream << "\r";
-        bool toggle_enabled = getClientConfiguration().getBool("enable-progress-table-toggle", true);
-        progress_table.writeTable(*tty_buf, progress_table_toggle_on.load(), toggle_enabled);
+        progress_table.writeTable(*tty_buf, progress_table_toggle_on.load(), progress_table_toggle_enabled);
    }
 }

@ -825,6 +824,9 @@ void ClientBase::initTTYBuffer(ProgressOption progress_option, ProgressOption pr
    if (!need_render_progress && !need_render_progress_table)
        return;

+    progress_table_toggle_enabled = getClientConfiguration().getBool("enable-progress-table-toggle");
+    progress_table_toggle_on = !progress_table_toggle_enabled;
+
    /// If need_render_progress and need_render_progress_table are enabled,
    /// use ProgressOption that was set for the progress bar for progress table as well.
    ProgressOption progress = progress_option ? progress_option : progress_table_option;
@ -881,7 +883,7 @@ void ClientBase::initTTYBuffer(ProgressOption progress_option, ProgressOption pr

 void ClientBase::initKeystrokeInterceptor()
 {
-    if (is_interactive && need_render_progress_table && getClientConfiguration().getBool("enable-progress-table-toggle", true))
+    if (is_interactive && need_render_progress_table && progress_table_toggle_enabled)
    {
        keystroke_interceptor = std::make_unique<TerminalKeystrokeInterceptor>(in_fd, error_stream);
        keystroke_interceptor->registerCallback(' ', [this]() { progress_table_toggle_on = !progress_table_toggle_on; });
@ -1151,6 +1153,7 @@ void ClientBase::receiveResult(ASTPtr parsed_query, Int32 signals_before_stop, b

    if (keystroke_interceptor)
    {
+        progress_table_toggle_on = false;
        try
        {
            keystroke_interceptor->startIntercept();
@ -1446,10 +1449,27 @@ void ClientBase::onProfileEvents(Block & block)
 /// Flush all buffers.
 void ClientBase::resetOutput()
 {
+    if (need_render_progress_table && tty_buf)
+        progress_table.clearTableOutput(*tty_buf);
+
    /// Order is important: format, compression, file

-    if (output_format)
-        output_format->finalize();
+    try
+    {
+        if (output_format)
+            output_format->finalize();
+    }
+    catch (...)
+    {
+        /// We need to make sure we continue resetting output_format (will stop threads on parallel output)
+        /// as well as cleaning other output related setup
+        if (!have_error)
+        {
+            client_exception
+                = std::make_unique<Exception>(getCurrentExceptionMessageAndPattern(print_stack_trace), getCurrentExceptionCode());
+            have_error = true;
+        }
+    }
    output_format.reset();

    logs_out_stream.reset();
--- a/src/Client/ClientBase.h
+++ b/src/Client/ClientBase.h
@ -340,6 +340,7 @@ protected:
    ProgressTable progress_table;
    bool need_render_progress = true;
    bool need_render_progress_table = true;
+    bool progress_table_toggle_enabled = true;
    std::atomic_bool progress_table_toggle_on = false;
    bool need_render_profile_events = true;
    bool written_first_block = false;
--- a/src/Client/ProgressTable.cpp
+++ b/src/Client/ProgressTable.cpp
@ -180,9 +180,12 @@ void writeWithWidth(Out & out, std::string_view s, size_t width)
 template <typename Out>
 void writeWithWidthStrict(Out & out, std::string_view s, size_t width)
 {
-    chassert(width != 0);
+    constexpr std::string_view ellipsis = "…";
    if (s.size() > width)
-        out << s.substr(0, width - 1) << "…";
+        if (width <= ellipsis.size())
+            out << s.substr(0, width);
+        else
+            out << s.substr(0, width - ellipsis.size()) << ellipsis;
    else
        out << s;
 }
@ -219,7 +222,9 @@ void ProgressTable::writeTable(WriteBufferFromFileDescriptor & message, bool sho
    writeWithWidth(message, COLUMN_EVENT_NAME, column_event_name_width);
    writeWithWidth(message, COLUMN_VALUE, COLUMN_VALUE_WIDTH);
    writeWithWidth(message, COLUMN_PROGRESS, COLUMN_PROGRESS_WIDTH);
-    writeWithWidth(message, COLUMN_DOCUMENTATION_NAME, COLUMN_DOCUMENTATION_WIDTH);
+    auto col_doc_width = getColumnDocumentationWidth(terminal_width);
+    if (col_doc_width)
+        writeWithWidth(message, COLUMN_DOCUMENTATION_NAME, col_doc_width);
    message << CLEAR_TO_END_OF_LINE;

    double elapsed_sec = watch.elapsedSeconds();
@ -257,9 +262,12 @@ void ProgressTable::writeTable(WriteBufferFromFileDescriptor & message, bool sho

        writeWithWidth(message, formatReadableValue(value_type, progress) + "/s", COLUMN_PROGRESS_WIDTH);

-        message << setColorForDocumentation();
-        const auto * doc = getDocumentation(event_name_to_event.at(name));
-        writeWithWidthStrict(message, doc, COLUMN_DOCUMENTATION_WIDTH);
+        if (col_doc_width)
+        {
+            message << setColorForDocumentation();
+            const auto * doc = getDocumentation(event_name_to_event.at(name));
+            writeWithWidthStrict(message, doc, col_doc_width);
+        }

        message << RESET_COLOR;
        message << CLEAR_TO_END_OF_LINE;
@ -372,6 +380,14 @@ size_t ProgressTable::tableSize() const
    return metrics.empty() ? 0 : metrics.size() + 1;
 }

+size_t ProgressTable::getColumnDocumentationWidth(size_t terminal_width) const
+{
+    auto fixed_columns_width = column_event_name_width + COLUMN_VALUE_WIDTH + COLUMN_PROGRESS_WIDTH;
+    if (terminal_width < fixed_columns_width + COLUMN_DOCUMENTATION_MIN_WIDTH)
+        return 0;
+    return terminal_width - fixed_columns_width;
+}
+
 ProgressTable::MetricInfo::MetricInfo(ProfileEvents::Type t) : type(t)
 {
 }
--- a/src/Client/ProgressTable.h
+++ b/src/Client/ProgressTable.h
@ -87,6 +87,7 @@ private:
    };

    size_t tableSize() const;
+    size_t getColumnDocumentationWidth(size_t terminal_width) const;

    using MetricName = String;

@ -110,7 +111,7 @@ private:
    static constexpr std::string_view COLUMN_DOCUMENTATION_NAME = "Documentation";
    static constexpr size_t COLUMN_VALUE_WIDTH = 20;
    static constexpr size_t COLUMN_PROGRESS_WIDTH = 20;
-    static constexpr size_t COLUMN_DOCUMENTATION_WIDTH = 100;
+    static constexpr size_t COLUMN_DOCUMENTATION_MIN_WIDTH = COLUMN_DOCUMENTATION_NAME.size();

    std::ostream & output_stream;
    int in_fd;
--- a/src/Columns/ColumnArray.cpp
+++ b/src/Columns/ColumnArray.cpp
@ -369,6 +369,23 @@ void ColumnArray::popBack(size_t n)
    offsets_data.resize_assume_reserved(offsets_data.size() - n);
 }

+ColumnCheckpointPtr ColumnArray::getCheckpoint() const
+{
+    return std::make_shared<ColumnCheckpointWithNested>(size(), getData().getCheckpoint());
+}
+
+void ColumnArray::updateCheckpoint(ColumnCheckpoint & checkpoint) const
+{
+    checkpoint.size = size();
+    getData().updateCheckpoint(*assert_cast<ColumnCheckpointWithNested &>(checkpoint).nested);
+}
+
+void ColumnArray::rollback(const ColumnCheckpoint & checkpoint)
+{
+    getOffsets().resize_assume_reserved(checkpoint.size);
+    getData().rollback(*assert_cast<const ColumnCheckpointWithNested &>(checkpoint).nested);
+}
+
 int ColumnArray::compareAtImpl(size_t n, size_t m, const IColumn & rhs_, int nan_direction_hint, const Collator * collator) const
 {
    const ColumnArray & rhs = assert_cast<const ColumnArray &>(rhs_);
--- a/src/Columns/ColumnArray.h
+++ b/src/Columns/ColumnArray.h
@ -161,6 +161,10 @@ public:

    ColumnPtr compress() const override;

+    ColumnCheckpointPtr getCheckpoint() const override;
+    void updateCheckpoint(ColumnCheckpoint & checkpoint) const override;
+    void rollback(const ColumnCheckpoint & checkpoint) override;
+
    void forEachSubcolumn(MutableColumnCallback callback) override
    {
        callback(offsets);
--- a/src/Columns/ColumnDynamic.cpp
+++ b/src/Columns/ColumnDynamic.cpp
@ -1000,6 +1000,56 @@ ColumnPtr ColumnDynamic::compress() const
        });
 }

+void ColumnDynamic::updateCheckpoint(ColumnCheckpoint & checkpoint) const
+{
+    auto & nested = assert_cast<ColumnCheckpointWithMultipleNested &>(checkpoint).nested;
+    const auto & variants = variant_column_ptr->getVariants();
+
+    size_t old_size = nested.size();
+    chassert(old_size <= variants.size());
+
+    for (size_t i = 0; i < old_size; ++i)
+    {
+        variants[i]->updateCheckpoint(*nested[i]);
+    }
+
+    /// If column has new variants since last checkpoint create checkpoints for them.
+    if (old_size < variants.size())
+    {
+        nested.resize(variants.size());
+        for (size_t i = old_size; i < variants.size(); ++i)
+            nested[i] = variants[i]->getCheckpoint();
+    }
+
+    checkpoint.size = size();
+}
+
+void ColumnDynamic::rollback(const ColumnCheckpoint & checkpoint)
+{
+    const auto & nested = assert_cast<const ColumnCheckpointWithMultipleNested &>(checkpoint).nested;
+    chassert(nested.size() <= variant_column_ptr->getNumVariants());
+
+    /// The structure hasn't changed, so we can use generic rollback of Variant column
+    if (nested.size() == variant_column_ptr->getNumVariants())
+    {
+        variant_column_ptr->rollback(checkpoint);
+        return;
+    }
+
+    /// Manually rollback internals of Variant column
+    variant_column_ptr->getOffsets().resize_assume_reserved(checkpoint.size);
+    variant_column_ptr->getLocalDiscriminators().resize_assume_reserved(checkpoint.size);
+
+    auto & variants = variant_column_ptr->getVariants();
+    for (size_t i = 0; i < nested.size(); ++i)
+        variants[i]->rollback(*nested[i]);
+
+    /// Keep the structure of variant as is but rollback
+    /// to 0 variants that are not in the checkpoint.
+    for (size_t i = nested.size(); i < variants.size(); ++i)
+        variants[i] = variants[i]->cloneEmpty();
+}
+
 String ColumnDynamic::getTypeNameAt(size_t row_num) const
 {
    const auto & variant_col = getVariantColumn();
--- a/src/Columns/ColumnDynamic.h
+++ b/src/Columns/ColumnDynamic.h
@ -304,6 +304,15 @@ public:
        variant_column_ptr->protect();
    }

+    ColumnCheckpointPtr getCheckpoint() const override
+    {
+        return variant_column_ptr->getCheckpoint();
+    }
+
+    void updateCheckpoint(ColumnCheckpoint & checkpoint) const override;
+
+    void rollback(const ColumnCheckpoint & checkpoint) override;
+
    void forEachSubcolumn(MutableColumnCallback callback) override
    {
        callback(variant_column);
--- a/src/Columns/ColumnMap.cpp
+++ b/src/Columns/ColumnMap.cpp
@ -312,6 +312,21 @@ void ColumnMap::getExtremes(Field & min, Field & max) const
    max = std::move(map_max_value);
 }

+ColumnCheckpointPtr ColumnMap::getCheckpoint() const
+{
+    return nested->getCheckpoint();
+}
+
+void ColumnMap::updateCheckpoint(ColumnCheckpoint & checkpoint) const
+{
+    nested->updateCheckpoint(checkpoint);
+}
+
+void ColumnMap::rollback(const ColumnCheckpoint & checkpoint)
+{
+    nested->rollback(checkpoint);
+}
+
 void ColumnMap::forEachSubcolumn(MutableColumnCallback callback)
 {
    callback(nested);
--- a/src/Columns/ColumnMap.h
+++ b/src/Columns/ColumnMap.h
@ -102,6 +102,9 @@ public:
    size_t byteSizeAt(size_t n) const override;
    size_t allocatedBytes() const override;
    void protect() override;
+    ColumnCheckpointPtr getCheckpoint() const override;
+    void updateCheckpoint(ColumnCheckpoint & checkpoint) const override;
+    void rollback(const ColumnCheckpoint & checkpoint) override;
    void forEachSubcolumn(MutableColumnCallback callback) override;
    void forEachSubcolumnRecursively(RecursiveMutableColumnCallback callback) override;
    bool structureEquals(const IColumn & rhs) const override;
--- a/src/Columns/ColumnNullable.cpp
+++ b/src/Columns/ColumnNullable.cpp
@ -302,6 +302,23 @@ void ColumnNullable::popBack(size_t n)
    getNullMapColumn().popBack(n);
 }

+ColumnCheckpointPtr ColumnNullable::getCheckpoint() const
+{
+    return std::make_shared<ColumnCheckpointWithNested>(size(), nested_column->getCheckpoint());
+}
+
+void ColumnNullable::updateCheckpoint(ColumnCheckpoint & checkpoint) const
+{
+    checkpoint.size = size();
+    nested_column->updateCheckpoint(*assert_cast<ColumnCheckpointWithNested &>(checkpoint).nested);
+}
+
+void ColumnNullable::rollback(const ColumnCheckpoint & checkpoint)
+{
+    getNullMapData().resize_assume_reserved(checkpoint.size);
+    nested_column->rollback(*assert_cast<const ColumnCheckpointWithNested &>(checkpoint).nested);
+}
+
 ColumnPtr ColumnNullable::filter(const Filter & filt, ssize_t result_size_hint) const
 {
    ColumnPtr filtered_data = getNestedColumn().filter(filt, result_size_hint);
--- a/src/Columns/ColumnNullable.h
+++ b/src/Columns/ColumnNullable.h
@ -143,6 +143,10 @@ public:

    ColumnPtr compress() const override;

+    ColumnCheckpointPtr getCheckpoint() const override;
+    void updateCheckpoint(ColumnCheckpoint & checkpoint) const override;
+    void rollback(const ColumnCheckpoint & checkpoint) override;
+
    void forEachSubcolumn(MutableColumnCallback callback) override
    {
        callback(nested_column);
--- a/src/Columns/ColumnObject.cpp
+++ b/src/Columns/ColumnObject.cpp
@ -30,6 +30,23 @@ const std::shared_ptr<SerializationDynamic> & getDynamicSerialization()
    return dynamic_serialization;
 }

+struct ColumnObjectCheckpoint : public ColumnCheckpoint
+{
+    using CheckpointsMap = std::unordered_map<std::string_view, ColumnCheckpointPtr>;
+
+    ColumnObjectCheckpoint(size_t size_, CheckpointsMap typed_paths_, CheckpointsMap dynamic_paths_, ColumnCheckpointPtr shared_data_)
+        : ColumnCheckpoint(size_)
+        , typed_paths(std::move(typed_paths_))
+        , dynamic_paths(std::move(dynamic_paths_))
+        , shared_data(std::move(shared_data_))
+    {
+    }
+
+    CheckpointsMap typed_paths;
+    CheckpointsMap dynamic_paths;
+    ColumnCheckpointPtr shared_data;
+};
+
 }

 ColumnObject::ColumnObject(
@ -698,6 +715,69 @@ void ColumnObject::popBack(size_t n)
    shared_data->popBack(n);
 }

+ColumnCheckpointPtr ColumnObject::getCheckpoint() const
+{
+    auto get_checkpoints = [](const auto & columns)
+    {
+        ColumnObjectCheckpoint::CheckpointsMap checkpoints;
+        for (const auto & [name, column] : columns)
+            checkpoints[name] = column->getCheckpoint();
+
+        return checkpoints;
+    };
+
+    return std::make_shared<ColumnObjectCheckpoint>(size(), get_checkpoints(typed_paths), get_checkpoints(dynamic_paths_ptrs), shared_data->getCheckpoint());
+}
+
+void ColumnObject::updateCheckpoint(ColumnCheckpoint & checkpoint) const
+{
+    auto & object_checkpoint = assert_cast<ColumnObjectCheckpoint &>(checkpoint);
+
+    auto update_checkpoints = [&](const auto & columns_map, auto & checkpoints_map)
+    {
+        for (const auto & [name, column] : columns_map)
+        {
+            auto & nested = checkpoints_map[name];
+            if (!nested)
+                nested = column->getCheckpoint();
+            else
+                column->updateCheckpoint(*nested);
+        }
+    };
+
+    checkpoint.size = size();
+    update_checkpoints(typed_paths, object_checkpoint.typed_paths);
+    update_checkpoints(dynamic_paths, object_checkpoint.dynamic_paths);
+    shared_data->updateCheckpoint(*object_checkpoint.shared_data);
+}
+
+void ColumnObject::rollback(const ColumnCheckpoint & checkpoint)
+{
+    const auto & object_checkpoint = assert_cast<const ColumnObjectCheckpoint &>(checkpoint);
+
+    auto rollback_columns = [&](auto & columns_map, const auto & checkpoints_map)
+    {
+        NameSet names_to_remove;
+
+        /// Rollback subcolumns and remove paths that were not in checkpoint.
+        for (auto & [name, column] : columns_map)
+        {
+            auto it = checkpoints_map.find(name);
+            if (it == checkpoints_map.end())
+                names_to_remove.insert(name);
+            else
+                column->rollback(*it->second);
+        }
+
+        for (const auto & name : names_to_remove)
+            columns_map.erase(name);
+    };
+
+    rollback_columns(typed_paths, object_checkpoint.typed_paths);
+    rollback_columns(dynamic_paths, object_checkpoint.dynamic_paths);
+    shared_data->rollback(*object_checkpoint.shared_data);
+}
+
 StringRef ColumnObject::serializeValueIntoArena(size_t n, Arena & arena, const char *& begin) const
 {
    StringRef res(begin, 0);
--- a/src/Columns/ColumnObject.h
+++ b/src/Columns/ColumnObject.h
@ -161,6 +161,9 @@ public:
    size_t byteSizeAt(size_t n) const override;
    size_t allocatedBytes() const override;
    void protect() override;
+    ColumnCheckpointPtr getCheckpoint() const override;
+    void updateCheckpoint(ColumnCheckpoint & checkpoint) const override;
+    void rollback(const ColumnCheckpoint & checkpoint) override;

    void forEachSubcolumn(MutableColumnCallback callback) override;

--- a/src/Columns/ColumnSparse.cpp
+++ b/src/Columns/ColumnSparse.cpp
@ -308,6 +308,28 @@ void ColumnSparse::popBack(size_t n)
    _size = new_size;
 }

+ColumnCheckpointPtr ColumnSparse::getCheckpoint() const
+{
+    return std::make_shared<ColumnCheckpointWithNested>(size(), values->getCheckpoint());
+}
+
+void ColumnSparse::updateCheckpoint(ColumnCheckpoint & checkpoint) const
+{
+    checkpoint.size = size();
+    values->updateCheckpoint(*assert_cast<ColumnCheckpointWithNested &>(checkpoint).nested);
+}
+
+void ColumnSparse::rollback(const ColumnCheckpoint & checkpoint)
+{
+    _size = checkpoint.size;
+
+    const auto & nested = *assert_cast<const ColumnCheckpointWithNested &>(checkpoint).nested;
+    chassert(nested.size > 0);
+
+    values->rollback(nested);
+    getOffsetsData().resize_assume_reserved(nested.size - 1);
+}
+
 ColumnPtr ColumnSparse::filter(const Filter & filt, ssize_t) const
 {
    if (_size != filt.size())
--- a/src/Columns/ColumnSparse.h
+++ b/src/Columns/ColumnSparse.h
@ -149,6 +149,10 @@ public:

    ColumnPtr compress() const override;

+    ColumnCheckpointPtr getCheckpoint() const override;
+    void updateCheckpoint(ColumnCheckpoint & checkpoint) const override;
+    void rollback(const ColumnCheckpoint & checkpoint) override;
+
    void forEachSubcolumn(MutableColumnCallback callback) override;
    void forEachSubcolumnRecursively(RecursiveMutableColumnCallback callback) override;

--- a/src/Columns/ColumnString.cpp
+++ b/src/Columns/ColumnString.cpp
@ -240,6 +240,23 @@ ColumnPtr ColumnString::permute(const Permutation & perm, size_t limit) const
    return permuteImpl(*this, perm, limit);
 }

+ColumnCheckpointPtr ColumnString::getCheckpoint() const
+{
+    auto nested = std::make_shared<ColumnCheckpoint>(chars.size());
+    return std::make_shared<ColumnCheckpointWithNested>(size(), std::move(nested));
+}
+
+void ColumnString::updateCheckpoint(ColumnCheckpoint & checkpoint) const
+{
+    checkpoint.size = size();
+    assert_cast<ColumnCheckpointWithNested &>(checkpoint).nested->size = chars.size();
+}
+
+void ColumnString::rollback(const ColumnCheckpoint & checkpoint)
+{
+    offsets.resize_assume_reserved(checkpoint.size);
+    chars.resize_assume_reserved(assert_cast<const ColumnCheckpointWithNested &>(checkpoint).nested->size);
+}

 void ColumnString::collectSerializedValueSizes(PaddedPODArray<UInt64> & sizes, const UInt8 * is_null) const
 {
--- a/src/Columns/ColumnString.h
+++ b/src/Columns/ColumnString.h
@ -194,6 +194,10 @@ public:
        offsets.resize_assume_reserved(offsets.size() - n);
    }

+    ColumnCheckpointPtr getCheckpoint() const override;
+    void updateCheckpoint(ColumnCheckpoint & checkpoint) const override;
+    void rollback(const ColumnCheckpoint & checkpoint) override;
+
    void collectSerializedValueSizes(PaddedPODArray<UInt64> & sizes, const UInt8 * is_null) const override;

    StringRef serializeValueIntoArena(size_t n, Arena & arena, char const *& begin) const override;
--- a/src/Columns/ColumnTuple.cpp
+++ b/src/Columns/ColumnTuple.cpp
@ -254,6 +254,37 @@ void ColumnTuple::popBack(size_t n)
        column->popBack(n);
 }

+ColumnCheckpointPtr ColumnTuple::getCheckpoint() const
+{
+    ColumnCheckpoints checkpoints;
+    checkpoints.reserve(columns.size());
+
+    for (const auto & column : columns)
+        checkpoints.push_back(column->getCheckpoint());
+
+    return std::make_shared<ColumnCheckpointWithMultipleNested>(size(), std::move(checkpoints));
+}
+
+void ColumnTuple::updateCheckpoint(ColumnCheckpoint & checkpoint) const
+{
+    auto & checkpoints = assert_cast<ColumnCheckpointWithMultipleNested &>(checkpoint).nested;
+    chassert(checkpoints.size() == columns.size());
+
+    checkpoint.size = size();
+    for (size_t i = 0; i < columns.size(); ++i)
+        columns[i]->updateCheckpoint(*checkpoints[i]);
+}
+
+void ColumnTuple::rollback(const ColumnCheckpoint & checkpoint)
+{
+    column_length = checkpoint.size;
+    const auto & checkpoints = assert_cast<const ColumnCheckpointWithMultipleNested &>(checkpoint).nested;
+
+    chassert(columns.size() == checkpoints.size());
+    for (size_t i = 0; i < columns.size(); ++i)
+        columns[i]->rollback(*checkpoints[i]);
+}
+
 StringRef ColumnTuple::serializeValueIntoArena(size_t n, Arena & arena, char const *& begin) const
 {
    if (columns.empty())
--- a/src/Columns/ColumnTuple.h
+++ b/src/Columns/ColumnTuple.h
@ -118,6 +118,9 @@ public:
    size_t byteSizeAt(size_t n) const override;
    size_t allocatedBytes() const override;
    void protect() override;
+    ColumnCheckpointPtr getCheckpoint() const override;
+    void updateCheckpoint(ColumnCheckpoint & checkpoint) const override;
+    void rollback(const ColumnCheckpoint & checkpoint) override;
    void forEachSubcolumn(MutableColumnCallback callback) override;
    void forEachSubcolumnRecursively(RecursiveMutableColumnCallback callback) override;
    bool structureEquals(const IColumn & rhs) const override;
--- a/src/Columns/ColumnVariant.cpp
+++ b/src/Columns/ColumnVariant.cpp
@ -739,6 +739,39 @@ void ColumnVariant::popBack(size_t n)
    offsets->popBack(n);
 }

+ColumnCheckpointPtr ColumnVariant::getCheckpoint() const
+{
+    ColumnCheckpoints checkpoints;
+    checkpoints.reserve(variants.size());
+
+    for (const auto & column : variants)
+        checkpoints.push_back(column->getCheckpoint());
+
+    return std::make_shared<ColumnCheckpointWithMultipleNested>(size(), std::move(checkpoints));
+}
+
+void ColumnVariant::updateCheckpoint(ColumnCheckpoint & checkpoint) const
+{
+    auto & checkpoints = assert_cast<ColumnCheckpointWithMultipleNested &>(checkpoint).nested;
+    chassert(checkpoints.size() == variants.size());
+
+    checkpoint.size = size();
+    for (size_t i = 0; i < variants.size(); ++i)
+        variants[i]->updateCheckpoint(*checkpoints[i]);
+}
+
+void ColumnVariant::rollback(const ColumnCheckpoint & checkpoint)
+{
+    getOffsets().resize_assume_reserved(checkpoint.size);
+    getLocalDiscriminators().resize_assume_reserved(checkpoint.size);
+
+    const auto & checkpoints = assert_cast<const ColumnCheckpointWithMultipleNested &>(checkpoint).nested;
+    chassert(variants.size() == checkpoints.size());
+
+    for (size_t i = 0; i < variants.size(); ++i)
+        variants[i]->rollback(*checkpoints[i]);
+}
+
 StringRef ColumnVariant::serializeValueIntoArena(size_t n, Arena & arena, char const *& begin) const
 {
    /// During any serialization/deserialization we should always use global discriminators.
--- a/src/Columns/ColumnVariant.h
+++ b/src/Columns/ColumnVariant.h
@ -248,6 +248,9 @@ public:
    size_t byteSizeAt(size_t n) const override;
    size_t allocatedBytes() const override;
    void protect() override;
+    ColumnCheckpointPtr getCheckpoint() const override;
+    void updateCheckpoint(ColumnCheckpoint & checkpoint) const override;
+    void rollback(const ColumnCheckpoint & checkpoint) override;
    void forEachSubcolumn(MutableColumnCallback callback) override;
    void forEachSubcolumnRecursively(RecursiveMutableColumnCallback callback) override;
    bool structureEquals(const IColumn & rhs) const override;
--- a/src/Columns/IColumn.h
+++ b/src/Columns/IColumn.h
@ -49,6 +49,40 @@ struct EqualRange

 using EqualRanges = std::vector<EqualRange>;

+/// A checkpoint that contains size of column and all its subcolumns.
+/// It can be used to rollback column to the previous state, for example
+/// after failed parsing when column may be in inconsistent state.
+struct ColumnCheckpoint
+{
+    size_t size;
+
+    explicit ColumnCheckpoint(size_t size_) : size(size_) {}
+    virtual ~ColumnCheckpoint() = default;
+};
+
+using ColumnCheckpointPtr = std::shared_ptr<ColumnCheckpoint>;
+using ColumnCheckpoints = std::vector<ColumnCheckpointPtr>;
+
+struct ColumnCheckpointWithNested : public ColumnCheckpoint
+{
+    ColumnCheckpointWithNested(size_t size_, ColumnCheckpointPtr nested_)
+        : ColumnCheckpoint(size_), nested(std::move(nested_))
+    {
+    }
+
+    ColumnCheckpointPtr nested;
+};
+
+struct ColumnCheckpointWithMultipleNested : public ColumnCheckpoint
+{
+    ColumnCheckpointWithMultipleNested(size_t size_, ColumnCheckpoints nested_)
+        : ColumnCheckpoint(size_), nested(std::move(nested_))
+    {
+    }
+
+    ColumnCheckpoints nested;
+};
+
 /// Declares interface to store columns in memory.
 class IColumn : public COW<IColumn>
 {
@ -509,6 +543,17 @@ public:
    /// The operation is slow and performed only for debug builds.
    virtual void protect() {}

+    /// Returns checkpoint of current state of column.
+    virtual ColumnCheckpointPtr getCheckpoint() const { return std::make_shared<ColumnCheckpoint>(size()); }
+
+    /// Updates the checkpoint with current state. It is used to avoid extra allocations in 'getCheckpoint'.
+    virtual void updateCheckpoint(ColumnCheckpoint & checkpoint) const { checkpoint.size = size(); }
+
+    /// Rollbacks column to the checkpoint.
+    /// Unlike 'popBack' this method should work correctly even if column has invalid state.
+    /// Sizes of columns in checkpoint must be less or equal than current size.
+    virtual void rollback(const ColumnCheckpoint & checkpoint) { popBack(size() - checkpoint.size); }
+
    /// If the column contains subcolumns (such as Array, Nullable, etc), do callback on them.
    /// Shallow: doesn't do recursive calls; don't do call for itself.

--- a/src/Columns/tests/gtest_column_dynamic.cpp
+++ b/src/Columns/tests/gtest_column_dynamic.cpp
@ -920,3 +920,71 @@ TEST(ColumnDynamic, compare)
    ASSERT_EQ(column_from->compareAt(3, 2, *column_from, -1), -1);
    ASSERT_EQ(column_from->compareAt(3, 4, *column_from, -1), -1);
 }
+
+TEST(ColumnDynamic, rollback)
+{
+    auto check_variant = [](const ColumnVariant & column_variant, std::vector<size_t> sizes)
+    {
+        ASSERT_EQ(column_variant.getNumVariants(), sizes.size());
+        size_t num_rows = 0;
+
+        for (size_t i = 0; i < sizes.size(); ++i)
+        {
+            ASSERT_EQ(column_variant.getVariants()[i]->size(), sizes[i]);
+            num_rows += sizes[i];
+        }
+
+        ASSERT_EQ(num_rows, column_variant.size());
+    };
+
+    auto check_checkpoint = [&](const ColumnCheckpoint & cp, std::vector<size_t> sizes)
+    {
+        const auto & nested = assert_cast<const ColumnCheckpointWithMultipleNested &>(cp).nested;
+        size_t num_rows = 0;
+
+        for (size_t i = 0; i < nested.size(); ++i)
+        {
+            ASSERT_EQ(nested[i]->size, sizes[i]);
+            num_rows += sizes[i];
+        }
+
+        ASSERT_EQ(num_rows, cp.size);
+    };
+
+    std::vector<std::pair<ColumnCheckpointPtr, std::vector<size_t>>> checkpoints;
+
+    auto column = ColumnDynamic::create(2);
+    auto checkpoint = column->getCheckpoint();
+
+    column->insert(Field(42));
+
+    column->updateCheckpoint(*checkpoint);
+    checkpoints.emplace_back(checkpoint, std::vector<size_t>{0, 1, 0});
+
+    column->insert(Field("str1"));
+    column->rollback(*checkpoint);
+
+    check_checkpoint(*checkpoint, checkpoints.back().second);
+    check_variant(column->getVariantColumn(), checkpoints.back().second);
+
+    column->insert("str1");
+    checkpoints.emplace_back(column->getCheckpoint(), std::vector<size_t>{0, 1, 1});
+
+    column->insert("str2");
+    checkpoints.emplace_back(column->getCheckpoint(), std::vector<size_t>{0, 1, 2});
+
+    column->insert(Array({1, 2}));
+    checkpoints.emplace_back(column->getCheckpoint(), std::vector<size_t>{1, 1, 2});
+
+    column->insert(Field(42.42));
+    checkpoints.emplace_back(column->getCheckpoint(), std::vector<size_t>{2, 1, 2});
+
+    for (const auto & [cp, sizes] : checkpoints)
+    {
+        auto column_copy = column->clone();
+        column_copy->rollback(*cp);
+
+        check_checkpoint(*cp, sizes);
+        check_variant(assert_cast<const ColumnDynamic &>(*column_copy).getVariantColumn(), sizes);
+    }
+}
--- a/src/Columns/tests/gtest_column_object.cpp
+++ b/src/Columns/tests/gtest_column_object.cpp
@ -5,6 +5,7 @@
 #include <IO/WriteBufferFromString.h>

 #include <Common/Arena.h>
+#include "Core/Field.h"
 #include <gtest/gtest.h>

 using namespace DB;
@ -349,3 +350,65 @@ TEST(ColumnObject, SkipSerializedInArena)
    pos = col2->skipSerializedInArena(pos);
    ASSERT_EQ(pos, end);
 }
+
+TEST(ColumnObject, rollback)
+{
+    auto type = DataTypeFactory::instance().get("JSON(max_dynamic_types=10, max_dynamic_paths=2, a.a UInt32, a.b UInt32)");
+    auto col = type->createColumn();
+    auto & col_object = assert_cast<ColumnObject &>(*col);
+    const auto & typed_paths = col_object.getTypedPaths();
+    const auto & dynamic_paths = col_object.getDynamicPaths();
+    const auto & shared_data = col_object.getSharedDataColumn();
+
+    auto assert_sizes = [&](size_t size)
+    {
+        for (const auto & [name, column] : typed_paths)
+            ASSERT_EQ(column->size(), size);
+
+        for (const auto & [name, column] : dynamic_paths)
+            ASSERT_EQ(column->size(), size);
+
+        ASSERT_EQ(shared_data.size(), size);
+    };
+
+    auto checkpoint = col_object.getCheckpoint();
+
+    col_object.insert(Object{{"a.a", Field{1u}}});
+    col_object.updateCheckpoint(*checkpoint);
+
+    col_object.insert(Object{{"a.b", Field{2u}}});
+    col_object.insert(Object{{"a.a", Field{3u}}});
+
+    col_object.rollback(*checkpoint);
+
+    assert_sizes(1);
+    ASSERT_EQ(typed_paths.size(), 2);
+    ASSERT_EQ(dynamic_paths.size(), 0);
+
+    ASSERT_EQ((*typed_paths.at("a.a"))[0], Field{1u});
+    ASSERT_EQ((*typed_paths.at("a.b"))[0], Field{0u});
+
+    col_object.insert(Object{{"a.c", Field{"ccc"}}});
+
+    checkpoint = col_object.getCheckpoint();
+
+    col_object.insert(Object{{"a.d", Field{"ddd"}}});
+    col_object.insert(Object{{"a.e", Field{"eee"}}});
+
+    assert_sizes(4);
+    ASSERT_EQ(typed_paths.size(), 2);
+    ASSERT_EQ(dynamic_paths.size(), 2);
+
+    ASSERT_EQ((*typed_paths.at("a.a"))[0], Field{1u});
+    ASSERT_EQ((*dynamic_paths.at("a.c"))[1], Field{"ccc"});
+    ASSERT_EQ((*dynamic_paths.at("a.d"))[2], Field{"ddd"});
+
+    col_object.rollback(*checkpoint);
+
+    assert_sizes(2);
+    ASSERT_EQ(typed_paths.size(), 2);
+    ASSERT_EQ(dynamic_paths.size(), 1);
+
+    ASSERT_EQ((*typed_paths.at("a.a"))[0], Field{1u});
+    ASSERT_EQ((*dynamic_paths.at("a.c"))[1], Field{"ccc"});
+}
--- a/src/Common/CurrentMetrics.cpp
+++ b/src/Common/CurrentMetrics.cpp
@ -183,8 +183,14 @@
    M(BuildVectorSimilarityIndexThreadsScheduled, "Number of queued or active jobs in the build vector similarity index thread pool.") \
    \
    M(DiskPlainRewritableAzureDirectoryMapSize, "Number of local-to-remote path entries in the 'plain_rewritable' in-memory map for AzureObjectStorage.") \
+    M(DiskPlainRewritableAzureFileCount, "Number of file entries in the 'plain_rewritable' in-memory map for AzureObjectStorage.") \
+    M(DiskPlainRewritableAzureUniqueFileNamesCount, "Number of unique file name entries in the 'plain_rewritable' in-memory map for AzureObjectStorage.") \
    M(DiskPlainRewritableLocalDirectoryMapSize, "Number of local-to-remote path entries in the 'plain_rewritable' in-memory map for LocalObjectStorage.") \
+    M(DiskPlainRewritableLocalFileCount, "Number of file entries in the 'plain_rewritable' in-memory map for LocalObjectStorage.") \
+    M(DiskPlainRewritableLocalUniqueFileNamesCount, "Number of unique file name entries in the 'plain_rewritable' in-memory map for LocalObjectStorage.") \
    M(DiskPlainRewritableS3DirectoryMapSize, "Number of local-to-remote path entries in the 'plain_rewritable' in-memory map for S3ObjectStorage.") \
+    M(DiskPlainRewritableS3FileCount, "Number of file entries in the 'plain_rewritable' in-memory map for S3ObjectStorage.") \
+    M(DiskPlainRewritableS3UniqueFileNamesCount, "Number of unique file name entries in the 'plain_rewritable' in-memory map for S3ObjectStorage.") \
    \
    M(MergeTreePartsLoaderThreads, "Number of threads in the MergeTree parts loader thread pool.") \
    M(MergeTreePartsLoaderThreadsActive, "Number of threads in the MergeTree parts loader thread pool running a task.") \
--- a/src/Common/GWPAsan.cpp
+++ b/src/Common/GWPAsan.cpp
@ -57,7 +57,7 @@ static bool guarded_alloc_initialized = []
        opts.MaxSimultaneousAllocations = 1024;

    if (!env_options_raw || !std::string_view{env_options_raw}.contains("SampleRate"))
-        opts.SampleRate = 10000;
+        opts.SampleRate = 0;

    const char * collect_stacktraces = std::getenv("GWP_ASAN_COLLECT_STACKTRACES"); // NOLINT(concurrency-mt-unsafe)
    if (collect_stacktraces && std::string_view{collect_stacktraces} == "1")
--- a/src/Common/GWPAsan.h
+++ b/src/Common/GWPAsan.h
@ -8,7 +8,6 @@
 #include <Common/thread_local_rng.h>

 #include <atomic>
-#include <random>

 namespace GWPAsan
 {
@ -39,14 +38,6 @@ inline bool shouldSample()
    return init_finished.load(std::memory_order_relaxed) && GuardedAlloc.shouldSample();
 }

-inline bool shouldForceSample()
-{
-    if (!init_finished.load(std::memory_order_relaxed))
-        return false;
-    std::bernoulli_distribution dist(force_sample_probability.load(std::memory_order_relaxed));
-    return dist(thread_local_rng);
-}
-
 }

 #endif
--- a/src/Common/MemoryTracker.cpp
+++ b/src/Common/MemoryTracker.cpp
@ -68,15 +68,15 @@ inline std::string_view toDescription(OvercommitResult result)
    case OvercommitResult::NONE:
        return "";
    case OvercommitResult::DISABLED:
-        return "Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero.";
+        return "Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero";
    case OvercommitResult::MEMORY_FREED:
        throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "OvercommitResult::MEMORY_FREED shouldn't be asked for description");
    case OvercommitResult::SELECTED:
-        return "Query was selected to stop by OvercommitTracker.";
+        return "Query was selected to stop by OvercommitTracker";
    case OvercommitResult::TIMEOUTED:
-        return "Waiting timeout for memory to be freed is reached.";
+        return "Waiting timeout for memory to be freed is reached";
    case OvercommitResult::NOT_ENOUGH_FREED:
-        return "Memory overcommit has freed not enough memory.";
+        return "Memory overcommit has not freed enough memory";
    }
 }

@ -150,15 +150,23 @@ void MemoryTracker::logPeakMemoryUsage()
    auto peak_bytes = peak.load(std::memory_order::relaxed);
    if (peak_bytes < 128 * 1024)
        return;
-    LOG_DEBUG(getLogger("MemoryTracker"),
-        "Peak memory usage{}: {}.", (description ? " " + std::string(description) : ""), ReadableSize(peak_bytes));
+    LOG_DEBUG(
+        getLogger("MemoryTracker"),
+        "{}{} memory usage: {}.",
+        description ? std::string(description) : "",
+        description ? " peak" : "Peak",
+        ReadableSize(peak_bytes));
 }

 void MemoryTracker::logMemoryUsage(Int64 current) const
 {
    const auto * description = description_ptr.load(std::memory_order_relaxed);
-    LOG_DEBUG(getLogger("MemoryTracker"),
-        "Current memory usage{}: {}.", (description ? " " + std::string(description) : ""), ReadableSize(current));
+    LOG_DEBUG(
+        getLogger("MemoryTracker"),
+        "{}{} memory usage: {}.",
+        description ? std::string(description) : "",
+        description ? " current" : "Current",
+        ReadableSize(current));
 }

 void MemoryTracker::injectFault() const
@ -178,9 +186,9 @@ void MemoryTracker::injectFault() const
    const auto * description = description_ptr.load(std::memory_order_relaxed);
    throw DB::Exception(
        DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED,
-        "Memory tracker{}{}: fault injected (at specific point)",
-        description ? " " : "",
-        description ? description : "");
+        "{}{}: fault injected (at specific point)",
+        description ? description : "",
+        description ? " memory tracker" : "Memory tracker");
 }

 void MemoryTracker::debugLogBigAllocationWithoutCheck(Int64 size [[maybe_unused]])
@ -282,9 +290,9 @@ AllocationTrace MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceed
            const auto * description = description_ptr.load(std::memory_order_relaxed);
            throw DB::Exception(
                DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED,
-                "Memory tracker{}{}: fault injected. Would use {} (attempt to allocate chunk of {} bytes), maximum: {}",
-                description ? " " : "",
+                "{}{}: fault injected. Would use {} (attempt to allocate chunk of {} bytes), maximum: {}",
                description ? description : "",
+                description ? " memory tracker" : "Memory tracker",
                formatReadableSizeWithBinarySuffix(will_be),
                size,
                formatReadableSizeWithBinarySuffix(current_hard_limit));
@ -305,6 +313,8 @@ AllocationTrace MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceed

            if (overcommit_result != OvercommitResult::MEMORY_FREED)
            {
+                bool overcommit_result_ignore
+                    = overcommit_result == OvercommitResult::NONE || overcommit_result == OvercommitResult::DISABLED;
                /// Revert
                amount.fetch_sub(size, std::memory_order_relaxed);
                rss.fetch_sub(size, std::memory_order_relaxed);
@ -314,18 +324,18 @@ AllocationTrace MemoryTracker::allocImpl(Int64 size, bool throw_if_memory_exceed
                ProfileEvents::increment(ProfileEvents::QueryMemoryLimitExceeded);
                const auto * description = description_ptr.load(std::memory_order_relaxed);
                throw DB::Exception(
-                                    DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED,
-                                    "Memory limit{}{} exceeded: "
-                                    "would use {} (attempt to allocate chunk of {} bytes), current RSS {}, maximum: {}."
-                                    "{}{}",
-                                    description ? " " : "",
-                                    description ? description : "",
-                                    formatReadableSizeWithBinarySuffix(will_be),
-                                    size,
-                                    formatReadableSizeWithBinarySuffix(rss.load(std::memory_order_relaxed)),
-                                    formatReadableSizeWithBinarySuffix(current_hard_limit),
-                                    overcommit_result == OvercommitResult::NONE ? "" : " OvercommitTracker decision: ",
-                                    toDescription(overcommit_result));
+                    DB::ErrorCodes::MEMORY_LIMIT_EXCEEDED,
+                    "{}{} exceeded: "
+                    "would use {} (attempt to allocate chunk of {} bytes), current RSS {}, maximum: {}."
+                    "{}{}",
+                    description ? description : "",
+                    description ? " memory limit" : "Memory limit",
+                    formatReadableSizeWithBinarySuffix(will_be),
+                    size,
+                    formatReadableSizeWithBinarySuffix(rss.load(std::memory_order_relaxed)),
+                    formatReadableSizeWithBinarySuffix(current_hard_limit),
+                    overcommit_result_ignore ? "" : " OvercommitTracker decision: ",
+                    overcommit_result_ignore ? "" : toDescription(overcommit_result));
            }

            // If OvercommitTracker::needToStopQuery returned false, it guarantees that enough memory is freed.
--- a/src/Common/PODArray.h
+++ b/src/Common/PODArray.h
@ -115,11 +115,6 @@ protected:
    template <typename ... TAllocatorParams>
    void alloc(size_t bytes, TAllocatorParams &&... allocator_params)
    {
-#if USE_GWP_ASAN
-        if (unlikely(GWPAsan::shouldForceSample()))
-            gwp_asan::getThreadLocals()->NextSampleCounter = 1;
-#endif
-
        char * allocated = reinterpret_cast<char *>(TAllocator::alloc(bytes, std::forward<TAllocatorParams>(allocator_params)...));

        c_start = allocated + pad_left;
@ -149,11 +144,6 @@ protected:
            return;
        }

-#if USE_GWP_ASAN
-        if (unlikely(GWPAsan::shouldForceSample()))
-            gwp_asan::getThreadLocals()->NextSampleCounter = 1;
-#endif
-
        unprotect();

        ptrdiff_t end_diff = c_end - c_start;
--- a/src/Common/Priority.h
+++ b/src/Common/Priority.h
@ -6,6 +6,7 @@
 /// Separate type (rather than `Int64` is used just to avoid implicit conversion errors and to default-initialize
 struct Priority
 {
-    Int64 value = 0; /// Note that lower value means higher priority.
-    constexpr operator Int64() const { return value; } /// NOLINT
+    using Value = Int64;
+    Value value = 0; /// Note that lower value means higher priority.
+    constexpr operator Value() const { return value; } /// NOLINT
 };
--- a/src/Common/Scheduler/IResourceManager.h
+++ b/src/Common/Scheduler/IResourceManager.h
@ -26,6 +26,9 @@ class IClassifier : private boost::noncopyable
 public:
    virtual ~IClassifier() = default;

+    /// Returns true iff resource access is allowed by this classifier
+    virtual bool has(const String & resource_name) = 0;
+
    /// Returns ResourceLink that should be used to access resource.
    /// Returned link is valid until classifier destruction.
    virtual ResourceLink get(const String & resource_name) = 0;
@ -46,12 +49,15 @@ public:
    /// Initialize or reconfigure manager.
    virtual void updateConfiguration(const Poco::Util::AbstractConfiguration & config) = 0;

+    /// Returns true iff given resource is controlled through this manager.
+    virtual bool hasResource(const String & resource_name) const = 0;
+
    /// Obtain a classifier instance required to get access to resources.
    /// Note that it holds resource configuration, so should be destructed when query is done.
    virtual ClassifierPtr acquire(const String & classifier_name) = 0;

    /// For introspection, see `system.scheduler` table
-    using VisitorFunc = std::function<void(const String & resource, const String & path, const String & type, const SchedulerNodePtr & node)>;
+    using VisitorFunc = std::function<void(const String & resource, const String & path, ISchedulerNode * node)>;
    virtual void forEachNode(VisitorFunc visitor) = 0;
 };

--- a/src/Common/Scheduler/ISchedulerConstraint.h
+++ b/src/Common/Scheduler/ISchedulerConstraint.h
@ -15,8 +15,7 @@ namespace DB
 * When constraint is again satisfied, scheduleActivation() is called from finishRequest().
 *
 * Derived class behaviour requirements:
- *  - dequeueRequest() must fill `request->constraint` iff it is nullptr;
- *  - finishRequest() must be recursive: call to `parent_constraint->finishRequest()`.
+ *  - dequeueRequest() must call `request->addConstraint()`.
 */
 class ISchedulerConstraint : public ISchedulerNode
 {
@ -25,34 +24,16 @@ public:
        : ISchedulerNode(event_queue_, config, config_prefix)
    {}

+    ISchedulerConstraint(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
+        : ISchedulerNode(event_queue_, info_)
+    {}
+
    /// Resource consumption by `request` is finished.
    /// Should be called outside of scheduling subsystem, implementation must be thread-safe.
    virtual void finishRequest(ResourceRequest * request) = 0;

-    void setParent(ISchedulerNode * parent_) override
-    {
-        ISchedulerNode::setParent(parent_);
-
-        // Assign `parent_constraint` to the nearest parent derived from ISchedulerConstraint
-        for (ISchedulerNode * node = parent_; node != nullptr; node = node->parent)
-        {
-            if (auto * constraint = dynamic_cast<ISchedulerConstraint *>(node))
-            {
-                parent_constraint = constraint;
-                break;
-            }
-        }
-    }
-
    /// For introspection of current state (true = satisfied, false = violated)
    virtual bool isSatisfied() = 0;
-
-protected:
-    // Reference to nearest parent that is also derived from ISchedulerConstraint.
-    // Request can traverse through multiple constraints while being dequeue from hierarchy,
-    // while finishing request should traverse the same chain in reverse order.
-    // NOTE: it must be immutable after initialization, because it is accessed in not thread-safe way from finishRequest()
-    ISchedulerConstraint * parent_constraint = nullptr;
 };

 }
--- a/src/Common/Scheduler/ISchedulerNode.h
+++ b/src/Common/Scheduler/ISchedulerNode.h
@ -57,7 +57,13 @@ struct SchedulerNodeInfo

    SchedulerNodeInfo() = default;

-    explicit SchedulerNodeInfo(const Poco::Util::AbstractConfiguration & config = emptyConfig(), const String & config_prefix = {})
+    explicit SchedulerNodeInfo(double weight_, Priority priority_ = {})
+    {
+        setWeight(weight_);
+        setPriority(priority_);
+    }
+
+    explicit SchedulerNodeInfo(const Poco::Util::AbstractConfiguration & config, const String & config_prefix = {})
    {
        setWeight(config.getDouble(config_prefix + ".weight", weight));
        setPriority(config.getInt64(config_prefix + ".priority", priority));
@ -68,7 +74,7 @@ struct SchedulerNodeInfo
        if (value <= 0 || !isfinite(value))
            throw Exception(
                ErrorCodes::INVALID_SCHEDULER_NODE,
-                "Negative and non-finite node weights are not allowed: {}",
+                "Zero, negative and non-finite node weights are not allowed: {}",
                value);
        weight = value;
    }
@ -78,6 +84,11 @@ struct SchedulerNodeInfo
        priority.value = value;
    }

+    void setPriority(Priority value)
+    {
+        priority = value;
+    }
+
    // To check if configuration update required
    bool equals(const SchedulerNodeInfo & o) const
    {
@ -123,7 +134,14 @@ public:
        , info(config, config_prefix)
    {}

-    virtual ~ISchedulerNode() = default;
+    ISchedulerNode(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
+        : event_queue(event_queue_)
+        , info(info_)
+    {}
+
+    virtual ~ISchedulerNode();
+
+    virtual const String & getTypeName() const = 0;

    /// Checks if two nodes configuration is equal
    virtual bool equals(ISchedulerNode * other)
@ -134,10 +152,11 @@ public:
    /// Attach new child
    virtual void attachChild(const std::shared_ptr<ISchedulerNode> & child) = 0;

-    /// Detach and destroy child
+    /// Detach child
+    /// NOTE: child might be destroyed if the only reference was stored in parent
    virtual void removeChild(ISchedulerNode * child) = 0;

-    /// Get attached child by name
+    /// Get attached child by name (for tests only)
    virtual ISchedulerNode * getChild(const String & child_name) = 0;

    /// Activation of child due to the first pending request
@ -147,7 +166,7 @@ public:
    /// Returns true iff node is active
    virtual bool isActive() = 0;

-    /// Returns number of active children
+    /// Returns number of active children (for introspection only).
    virtual size_t activeChildren() = 0;

    /// Returns the first request to be executed as the first component of resulting pair.
@ -155,10 +174,10 @@ public:
    virtual std::pair<ResourceRequest *, bool> dequeueRequest() = 0;

    /// Returns full path string using names of every parent
-    String getPath()
+    String getPath() const
    {
        String result;
-        ISchedulerNode * ptr = this;
+        const ISchedulerNode * ptr = this;
        while (ptr->parent)
        {
            result = "/" + ptr->basename + result;
@ -168,10 +187,7 @@ public:
    }

    /// Attach to a parent (used by attachChild)
-    virtual void setParent(ISchedulerNode * parent_)
-    {
-        parent = parent_;
-    }
+    void setParent(ISchedulerNode * parent_);

 protected:
    /// Notify parents about the first pending request or constraint becoming satisfied.
@ -307,6 +323,15 @@ public:
            pending.notify_one();
    }

+    /// Removes an activation from queue
+    void cancelActivation(ISchedulerNode * node)
+    {
+        std::unique_lock lock{mutex};
+        if (node->is_linked())
+            activations.erase(activations.iterator_to(*node));
+        node->activation_event_id = 0;
+    }
+
    /// Process single event if it exists
    /// Note that postponing constraint are ignored, use it to empty the queue including postponed events on shutdown
    /// Returns `true` iff event has been processed
@ -471,6 +496,20 @@ private:
    std::atomic<TimePoint> manual_time{TimePoint()}; // for tests only
 };

+inline ISchedulerNode::~ISchedulerNode()
+{
+    // Make sure there is no dangling reference in activations queue
+    event_queue->cancelActivation(this);
+}
+
+inline void ISchedulerNode::setParent(ISchedulerNode * parent_)
+{
+    parent = parent_;
+    // Avoid activation of a detached node
+    if (parent == nullptr)
+        event_queue->cancelActivation(this);
+}
+
 inline void ISchedulerNode::scheduleActivation()
 {
    if (likely(parent))
--- a/src/Common/Scheduler/ISchedulerQueue.h
+++ b/src/Common/Scheduler/ISchedulerQueue.h
@ -21,6 +21,10 @@ public:
        : ISchedulerNode(event_queue_, config, config_prefix)
    {}

+    ISchedulerQueue(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
+        : ISchedulerNode(event_queue_, info_)
+    {}
+
    // Wrapper for `enqueueRequest()` that should be used to account for available resource budget
    // Returns `estimated_cost` that should be passed later to `adjustBudget()`
    [[ nodiscard ]] ResourceCost enqueueRequestUsingBudget(ResourceRequest * request)
@ -47,6 +51,11 @@ public:
    /// Should be called outside of scheduling subsystem, implementation must be thread-safe.
    virtual bool cancelRequest(ResourceRequest * request) = 0;

+    /// Fails all the resource requests in queue and marks this queue as not usable.
+    /// Afterwards any new request will be failed on `enqueueRequest()`.
+    /// NOTE: This is done for queues that are about to be destructed.
+    virtual void purgeQueue() = 0;
+
    /// For introspection
    ResourceCost getBudget() const
    {
--- a/src/Common/Scheduler/Nodes/ClassifiersConfig.cpp
+++ b/src/Common/Scheduler/Nodes/ClassifiersConfig.cpp
@ -5,11 +5,6 @@
 namespace DB
 {

-namespace ErrorCodes
-{
-    extern const int RESOURCE_NOT_FOUND;
-}
-
 ClassifierDescription::ClassifierDescription(const Poco::Util::AbstractConfiguration & config, const String & config_prefix)
 {
    Poco::Util::AbstractConfiguration::Keys keys;
@ -31,9 +26,11 @@ ClassifiersConfig::ClassifiersConfig(const Poco::Util::AbstractConfiguration & c

 const ClassifierDescription & ClassifiersConfig::get(const String & classifier_name)
 {
+    static ClassifierDescription empty;
    if (auto it = classifiers.find(classifier_name); it != classifiers.end())
        return it->second;
-    throw Exception(ErrorCodes::RESOURCE_NOT_FOUND, "Unknown workload classifier '{}' to access resources", classifier_name);
+    else
+        return empty;
 }

 }
--- a/src/Common/Scheduler/Nodes/ClassifiersConfig.h
+++ b/src/Common/Scheduler/Nodes/ClassifiersConfig.h
@ -10,6 +10,7 @@ namespace DB
 /// Mapping of resource name into path string (e.g. "disk1" -> "/path/to/class")
 struct ClassifierDescription : std::unordered_map<String, String>
 {
+    ClassifierDescription() = default;
    ClassifierDescription(const Poco::Util::AbstractConfiguration & config, const String & config_prefix);
 };

--- a/src/Common/Scheduler/Nodes/DynamicResourceManager.cpp
+++ b/src/Common/Scheduler/Nodes/DynamicResourceManager.cpp
@ -1,7 +1,6 @@
-#include <Common/Scheduler/Nodes/DynamicResourceManager.h>
+#include <Common/Scheduler/Nodes/CustomResourceManager.h>

 #include <Common/Scheduler/Nodes/SchedulerNodeFactory.h>
-#include <Common/Scheduler/ResourceManagerFactory.h>
 #include <Common/Scheduler/ISchedulerQueue.h>

 #include <Common/Exception.h>
@ -21,7 +20,7 @@ namespace ErrorCodes
    extern const int INVALID_SCHEDULER_NODE;
 }

-DynamicResourceManager::State::State(EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config)
+CustomResourceManager::State::State(EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config)
    : classifiers(config)
 {
    Poco::Util::AbstractConfiguration::Keys keys;
@ -35,7 +34,7 @@ DynamicResourceManager::State::State(EventQueue * event_queue, const Poco::Util:
    }
 }

-DynamicResourceManager::State::Resource::Resource(
+CustomResourceManager::State::Resource::Resource(
    const String & name,
    EventQueue * event_queue,
    const Poco::Util::AbstractConfiguration & config,
@ -92,7 +91,7 @@ DynamicResourceManager::State::Resource::Resource(
        throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "undefined root node path '/' for resource '{}'", name);
 }

-DynamicResourceManager::State::Resource::~Resource()
+CustomResourceManager::State::Resource::~Resource()
 {
    // NOTE: we should rely on `attached_to` and cannot use `parent`,
    // NOTE: because `parent` can be `nullptr` in case attachment is still in event queue
@ -106,14 +105,14 @@ DynamicResourceManager::State::Resource::~Resource()
    }
 }

-DynamicResourceManager::State::Node::Node(const String & name, EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config, const std::string & config_prefix)
+CustomResourceManager::State::Node::Node(const String & name, EventQueue * event_queue, const Poco::Util::AbstractConfiguration & config, const std::string & config_prefix)
    : type(config.getString(config_prefix + ".type", "fifo"))
    , ptr(SchedulerNodeFactory::instance().get(type, event_queue, config, config_prefix))
 {
    ptr->basename = name;
 }

-bool DynamicResourceManager::State::Resource::equals(const DynamicResourceManager::State::Resource & o) const
+bool CustomResourceManager::State::Resource::equals(const CustomResourceManager::State::Resource & o) const
 {
    if (nodes.size() != o.nodes.size())
        return false;
@ -130,14 +129,14 @@ bool DynamicResourceManager::State::Resource::equals(const DynamicResourceManage
    return true;
 }

-bool DynamicResourceManager::State::Node::equals(const DynamicResourceManager::State::Node & o) const
+bool CustomResourceManager::State::Node::equals(const CustomResourceManager::State::Node & o) const
 {
    if (type != o.type)
        return false;
    return ptr->equals(o.ptr.get());
 }

-DynamicResourceManager::Classifier::Classifier(const DynamicResourceManager::StatePtr & state_, const String & classifier_name)
+CustomResourceManager::Classifier::Classifier(const CustomResourceManager::StatePtr & state_, const String & classifier_name)
    : state(state_)
 {
    // State is immutable, but nodes are mutable and thread-safe
@ -162,20 +161,25 @@ DynamicResourceManager::Classifier::Classifier(const DynamicResourceManager::Sta
    }
 }

-ResourceLink DynamicResourceManager::Classifier::get(const String & resource_name)
+bool CustomResourceManager::Classifier::has(const String & resource_name)
+{
+    return resources.contains(resource_name);
+}
+
+ResourceLink CustomResourceManager::Classifier::get(const String & resource_name)
 {
    if (auto iter = resources.find(resource_name); iter != resources.end())
        return iter->second;
    throw Exception(ErrorCodes::RESOURCE_ACCESS_DENIED, "Access denied to resource '{}'", resource_name);
 }

-DynamicResourceManager::DynamicResourceManager()
+CustomResourceManager::CustomResourceManager()
    : state(new State())
 {
    scheduler.start();
 }

-void DynamicResourceManager::updateConfiguration(const Poco::Util::AbstractConfiguration & config)
+void CustomResourceManager::updateConfiguration(const Poco::Util::AbstractConfiguration & config)
 {
    StatePtr new_state = std::make_shared<State>(scheduler.event_queue, config);

@ -217,7 +221,13 @@ void DynamicResourceManager::updateConfiguration(const Poco::Util::AbstractConfi
    // NOTE: after mutex unlock `state` became available for Classifier(s) and must be immutable
 }

-ClassifierPtr DynamicResourceManager::acquire(const String & classifier_name)
+bool CustomResourceManager::hasResource(const String & resource_name) const
+{
+    std::lock_guard lock{mutex};
+    return state->resources.contains(resource_name);
+}
+
+ClassifierPtr CustomResourceManager::acquire(const String & classifier_name)
 {
    // Acquire a reference to the current state
    StatePtr state_ref;
@ -229,7 +239,7 @@ ClassifierPtr DynamicResourceManager::acquire(const String & classifier_name)
    return std::make_shared<Classifier>(state_ref, classifier_name);
 }

-void DynamicResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
+void CustomResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
 {
    // Acquire a reference to the current state
    StatePtr state_ref;
@ -244,7 +254,7 @@ void DynamicResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
    {
        for (auto & [name, resource] : state_ref->resources)
            for (auto & [path, node] : resource->nodes)
-                visitor(name, path, node.type, node.ptr);
+                visitor(name, path, node.ptr.get());
        promise.set_value();
    });

@ -252,9 +262,4 @@ void DynamicResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
    future.get();
 }

-void registerDynamicResourceManager(ResourceManagerFactory & factory)
-{
-    factory.registerMethod<DynamicResourceManager>("dynamic");
-}
-
 }
--- a/src/Common/Scheduler/Nodes/DynamicResourceManager.h
+++ b/src/Common/Scheduler/Nodes/DynamicResourceManager.h
@ -10,7 +10,9 @@ namespace DB
 {

 /*
- * Implementation of `IResourceManager` supporting arbitrary dynamic hierarchy of scheduler nodes.
+ * Implementation of `IResourceManager` supporting arbitrary hierarchy of scheduler nodes.
+ * Scheduling hierarchies for every resource is described through server xml or yaml configuration.
+ * Configuration could be changed dynamically without server restart.
 * All resources are controlled by single root `SchedulerRoot`.
 *
 * State of manager is set of resources attached to the scheduler. States are referenced by classifiers.
@ -24,11 +26,12 @@ namespace DB
 * violation will apply to fairness. Old version exists as long as there is at least one classifier
 * instance referencing it. Classifiers are typically attached to queries and will be destructed with them.
 */
-class DynamicResourceManager : public IResourceManager
+class CustomResourceManager : public IResourceManager
 {
 public:
-    DynamicResourceManager();
+    CustomResourceManager();
    void updateConfiguration(const Poco::Util::AbstractConfiguration & config) override;
+    bool hasResource(const String & resource_name) const override;
    ClassifierPtr acquire(const String & classifier_name) override;
    void forEachNode(VisitorFunc visitor) override;

@ -79,6 +82,7 @@ private:
    {
    public:
        Classifier(const StatePtr & state_, const String & classifier_name);
+        bool has(const String & resource_name) override;
        ResourceLink get(const String & resource_name) override;
    private:
        std::unordered_map<String, ResourceLink> resources; // accessible resources by names
@ -86,7 +90,7 @@ private:
    };

    SchedulerRoot scheduler;
-    std::mutex mutex;
+    mutable std::mutex mutex;
    StatePtr state;
 };

--- a/src/Common/Scheduler/Nodes/FairPolicy.h
+++ b/src/Common/Scheduler/Nodes/FairPolicy.h
@ -28,7 +28,7 @@ namespace ErrorCodes
 * of a child is set to vruntime of "start" of the last request. This guarantees immediate processing
 * of at least single request of newly activated children and thus best isolation and scheduling latency.
 */
-class FairPolicy : public ISchedulerNode
+class FairPolicy final : public ISchedulerNode
 {
    /// Scheduling state of a child
    struct Item
@ -48,6 +48,23 @@ public:
        : ISchedulerNode(event_queue_, config, config_prefix)
    {}

+    FairPolicy(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
+        : ISchedulerNode(event_queue_, info_)
+    {}
+
+    ~FairPolicy() override
+    {
+        // We need to clear `parent` in all children to avoid dangling references
+        while (!children.empty())
+            removeChild(children.begin()->second.get());
+    }
+
+    const String & getTypeName() const override
+    {
+        static String type_name("fair");
+        return type_name;
+    }
+
    bool equals(ISchedulerNode * other) override
    {
        if (!ISchedulerNode::equals(other))
--- a/src/Common/Scheduler/Nodes/FifoQueue.h
+++ b/src/Common/Scheduler/Nodes/FifoQueue.h
@ -23,13 +23,28 @@ namespace ErrorCodes
 /*
 * FIFO queue to hold pending resource requests
 */
-class FifoQueue : public ISchedulerQueue
+class FifoQueue final : public ISchedulerQueue
 {
 public:
    FifoQueue(EventQueue * event_queue_, const Poco::Util::AbstractConfiguration & config, const String & config_prefix)
        : ISchedulerQueue(event_queue_, config, config_prefix)
    {}

+    FifoQueue(EventQueue * event_queue_, const SchedulerNodeInfo & info_)
+        : ISchedulerQueue(event_queue_, info_)
+    {}
+
+    ~FifoQueue() override
+    {
+        purgeQueue();
+    }
+
+    const String & getTypeName() const override
+    {
+        static String type_name("fifo");
+        return type_name;
+    }
+
    bool equals(ISchedulerNode * other) override
    {
        if (!ISchedulerNode::equals(other))
@ -42,6 +57,8 @@ public:
    void enqueueRequest(ResourceRequest * request) override
    {
        std::lock_guard lock(mutex);
+        if (is_not_usable)
+            throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Scheduler queue is about to be destructed");
        queue_cost += request->cost;
        bool was_empty = requests.empty();
        requests.push_back(*request);
@ -66,6 +83,8 @@ public:
    bool cancelRequest(ResourceRequest * request) override
    {
        std::lock_guard lock(mutex);
+        if (is_not_usable)
+            return false; // Any request should already be failed or executed
        if (request->is_linked())
        {
            // It's impossible to check that `request` is indeed inserted to this queue and not another queue.
@ -88,6 +107,19 @@ public:
        return false;
    }

+    void purgeQueue() override
+    {
+        std::lock_guard lock(mutex);
+        is_not_usable = true;
+        while (!requests.empty())
+        {
+            ResourceRequest * request = &requests.front();
+            requests.pop_front();
+            request->failed(std::make_exception_ptr(
+                Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Scheduler queue with resource request is about to be destructed")));
+        }
+    }
+
    bool isActive() override
    {
        std::lock_guard lock(mutex);
@ -131,6 +163,7 @@ private:
    std::mutex mutex;
    Int64 queue_cost = 0;
    boost::intrusive::list<ResourceRequest> requests;
+    bool is_not_usable = false;
 };

 }
--- a/src/Common/Scheduler/Nodes/IOResourceManager.cpp
+++ b/src/Common/Scheduler/Nodes/IOResourceManager.cpp
@ -0,0 +1,532 @@
+#include <Common/Scheduler/Nodes/IOResourceManager.h>
+
+#include <Common/Scheduler/Nodes/FifoQueue.h>
+#include <Common/Scheduler/Nodes/FairPolicy.h>
+
+#include <Common/logger_useful.h>
+#include <Common/Exception.h>
+#include <Common/StringUtils.h>
+#include <Common/assert_cast.h>
+#include <Common/typeid_cast.h>
+#include <Common/Priority.h>
+
+#include <Parsers/ASTCreateWorkloadQuery.h>
+#include <Parsers/ASTCreateResourceQuery.h>
+
+#include <memory>
+#include <mutex>
+#include <map>
+
+namespace DB
+{
+
+namespace ErrorCodes
+{
+    extern const int RESOURCE_NOT_FOUND;
+    extern const int INVALID_SCHEDULER_NODE;
+    extern const int LOGICAL_ERROR;
+}
+
+namespace
+{
+    String getEntityName(const ASTPtr & ast)
+    {
+        if (auto * create = typeid_cast<ASTCreateWorkloadQuery *>(ast.get()))
+            return create->getWorkloadName();
+        if (auto * create = typeid_cast<ASTCreateResourceQuery *>(ast.get()))
+            return create->getResourceName();
+        return "unknown-workload-entity";
+    }
+}
+
+IOResourceManager::NodeInfo::NodeInfo(const ASTPtr & ast, const String & resource_name)
+{
+    auto * create = assert_cast<ASTCreateWorkloadQuery *>(ast.get());
+    name = create->getWorkloadName();
+    parent = create->getWorkloadParent();
+    settings.updateFromChanges(create->changes, resource_name);
+}
+
+IOResourceManager::Resource::Resource(const ASTPtr & resource_entity_)
+    : resource_entity(resource_entity_)
+    , resource_name(getEntityName(resource_entity))
+{
+    scheduler.start();
+}
+
+IOResourceManager::Resource::~Resource()
+{
+    scheduler.stop();
+}
+
+void IOResourceManager::Resource::createNode(const NodeInfo & info)
+{
+    if (info.name.empty())
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Workload must have a name in resource '{}'",
+            resource_name);
+
+    if (info.name == info.parent)
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Self-referencing workload '{}' is not allowed in resource '{}'",
+            info.name, resource_name);
+
+    if (node_for_workload.contains(info.name))
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Node for creating workload '{}' already exist in resource '{}'",
+            info.name, resource_name);
+
+    if (!info.parent.empty() && !node_for_workload.contains(info.parent))
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Parent node '{}' for creating workload '{}' does not exist in resource '{}'",
+            info.parent, info.name, resource_name);
+
+    if (info.parent.empty() && root_node)
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "The second root workload '{}' is not allowed (current root '{}') in resource '{}'",
+            info.name, root_node->basename, resource_name);
+
+    executeInSchedulerThread([&, this]
+    {
+        auto node = std::make_shared<UnifiedSchedulerNode>(scheduler.event_queue, info.settings);
+        node->basename = info.name;
+        if (!info.parent.empty())
+            node_for_workload[info.parent]->attachUnifiedChild(node);
+        else
+        {
+            root_node = node;
+            scheduler.attachChild(root_node);
+        }
+        node_for_workload[info.name] = node;
+
+        updateCurrentVersion();
+    });
+}
+
+void IOResourceManager::Resource::deleteNode(const NodeInfo & info)
+{
+    if (!node_for_workload.contains(info.name))
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Node for removing workload '{}' does not exist in resource '{}'",
+            info.name, resource_name);
+
+    if (!info.parent.empty() && !node_for_workload.contains(info.parent))
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Parent node '{}' for removing workload '{}' does not exist in resource '{}'",
+            info.parent, info.name, resource_name);
+
+    auto node = node_for_workload[info.name];
+
+    if (node->hasUnifiedChildren())
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Removing workload '{}' with children in resource '{}'",
+        info.name, resource_name);
+
+    executeInSchedulerThread([&]
+    {
+        if (!info.parent.empty())
+            node_for_workload[info.parent]->detachUnifiedChild(node);
+        else
+        {
+            chassert(node == root_node);
+            scheduler.removeChild(root_node.get());
+            root_node.reset();
+        }
+
+        node_for_workload.erase(info.name);
+
+        updateCurrentVersion();
+    });
+}
+
+void IOResourceManager::Resource::updateNode(const NodeInfo & old_info, const NodeInfo & new_info)
+{
+    if (old_info.name != new_info.name)
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Updating a name of workload '{}' to '{}' is not allowed in resource '{}'",
+            old_info.name, new_info.name, resource_name);
+
+    if (old_info.parent != new_info.parent && (old_info.parent.empty() || new_info.parent.empty()))
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Workload '{}' invalid update of parent from '{}' to '{}' in resource '{}'",
+            old_info.name, old_info.parent, new_info.parent, resource_name);
+
+    if (!node_for_workload.contains(old_info.name))
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Node for updating workload '{}' does not exist in resource '{}'",
+            old_info.name, resource_name);
+
+    if (!old_info.parent.empty() && !node_for_workload.contains(old_info.parent))
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Old parent node '{}' for updating workload '{}' does not exist in resource '{}'",
+            old_info.parent, old_info.name, resource_name);
+
+    if (!new_info.parent.empty() && !node_for_workload.contains(new_info.parent))
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "New parent node '{}' for updating workload '{}' does not exist in resource '{}'",
+            new_info.parent, new_info.name, resource_name);
+
+    executeInSchedulerThread([&, this]
+    {
+        auto node = node_for_workload[old_info.name];
+        bool detached = false;
+        if (UnifiedSchedulerNode::updateRequiresDetach(old_info.parent, new_info.parent, old_info.settings, new_info.settings))
+        {
+            if (!old_info.parent.empty())
+                node_for_workload[old_info.parent]->detachUnifiedChild(node);
+            detached = true;
+        }
+
+        node->updateSchedulingSettings(new_info.settings);
+
+        if (detached)
+        {
+            if (!new_info.parent.empty())
+                node_for_workload[new_info.parent]->attachUnifiedChild(node);
+        }
+        updateCurrentVersion();
+    });
+}
+
+void IOResourceManager::Resource::updateCurrentVersion()
+{
+    auto previous_version = current_version;
+
+    // Create a full list of constraints and queues in the current hierarchy
+    current_version = std::make_shared<Version>();
+    if (root_node)
+        root_node->addRawPointerNodes(current_version->nodes);
+
+    // See details in version control section of description in IOResourceManager.h
+    if (previous_version)
+    {
+        previous_version->newer_version = current_version;
+        previous_version.reset(); // Destroys previous version nodes if there are no classifiers referencing it
+    }
+}
+
+IOResourceManager::Workload::Workload(IOResourceManager * resource_manager_, const ASTPtr & workload_entity_)
+    : resource_manager(resource_manager_)
+    , workload_entity(workload_entity_)
+{
+    try
+    {
+        for (auto & [resource_name, resource] : resource_manager->resources)
+            resource->createNode(NodeInfo(workload_entity, resource_name));
+    }
+    catch (...)
+    {
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected error in IOResourceManager: {}",
+            getCurrentExceptionMessage(/* with_stacktrace = */ true));
+    }
+}
+
+IOResourceManager::Workload::~Workload()
+{
+    try
+    {
+        for (auto & [resource_name, resource] : resource_manager->resources)
+            resource->deleteNode(NodeInfo(workload_entity, resource_name));
+    }
+    catch (...)
+    {
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected error in IOResourceManager: {}",
+            getCurrentExceptionMessage(/* with_stacktrace = */ true));
+    }
+}
+
+void IOResourceManager::Workload::updateWorkload(const ASTPtr & new_entity)
+{
+    try
+    {
+        for (auto & [resource_name, resource] : resource_manager->resources)
+            resource->updateNode(NodeInfo(workload_entity, resource_name), NodeInfo(new_entity, resource_name));
+        workload_entity = new_entity;
+    }
+    catch (...)
+    {
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected error in IOResourceManager: {}",
+            getCurrentExceptionMessage(/* with_stacktrace = */ true));
+    }
+}
+
+String IOResourceManager::Workload::getParent() const
+{
+    return assert_cast<ASTCreateWorkloadQuery *>(workload_entity.get())->getWorkloadParent();
+}
+
+IOResourceManager::IOResourceManager(IWorkloadEntityStorage & storage_)
+    : storage(storage_)
+    , log{getLogger("IOResourceManager")}
+{
+    subscription = storage.getAllEntitiesAndSubscribe(
+        [this] (const std::vector<IWorkloadEntityStorage::Event> & events)
+        {
+            for (const auto & [entity_type, entity_name, entity] : events)
+            {
+                switch (entity_type)
+                {
+                    case WorkloadEntityType::Workload:
+                    {
+                        if (entity)
+                            createOrUpdateWorkload(entity_name, entity);
+                        else
+                            deleteWorkload(entity_name);
+                        break;
+                    }
+                    case WorkloadEntityType::Resource:
+                    {
+                        if (entity)
+                            createOrUpdateResource(entity_name, entity);
+                        else
+                            deleteResource(entity_name);
+                        break;
+                    }
+                    case WorkloadEntityType::MAX: break;
+                }
+            }
+        });
+}
+
+IOResourceManager::~IOResourceManager()
+{
+    subscription.reset();
+    resources.clear();
+    workloads.clear();
+}
+
+void IOResourceManager::updateConfiguration(const Poco::Util::AbstractConfiguration &)
+{
+    // No-op
+}
+
+void IOResourceManager::createOrUpdateWorkload(const String & workload_name, const ASTPtr & ast)
+{
+    std::unique_lock lock{mutex};
+    if (auto workload_iter = workloads.find(workload_name); workload_iter != workloads.end())
+        workload_iter->second->updateWorkload(ast);
+    else
+        workloads.emplace(workload_name, std::make_shared<Workload>(this, ast));
+}
+
+void IOResourceManager::deleteWorkload(const String & workload_name)
+{
+    std::unique_lock lock{mutex};
+    if (auto workload_iter = workloads.find(workload_name); workload_iter != workloads.end())
+    {
+        // Note that we rely of the fact that workload entity storage will not drop workload that is used as a parent
+        workloads.erase(workload_iter);
+    }
+    else // Workload to be deleted does not exist -- do nothing, throwing exceptions from a subscription is pointless
+        LOG_ERROR(log, "Delete workload that doesn't exist: {}", workload_name);
+}
+
+void IOResourceManager::createOrUpdateResource(const String & resource_name, const ASTPtr & ast)
+{
+    std::unique_lock lock{mutex};
+    if (auto resource_iter = resources.find(resource_name); resource_iter != resources.end())
+        resource_iter->second->updateResource(ast);
+    else
+    {
+        // Add all workloads into the new resource
+        auto resource = std::make_shared<Resource>(ast);
+        for (Workload * workload : topologicallySortedWorkloads())
+            resource->createNode(NodeInfo(workload->workload_entity, resource_name));
+
+        // Attach the resource
+        resources.emplace(resource_name, resource);
+    }
+}
+
+void IOResourceManager::deleteResource(const String & resource_name)
+{
+    std::unique_lock lock{mutex};
+    if (auto resource_iter = resources.find(resource_name); resource_iter != resources.end())
+    {
+        resources.erase(resource_iter);
+    }
+    else // Resource to be deleted does not exist -- do nothing, throwing exceptions from a subscription is pointless
+        LOG_ERROR(log, "Delete resource that doesn't exist: {}", resource_name);
+}
+
+IOResourceManager::Classifier::~Classifier()
+{
+    // Detach classifier from all resources in parallel (executed in every scheduler thread)
+    std::vector<std::future<void>> futures;
+    {
+        std::unique_lock lock{mutex};
+        futures.reserve(attachments.size());
+        for (auto & [resource_name, attachment] : attachments)
+        {
+            futures.emplace_back(attachment.resource->detachClassifier(std::move(attachment.version)));
+            attachment.link.reset(); // Just in case because it is not valid any longer
+        }
+    }
+
+    // Wait for all tasks to finish (to avoid races in case of exceptions)
+    for (auto & future : futures)
+        future.wait();
+
+    // There should not be any exceptions because it just destruct few objects, but let's rethrow just in case
+    for (auto & future : futures)
+        future.get();
+
+    // This unreferences and probably destroys `Resource` objects.
+    // NOTE: We cannot do it in the scheduler threads (because thread cannot join itself).
+    attachments.clear();
+}
+
+std::future<void> IOResourceManager::Resource::detachClassifier(VersionPtr && version)
+{
+    auto detach_promise = std::make_shared<std::promise<void>>(); // event queue task is std::function, which requires copy semanticss
+    auto future = detach_promise->get_future();
+    scheduler.event_queue->enqueue([detached_version = std::move(version), promise = std::move(detach_promise)] mutable
+    {
+        try
+        {
+            // Unreferences and probably destroys the version and scheduler nodes it owns.
+            // The main reason from moving destruction into the scheduler thread is to
+            // free memory in the same thread it was allocated to avoid memtrackers drift.
+            detached_version.reset();
+            promise->set_value();
+        }
+        catch (...)
+        {
+            promise->set_exception(std::current_exception());
+        }
+    });
+    return future;
+}
+
+bool IOResourceManager::Classifier::has(const String & resource_name)
+{
+    std::unique_lock lock{mutex};
+    return attachments.contains(resource_name);
+}
+
+ResourceLink IOResourceManager::Classifier::get(const String & resource_name)
+{
+    std::unique_lock lock{mutex};
+    if (auto iter = attachments.find(resource_name); iter != attachments.end())
+        return iter->second.link;
+    else
+        throw Exception(ErrorCodes::RESOURCE_NOT_FOUND, "Access denied to resource '{}'", resource_name);
+}
+
+void IOResourceManager::Classifier::attach(const ResourcePtr & resource, const VersionPtr & version, ResourceLink link)
+{
+    std::unique_lock lock{mutex};
+    chassert(!attachments.contains(resource->getName()));
+    attachments[resource->getName()] = Attachment{.resource = resource, .version = version, .link = link};
+}
+
+void IOResourceManager::Resource::updateResource(const ASTPtr & new_resource_entity)
+{
+    chassert(getEntityName(new_resource_entity) == resource_name);
+    resource_entity = new_resource_entity;
+}
+
+std::future<void> IOResourceManager::Resource::attachClassifier(Classifier & classifier, const String & workload_name)
+{
+    auto attach_promise = std::make_shared<std::promise<void>>(); // event queue task is std::function, which requires copy semantics
+    auto future = attach_promise->get_future();
+    scheduler.event_queue->enqueue([&, this, promise = std::move(attach_promise)]
+    {
+        try
+        {
+            if (auto iter = node_for_workload.find(workload_name); iter != node_for_workload.end())
+            {
+                auto queue = iter->second->getQueue();
+                if (!queue)
+                    throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Unable to use workload '{}' that have children for resource '{}'",
+                        workload_name, resource_name);
+                classifier.attach(shared_from_this(), current_version, ResourceLink{.queue = queue.get()});
+            }
+            else
+            {
+                // This resource does not have specified workload. It is either unknown or managed by another resource manager.
+                // We leave this resource not attached to the classifier. Access denied will be thrown later on `classifier->get(resource_name)`
+            }
+            promise->set_value();
+        }
+        catch (...)
+        {
+            promise->set_exception(std::current_exception());
+        }
+    });
+    return future;
+}
+
+bool IOResourceManager::hasResource(const String & resource_name) const
+{
+    std::unique_lock lock{mutex};
+    return resources.contains(resource_name);
+}
+
+ClassifierPtr IOResourceManager::acquire(const String & workload_name)
+{
+    auto classifier = std::make_shared<Classifier>();
+
+    // Attach classifier to all resources in parallel (executed in every scheduler thread)
+    std::vector<std::future<void>> futures;
+    {
+        std::unique_lock lock{mutex};
+        futures.reserve(resources.size());
+        for (auto & [resource_name, resource] : resources)
+            futures.emplace_back(resource->attachClassifier(*classifier, workload_name));
+    }
+
+    // Wait for all tasks to finish (to avoid races in case of exceptions)
+    for (auto & future : futures)
+        future.wait();
+
+    // Rethrow exceptions if any
+    for (auto & future : futures)
+        future.get();
+
+    return classifier;
+}
+
+void IOResourceManager::Resource::forEachResourceNode(IResourceManager::VisitorFunc & visitor)
+{
+    executeInSchedulerThread([&, this]
+    {
+        for (auto & [path, node] : node_for_workload)
+        {
+            node->forEachSchedulerNode([&] (ISchedulerNode * scheduler_node)
+            {
+                visitor(resource_name, scheduler_node->getPath(), scheduler_node);
+            });
+        }
+    });
+}
+
+void IOResourceManager::forEachNode(IResourceManager::VisitorFunc visitor)
+{
+    // Copy resource to avoid holding mutex for a long time
+    std::unordered_map<String, ResourcePtr> resources_copy;
+    {
+        std::unique_lock lock{mutex};
+        resources_copy = resources;
+    }
+
+    /// Run tasks one by one to avoid concurrent calls to visitor
+    for (auto & [resource_name, resource] : resources_copy)
+        resource->forEachResourceNode(visitor);
+}
+
+void IOResourceManager::topologicallySortedWorkloadsImpl(Workload * workload, std::unordered_set<Workload *> & visited, std::vector<Workload *> & sorted_workloads)
+{
+    if (visited.contains(workload))
+        return;
+    visited.insert(workload);
+
+    // Recurse into parent (if any)
+    String parent = workload->getParent();
+    if (!parent.empty())
+    {
+        auto parent_iter = workloads.find(parent);
+        chassert(parent_iter != workloads.end()); // validations check that all parents exist
+        topologicallySortedWorkloadsImpl(parent_iter->second.get(), visited, sorted_workloads);
+    }
+
+    sorted_workloads.push_back(workload);
+}
+
+std::vector<IOResourceManager::Workload *> IOResourceManager::topologicallySortedWorkloads()
+{
+    std::vector<Workload *> sorted_workloads;
+    std::unordered_set<Workload *> visited;
+    for (auto & [workload_name, workload] : workloads)
+        topologicallySortedWorkloadsImpl(workload.get(), visited, sorted_workloads);
+    return sorted_workloads;
+}
+
+}
--- a/src/Common/Scheduler/Nodes/IOResourceManager.h
+++ b/src/Common/Scheduler/Nodes/IOResourceManager.h
@ -0,0 +1,281 @@
+#pragma once
+
+#include <base/defines.h>
+#include <base/scope_guard.h>
+
+#include <Common/Logger.h>
+#include <Common/Scheduler/SchedulingSettings.h>
+#include <Common/Scheduler/IResourceManager.h>
+#include <Common/Scheduler/SchedulerRoot.h>
+#include <Common/Scheduler/Nodes/UnifiedSchedulerNode.h>
+#include <Common/Scheduler/Workload/IWorkloadEntityStorage.h>
+
+#include <Parsers/IAST_fwd.h>
+
+#include <boost/core/noncopyable.hpp>
+
+#include <exception>
+#include <memory>
+#include <mutex>
+#include <future>
+#include <unordered_set>
+
+namespace DB
+{
+
+/*
+ * Implementation of `IResourceManager` that creates hierarchy of scheduler nodes according to
+ * workload entities (WORKLOADs and RESOURCEs). It subscribes for updates in IWorkloadEntityStorage and
+ * creates hierarchy of UnifiedSchedulerNode identical to the hierarchy of WORKLOADs.
+ * For every RESOURCE an independent hierarchy of scheduler nodes is created.
+ *
+ * Manager process updates of WORKLOADs and RESOURCEs: CREATE/DROP/ALTER.
+ * When a RESOURCE is created (dropped) a corresponding scheduler nodes hierarchy is created (destroyed).
+ * After DROP RESOURCE parts of hierarchy might be kept alive while at least one query uses it.
+ *
+ * Manager is specific to IO only because it create scheduler node hierarchies for RESOURCEs having
+ * WRITE DISK and/or READ DISK definitions. CPU and memory resources are managed separately.
+ *
+ * Classifiers are used (1) to access IO resources and (2) to keep shared ownership of scheduling nodes.
+ * This allows `ResourceRequest` and `ResourceLink` to hold raw pointers as long as
+ * `ClassifierPtr` is acquired and held.
+ *
+ * === RESOURCE ARCHITECTURE ===
+ * Let's consider how a single resource is implemented. Every workload is represented by corresponding UnifiedSchedulerNode.
+ * Every UnifiedSchedulerNode manages its own subtree of ISchedulerNode objects (see details in UnifiedSchedulerNode.h)
+ * UnifiedSchedulerNode for workload w/o children has a queue, which provide a ResourceLink for consumption.
+ * Parent of the root workload for a resource is SchedulerRoot with its own scheduler thread.
+ * So every resource has its dedicated thread for processing of resource request and other events (see EventQueue).
+ *
+ * Here is an example of SQL and corresponding hierarchy of scheduler nodes:
+ *    CREATE RESOURCE my_io_resource (...)
+ *    CREATE WORKLOAD all
+ *    CREATE WORKLOAD production PARENT all
+ *    CREATE WORKLOAD development PARENT all
+ *
+ *             root                - SchedulerRoot (with scheduler thread and EventQueue)
+ *               |
+ *              all                - UnifiedSchedulerNode
+ *               |
+ *            p0_fair              - FairPolicy (part of parent UnifiedSchedulerNode internal structure)
+ *            /     \
+ *    production     development   - UnifiedSchedulerNode
+ *        |               |
+ *      queue           queue      - FifoQueue (part of parent UnifiedSchedulerNode internal structure)
+ *
+ * === UPDATING WORKLOADS ===
+ * Workload may be created, updated or deleted.
+ * Updating a child of a workload might lead to updating other workloads:
+ *  1. Workload itself: it's structure depend on settings of children workloads
+ *     (e.g. fifo node of a leaf workload is remove when the first child is added;
+ *      and a fair node is inserted after the first two children are added).
+ *  2. Other children: for them path to root might be changed (e.g. intermediate priority node is inserted)
+ *
+ * === VERSION CONTROL ===
+ * Versions are created on hierarchy updates and hold ownership of nodes that are used through raw pointers.
+ * Classifier reference version of every resource it use. Older version reference newer version.
+ * Here is a diagram explaining version control based on Version objects (for 1 resource):
+ *
+ *       [nodes]      [nodes]         [nodes]
+ *          ^            ^               ^
+ *          |            |               |
+ *       version1 --> version2 -...-> versionN
+ *          ^                           ^  ^
+ *          |                           |  |
+ *       old_classifier    new_classifier  current_version
+ *
+ * Previous version should hold reference to a newer version. It is required for proper handling of updates.
+ * Classifiers that were created for any of old versions may use nodes of newer version due to updateNode().
+ * It may move a queue to a new position in the hierarchy or create/destroy constraints, thus resource requests
+ * created by old classifier may reference constraints of newer versions through `request->constraints` which
+ * is filled during dequeueRequest().
+ *
+ * === THREADS ===
+ * scheduler thread:
+ *  - one thread per resource
+ *  - uses event_queue (per resource) for processing w/o holding mutex for every scheduler node
+ *  - handle resource requests
+ *  - node activations
+ *  - scheduler hierarchy updates
+ * query thread:
+ *  - multiple independent threads
+ *  - send resource requests
+ *  - acquire and release classifiers (via scheduler event queues)
+ * control thread:
+ *  - modify workload and resources through subscription
+ *
+ * === SYNCHRONIZATION ===
+ * List of related sync primitives and their roles:
+ * IOResourceManager::mutex
+ *  - protects resource manager data structures - resource and workloads
+ *  - serialize control thread actions
+ * IOResourceManager::Resource::scheduler->event_queue
+ *  - serializes scheduler hierarchy events
+ *  - events are created in control and query threads
+ *  - all events are processed by specific scheduler thread
+ *  - hierarchy-wide actions: requests dequeueing, activations propagation and nodes updates.
+ *  - resource version control management
+ * FifoQueue::mutex and SemaphoreContraint::mutex
+ *  - serializes query and scheduler threads on specific node accesses
+ *  - resource request processing: enqueueRequest(), dequeueRequest() and finishRequest()
+ */
+class IOResourceManager : public IResourceManager
+{
+public:
+    explicit IOResourceManager(IWorkloadEntityStorage & storage_);
+    ~IOResourceManager() override;
+    void updateConfiguration(const Poco::Util::AbstractConfiguration & config) override;
+    bool hasResource(const String & resource_name) const override;
+    ClassifierPtr acquire(const String & workload_name) override;
+    void forEachNode(VisitorFunc visitor) override;
+
+private:
+    // Forward declarations
+    struct NodeInfo;
+    struct Version;
+    class Resource;
+    struct Workload;
+    class Classifier;
+
+    friend struct Workload;
+
+    using VersionPtr = std::shared_ptr<Version>;
+    using ResourcePtr = std::shared_ptr<Resource>;
+    using WorkloadPtr = std::shared_ptr<Workload>;
+
+    /// Helper for parsing workload AST for a specific resource
+    struct NodeInfo
+    {
+        String name; // Workload name
+        String parent; // Name of parent workload
+        SchedulingSettings settings; // Settings specific for a given resource
+
+        NodeInfo(const ASTPtr & ast, const String & resource_name);
+    };
+
+    /// Ownership control for scheduler nodes, which could be referenced by raw pointers
+    struct Version
+    {
+        std::vector<SchedulerNodePtr> nodes;
+        VersionPtr newer_version;
+    };
+
+    /// Holds a thread and hierarchy of unified scheduler nodes for specific RESOURCE
+    class Resource : public std::enable_shared_from_this<Resource>, boost::noncopyable
+    {
+    public:
+        explicit Resource(const ASTPtr & resource_entity_);
+        ~Resource();
+
+        const String & getName() const { return resource_name; }
+
+        /// Hierarchy management
+        void createNode(const NodeInfo & info);
+        void deleteNode(const NodeInfo & info);
+        void updateNode(const NodeInfo & old_info, const NodeInfo & new_info);
+
+        /// Updates resource entity
+        void updateResource(const ASTPtr & new_resource_entity);
+
+        /// Updates a classifier to contain a reference for specified workload
+        std::future<void> attachClassifier(Classifier & classifier, const String & workload_name);
+
+        /// Remove classifier reference. This destroys scheduler nodes in proper scheduler thread
+        std::future<void> detachClassifier(VersionPtr && version);
+
+        /// Introspection
+        void forEachResourceNode(IOResourceManager::VisitorFunc & visitor);
+
+    private:
+        void updateCurrentVersion();
+
+        template <class Task>
+        void executeInSchedulerThread(Task && task)
+        {
+            std::promise<void> promise;
+            auto future = promise.get_future();
+            scheduler.event_queue->enqueue([&]
+            {
+                try
+                {
+                    task();
+                    promise.set_value();
+                }
+                catch (...)
+                {
+                    promise.set_exception(std::current_exception());
+                }
+            });
+            future.get(); // Blocks until execution is done in the scheduler thread
+        }
+
+        ASTPtr resource_entity;
+        const String resource_name;
+        SchedulerRoot scheduler;
+
+        // TODO(serxa): consider using resource_manager->mutex + scheduler thread for updates and mutex only for reading to avoid slow acquire/release of classifier
+        /// These field should be accessed only by the scheduler thread
+        std::unordered_map<String, UnifiedSchedulerNodePtr> node_for_workload;
+        UnifiedSchedulerNodePtr root_node;
+        VersionPtr current_version;
+    };
+
+    struct Workload : boost::noncopyable
+    {
+        IOResourceManager * resource_manager;
+        ASTPtr workload_entity;
+
+        Workload(IOResourceManager * resource_manager_, const ASTPtr & workload_entity_);
+        ~Workload();
+
+        void updateWorkload(const ASTPtr & new_entity);
+        String getParent() const;
+    };
+
+    class Classifier : public IClassifier
+    {
+    public:
+        ~Classifier() override;
+
+        /// Implements IClassifier interface
+        /// NOTE: It is called from query threads (possibly multiple)
+        bool has(const String & resource_name) override;
+        ResourceLink get(const String & resource_name) override;
+
+        /// Attaches/detaches a specific resource
+        /// NOTE: It is called from scheduler threads (possibly multiple)
+        void attach(const ResourcePtr & resource, const VersionPtr & version, ResourceLink link);
+        void detach(const ResourcePtr & resource);
+
+    private:
+        IOResourceManager * resource_manager;
+        std::mutex mutex;
+        struct Attachment
+        {
+            ResourcePtr resource;
+            VersionPtr version;
+            ResourceLink link;
+        };
+        std::unordered_map<String, Attachment> attachments; // TSA_GUARDED_BY(mutex);
+    };
+
+    void createOrUpdateWorkload(const String & workload_name, const ASTPtr & ast);
+    void deleteWorkload(const String & workload_name);
+    void createOrUpdateResource(const String & resource_name, const ASTPtr & ast);
+    void deleteResource(const String & resource_name);
+
+    // Topological sorting of workloads
+    void topologicallySortedWorkloadsImpl(Workload * workload, std::unordered_set<Workload *> & visited, std::vector<Workload *> & sorted_workloads);
+    std::vector<Workload *> topologicallySortedWorkloads();
+
+    IWorkloadEntityStorage & storage;
+    scope_guard subscription;
+
+    mutable std::mutex mutex;
+    std::unordered_map<String, WorkloadPtr> workloads; // TSA_GUARDED_BY(mutex);
+    std::unordered_map<String, ResourcePtr> resources; // TSA_GUARDED_BY(mutex);
+
+    LoggerPtr log;
+};
+
+}
--- a/src/Common/Scheduler/Nodes/PriorityPolicy.h
+++ b/src/Common/Scheduler/Nodes/PriorityPolicy.h
@ -19,7 +19,7 @@ namespace ErrorCodes
 * Scheduler node that implements priority scheduling policy.
 * Requests are scheduled in order of priorities.
 */
-class PriorityPolicy : public ISchedulerNode
+class PriorityPolicy final : public ISchedulerNode
 {
    /// Scheduling state of a child
    struct Item
@ -39,6 +39,23 @@ public:
        : ISchedulerNode(event_queue_, config, config_prefix)
    {}

+    explicit PriorityPolicy(EventQueue * event_queue_, const SchedulerNodeInfo & node_info)
+        : ISchedulerNode(event_queue_, node_info)
+    {}
+
+    ~PriorityPolicy() override
+    {
+        // We need to clear `parent` in all children to avoid dangling references
+        while (!children.empty())
+            removeChild(children.begin()->second.get());
+    }
+
+    const String & getTypeName() const override
+    {
+        static String type_name("priority");
+        return type_name;
+    }
+
    bool equals(ISchedulerNode * other) override
    {
        if (!ISchedulerNode::equals(other))
--- a/src/Common/Scheduler/Nodes/SemaphoreConstraint.h
+++ b/src/Common/Scheduler/Nodes/SemaphoreConstraint.h
@ -1,5 +1,6 @@
 #pragma once

+#include "Common/Scheduler/ISchedulerNode.h"
 #include <Common/Scheduler/ISchedulerConstraint.h>

 #include <mutex>
@ -13,7 +14,7 @@ namespace DB
 * Limited concurrency constraint.
 * Blocks if either number of concurrent in-flight requests exceeds `max_requests`, or their total cost exceeds `max_cost`
 */
-class SemaphoreConstraint : public ISchedulerConstraint
+class SemaphoreConstraint final : public ISchedulerConstraint
 {
    static constexpr Int64 default_max_requests = std::numeric_limits<Int64>::max();
    static constexpr Int64 default_max_cost = std::numeric_limits<Int64>::max();
@ -24,6 +25,25 @@ public:
        , max_cost(config.getInt64(config_prefix + ".max_cost", config.getInt64(config_prefix + ".max_bytes", default_max_cost)))
    {}

+    SemaphoreConstraint(EventQueue * event_queue_, const SchedulerNodeInfo & info_, Int64 max_requests_, Int64 max_cost_)
+        : ISchedulerConstraint(event_queue_, info_)
+        , max_requests(max_requests_)
+        , max_cost(max_cost_)
+    {}
+
+    ~SemaphoreConstraint() override
+    {
+        // We need to clear `parent` in child to avoid dangling references
+        if (child)
+            removeChild(child.get());
+    }
+
+    const String & getTypeName() const override
+    {
+        static String type_name("inflight_limit");
+        return type_name;
+    }
+
    bool equals(ISchedulerNode * other) override
    {
        if (!ISchedulerNode::equals(other))
@ -68,15 +88,14 @@ public:
        if (!request)
            return {nullptr, false};

-        // Request has reference to the first (closest to leaf) `constraint`, which can have `parent_constraint`.
-        // The former is initialized here dynamically and the latter is initialized once during hierarchy construction.
-        if (!request->constraint)
-            request->constraint = this;
-
-        // Update state on request arrival
        std::unique_lock lock(mutex);
-        requests++;
-        cost += request->cost;
+        if (request->addConstraint(this))
+        {
+            // Update state on request arrival
+            requests++;
+            cost += request->cost;
+        }
+
        child_active = child_now_active;
        if (!active())
            busy_periods++;
@ -86,10 +105,6 @@ public:

    void finishRequest(ResourceRequest * request) override
    {
-        // Recursive traverse of parent flow controls in reverse order
-        if (parent_constraint)
-            parent_constraint->finishRequest(request);
-
        // Update state on request departure
        std::unique_lock lock(mutex);
        bool was_active = active();
@ -109,6 +124,32 @@ public:
                parent->activateChild(this);
    }

+    /// Update limits.
+    /// Should be called from the scheduler thread because it could lead to activation or deactivation
+    void updateConstraints(const SchedulerNodePtr & self, Int64 new_max_requests, UInt64 new_max_cost)
+    {
+        std::unique_lock lock(mutex);
+        bool was_active = active();
+        max_requests = new_max_requests;
+        max_cost = new_max_cost;
+
+        if (parent)
+        {
+            // Activate on transition from inactive state
+            if (!was_active && active())
+                parent->activateChild(this);
+            // Deactivate on transition into inactive state
+            else if (was_active && !active())
+            {
+                // Node deactivation is usually done in dequeueRequest(), but we do not want to
+                // do extra call to active() on every request just to make sure there was no update().
+                // There is no interface method to do deactivation, so we do the following trick.
+                parent->removeChild(this);
+                parent->attachChild(self); // This call is the only reason we have `recursive_mutex`
+            }
+        }
+    }
+
    bool isActive() override
    {
        std::unique_lock lock(mutex);
@ -150,10 +191,10 @@ private:
        return satisfied() && child_active;
    }

-    const Int64 max_requests = default_max_requests;
-    const Int64 max_cost = default_max_cost;
+    Int64 max_requests = default_max_requests;
+    Int64 max_cost = default_max_cost;

-    std::mutex mutex;
+    std::recursive_mutex mutex;
    Int64 requests = 0;
    Int64 cost = 0;
    bool child_active = false;
--- a/src/Common/Scheduler/Nodes/ThrottlerConstraint.h
+++ b/src/Common/Scheduler/Nodes/ThrottlerConstraint.h
@ -3,8 +3,6 @@
 #include <Common/Scheduler/ISchedulerConstraint.h>

 #include <chrono>
-#include <mutex>
-#include <limits>
 #include <utility>


@ -15,7 +13,7 @@ namespace DB
 * Limited throughput constraint. Blocks if token-bucket constraint is violated:
 * i.e. more than `max_burst + duration * max_speed` cost units (aka tokens) dequeued from this node in last `duration` seconds.
 */
-class ThrottlerConstraint : public ISchedulerConstraint
+class ThrottlerConstraint final : public ISchedulerConstraint
 {
 public:
    static constexpr double default_burst_seconds = 1.0;
@ -28,10 +26,28 @@ public:
        , tokens(max_burst)
    {}

+    ThrottlerConstraint(EventQueue * event_queue_, const SchedulerNodeInfo & info_, double max_speed_, double max_burst_)
+        : ISchedulerConstraint(event_queue_, info_)
+        , max_speed(max_speed_)
+        , max_burst(max_burst_)
+        , last_update(event_queue_->now())
+        , tokens(max_burst)
+    {}
+
    ~ThrottlerConstraint() override
    {
        // We should cancel event on destruction to avoid dangling references from event queue
        event_queue->cancelPostponed(postponed);
+
+        // We need to clear `parent` in child to avoid dangling reference
+        if (child)
+            removeChild(child.get());
+    }
+
+    const String & getTypeName() const override
+    {
+        static String type_name("bandwidth_limit");
+        return type_name;
    }

    bool equals(ISchedulerNode * other) override
@ -78,10 +94,7 @@ public:
        if (!request)
            return {nullptr, false};

-        // Request has reference to the first (closest to leaf) `constraint`, which can have `parent_constraint`.
-        // The former is initialized here dynamically and the latter is initialized once during hierarchy construction.
-        if (!request->constraint)
-            request->constraint = this;
+        // We don't do `request->addConstraint(this)` because `finishRequest()` is no-op

        updateBucket(request->cost);

@ -92,12 +105,8 @@ public:
        return {request, active()};
    }

-    void finishRequest(ResourceRequest * request) override
+    void finishRequest(ResourceRequest *) override
    {
-        // Recursive traverse of parent flow controls in reverse order
-        if (parent_constraint)
-            parent_constraint->finishRequest(request);
-
        // NOTE: Token-bucket constraint does not require any action when consumption ends
    }

@ -108,6 +117,21 @@ public:
                parent->activateChild(this);
    }

+    /// Update limits.
+    /// Should be called from the scheduler thread because it could lead to activation
+    void updateConstraints(double new_max_speed, double new_max_burst)
+    {
+        event_queue->cancelPostponed(postponed);
+        postponed = EventQueue::not_postponed;
+        bool was_active = active();
+        updateBucket(0, true); // To apply previous params for duration since `last_update`
+        max_speed = new_max_speed;
+        max_burst = new_max_burst;
+        updateBucket(0, false); // To postpone (if needed) using new params
+        if (!was_active && active() && parent)
+            parent->activateChild(this);
+    }
+
    bool isActive() override
    {
        return active();
@ -150,7 +174,7 @@ private:
            parent->activateChild(this);
    }

-    void updateBucket(ResourceCost use = 0)
+    void updateBucket(ResourceCost use = 0, bool do_not_postpone = false)
    {
        auto now = event_queue->now();
        if (max_speed > 0.0)
@ -160,7 +184,7 @@ private:
            tokens -= use; // This is done outside min() to avoid passing large requests w/o token consumption after long idle period

            // Postpone activation until there is positive amount of tokens
-            if (tokens < 0.0)
+            if (!do_not_postpone && tokens < 0.0)
            {
                auto delay_ns = std::chrono::nanoseconds(static_cast<Int64>(-tokens / max_speed * 1e9));
                if (postponed == EventQueue::not_postponed)
@ -184,8 +208,8 @@ private:
        return satisfied() && child_active;
    }

-    const double max_speed{0}; /// in tokens per second
-    const double max_burst{0}; /// in tokens
+    double max_speed{0}; /// in tokens per second
+    double max_burst{0}; /// in tokens

    EventQueue::TimePoint last_update;
    UInt64 postponed = EventQueue::not_postponed;
--- a/src/Common/Scheduler/Nodes/UnifiedSchedulerNode.h
+++ b/src/Common/Scheduler/Nodes/UnifiedSchedulerNode.h
@ -0,0 +1,606 @@
+#pragma once
+
+#include <Common/Priority.h>
+#include <Common/Scheduler/Nodes/PriorityPolicy.h>
+#include <Common/Scheduler/Nodes/FairPolicy.h>
+#include <Common/Scheduler/Nodes/ThrottlerConstraint.h>
+#include <Common/Scheduler/Nodes/SemaphoreConstraint.h>
+#include <Common/Scheduler/ISchedulerQueue.h>
+#include <Common/Scheduler/Nodes/FifoQueue.h>
+#include <Common/Scheduler/ISchedulerNode.h>
+#include <Common/Scheduler/SchedulingSettings.h>
+#include <Common/Exception.h>
+
+#include <memory>
+#include <unordered_map>
+
+namespace DB
+{
+
+namespace ErrorCodes
+{
+    extern const int INVALID_SCHEDULER_NODE;
+    extern const int LOGICAL_ERROR;
+}
+
+class UnifiedSchedulerNode;
+using UnifiedSchedulerNodePtr = std::shared_ptr<UnifiedSchedulerNode>;
+
+/*
+ * Unified scheduler node combines multiple nodes internally to provide all available scheduling policies and constraints.
+ * Whole scheduling hierarchy could "logically" consist of unified nodes only. Physically intermediate "internal" nodes
+ * are also present. This approach is easiers for manipulations in runtime than using multiple types of nodes.
+ *
+ * Unified node is capable of updating its internal structure based on:
+ * 1. Number of children (fifo if =0 or fairness/priority if >0).
+ * 2. Priorities of its children (for subtree structure).
+ * 3. `SchedulingSettings` associated with unified node (for throttler and semaphore constraints).
+ *
+ * In general, unified node has "internal" subtree with the following structure:
+ *
+ *                            THIS           <-- UnifiedSchedulerNode object
+ *                              |
+ *                          THROTTLER        <-- [Optional] Throttling scheduling constraint
+ *                              |
+ *   [If no children]------ SEMAPHORE        <-- [Optional] Semaphore constraint
+ *           |                  |
+ *         FIFO             PRIORITY         <-- [Optional] Scheduling policy distinguishing priorities
+ *                 .-------'        '-------.
+ *       FAIRNESS[p1]          ...         FAIRNESS[pN] <-- [Optional] Policies for fairness if priorities are equal
+ *        /        \                        /        \
+ *  CHILD[p1,w1] ... CHILD[p1,wM]  CHILD[pN,w1] ... CHILD[pN,wM]  <-- Unified children (UnifiedSchedulerNode objects)
+ *
+ * NOTE: to distinguish different kinds of children we use the following terms:
+ *  - immediate child: child of unified object (THROTTLER);
+ *  - unified child: leaf of this "internal" subtree (CHILD[p,w]);
+ *  - intermediate node: any child that is not UnifiedSchedulerNode (unified child or `this`)
+ */
+class UnifiedSchedulerNode final : public ISchedulerNode
+{
+private:
+    /// Helper function for managing a parent of a node
+    static void reparent(const SchedulerNodePtr & node, const SchedulerNodePtr & new_parent)
+    {
+        reparent(node, new_parent.get());
+    }
+
+    /// Helper function for managing a parent of a node
+    static void reparent(const SchedulerNodePtr & node, ISchedulerNode * new_parent)
+    {
+        chassert(node);
+        chassert(new_parent);
+        if (new_parent == node->parent)
+            return;
+        if (node->parent)
+            node->parent->removeChild(node.get());
+        new_parent->attachChild(node);
+    }
+
+    /// Helper function for managing a parent of a node
+    static void detach(const SchedulerNodePtr & node)
+    {
+        if (node->parent)
+            node->parent->removeChild(node.get());
+    }
+
+    /// A branch of the tree for a specific priority value
+    struct FairnessBranch
+    {
+        SchedulerNodePtr root; /// FairPolicy node is used if multiple children with the same priority are attached
+        std::unordered_map<String, UnifiedSchedulerNodePtr> children; // basename -> child
+
+        bool empty() const { return children.empty(); }
+
+        SchedulerNodePtr getRoot()
+        {
+            chassert(!children.empty());
+            if (root)
+                return root;
+            chassert(children.size() == 1);
+            return children.begin()->second;
+        }
+
+        /// Attaches a new child.
+        /// Returns root node if it has been changed to a different node, otherwise returns null.
+        [[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
+        {
+            if (auto [it, inserted] = children.emplace(child->basename, child); !inserted)
+                throw Exception(
+                    ErrorCodes::INVALID_SCHEDULER_NODE,
+                    "Can't add another child with the same path: {}",
+                    it->second->getPath());
+
+            if (children.size() == 2)
+            {
+                // Insert fair node if we have just added the second child
+                chassert(!root);
+                root = std::make_shared<FairPolicy>(event_queue_, SchedulerNodeInfo{});
+                root->info.setPriority(child->info.priority);
+                root->basename = fmt::format("p{}_fair", child->info.priority.value);
+                for (auto & [_, node] : children)
+                    reparent(node, root);
+                return root; // New root has been created
+            }
+            else if (children.size() == 1)
+                return child; // We have added single child so far and it is the new root
+            else
+                reparent(child, root);
+            return {}; // Root is the same
+        }
+
+        /// Detaches a child.
+        /// Returns root node if it has been changed to a different node, otherwise returns null.
+        /// NOTE: It could also return null if `empty()` after detaching
+        [[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue *, const UnifiedSchedulerNodePtr & child)
+        {
+            auto it = children.find(child->basename);
+            if (it == children.end())
+                return {}; // unknown child
+
+            detach(child);
+            children.erase(it);
+            if (children.size() == 1)
+            {
+                // Remove fair if the only child has left
+                chassert(root);
+                detach(root);
+                root.reset();
+                return children.begin()->second; // The last child is a new root now
+            }
+            else if (children.empty())
+                return {}; // We have detached the last child
+            else
+                return {}; // Root is the same (two or more children have left)
+        }
+    };
+
+    /// Handles all the children nodes with intermediate fair and/or priority nodes
+    struct ChildrenBranch
+    {
+        SchedulerNodePtr root; /// PriorityPolicy node is used if multiple children with different priority are attached
+        std::unordered_map<Priority::Value, FairnessBranch> branches; /// Branches for different priority values
+
+        // Returns true iff there are no unified children attached
+        bool empty() const { return branches.empty(); }
+
+        SchedulerNodePtr getRoot()
+        {
+            chassert(!branches.empty());
+            if (root)
+                return root;
+            return branches.begin()->second.getRoot(); // There should be exactly one child-branch
+        }
+
+        /// Attaches a new child.
+        /// Returns root node if it has been changed to a different node, otherwise returns null.
+        [[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
+        {
+            auto [it, new_branch]  = branches.try_emplace(child->info.priority);
+            auto & child_branch = it->second;
+            auto branch_root = child_branch.attachUnifiedChild(event_queue_, child);
+            if (!new_branch)
+            {
+                if (branch_root)
+                {
+                    if (root)
+                        reparent(branch_root, root);
+                    else
+                        return branch_root;
+                }
+                return {};
+            }
+            else
+            {
+                chassert(branch_root);
+                if (branches.size() == 2)
+                {
+                    // Insert priority node if we have just added the second branch
+                    chassert(!root);
+                    root = std::make_shared<PriorityPolicy>(event_queue_, SchedulerNodeInfo{});
+                    root->basename = "prio";
+                    for (auto & [_, branch] : branches)
+                        reparent(branch.getRoot(), root);
+                    return root; // New root has been created
+                }
+                else if (branches.size() == 1)
+                    return child; // We have added single child so far and it is the new root
+                else
+                    reparent(child, root);
+                return {}; // Root is the same
+            }
+        }
+
+        /// Detaches a child.
+        /// Returns root node if it has been changed to a different node, otherwise returns null.
+        /// NOTE: It could also return null if `empty()` after detaching
+        [[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
+        {
+            auto it = branches.find(child->info.priority);
+            if (it == branches.end())
+                return {}; // unknown child
+
+            auto & child_branch = it->second;
+            auto branch_root = child_branch.detachUnifiedChild(event_queue_, child);
+            if (child_branch.empty())
+            {
+                branches.erase(it);
+                if (branches.size() == 1)
+                {
+                    // Remove priority node if the only child-branch has left
+                    chassert(root);
+                    detach(root);
+                    root.reset();
+                    return branches.begin()->second.getRoot(); // The last child-branch is a new root now
+                }
+                else if (branches.empty())
+                    return {}; // We have detached the last child
+                else
+                    return {}; // Root is the same (two or more children-branches have left)
+            }
+            if (branch_root)
+            {
+                if (root)
+                    reparent(branch_root, root);
+                else
+                    return branch_root;
+            }
+            return {}; // Root is the same
+        }
+    };
+
+    /// Handles degenerate case of zero children (a fifo queue) or delegate to `ChildrenBranch`.
+    struct QueueOrChildrenBranch
+    {
+        SchedulerNodePtr queue; /// FifoQueue node is used if there are no children
+        ChildrenBranch branch; /// Used if there is at least one child
+
+        SchedulerNodePtr getRoot()
+        {
+            if (queue)
+                return queue;
+            else
+                return branch.getRoot();
+        }
+
+        // Should be called after constructor, before any other methods
+        [[nodiscard]] SchedulerNodePtr initialize(EventQueue * event_queue_)
+        {
+            createQueue(event_queue_);
+            return queue;
+        }
+
+        /// Attaches a new child.
+        /// Returns root node if it has been changed to a different node, otherwise returns null.
+        [[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
+        {
+            if (queue)
+                removeQueue();
+            return branch.attachUnifiedChild(event_queue_, child);
+        }
+
+        /// Detaches a child.
+        /// Returns root node if it has been changed to a different node, otherwise returns null.
+        [[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
+        {
+            if (queue)
+                return {}; // No-op, it already has no children
+            auto branch_root = branch.detachUnifiedChild(event_queue_, child);
+            if (branch.empty())
+            {
+                createQueue(event_queue_);
+                return queue;
+            }
+            return branch_root;
+        }
+
+    private:
+        void createQueue(EventQueue * event_queue_)
+        {
+            queue = std::make_shared<FifoQueue>(event_queue_, SchedulerNodeInfo{});
+            queue->basename = "fifo";
+        }
+
+        void removeQueue()
+        {
+            // This unified node will not be able to process resource requests any longer
+            // All remaining resource requests are be aborted on queue destruction
+            detach(queue);
+            std::static_pointer_cast<ISchedulerQueue>(queue)->purgeQueue();
+            queue.reset();
+        }
+    };
+
+    /// Handles all the nodes under this unified node
+    /// Specifically handles constraints with `QueueOrChildrenBranch` under it
+    struct ConstraintsBranch
+    {
+        SchedulerNodePtr throttler;
+        SchedulerNodePtr semaphore;
+        QueueOrChildrenBranch branch;
+        SchedulingSettings settings;
+
+        // Should be called after constructor, before any other methods
+        [[nodiscard]] SchedulerNodePtr initialize(EventQueue * event_queue_, const SchedulingSettings & settings_)
+        {
+            settings = settings_;
+            SchedulerNodePtr node = branch.initialize(event_queue_);
+            if (settings.hasSemaphore())
+            {
+                semaphore = std::make_shared<SemaphoreConstraint>(event_queue_, SchedulerNodeInfo{}, settings.max_requests, settings.max_cost);
+                semaphore->basename = "semaphore";
+                reparent(node, semaphore);
+                node = semaphore;
+            }
+            if (settings.hasThrottler())
+            {
+                throttler = std::make_shared<ThrottlerConstraint>(event_queue_, SchedulerNodeInfo{}, settings.max_speed, settings.max_burst);
+                throttler->basename = "throttler";
+                reparent(node, throttler);
+                node = throttler;
+            }
+            return node;
+        }
+
+        /// Attaches a new child.
+        /// Returns root node if it has been changed to a different node, otherwise returns null.
+        [[nodiscard]] SchedulerNodePtr attachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
+        {
+            if (auto branch_root = branch.attachUnifiedChild(event_queue_, child))
+            {
+                // If both semaphore and throttler exist we should reparent to the farthest from the root
+                if (semaphore)
+                    reparent(branch_root, semaphore);
+                else if (throttler)
+                    reparent(branch_root, throttler);
+                else
+                    return branch_root;
+            }
+            return {};
+        }
+
+        /// Detaches a child.
+        /// Returns root node if it has been changed to a different node, otherwise returns null.
+        [[nodiscard]] SchedulerNodePtr detachUnifiedChild(EventQueue * event_queue_, const UnifiedSchedulerNodePtr & child)
+        {
+            if (auto branch_root = branch.detachUnifiedChild(event_queue_, child))
+            {
+                if (semaphore)
+                    reparent(branch_root, semaphore);
+                else if (throttler)
+                    reparent(branch_root, throttler);
+                else
+                    return branch_root;
+            }
+            return {};
+        }
+
+        /// Updates constraint-related nodes.
+        /// Returns root node if it has been changed to a different node, otherwise returns null.
+        [[nodiscard]] SchedulerNodePtr updateSchedulingSettings(EventQueue * event_queue_, const SchedulingSettings & new_settings)
+        {
+            SchedulerNodePtr node = branch.getRoot();
+
+            if (!settings.hasSemaphore() && new_settings.hasSemaphore()) // Add semaphore
+            {
+                semaphore = std::make_shared<SemaphoreConstraint>(event_queue_, SchedulerNodeInfo{}, new_settings.max_requests, new_settings.max_cost);
+                semaphore->basename = "semaphore";
+                reparent(node, semaphore);
+                node = semaphore;
+            }
+            else if (settings.hasSemaphore() && !new_settings.hasSemaphore()) // Remove semaphore
+            {
+                detach(semaphore);
+                semaphore.reset();
+            }
+            else if (settings.hasSemaphore() && new_settings.hasSemaphore()) // Update semaphore
+            {
+                static_cast<SemaphoreConstraint&>(*semaphore).updateConstraints(semaphore, new_settings.max_requests, new_settings.max_cost);
+                node = semaphore;
+            }
+
+            if (!settings.hasThrottler() && new_settings.hasThrottler()) // Add throttler
+            {
+                throttler = std::make_shared<ThrottlerConstraint>(event_queue_, SchedulerNodeInfo{}, new_settings.max_speed, new_settings.max_burst);
+                throttler->basename = "throttler";
+                reparent(node, throttler);
+                node = throttler;
+            }
+            else if (settings.hasThrottler() && !new_settings.hasThrottler()) // Remove throttler
+            {
+                detach(throttler);
+                throttler.reset();
+            }
+            else if (settings.hasThrottler() && new_settings.hasThrottler()) // Update throttler
+            {
+                static_cast<ThrottlerConstraint&>(*throttler).updateConstraints(new_settings.max_speed, new_settings.max_burst);
+                node = throttler;
+            }
+
+            settings = new_settings;
+            return node;
+        }
+    };
+
+public:
+    explicit UnifiedSchedulerNode(EventQueue * event_queue_, const SchedulingSettings & settings)
+        : ISchedulerNode(event_queue_, SchedulerNodeInfo(settings.weight, settings.priority))
+    {
+        immediate_child = impl.initialize(event_queue, settings);
+        reparent(immediate_child, this);
+    }
+
+    ~UnifiedSchedulerNode() override
+    {
+        // We need to clear `parent` in child to avoid dangling references
+        if (immediate_child)
+            removeChild(immediate_child.get());
+    }
+
+    /// Attaches a unified child as a leaf of internal subtree and insert or update all the intermediate nodes
+    /// NOTE: Do not confuse with `attachChild()` which is used only for immediate children
+    void attachUnifiedChild(const UnifiedSchedulerNodePtr & child)
+    {
+        if (auto new_child = impl.attachUnifiedChild(event_queue, child))
+            reparent(new_child, this);
+    }
+
+    /// Detaches unified child and update all the intermediate nodes.
+    /// Detached child could be safely attached to another parent.
+    /// NOTE: Do not confuse with `removeChild()` which is used only for immediate children
+    void detachUnifiedChild(const UnifiedSchedulerNodePtr & child)
+    {
+        if (auto new_child = impl.detachUnifiedChild(event_queue, child))
+            reparent(new_child, this);
+    }
+
+    static bool updateRequiresDetach(const String & old_parent, const String & new_parent, const SchedulingSettings & old_settings, const SchedulingSettings & new_settings)
+    {
+        return old_parent != new_parent || old_settings.priority != new_settings.priority;
+    }
+
+    /// Updates scheduling settings. Set of constraints might change.
+    /// NOTE: Caller is responsible for detaching and attaching if `updateRequiresDetach` returns true
+    void updateSchedulingSettings(const SchedulingSettings & new_settings)
+    {
+        info.setPriority(new_settings.priority);
+        info.setWeight(new_settings.weight);
+        if (auto new_child = impl.updateSchedulingSettings(event_queue, new_settings))
+            reparent(new_child, this);
+    }
+
+    const SchedulingSettings & getSettings() const
+    {
+        return impl.settings;
+    }
+
+    /// Returns the queue to be used for resource requests or `nullptr` if it has unified children
+    std::shared_ptr<ISchedulerQueue> getQueue() const
+    {
+        return static_pointer_cast<ISchedulerQueue>(impl.branch.queue);
+    }
+
+    /// Collects nodes that could be accessed with raw pointers by resource requests (queue and constraints)
+    /// NOTE: This is a building block for classifier. Note that due to possible movement of a queue, set of constraints
+    /// for that queue might change in future, and `request->constraints` might reference nodes not in
+    /// the initial set of nodes returned by `addRawPointerNodes()`. To avoid destruction of such additional nodes
+    /// classifier must (indirectly) hold nodes return by `addRawPointerNodes()` for all future versions of
+    /// all unified nodes. Such a version control is done by `IOResourceManager`.
+    void addRawPointerNodes(std::vector<SchedulerNodePtr> & nodes)
+    {
+        // NOTE: `impl.throttler` could be skipped, because ThrottlerConstraint does not call `request->addConstraint()`
+        if (impl.semaphore)
+            nodes.push_back(impl.semaphore);
+        if (impl.branch.queue)
+            nodes.push_back(impl.branch.queue);
+        for (auto & [_, branch] : impl.branch.branch.branches)
+        {
+            for (auto & [_, child] : branch.children)
+                child->addRawPointerNodes(nodes);
+        }
+    }
+
+    bool hasUnifiedChildren() const
+    {
+        return impl.branch.queue == nullptr;
+    }
+
+    /// Introspection. Calls a visitor for self and every internal node. Do not recurse into unified children.
+    void forEachSchedulerNode(std::function<void(ISchedulerNode *)> visitor)
+    {
+        visitor(this);
+        if (impl.throttler)
+            visitor(impl.throttler.get());
+        if (impl.semaphore)
+            visitor(impl.semaphore.get());
+        if (impl.branch.queue)
+            visitor(impl.branch.queue.get());
+        if (impl.branch.branch.root) // priority
+            visitor(impl.branch.branch.root.get());
+        for (auto & [_, branch] : impl.branch.branch.branches)
+        {
+            if (branch.root) // fairness
+                visitor(branch.root.get());
+        }
+    }
+
+protected: // Hide all the ISchedulerNode interface methods as an implementation details
+    const String & getTypeName() const override
+    {
+        static String type_name("unified");
+        return type_name;
+    }
+
+    bool equals(ISchedulerNode *) override
+    {
+        throw Exception(ErrorCodes::LOGICAL_ERROR, "UnifiedSchedulerNode should not be used with CustomResourceManager");
+    }
+
+    /// Attaches an immediate child (used through `reparent()`)
+    void attachChild(const SchedulerNodePtr & child_) override
+    {
+        immediate_child = child_;
+        immediate_child->setParent(this);
+
+        // Activate if required
+        if (immediate_child->isActive())
+            activateChild(immediate_child.get());
+    }
+
+    /// Removes an immediate child (used through `reparent()`)
+    void removeChild(ISchedulerNode * child) override
+    {
+        if (immediate_child.get() == child)
+        {
+            child_active = false; // deactivate
+            immediate_child->setParent(nullptr); // detach
+            immediate_child.reset();
+        }
+    }
+
+    ISchedulerNode * getChild(const String & child_name) override
+    {
+        if (immediate_child->basename == child_name)
+            return immediate_child.get();
+        else
+            return nullptr;
+    }
+
+    std::pair<ResourceRequest *, bool> dequeueRequest() override
+    {
+        auto [request, child_now_active] = immediate_child->dequeueRequest();
+        if (!request)
+            return {nullptr, false};
+
+        child_active = child_now_active;
+        if (!child_active)
+            busy_periods++;
+        incrementDequeued(request->cost);
+        return {request, child_active};
+    }
+
+    bool isActive() override
+    {
+        return child_active;
+    }
+
+    /// Shows number of immediate active children (for introspection)
+    size_t activeChildren() override
+    {
+        return child_active;
+    }
+
+    /// Activate an immediate child
+    void activateChild(ISchedulerNode * child) override
+    {
+        if (child == immediate_child.get())
+            if (!std::exchange(child_active, true) && parent)
+                parent->activateChild(this);
+    }
+
+private:
+    ConstraintsBranch impl;
+    SchedulerNodePtr immediate_child; // An immediate child (actually the root of the whole subtree)
+    bool child_active = false;
+};
+
+}
--- a/src/Common/Scheduler/Nodes/registerResourceManagers.cpp
+++ b/src/Common/Scheduler/Nodes/registerResourceManagers.cpp
@ -1,15 +0,0 @@
-#include <Common/Scheduler/Nodes/registerResourceManagers.h>
-#include <Common/Scheduler/ResourceManagerFactory.h>
-
-namespace DB
-{
-
-void registerDynamicResourceManager(ResourceManagerFactory &);
-
-void registerResourceManagers()
-{
-    auto & factory = ResourceManagerFactory::instance();
-    registerDynamicResourceManager(factory);
-}
-
-}
--- a/src/Common/Scheduler/Nodes/registerResourceManagers.h
+++ b/src/Common/Scheduler/Nodes/registerResourceManagers.h
@ -1,8 +0,0 @@
-#pragma once
-
-namespace DB
-{
-
-void registerResourceManagers();
-
-}
--- a/src/Common/Scheduler/Nodes/tests/ResourceTest.h
+++ b/src/Common/Scheduler/Nodes/tests/ResourceTest.h
@ -1,5 +1,8 @@
 #pragma once

+#include <gtest/gtest.h>
+
+#include <Common/Scheduler/SchedulingSettings.h>
 #include <Common/Scheduler/IResourceManager.h>
 #include <Common/Scheduler/SchedulerRoot.h>
 #include <Common/Scheduler/ResourceGuard.h>
@ -7,26 +10,35 @@
 #include <Common/Scheduler/Nodes/PriorityPolicy.h>
 #include <Common/Scheduler/Nodes/FifoQueue.h>
 #include <Common/Scheduler/Nodes/SemaphoreConstraint.h>
+#include <Common/Scheduler/Nodes/UnifiedSchedulerNode.h>
 #include <Common/Scheduler/Nodes/registerSchedulerNodes.h>
-#include <Common/Scheduler/Nodes/registerResourceManagers.h>

 #include <Poco/Util/XMLConfiguration.h>

 #include <atomic>
 #include <barrier>
+#include <exception>
+#include <functional>
+#include <memory>
 #include <unordered_map>
 #include <mutex>
 #include <set>
 #include <sstream>
+#include <utility>

 namespace DB
 {

+namespace ErrorCodes
+{
+    extern const int RESOURCE_ACCESS_DENIED;
+}
+
 struct ResourceTestBase
 {
    ResourceTestBase()
    {
-        [[maybe_unused]] static bool typesRegistered = [] { registerSchedulerNodes(); registerResourceManagers(); return true; }();
+        [[maybe_unused]] static bool typesRegistered = [] { registerSchedulerNodes(); return true; }();
    }

    template <class TClass>
@ -37,10 +49,16 @@ struct ResourceTestBase
        Poco::AutoPtr config{new Poco::Util::XMLConfiguration(stream)};
        String config_prefix = "node";

+        return add<TClass>(event_queue, root_node, path, std::ref(*config), config_prefix);
+    }
+
+    template <class TClass, class... Args>
+    static TClass * add(EventQueue * event_queue, SchedulerNodePtr & root_node, const String & path, Args... args)
+    {
        if (path == "/")
        {
            EXPECT_TRUE(root_node.get() == nullptr);
-            root_node.reset(new TClass(event_queue, *config, config_prefix));
+            root_node.reset(new TClass(event_queue, std::forward<Args>(args)...));
            return static_cast<TClass *>(root_node.get());
        }

@ -65,73 +83,114 @@ struct ResourceTestBase
        }

        EXPECT_TRUE(!child_name.empty()); // wrong path
-        SchedulerNodePtr node = std::make_shared<TClass>(event_queue, *config, config_prefix);
+        SchedulerNodePtr node = std::make_shared<TClass>(event_queue, std::forward<Args>(args)...);
        node->basename = child_name;
        parent->attachChild(node);
        return static_cast<TClass *>(node.get());
    }
 };

-
-struct ConstraintTest : public SemaphoreConstraint
-{
-    explicit ConstraintTest(EventQueue * event_queue_, const Poco::Util::AbstractConfiguration & config = emptyConfig(), const String & config_prefix = {})
-        : SemaphoreConstraint(event_queue_, config, config_prefix)
-    {}
-
-    std::pair<ResourceRequest *, bool> dequeueRequest() override
-    {
-        auto [request, active] = SemaphoreConstraint::dequeueRequest();
-        if (request)
-        {
-            std::unique_lock lock(mutex);
-            requests.insert(request);
-        }
-        return {request, active};
-    }
-
-    void finishRequest(ResourceRequest * request) override
-    {
-        {
-            std::unique_lock lock(mutex);
-            requests.erase(request);
-        }
-        SemaphoreConstraint::finishRequest(request);
-    }
-
-    std::mutex mutex;
-    std::set<ResourceRequest *> requests;
-};
-
 class ResourceTestClass : public ResourceTestBase
 {
    struct Request : public ResourceRequest
    {
+        ResourceTestClass * test;
        String name;

-        Request(ResourceCost cost_, const String & name_)
+        Request(ResourceTestClass * test_, ResourceCost cost_, const String & name_)
            : ResourceRequest(cost_)
+            , test(test_)
            , name(name_)
        {}

        void execute() override
        {
        }
+
+        void failed(const std::exception_ptr &) override
+        {
+            test->failed_cost += cost;
+            delete this;
+        }
    };

 public:
+    ~ResourceTestClass()
+    {
+        if (root_node)
+            dequeue(); // Just to avoid any leaks of `Request` object
+    }
+
    template <class TClass>
    void add(const String & path, const String & xml = {})
    {
        ResourceTestBase::add<TClass>(&event_queue, root_node, path, xml);
    }

+    template <class TClass, class... Args>
+    void addCustom(const String & path, Args... args)
+    {
+        ResourceTestBase::add<TClass>(&event_queue, root_node, path, std::forward<Args>(args)...);
+    }
+
+    UnifiedSchedulerNodePtr createUnifiedNode(const String & basename, const SchedulingSettings & settings = {})
+    {
+        return createUnifiedNode(basename, {}, settings);
+    }
+
+    UnifiedSchedulerNodePtr createUnifiedNode(const String & basename, const UnifiedSchedulerNodePtr & parent, const SchedulingSettings & settings = {})
+    {
+        auto node = std::make_shared<UnifiedSchedulerNode>(&event_queue, settings);
+        node->basename = basename;
+        if (parent)
+        {
+            parent->attachUnifiedChild(node);
+        }
+        else
+        {
+            EXPECT_TRUE(root_node.get() == nullptr);
+            root_node = node;
+        }
+        return node;
+    }
+
+    // Updates the parent and/or scheduling settings for a specidfied `node`.
+    // Unit test implementation must make sure that all needed queues and constraints are not going to be destroyed.
+    // Normally it is the responsibility of IOResourceManager, but we do not use it here, so manual version control is required.
+    // (see IOResourceManager::Resource::updateCurrentVersion() fo details)
+    void updateUnifiedNode(const UnifiedSchedulerNodePtr & node, const UnifiedSchedulerNodePtr & old_parent, const UnifiedSchedulerNodePtr & new_parent, const SchedulingSettings & new_settings)
+    {
+        EXPECT_TRUE((old_parent && new_parent) || (!old_parent && !new_parent)); // changing root node is not supported
+        bool detached = false;
+        if (UnifiedSchedulerNode::updateRequiresDetach(
+            old_parent ? old_parent->basename : "",
+            new_parent ? new_parent->basename : "",
+            node->getSettings(),
+            new_settings))
+        {
+            if (old_parent)
+                old_parent->detachUnifiedChild(node);
+            detached = true;
+        }
+
+        node->updateSchedulingSettings(new_settings);
+
+        if (detached && new_parent)
+            new_parent->attachUnifiedChild(node);
+    }
+
+
+    void enqueue(const UnifiedSchedulerNodePtr & node, const std::vector<ResourceCost> & costs)
+    {
+        enqueueImpl(node->getQueue().get(), costs, node->basename);
+    }
+
    void enqueue(const String & path, const std::vector<ResourceCost> & costs)
    {
        ASSERT_TRUE(root_node.get() != nullptr); // root should be initialized first
        ISchedulerNode * node = root_node.get();
        size_t pos = 1;
-        while (pos < path.length())
+        while (node && pos < path.length())
        {
            size_t slash = path.find('/', pos);
            if (slash != String::npos)
@ -146,13 +205,17 @@ public:
                pos = String::npos;
            }
        }
-        ISchedulerQueue * queue = dynamic_cast<ISchedulerQueue *>(node);
-        ASSERT_TRUE(queue != nullptr); // not a queue
+        if (node)
+            enqueueImpl(dynamic_cast<ISchedulerQueue *>(node), costs);
+    }

+    void enqueueImpl(ISchedulerQueue * queue, const std::vector<ResourceCost> & costs, const String & name = {})
+    {
+        ASSERT_TRUE(queue != nullptr); // not a queue
+        if (!queue)
+            return; // to make clang-analyzer-core.NonNullParamChecker happy
        for (ResourceCost cost : costs)
-        {
-            queue->enqueueRequest(new Request(cost, queue->basename));
-        }
+            queue->enqueueRequest(new Request(this, cost, name.empty() ? queue->basename : name));
        processEvents(); // to activate queues
    }

@ -208,6 +271,12 @@ public:
        consumed_cost[name] -= value;
    }

+    void failed(ResourceCost value)
+    {
+        EXPECT_EQ(failed_cost, value);
+        failed_cost -= value;
+    }
+
    void processEvents()
    {
        while (event_queue.tryProcess()) {}
@ -217,8 +286,11 @@ private:
    EventQueue event_queue;
    SchedulerNodePtr root_node;
    std::unordered_map<String, ResourceCost> consumed_cost;
+    ResourceCost failed_cost = 0;
 };

+enum EnqueueOnlyEnum { EnqueueOnly };
+
 template <class TManager>
 struct ResourceTestManager : public ResourceTestBase
 {
@ -230,16 +302,49 @@ struct ResourceTestManager : public ResourceTestBase
    struct Guard : public ResourceGuard
    {
        ResourceTestManager & t;
+        ResourceCost cost;

-        Guard(ResourceTestManager & t_, ResourceLink link_, ResourceCost cost)
-            : ResourceGuard(ResourceGuard::Metrics::getIOWrite(), link_, cost, Lock::Defer)
+        /// Works like regular ResourceGuard, ready for consumption after constructor
+        Guard(ResourceTestManager & t_, ResourceLink link_, ResourceCost cost_)
+            : ResourceGuard(ResourceGuard::Metrics::getIOWrite(), link_, cost_, Lock::Defer)
            , t(t_)
+            , cost(cost_)
        {
            t.onEnqueue(link);
+            waitExecute();
+        }
+
+        /// Just enqueue resource request, do not block (needed for tests to sync). Call `waitExecuted()` afterwards
+        Guard(ResourceTestManager & t_, ResourceLink link_, ResourceCost cost_, EnqueueOnlyEnum)
+            : ResourceGuard(ResourceGuard::Metrics::getIOWrite(), link_, cost_, Lock::Defer)
+            , t(t_)
+            , cost(cost_)
+        {
+            t.onEnqueue(link);
+        }
+
+        /// Waits for ResourceRequest::execute() to be called for enqueued request
+        void waitExecute()
+        {
            lock();
            t.onExecute(link);
            consume(cost);
        }
+
+        /// Waits for ResourceRequest::failure() to be called for enqueued request
+        void waitFailed(const String & pattern)
+        {
+            try
+            {
+                lock();
+                FAIL();
+            }
+            catch (Exception & e)
+            {
+                ASSERT_EQ(e.code(), ErrorCodes::RESOURCE_ACCESS_DENIED);
+                ASSERT_TRUE(e.message().contains(pattern));
+            }
+        }
    };

    struct TItem
@ -264,10 +369,24 @@ struct ResourceTestManager : public ResourceTestBase
        , busy_period(thread_count)
    {}

+    enum DoNotInitManagerEnum { DoNotInitManager };
+
+    explicit ResourceTestManager(size_t thread_count, DoNotInitManagerEnum)
+        : busy_period(thread_count)
+    {}
+
    ~ResourceTestManager()
+    {
+        wait();
+    }
+
+    void wait()
    {
        for (auto & thread : threads)
-            thread.join();
+        {
+            if (thread.joinable())
+                thread.join();
+        }
    }

    void update(const String & xml)
--- a/src/Common/Scheduler/Nodes/tests/gtest_dynamic_resource_manager.cpp
+++ b/src/Common/Scheduler/Nodes/tests/gtest_dynamic_resource_manager.cpp
@ -2,15 +2,15 @@

 #include <Common/Scheduler/Nodes/tests/ResourceTest.h>

-#include <Common/Scheduler/Nodes/DynamicResourceManager.h>
+#include <Common/Scheduler/Nodes/CustomResourceManager.h>
 #include <Poco/Util/XMLConfiguration.h>

 using namespace DB;

-using ResourceTest = ResourceTestManager<DynamicResourceManager>;
+using ResourceTest = ResourceTestManager<CustomResourceManager>;
 using TestGuard = ResourceTest::Guard;

-TEST(SchedulerDynamicResourceManager, Smoke)
+TEST(SchedulerCustomResourceManager, Smoke)
 {
    ResourceTest t;

@ -31,25 +31,25 @@ TEST(SchedulerDynamicResourceManager, Smoke)
        </clickhouse>
    )CONFIG");

-    ClassifierPtr cA = t.manager->acquire("A");
-    ClassifierPtr cB = t.manager->acquire("B");
+    ClassifierPtr c_a = t.manager->acquire("A");
+    ClassifierPtr c_b = t.manager->acquire("B");

    for (int i = 0; i < 10; i++)
    {
-        ResourceGuard gA(ResourceGuard::Metrics::getIOWrite(), cA->get("res1"), 1, ResourceGuard::Lock::Defer);
-        gA.lock();
-        gA.consume(1);
-        gA.unlock();
+        ResourceGuard g_a(ResourceGuard::Metrics::getIOWrite(), c_a->get("res1"), 1, ResourceGuard::Lock::Defer);
+        g_a.lock();
+        g_a.consume(1);
+        g_a.unlock();

-        ResourceGuard gB(ResourceGuard::Metrics::getIOWrite(), cB->get("res1"));
-        gB.unlock();
+        ResourceGuard g_b(ResourceGuard::Metrics::getIOWrite(), c_b->get("res1"));
+        g_b.unlock();

-        ResourceGuard gC(ResourceGuard::Metrics::getIORead(), cB->get("res1"));
-        gB.consume(2);
+        ResourceGuard g_c(ResourceGuard::Metrics::getIORead(), c_b->get("res1"));
+        g_b.consume(2);
    }
 }

-TEST(SchedulerDynamicResourceManager, Fairness)
+TEST(SchedulerCustomResourceManager, Fairness)
 {
    // Total cost for A and B cannot differ for more than 1 (every request has cost equal to 1).
    // Requests from A use `value = 1` and from B `value = -1` is used.
--- a/src/Common/Scheduler/Nodes/tests/gtest_event_queue.cpp
+++ b/src/Common/Scheduler/Nodes/tests/gtest_event_queue.cpp
@ -13,6 +13,12 @@ public:
        , log(log_)
    {}

+    const String & getTypeName() const override
+    {
+        static String type_name("fake");
+        return type_name;
+    }
+
    void attachChild(const SchedulerNodePtr & child) override
    {
        log += " +" + child->basename;
--- a/src/Common/Scheduler/Nodes/tests/gtest_io_resource_manager.cpp
+++ b/src/Common/Scheduler/Nodes/tests/gtest_io_resource_manager.cpp
@ -0,0 +1,335 @@
+#include <gtest/gtest.h>
+
+#include <Core/Defines.h>
+#include <Core/Settings.h>
+
+#include <Common/Scheduler/Nodes/tests/ResourceTest.h>
+#include <Common/Scheduler/Workload/WorkloadEntityStorageBase.h>
+#include <Common/Scheduler/Nodes/IOResourceManager.h>
+
+#include <Interpreters/Context.h>
+
+#include <Parsers/parseQuery.h>
+#include <Parsers/ASTCreateWorkloadQuery.h>
+#include <Parsers/ASTCreateResourceQuery.h>
+#include <Parsers/ASTDropWorkloadQuery.h>
+#include <Parsers/ASTDropResourceQuery.h>
+#include <Parsers/ParserCreateWorkloadQuery.h>
+#include <Parsers/ParserCreateResourceQuery.h>
+#include <Parsers/ParserDropWorkloadQuery.h>
+#include <Parsers/ParserDropResourceQuery.h>
+
+using namespace DB;
+
+class WorkloadEntityTestStorage : public WorkloadEntityStorageBase
+{
+public:
+    WorkloadEntityTestStorage()
+        : WorkloadEntityStorageBase(Context::getGlobalContextInstance())
+    {}
+
+    void loadEntities() override {}
+
+    void executeQuery(const String & query)
+    {
+        ParserCreateWorkloadQuery create_workload_p;
+        ParserDropWorkloadQuery drop_workload_p;
+        ParserCreateResourceQuery create_resource_p;
+        ParserDropResourceQuery drop_resource_p;
+
+        auto parse = [&] (IParser & parser)
+        {
+            String error;
+            const char * end = query.data();
+            return tryParseQuery(
+                parser,
+                end,
+                query.data() + query.size(),
+                error,
+                false,
+                "",
+                false,
+                0,
+                DBMS_DEFAULT_MAX_PARSER_DEPTH,
+                DBMS_DEFAULT_MAX_PARSER_BACKTRACKS,
+                true);
+        };
+
+        if (ASTPtr create_workload = parse(create_workload_p))
+        {
+            auto & parsed = create_workload->as<ASTCreateWorkloadQuery &>();
+            auto workload_name = parsed.getWorkloadName();
+            bool throw_if_exists = !parsed.if_not_exists && !parsed.or_replace;
+            bool replace_if_exists = parsed.or_replace;
+
+            storeEntity(
+                nullptr,
+                WorkloadEntityType::Workload,
+                workload_name,
+                create_workload,
+                throw_if_exists,
+                replace_if_exists,
+                {});
+        }
+        else if (ASTPtr create_resource = parse(create_resource_p))
+        {
+            auto & parsed = create_resource->as<ASTCreateResourceQuery &>();
+            auto resource_name = parsed.getResourceName();
+            bool throw_if_exists = !parsed.if_not_exists && !parsed.or_replace;
+            bool replace_if_exists = parsed.or_replace;
+
+            storeEntity(
+                nullptr,
+                WorkloadEntityType::Resource,
+                resource_name,
+                create_resource,
+                throw_if_exists,
+                replace_if_exists,
+                {});
+        }
+        else if (ASTPtr drop_workload = parse(drop_workload_p))
+        {
+            auto & parsed = drop_workload->as<ASTDropWorkloadQuery &>();
+            bool throw_if_not_exists = !parsed.if_exists;
+            removeEntity(
+                nullptr,
+                WorkloadEntityType::Workload,
+                parsed.workload_name,
+                throw_if_not_exists);
+        }
+        else if (ASTPtr drop_resource = parse(drop_resource_p))
+        {
+            auto & parsed = drop_resource->as<ASTDropResourceQuery &>();
+            bool throw_if_not_exists = !parsed.if_exists;
+            removeEntity(
+                nullptr,
+                WorkloadEntityType::Resource,
+                parsed.resource_name,
+                throw_if_not_exists);
+        }
+        else
+            throw Exception(ErrorCodes::LOGICAL_ERROR, "Invalid query in WorkloadEntityTestStorage: {}", query);
+    }
+
+private:
+    WorkloadEntityStorageBase::OperationResult storeEntityImpl(
+        const ContextPtr & current_context,
+        WorkloadEntityType entity_type,
+        const String & entity_name,
+        ASTPtr create_entity_query,
+        bool throw_if_exists,
+        bool replace_if_exists,
+        const Settings & settings) override
+    {
+        UNUSED(current_context, entity_type, entity_name, create_entity_query, throw_if_exists, replace_if_exists, settings);
+        return OperationResult::Ok;
+    }
+
+    WorkloadEntityStorageBase::OperationResult removeEntityImpl(
+        const ContextPtr & current_context,
+        WorkloadEntityType entity_type,
+        const String & entity_name,
+        bool throw_if_not_exists) override
+    {
+        UNUSED(current_context, entity_type, entity_name, throw_if_not_exists);
+        return OperationResult::Ok;
+    }
+};
+
+struct ResourceTest : ResourceTestManager<IOResourceManager>
+{
+    WorkloadEntityTestStorage storage;
+
+    explicit ResourceTest(size_t thread_count = 1)
+        : ResourceTestManager(thread_count, DoNotInitManager)
+    {
+        manager = std::make_shared<IOResourceManager>(storage);
+    }
+
+    void query(const String & query_str)
+    {
+        storage.executeQuery(query_str);
+    }
+
+    template <class Func>
+    void async(const String & workload, Func func)
+    {
+        threads.emplace_back([=, this, func2 = std::move(func)]
+        {
+            ClassifierPtr classifier = manager->acquire(workload);
+            func2(classifier);
+        });
+    }
+
+    template <class Func>
+    void async(const String & workload, const String & resource, Func func)
+    {
+        threads.emplace_back([=, this, func2 = std::move(func)]
+        {
+            ClassifierPtr classifier = manager->acquire(workload);
+            ResourceLink link = classifier->get(resource);
+            func2(link);
+        });
+    }
+};
+
+using TestGuard = ResourceTest::Guard;
+
+TEST(SchedulerIOResourceManager, Smoke)
+{
+    ResourceTest t;
+
+    t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)");
+    t.query("CREATE WORKLOAD all SETTINGS max_requests = 10");
+    t.query("CREATE WORKLOAD A in all");
+    t.query("CREATE WORKLOAD B in all SETTINGS weight = 3");
+
+    ClassifierPtr c_a = t.manager->acquire("A");
+    ClassifierPtr c_b = t.manager->acquire("B");
+
+    for (int i = 0; i < 10; i++)
+    {
+        ResourceGuard g_a(ResourceGuard::Metrics::getIOWrite(), c_a->get("res1"), 1, ResourceGuard::Lock::Defer);
+        g_a.lock();
+        g_a.consume(1);
+        g_a.unlock();
+
+        ResourceGuard g_b(ResourceGuard::Metrics::getIOWrite(), c_b->get("res1"));
+        g_b.unlock();
+
+        ResourceGuard g_c(ResourceGuard::Metrics::getIORead(), c_b->get("res1"));
+        g_b.consume(2);
+    }
+}
+
+TEST(SchedulerIOResourceManager, Fairness)
+{
+    // Total cost for A and B cannot differ for more than 1 (every request has cost equal to 1).
+    // Requests from A use `value = 1` and from B `value = -1` is used.
+    std::atomic<Int64> unfairness = 0;
+    auto fairness_diff = [&] (Int64 value)
+    {
+        Int64 cur_unfairness = unfairness.fetch_add(value, std::memory_order_relaxed) + value;
+        EXPECT_NEAR(cur_unfairness, 0, 1);
+    };
+
+    constexpr size_t threads_per_queue = 2;
+    int requests_per_thread = 100;
+    ResourceTest t(2 * threads_per_queue + 1);
+
+    t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)");
+    t.query("CREATE WORKLOAD all SETTINGS max_requests = 1");
+    t.query("CREATE WORKLOAD A IN all");
+    t.query("CREATE WORKLOAD B IN all");
+    t.query("CREATE WORKLOAD leader IN all");
+
+    for (int thread = 0; thread < threads_per_queue; thread++)
+    {
+        t.threads.emplace_back([&]
+        {
+            ClassifierPtr c = t.manager->acquire("A");
+            ResourceLink link = c->get("res1");
+            t.startBusyPeriod(link, 1, requests_per_thread);
+            for (int request = 0; request < requests_per_thread; request++)
+            {
+                TestGuard g(t, link, 1);
+                fairness_diff(1);
+            }
+        });
+    }
+
+    for (int thread = 0; thread < threads_per_queue; thread++)
+    {
+        t.threads.emplace_back([&]
+        {
+            ClassifierPtr c = t.manager->acquire("B");
+            ResourceLink link = c->get("res1");
+            t.startBusyPeriod(link, 1, requests_per_thread);
+            for (int request = 0; request < requests_per_thread; request++)
+            {
+                TestGuard g(t, link, 1);
+                fairness_diff(-1);
+            }
+        });
+    }
+
+    ClassifierPtr c = t.manager->acquire("leader");
+    ResourceLink link = c->get("res1");
+    t.blockResource(link);
+
+    t.wait(); // Wait for threads to finish before destructing locals
+}
+
+TEST(SchedulerIOResourceManager, DropNotEmptyQueue)
+{
+    ResourceTest t;
+
+    t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)");
+    t.query("CREATE WORKLOAD all SETTINGS max_requests = 1");
+    t.query("CREATE WORKLOAD intermediate IN all");
+
+    std::barrier sync_before_enqueue(2);
+    std::barrier sync_before_drop(3);
+    std::barrier sync_after_drop(2);
+    t.async("intermediate", "res1", [&] (ResourceLink link)
+    {
+        TestGuard g(t, link, 1);
+        sync_before_enqueue.arrive_and_wait();
+        sync_before_drop.arrive_and_wait(); // 1st resource request is consuming
+        sync_after_drop.arrive_and_wait(); // 1st resource request is still consuming
+    });
+
+    sync_before_enqueue.arrive_and_wait(); // to maintain correct order of resource requests
+
+    t.async("intermediate", "res1", [&] (ResourceLink link)
+    {
+        TestGuard g(t, link, 1, EnqueueOnly);
+        sync_before_drop.arrive_and_wait(); // 2nd resource request is enqueued
+        g.waitFailed("is about to be destructed");
+    });
+
+    sync_before_drop.arrive_and_wait(); // main thread triggers FifoQueue destruction by adding a unified child
+    t.query("CREATE WORKLOAD leaf IN intermediate");
+    sync_after_drop.arrive_and_wait();
+
+    t.wait(); // Wait for threads to finish before destructing locals
+}
+
+TEST(SchedulerIOResourceManager, DropNotEmptyQueueLong)
+{
+    ResourceTest t;
+
+    t.query("CREATE RESOURCE res1 (WRITE DISK disk, READ DISK disk)");
+    t.query("CREATE WORKLOAD all SETTINGS max_requests = 1");
+    t.query("CREATE WORKLOAD intermediate IN all");
+
+    static constexpr int queue_size = 100;
+    std::barrier sync_before_enqueue(2);
+    std::barrier sync_before_drop(2 + queue_size);
+    std::barrier sync_after_drop(2);
+    t.async("intermediate", "res1", [&] (ResourceLink link)
+    {
+        TestGuard g(t, link, 1);
+        sync_before_enqueue.arrive_and_wait();
+        sync_before_drop.arrive_and_wait(); // 1st resource request is consuming
+        sync_after_drop.arrive_and_wait(); // 1st resource request is still consuming
+    });
+
+    sync_before_enqueue.arrive_and_wait(); // to maintain correct order of resource requests
+
+    for (int i = 0; i < queue_size; i++)
+    {
+        t.async("intermediate", "res1", [&] (ResourceLink link)
+        {
+            TestGuard g(t, link, 1, EnqueueOnly);
+            sync_before_drop.arrive_and_wait(); // many resource requests are enqueued
+            g.waitFailed("is about to be destructed");
+        });
+    }
+
+    sync_before_drop.arrive_and_wait(); // main thread triggers FifoQueue destruction by adding a unified child
+    t.query("CREATE WORKLOAD leaf IN intermediate");
+    sync_after_drop.arrive_and_wait();
+
+    t.wait(); // Wait for threads to finish before destructing locals
+}
--- a/src/Common/Scheduler/Nodes/tests/gtest_resource_class_fair.cpp
+++ b/src/Common/Scheduler/Nodes/tests/gtest_resource_class_fair.cpp
@ -8,18 +8,17 @@ using namespace DB;

 using ResourceTest = ResourceTestClass;

-/// Tests disabled because of leaks in the test themselves: https://github.com/ClickHouse/ClickHouse/issues/67678
-
-TEST(DISABLED_SchedulerFairPolicy, Factory)
+TEST(SchedulerFairPolicy, Factory)
 {
    ResourceTest t;

    Poco::AutoPtr cfg = new Poco::Util::XMLConfiguration();
-    SchedulerNodePtr fair = SchedulerNodeFactory::instance().get("fair", /* event_queue = */ nullptr, *cfg, "");
+    EventQueue event_queue;
+    SchedulerNodePtr fair = SchedulerNodeFactory::instance().get("fair", &event_queue, *cfg, "");
    EXPECT_TRUE(dynamic_cast<FairPolicy *>(fair.get()) != nullptr);
 }

-TEST(DISABLED_SchedulerFairPolicy, FairnessWeights)
+TEST(SchedulerFairPolicy, FairnessWeights)
 {
    ResourceTest t;

@ -43,7 +42,7 @@ TEST(DISABLED_SchedulerFairPolicy, FairnessWeights)
    t.consumed("B", 20);
 }

-TEST(DISABLED_SchedulerFairPolicy, Activation)
+TEST(SchedulerFairPolicy, Activation)
 {
    ResourceTest t;

@ -79,7 +78,7 @@ TEST(DISABLED_SchedulerFairPolicy, Activation)
    t.consumed("B", 10);
 }

-TEST(DISABLED_SchedulerFairPolicy, FairnessMaxMin)
+TEST(SchedulerFairPolicy, FairnessMaxMin)
 {
    ResourceTest t;

@ -103,7 +102,7 @@ TEST(DISABLED_SchedulerFairPolicy, FairnessMaxMin)
    t.consumed("A", 20);
 }

-TEST(DISABLED_SchedulerFairPolicy, HierarchicalFairness)
+TEST(SchedulerFairPolicy, HierarchicalFairness)
 {
    ResourceTest t;

--- a/src/Common/Scheduler/Nodes/tests/gtest_resource_class_priority.cpp
+++ b/src/Common/Scheduler/Nodes/tests/gtest_resource_class_priority.cpp
@ -8,18 +8,17 @@ using namespace DB;

 using ResourceTest = ResourceTestClass;

-/// Tests disabled because of leaks in the test themselves: https://github.com/ClickHouse/ClickHouse/issues/67678
-
-TEST(DISABLED_SchedulerPriorityPolicy, Factory)
+TEST(SchedulerPriorityPolicy, Factory)
 {
    ResourceTest t;

    Poco::AutoPtr cfg = new Poco::Util::XMLConfiguration();
-    SchedulerNodePtr prio = SchedulerNodeFactory::instance().get("priority", /* event_queue = */ nullptr, *cfg, "");
+    EventQueue event_queue;
+    SchedulerNodePtr prio = SchedulerNodeFactory::instance().get("priority", &event_queue, *cfg, "");
    EXPECT_TRUE(dynamic_cast<PriorityPolicy *>(prio.get()) != nullptr);
 }

-TEST(DISABLED_SchedulerPriorityPolicy, Priorities)
+TEST(SchedulerPriorityPolicy, Priorities)
 {
    ResourceTest t;

@ -53,7 +52,7 @@ TEST(DISABLED_SchedulerPriorityPolicy, Priorities)
    t.consumed("C", 0);
 }

-TEST(DISABLED_SchedulerPriorityPolicy, Activation)
+TEST(SchedulerPriorityPolicy, Activation)
 {
    ResourceTest t;

@ -94,7 +93,7 @@ TEST(DISABLED_SchedulerPriorityPolicy, Activation)
    t.consumed("C", 0);
 }

-TEST(DISABLED_SchedulerPriorityPolicy, SinglePriority)
+TEST(SchedulerPriorityPolicy, SinglePriority)
 {
    ResourceTest t;

--- a/src/Common/Scheduler/Nodes/tests/gtest_resource_scheduler.cpp
+++ b/src/Common/Scheduler/Nodes/tests/gtest_resource_scheduler.cpp
@ -1,5 +1,6 @@
 #include <gtest/gtest.h>

+#include <Common/Scheduler/Nodes/SemaphoreConstraint.h>
 #include <Common/Scheduler/Nodes/tests/ResourceTest.h>

 #include <Common/Scheduler/SchedulerRoot.h>
@ -101,6 +102,11 @@ struct MyRequest : public ResourceRequest
        if (on_execute)
            on_execute();
    }
+
+    void failed(const std::exception_ptr &) override
+    {
+        FAIL();
+    }
 };

 TEST(SchedulerRoot, Smoke)
@ -108,14 +114,14 @@ TEST(SchedulerRoot, Smoke)
    ResourceTest t;

    ResourceHolder r1(t);
-    auto * fc1 = r1.add<ConstraintTest>("/", "<max_requests>1</max_requests>");
+    auto * fc1 = r1.add<SemaphoreConstraint>("/", "<max_requests>1</max_requests>");
    r1.add<PriorityPolicy>("/prio");
    auto a = r1.addQueue("/prio/A", "<priority>1</priority>");
    auto b = r1.addQueue("/prio/B", "<priority>2</priority>");
    r1.registerResource();

    ResourceHolder r2(t);
-    auto * fc2 = r2.add<ConstraintTest>("/", "<max_requests>1</max_requests>");
+    auto * fc2 = r2.add<SemaphoreConstraint>("/", "<max_requests>1</max_requests>");
    r2.add<PriorityPolicy>("/prio");
    auto c = r2.addQueue("/prio/C", "<priority>-1</priority>");
    auto d = r2.addQueue("/prio/D", "<priority>-2</priority>");
@ -123,25 +129,25 @@ TEST(SchedulerRoot, Smoke)

    {
        ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), a);
-        EXPECT_TRUE(fc1->requests.contains(&rg.request));
+        EXPECT_TRUE(fc1->getInflights().first == 1);
        rg.consume(1);
    }

    {
        ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), b);
-        EXPECT_TRUE(fc1->requests.contains(&rg.request));
+        EXPECT_TRUE(fc1->getInflights().first == 1);
        rg.consume(1);
    }

    {
        ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), c);
-        EXPECT_TRUE(fc2->requests.contains(&rg.request));
+        EXPECT_TRUE(fc2->getInflights().first == 1);
        rg.consume(1);
    }

    {
        ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), d);
-        EXPECT_TRUE(fc2->requests.contains(&rg.request));
+        EXPECT_TRUE(fc2->getInflights().first == 1);
        rg.consume(1);
    }
 }
@ -151,7 +157,7 @@ TEST(SchedulerRoot, Budget)
    ResourceTest t;

    ResourceHolder r1(t);
-    r1.add<ConstraintTest>("/", "<max_requests>1</max_requests>");
+    r1.add<SemaphoreConstraint>("/", "<max_requests>1</max_requests>");
    r1.add<PriorityPolicy>("/prio");
    auto a = r1.addQueue("/prio/A", "");
    r1.registerResource();
@ -176,7 +182,7 @@ TEST(SchedulerRoot, Cancel)
    ResourceTest t;

    ResourceHolder r1(t);
-    auto * fc1 = r1.add<ConstraintTest>("/", "<max_requests>1</max_requests>");
+    auto * fc1 = r1.add<SemaphoreConstraint>("/", "<max_requests>1</max_requests>");
    r1.add<PriorityPolicy>("/prio");
    auto a = r1.addQueue("/prio/A", "<priority>1</priority>");
    auto b = r1.addQueue("/prio/B", "<priority>2</priority>");
@ -189,7 +195,7 @@ TEST(SchedulerRoot, Cancel)
        MyRequest request(1,[&]
        {
            sync.arrive_and_wait(); // (A)
-            EXPECT_TRUE(fc1->requests.contains(&request));
+            EXPECT_TRUE(fc1->getInflights().first == 1);
            sync.arrive_and_wait(); // (B)
            request.finish();
            destruct_sync.arrive_and_wait(); // (C)
@ -214,5 +220,5 @@ TEST(SchedulerRoot, Cancel)
    consumer1.join();
    consumer2.join();

-    EXPECT_TRUE(fc1->requests.empty());
+    EXPECT_TRUE(fc1->getInflights().first == 0);
 }
--- a/src/Common/Scheduler/Nodes/tests/gtest_throttler_constraint.cpp
+++ b/src/Common/Scheduler/Nodes/tests/gtest_throttler_constraint.cpp
@ -10,9 +10,7 @@ using namespace DB;

 using ResourceTest = ResourceTestClass;

-/// Tests disabled because of leaks in the test themselves: https://github.com/ClickHouse/ClickHouse/issues/67678
-
-TEST(DISABLED_SchedulerThrottlerConstraint, LeakyBucketConstraint)
+TEST(SchedulerThrottlerConstraint, LeakyBucketConstraint)
 {
    ResourceTest t;
    EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -42,7 +40,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, LeakyBucketConstraint)
    t.consumed("A", 10);
 }

-TEST(DISABLED_SchedulerThrottlerConstraint, Unlimited)
+TEST(SchedulerThrottlerConstraint, Unlimited)
 {
    ResourceTest t;
    EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -59,7 +57,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, Unlimited)
    }
 }

-TEST(DISABLED_SchedulerThrottlerConstraint, Pacing)
+TEST(SchedulerThrottlerConstraint, Pacing)
 {
    ResourceTest t;
    EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -79,7 +77,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, Pacing)
    }
 }

-TEST(DISABLED_SchedulerThrottlerConstraint, BucketFilling)
+TEST(SchedulerThrottlerConstraint, BucketFilling)
 {
    ResourceTest t;
    EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -113,7 +111,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, BucketFilling)
    t.consumed("A", 3);
 }

-TEST(DISABLED_SchedulerThrottlerConstraint, PeekAndAvgLimits)
+TEST(SchedulerThrottlerConstraint, PeekAndAvgLimits)
 {
    ResourceTest t;
    EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -141,7 +139,7 @@ TEST(DISABLED_SchedulerThrottlerConstraint, PeekAndAvgLimits)
    }
 }

-TEST(DISABLED_SchedulerThrottlerConstraint, ThrottlerAndFairness)
+TEST(SchedulerThrottlerConstraint, ThrottlerAndFairness)
 {
    ResourceTest t;
    EventQueue::TimePoint start = std::chrono::system_clock::now();
@ -160,22 +158,22 @@ TEST(DISABLED_SchedulerThrottlerConstraint, ThrottlerAndFairness)
        t.enqueue("/fair/B", {req_cost});
    }

-    double shareA = 0.1;
-    double shareB = 0.9;
+    double share_a = 0.1;
+    double share_b = 0.9;

    // Bandwidth-latency coupling due to fairness: worst latency is inversely proportional to share
-    auto max_latencyA = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / shareA));
-    auto max_latencyB = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / shareB));
+    auto max_latency_a = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / share_a));
+    auto max_latency_b = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / share_b));

-    double consumedA = 0;
-    double consumedB = 0;
+    double consumed_a = 0;
+    double consumed_b = 0;
    for (int seconds = 0; seconds < 100; seconds++)
    {
        t.process(start + std::chrono::seconds(seconds));
        double arrival_curve = 100.0 + 10.0 * seconds + req_cost;
-        t.consumed("A", static_cast<ResourceCost>(arrival_curve * shareA - consumedA), max_latencyA);
-        t.consumed("B", static_cast<ResourceCost>(arrival_curve * shareB - consumedB), max_latencyB);
-        consumedA = arrival_curve * shareA;
-        consumedB = arrival_curve * shareB;
+        t.consumed("A", static_cast<ResourceCost>(arrival_curve * share_a - consumed_a), max_latency_a);
+        t.consumed("B", static_cast<ResourceCost>(arrival_curve * share_b - consumed_b), max_latency_b);
+        consumed_a = arrival_curve * share_a;
+        consumed_b = arrival_curve * share_b;
    }
 }
--- a/src/Common/Scheduler/Nodes/tests/gtest_unified_scheduler_node.cpp
+++ b/src/Common/Scheduler/Nodes/tests/gtest_unified_scheduler_node.cpp
@ -0,0 +1,748 @@
+#include <chrono>
+#include <gtest/gtest.h>
+
+#include <Common/Scheduler/ResourceGuard.h>
+#include <Common/Scheduler/ResourceLink.h>
+#include <Common/Scheduler/Nodes/tests/ResourceTest.h>
+
+#include <Common/Priority.h>
+#include <Common/Scheduler/Nodes/FairPolicy.h>
+#include <Common/Scheduler/Nodes/UnifiedSchedulerNode.h>
+
+using namespace DB;
+
+using ResourceTest = ResourceTestClass;
+
+TEST(SchedulerUnifiedNode, Smoke)
+{
+    ResourceTest t;
+
+    t.addCustom<UnifiedSchedulerNode>("/", SchedulingSettings{});
+
+    t.enqueue("/fifo", {10, 10});
+    t.dequeue(2);
+    t.consumed("fifo", 20);
+}
+
+TEST(SchedulerUnifiedNode, FairnessWeight)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{}});
+    auto b = t.createUnifiedNode("B", all, {.weight = 3.0, .priority = Priority{}});
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
+
+    t.dequeue(4);
+    t.consumed("A", 10);
+    t.consumed("B", 30);
+
+    t.dequeue(4);
+    t.consumed("A", 10);
+    t.consumed("B", 30);
+
+    t.dequeue();
+    t.consumed("A", 60);
+    t.consumed("B", 20);
+}
+
+TEST(SchedulerUnifiedNode, FairnessActivation)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all);
+    auto b = t.createUnifiedNode("B", all);
+    auto c = t.createUnifiedNode("C", all);
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(b, {10});
+    t.enqueue(c, {10, 10});
+
+    t.dequeue(3);
+    t.consumed("A", 10);
+    t.consumed("B", 10);
+    t.consumed("C", 10);
+
+    t.dequeue(4);
+    t.consumed("A", 30);
+    t.consumed("B", 0);
+    t.consumed("C", 10);
+
+    t.enqueue(b, {10, 10});
+    t.dequeue(1);
+    t.consumed("B", 10);
+
+    t.enqueue(c, {10, 10});
+    t.dequeue(1);
+    t.consumed("C", 10);
+
+    t.dequeue(2); // A B or B A
+    t.consumed("A", 10);
+    t.consumed("B", 10);
+}
+
+TEST(SchedulerUnifiedNode, FairnessMaxMin)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all);
+    auto b = t.createUnifiedNode("B", all);
+
+    t.enqueue(a, {10, 10}); // make sure A is never empty
+
+    for (int i = 0; i < 10; i++)
+    {
+        t.enqueue(a, {10, 10, 10, 10});
+        t.enqueue(b, {10, 10});
+
+        t.dequeue(6);
+        t.consumed("A", 40);
+        t.consumed("B", 20);
+    }
+
+    t.dequeue(2);
+    t.consumed("A", 20);
+}
+
+TEST(SchedulerUnifiedNode, FairnessHierarchical)
+{
+    ResourceTest t;
+
+
+    auto all = t.createUnifiedNode("all");
+    auto x = t.createUnifiedNode("X", all);
+    auto y = t.createUnifiedNode("Y", all);
+    auto a = t.createUnifiedNode("A", x);
+    auto b = t.createUnifiedNode("B", x);
+    auto c = t.createUnifiedNode("C", y);
+    auto d = t.createUnifiedNode("D", y);
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
+    for (int i = 0; i < 4; i++)
+    {
+        t.dequeue(8);
+        t.consumed("A", 20);
+        t.consumed("B", 20);
+        t.consumed("C", 20);
+        t.consumed("D", 20);
+    }
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
+    for (int i = 0; i < 4; i++)
+    {
+        t.dequeue(8);
+        t.consumed("A", 40);
+        t.consumed("C", 20);
+        t.consumed("D", 20);
+    }
+
+    t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
+    for (int i = 0; i < 4; i++)
+    {
+        t.dequeue(8);
+        t.consumed("B", 40);
+        t.consumed("C", 20);
+        t.consumed("D", 20);
+    }
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(c, {10, 10, 10, 10, 10, 10, 10, 10});
+    for (int i = 0; i < 4; i++)
+    {
+        t.dequeue(8);
+        t.consumed("A", 20);
+        t.consumed("B", 20);
+        t.consumed("C", 40);
+    }
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
+    for (int i = 0; i < 4; i++)
+    {
+        t.dequeue(8);
+        t.consumed("A", 20);
+        t.consumed("B", 20);
+        t.consumed("D", 40);
+    }
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(d, {10, 10, 10, 10, 10, 10, 10, 10});
+    for (int i = 0; i < 4; i++)
+    {
+        t.dequeue(8);
+        t.consumed("A", 40);
+        t.consumed("D", 40);
+    }
+}
+
+TEST(SchedulerUnifiedNode, Priority)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all, {.priority = Priority{3}});
+    auto b = t.createUnifiedNode("B", all, {.priority = Priority{2}});
+    auto c = t.createUnifiedNode("C", all, {.priority = Priority{1}});
+
+    t.enqueue(a, {10, 10, 10});
+    t.enqueue(b, {10, 10, 10});
+    t.enqueue(c, {10, 10, 10});
+
+    t.dequeue(2);
+    t.consumed("A", 0);
+    t.consumed("B", 0);
+    t.consumed("C", 20);
+
+    t.dequeue(2);
+    t.consumed("A", 0);
+    t.consumed("B", 10);
+    t.consumed("C", 10);
+
+    t.dequeue(2);
+    t.consumed("A", 0);
+    t.consumed("B", 20);
+    t.consumed("C", 0);
+
+    t.dequeue();
+    t.consumed("A", 30);
+    t.consumed("B", 0);
+    t.consumed("C", 0);
+}
+
+TEST(SchedulerUnifiedNode, PriorityActivation)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all, {.priority = Priority{3}});
+    auto b = t.createUnifiedNode("B", all, {.priority = Priority{2}});
+    auto c = t.createUnifiedNode("C", all, {.priority = Priority{1}});
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10});
+    t.enqueue(b, {10});
+    t.enqueue(c, {10, 10});
+
+    t.dequeue(3);
+    t.consumed("A", 0);
+    t.consumed("B", 10);
+    t.consumed("C", 20);
+
+    t.dequeue(2);
+    t.consumed("A", 20);
+    t.consumed("B", 0);
+    t.consumed("C", 0);
+
+    t.enqueue(b, {10, 10, 10});
+    t.dequeue(2);
+    t.consumed("A", 0);
+    t.consumed("B", 20);
+    t.consumed("C", 0);
+
+    t.enqueue(c, {10, 10});
+    t.dequeue(3);
+    t.consumed("A", 0);
+    t.consumed("B", 10);
+    t.consumed("C", 20);
+
+    t.dequeue(2);
+    t.consumed("A", 20);
+    t.consumed("B", 0);
+    t.consumed("C", 0);
+}
+
+TEST(SchedulerUnifiedNode, List)
+{
+    ResourceTest t;
+
+    std::list<UnifiedSchedulerNodePtr> list;
+    list.push_back(t.createUnifiedNode("all"));
+
+    for (int length = 1; length < 5; length++)
+    {
+        String name = fmt::format("L{}", length);
+        list.push_back(t.createUnifiedNode(name, list.back()));
+
+        for (int i = 0; i < 3; i++)
+        {
+            t.enqueue(list.back(), {10, 10});
+            t.dequeue(1);
+            t.consumed(name, 10);
+
+            for (int j = 0; j < 3; j++)
+            {
+                t.enqueue(list.back(), {10, 10, 10});
+                t.dequeue(1);
+                t.consumed(name, 10);
+                t.dequeue(1);
+                t.consumed(name, 10);
+                t.dequeue(1);
+                t.consumed(name, 10);
+            }
+
+            t.dequeue(1);
+            t.consumed(name, 10);
+        }
+    }
+}
+
+TEST(SchedulerUnifiedNode, ThrottlerLeakyBucket)
+{
+    ResourceTest t;
+    EventQueue::TimePoint start = std::chrono::system_clock::now();
+    t.process(start, 0);
+
+    auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 20.0});
+
+    t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10});
+
+    t.process(start + std::chrono::seconds(0));
+    t.consumed("all", 30); // It is allowed to go below zero for exactly one resource request
+
+    t.process(start + std::chrono::seconds(1));
+    t.consumed("all", 10);
+
+    t.process(start + std::chrono::seconds(2));
+    t.consumed("all", 10);
+
+    t.process(start + std::chrono::seconds(3));
+    t.consumed("all", 10);
+
+    t.process(start + std::chrono::seconds(4));
+    t.consumed("all", 10);
+
+    t.process(start + std::chrono::seconds(100500));
+    t.consumed("all", 10);
+}
+
+TEST(SchedulerUnifiedNode, ThrottlerPacing)
+{
+    ResourceTest t;
+    EventQueue::TimePoint start = std::chrono::system_clock::now();
+    t.process(start, 0);
+
+    // Zero burst allows you to send one request of any `size` and than throttle for `size/max_speed` seconds.
+    // Useful if outgoing traffic should be "paced", i.e. have the least possible burstiness.
+    auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 1.0, .max_burst = 0.0});
+
+    t.enqueue(all, {1, 2, 3, 1, 2, 1});
+    int output[] = {1, 2, 0, 3, 0, 0, 1, 2, 0, 1, 0};
+    for (int i = 0; i < std::size(output); i++)
+    {
+        t.process(start + std::chrono::seconds(i));
+        t.consumed("all", output[i]);
+    }
+}
+
+TEST(SchedulerUnifiedNode, ThrottlerBucketFilling)
+{
+    ResourceTest t;
+    EventQueue::TimePoint start = std::chrono::system_clock::now();
+    t.process(start, 0);
+
+    auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0});
+
+    t.enqueue(all, {100});
+
+    t.process(start + std::chrono::seconds(0));
+    t.consumed("all", 100); // consume all tokens, but it is still active (not negative)
+
+    t.process(start + std::chrono::seconds(5));
+    t.consumed("all", 0); // There was nothing to consume
+
+    t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10, 10, 10});
+    t.process(start + std::chrono::seconds(5));
+    t.consumed("all", 60); // 5 sec * 10 tokens/sec = 50 tokens + 1 extra request to go below zero
+
+    t.process(start + std::chrono::seconds(100));
+    t.consumed("all", 40); // Consume rest
+
+    t.process(start + std::chrono::seconds(200));
+
+    t.enqueue(all, {95, 1, 1, 1, 1, 1, 1, 1, 1, 1});
+    t.process(start + std::chrono::seconds(200));
+    t.consumed("all", 101); // check we cannot consume more than max_burst + 1 request
+
+    t.process(start + std::chrono::seconds(100500));
+    t.consumed("all", 3);
+}
+
+TEST(SchedulerUnifiedNode, ThrottlerAndFairness)
+{
+    ResourceTest t;
+    EventQueue::TimePoint start = std::chrono::system_clock::now();
+    t.process(start, 0);
+
+    auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0});
+    auto a = t.createUnifiedNode("A", all, {.weight = 10.0, .priority = Priority{}});
+    auto b = t.createUnifiedNode("B", all, {.weight = 90.0, .priority = Priority{}});
+
+    ResourceCost req_cost = 1;
+    ResourceCost total_cost = 2000;
+    for (int i = 0; i < total_cost / req_cost; i++)
+    {
+        t.enqueue(a, {req_cost});
+        t.enqueue(b, {req_cost});
+    }
+
+    double share_a = 0.1;
+    double share_b = 0.9;
+
+    // Bandwidth-latency coupling due to fairness: worst latency is inversely proportional to share
+    auto max_latency_a = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / share_a));
+    auto max_latency_b = static_cast<ResourceCost>(req_cost * (1.0 + 1.0 / share_b));
+
+    double consumed_a = 0;
+    double consumed_b = 0;
+    for (int seconds = 0; seconds < 100; seconds++)
+    {
+        t.process(start + std::chrono::seconds(seconds));
+        double arrival_curve = 100.0 + 10.0 * seconds + req_cost;
+        t.consumed("A", static_cast<ResourceCost>(arrival_curve * share_a - consumed_a), max_latency_a);
+        t.consumed("B", static_cast<ResourceCost>(arrival_curve * share_b - consumed_b), max_latency_b);
+        consumed_a = arrival_curve * share_a;
+        consumed_b = arrival_curve * share_b;
+    }
+}
+
+TEST(SchedulerUnifiedNode, QueueWithRequestsDestruction)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+
+    t.enqueue(all, {10, 10}); // enqueue reqeuests to be canceled
+
+    // This will destroy the queue and fail both requests
+    auto a = t.createUnifiedNode("A", all);
+    t.failed(20);
+
+    // Check that everything works fine after destruction
+    auto b = t.createUnifiedNode("B", all);
+    t.enqueue(a, {10, 10}); // make sure A is never empty
+    for (int i = 0; i < 10; i++)
+    {
+        t.enqueue(a, {10, 10, 10, 10});
+        t.enqueue(b, {10, 10});
+
+        t.dequeue(6);
+        t.consumed("A", 40);
+        t.consumed("B", 20);
+    }
+    t.dequeue(2);
+    t.consumed("A", 20);
+}
+
+TEST(SchedulerUnifiedNode, ResourceGuardException)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+
+    t.enqueue(all, {10, 10}); // enqueue reqeuests to be canceled
+
+    std::thread consumer([queue = all->getQueue()]
+    {
+        ResourceLink link{.queue = queue.get()};
+        bool caught = false;
+        try
+        {
+            ResourceGuard rg(ResourceGuard::Metrics::getIOWrite(), link);
+        }
+        catch (...)
+        {
+            caught = true;
+        }
+        ASSERT_TRUE(caught);
+    });
+
+    // This will destroy the queue and fail both requests
+    auto a = t.createUnifiedNode("A", all);
+    t.failed(20);
+    consumer.join();
+
+    // Check that everything works fine after destruction
+    auto b = t.createUnifiedNode("B", all);
+    t.enqueue(a, {10, 10}); // make sure A is never empty
+    for (int i = 0; i < 10; i++)
+    {
+        t.enqueue(a, {10, 10, 10, 10});
+        t.enqueue(b, {10, 10});
+
+        t.dequeue(6);
+        t.consumed("A", 40);
+        t.consumed("B", 20);
+    }
+    t.dequeue(2);
+    t.consumed("A", 20);
+}
+
+TEST(SchedulerUnifiedNode, UpdateWeight)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{}});
+    auto b = t.createUnifiedNode("B", all, {.weight = 3.0, .priority = Priority{}});
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
+
+    t.dequeue(4);
+    t.consumed("A", 10);
+    t.consumed("B", 30);
+
+    t.updateUnifiedNode(b, all, all, {.weight = 1.0, .priority = Priority{}});
+
+    t.dequeue(4);
+    t.consumed("A", 20);
+    t.consumed("B", 20);
+
+    t.dequeue(4);
+    t.consumed("A", 20);
+    t.consumed("B", 20);
+}
+
+TEST(SchedulerUnifiedNode, UpdatePriority)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{}});
+    auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{}});
+
+    t.enqueue(a, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(b, {10, 10, 10, 10, 10, 10, 10, 10});
+
+    t.dequeue(2);
+    t.consumed("A", 10);
+    t.consumed("B", 10);
+
+    t.updateUnifiedNode(a, all, all, {.weight = 1.0, .priority = Priority{-1}});
+
+    t.dequeue(2);
+    t.consumed("A", 20);
+    t.consumed("B", 0);
+
+    t.updateUnifiedNode(b, all, all, {.weight = 1.0, .priority = Priority{-2}});
+
+    t.dequeue(2);
+    t.consumed("A", 0);
+    t.consumed("B", 20);
+
+    t.updateUnifiedNode(a, all, all, {.weight = 1.0, .priority = Priority{-2}});
+
+    t.dequeue(2);
+    t.consumed("A", 10);
+    t.consumed("B", 10);
+}
+
+TEST(SchedulerUnifiedNode, UpdateParentOfLeafNode)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{1}});
+    auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{2}});
+    auto x = t.createUnifiedNode("X", a, {});
+    auto y = t.createUnifiedNode("Y", b, {});
+
+    t.enqueue(x, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(y, {10, 10, 10, 10, 10, 10, 10, 10});
+
+    t.dequeue(2);
+    t.consumed("X", 20);
+    t.consumed("Y", 0);
+
+    t.updateUnifiedNode(x, a, b, {});
+
+    t.dequeue(2);
+    t.consumed("X", 10);
+    t.consumed("Y", 10);
+
+    t.updateUnifiedNode(y, b, a, {});
+
+    t.dequeue(2);
+    t.consumed("X", 0);
+    t.consumed("Y", 20);
+
+    t.updateUnifiedNode(y, a, all, {});
+    t.updateUnifiedNode(x, b, all, {});
+
+    t.dequeue(4);
+    t.consumed("X", 20);
+    t.consumed("Y", 20);
+}
+
+TEST(SchedulerUnifiedNode, UpdatePriorityOfIntermediateNode)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{1}});
+    auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{2}});
+    auto x1 = t.createUnifiedNode("X1", a, {});
+    auto y1 = t.createUnifiedNode("Y1", b, {});
+    auto x2 = t.createUnifiedNode("X2", a, {});
+    auto y2 = t.createUnifiedNode("Y2", b, {});
+
+    t.enqueue(x1, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(y1, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(x2, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(y2, {10, 10, 10, 10, 10, 10, 10, 10});
+
+    t.dequeue(4);
+    t.consumed("X1", 20);
+    t.consumed("Y1", 0);
+    t.consumed("X2", 20);
+    t.consumed("Y2", 0);
+
+    t.updateUnifiedNode(a, all, all, {.weight = 1.0, .priority = Priority{2}});
+
+    t.dequeue(4);
+    t.consumed("X1", 10);
+    t.consumed("Y1", 10);
+    t.consumed("X2", 10);
+    t.consumed("Y2", 10);
+
+    t.updateUnifiedNode(b, all, all, {.weight = 1.0, .priority = Priority{1}});
+
+    t.dequeue(4);
+    t.consumed("X1", 0);
+    t.consumed("Y1", 20);
+    t.consumed("X2", 0);
+    t.consumed("Y2", 20);
+}
+
+TEST(SchedulerUnifiedNode, UpdateParentOfIntermediateNode)
+{
+    ResourceTest t;
+
+    auto all = t.createUnifiedNode("all");
+    auto a = t.createUnifiedNode("A", all, {.weight = 1.0, .priority = Priority{1}});
+    auto b = t.createUnifiedNode("B", all, {.weight = 1.0, .priority = Priority{2}});
+    auto c = t.createUnifiedNode("C", a, {});
+    auto d = t.createUnifiedNode("D", b, {});
+    auto x1 = t.createUnifiedNode("X1", c, {});
+    auto y1 = t.createUnifiedNode("Y1", d, {});
+    auto x2 = t.createUnifiedNode("X2", c, {});
+    auto y2 = t.createUnifiedNode("Y2", d, {});
+
+    t.enqueue(x1, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(y1, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(x2, {10, 10, 10, 10, 10, 10, 10, 10});
+    t.enqueue(y2, {10, 10, 10, 10, 10, 10, 10, 10});
+
+    t.dequeue(4);
+    t.consumed("X1", 20);
+    t.consumed("Y1", 0);
+    t.consumed("X2", 20);
+    t.consumed("Y2", 0);
+
+    t.updateUnifiedNode(c, a, b, {});
+
+    t.dequeue(4);
+    t.consumed("X1", 10);
+    t.consumed("Y1", 10);
+    t.consumed("X2", 10);
+    t.consumed("Y2", 10);
+
+    t.updateUnifiedNode(d, b, a, {});
+
+    t.dequeue(4);
+    t.consumed("X1", 0);
+    t.consumed("Y1", 20);
+    t.consumed("X2", 0);
+    t.consumed("Y2", 20);
+}
+
+TEST(SchedulerUnifiedNode, UpdateThrottlerMaxSpeed)
+{
+    ResourceTest t;
+    EventQueue::TimePoint start = std::chrono::system_clock::now();
+    t.process(start, 0);
+
+    auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 20.0});
+
+    t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10});
+
+    t.process(start + std::chrono::seconds(0));
+    t.consumed("all", 30); // It is allowed to go below zero for exactly one resource request
+
+    t.process(start + std::chrono::seconds(1));
+    t.consumed("all", 10);
+
+    t.process(start + std::chrono::seconds(2));
+    t.consumed("all", 10);
+
+    t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 1.0, .max_burst = 20.0});
+
+    t.process(start + std::chrono::seconds(12));
+    t.consumed("all", 10);
+
+    t.process(start + std::chrono::seconds(22));
+    t.consumed("all", 10);
+
+    t.process(start + std::chrono::seconds(100500));
+    t.consumed("all", 10);
+}
+
+TEST(SchedulerUnifiedNode, UpdateThrottlerMaxBurst)
+{
+    ResourceTest t;
+    EventQueue::TimePoint start = std::chrono::system_clock::now();
+    t.process(start, 0);
+
+    auto all = t.createUnifiedNode("all", {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0});
+
+    t.enqueue(all, {100});
+
+    t.process(start + std::chrono::seconds(0));
+    t.consumed("all", 100); // consume all tokens, but it is still active (not negative)
+
+    t.process(start + std::chrono::seconds(2));
+    t.consumed("all", 0); // There was nothing to consume
+    t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 10.0, .max_burst = 30.0});
+
+    t.process(start + std::chrono::seconds(5));
+    t.consumed("all", 0); // There was nothing to consume
+
+    t.enqueue(all, {10, 10, 10, 10, 10, 10, 10, 10, 10, 10});
+    t.process(start + std::chrono::seconds(5));
+    t.consumed("all", 40); // min(30 tokens, 5 sec * 10 tokens/sec) = 30 tokens + 1 extra request to go below zero
+
+    t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 10.0, .max_burst = 100.0});
+
+    t.process(start + std::chrono::seconds(100));
+    t.consumed("all", 60); // Consume rest
+
+    t.process(start + std::chrono::seconds(150));
+    t.updateUnifiedNode(all, {}, {}, {.priority = Priority{}, .max_speed = 100.0, .max_burst = 200.0});
+
+    t.process(start + std::chrono::seconds(200));
+
+    t.enqueue(all, {195, 1, 1, 1, 1, 1, 1, 1, 1, 1});
+    t.process(start + std::chrono::seconds(200));
+    t.consumed("all", 201); // check we cannot consume more than max_burst + 1 request
+
+    t.process(start + std::chrono::seconds(100500));
+    t.consumed("all", 3);
+}
--- a/src/Common/Scheduler/ResourceGuard.h
+++ b/src/Common/Scheduler/ResourceGuard.h
@ -12,6 +12,7 @@
 #include <Common/CurrentMetrics.h>

 #include <condition_variable>
+#include <exception>
 #include <mutex>


@ -34,6 +35,11 @@ namespace CurrentMetrics
 namespace DB
 {

+namespace ErrorCodes
+{
+    extern const int RESOURCE_ACCESS_DENIED;
+}
+
 /*
 * Scoped resource guard.
 * Waits for resource to be available in constructor and releases resource in destructor
@ -109,12 +115,25 @@ public:
            dequeued_cv.notify_one();
        }

+        // This function is executed inside scheduler thread and wakes thread that issued this `request`.
+        // That thread will throw an exception.
+        void failed(const std::exception_ptr & ptr) override
+        {
+            std::unique_lock lock(mutex);
+            chassert(state == Enqueued);
+            state = Dequeued;
+            exception = ptr;
+            dequeued_cv.notify_one();
+        }
+
        void wait()
        {
            CurrentMetrics::Increment scheduled(metrics->scheduled_count);
            auto timer = CurrentThread::getProfileEvents().timer(metrics->wait_microseconds);
            std::unique_lock lock(mutex);
            dequeued_cv.wait(lock, [this] { return state == Dequeued; });
+            if (exception)
+                throw Exception(ErrorCodes::RESOURCE_ACCESS_DENIED, "Resource request failed: {}", getExceptionMessage(exception, /* with_stacktrace = */ false));
        }

        void finish(ResourceCost real_cost_, ResourceLink link_)
@ -151,6 +170,7 @@ public:
        std::mutex mutex;
        std::condition_variable dequeued_cv;
        RequestState state = Finished;
+        std::exception_ptr exception;
    };

    /// Creates pending request for resource; blocks while resource is not available (unless `Lock::Defer`)
--- a/src/Common/Scheduler/ResourceManagerFactory.h
+++ b/src/Common/Scheduler/ResourceManagerFactory.h
@ -1,55 +0,0 @@
-#pragma once
-
-#include <Common/ErrorCodes.h>
-#include <Common/Exception.h>
-
-#include <Common/Scheduler/IResourceManager.h>
-
-#include <boost/noncopyable.hpp>
-
-#include <memory>
-#include <mutex>
-#include <unordered_map>
-
-namespace DB
-{
-
-namespace ErrorCodes
-{
-    extern const int INVALID_SCHEDULER_NODE;
-}
-
-class ResourceManagerFactory : private boost::noncopyable
-{
-public:
-    static ResourceManagerFactory & instance()
-    {
-        static ResourceManagerFactory ret;
-        return ret;
-    }
-
-    ResourceManagerPtr get(const String & name)
-    {
-        std::lock_guard lock{mutex};
-        if (auto iter = methods.find(name); iter != methods.end())
-            return iter->second();
-        throw Exception(ErrorCodes::INVALID_SCHEDULER_NODE, "Unknown scheduler node type: {}", name);
-    }
-
-    template <class TDerived>
-    void registerMethod(const String & name)
-    {
-        std::lock_guard lock{mutex};
-        methods[name] = [] ()
-        {
-            return std::make_shared<TDerived>();
-        };
-    }
-
-private:
-    std::mutex mutex;
-    using Method = std::function<ResourceManagerPtr()>;
-    std::unordered_map<String, Method> methods;
-};
-
-}
--- a/src/Common/Scheduler/ResourceRequest.cpp
+++ b/src/Common/Scheduler/ResourceRequest.cpp
@ -1,13 +1,34 @@
 #include <Common/Scheduler/ResourceRequest.h>
 #include <Common/Scheduler/ISchedulerConstraint.h>

+#include <Common/Exception.h>
+
+#include <ranges>
+
 namespace DB
 {

 void ResourceRequest::finish()
 {
-    if (constraint)
-        constraint->finishRequest(this);
+    // Iterate over constraints in reverse order
+    for (ISchedulerConstraint * constraint : std::ranges::reverse_view(constraints))
+    {
+        if (constraint)
+            constraint->finishRequest(this);
+    }
+}
+
+bool ResourceRequest::addConstraint(ISchedulerConstraint * new_constraint)
+{
+    for (auto & constraint : constraints)
+    {
+        if (!constraint)
+        {
+            constraint = new_constraint;
+            return true;
+        }
+    }
+    return false;
 }

 }
--- a/src/Common/Scheduler/ResourceRequest.h
+++ b/src/Common/Scheduler/ResourceRequest.h
@ -2,7 +2,9 @@

 #include <boost/intrusive/list.hpp>
 #include <base/types.h>
+#include <array>
 #include <limits>
+#include <exception>

 namespace DB
 {
@ -15,6 +17,9 @@ class ISchedulerConstraint;
 using ResourceCost = Int64;
 constexpr ResourceCost ResourceCostMax = std::numeric_limits<int>::max();

+/// Max number of constraints for a request to pass though (depth of constraints chain)
+constexpr size_t ResourceMaxConstraints = 8;
+
 /*
 * Request for a resource consumption. The main moving part of the scheduling subsystem.
 * Resource requests processing workflow:
@ -39,8 +44,7 @@ constexpr ResourceCost ResourceCostMax = std::numeric_limits<int>::max();
 *
 * Request can also be canceled before (3) using ISchedulerQueue::cancelRequest().
 * Returning false means it is too late for request to be canceled. It should be processed in a regular way.
- * Returning true means successful cancel and therefore steps (4) and (5) are not going to happen
- * and step (6) MUST be omitted.
+ * Returning true means successful cancel and therefore steps (4) and (5) are not going to happen.
 */
 class ResourceRequest : public boost::intrusive::list_base_hook<>
 {
@ -49,9 +53,10 @@ public:
    /// NOTE: If cost is not known in advance, ResourceBudget should be used (note that every ISchedulerQueue has it)
    ResourceCost cost;

-    /// Scheduler node to be notified on consumption finish
-    /// Auto-filled during request enqueue/dequeue
-    ISchedulerConstraint * constraint;
+    /// Scheduler nodes to be notified on consumption finish
+    /// Auto-filled during request dequeue
+    /// Vector is not used to avoid allocations in the scheduler thread
+    std::array<ISchedulerConstraint *, ResourceMaxConstraints> constraints;

    explicit ResourceRequest(ResourceCost cost_ = 1)
    {
@ -62,7 +67,8 @@ public:
    void reset(ResourceCost cost_)
    {
        cost = cost_;
-        constraint = nullptr;
+        for (auto & constraint : constraints)
+            constraint = nullptr;
        // Note that list_base_hook should be reset independently (by intrusive list)
    }

@ -74,11 +80,18 @@ public:
    /// (e.g. setting an std::promise or creating a job in a thread pool)
    virtual void execute() = 0;

+    /// Callback to trigger an error in case if resource is unavailable.
+    virtual void failed(const std::exception_ptr & ptr) = 0;
+
    /// Stop resource consumption and notify resource scheduler.
    /// Should be called when resource consumption is finished by consumer.
    /// ResourceRequest should not be destructed or reset before calling to `finish()`.
-    /// WARNING: this function MUST not be called if request was canceled.
+    /// It is okay to call finish() even for failed and canceled requests (it will be no-op)
    void finish();
+
+    /// Is called from the scheduler thread to fill `constraints` chain
+    /// Returns `true` iff constraint was added successfully
+    bool addConstraint(ISchedulerConstraint * new_constraint);
 };

 }
--- a/src/Common/Scheduler/SchedulerRoot.h
+++ b/src/Common/Scheduler/SchedulerRoot.h
@ -28,27 +28,27 @@ namespace ErrorCodes
 * Resource scheduler root node with a dedicated thread.
 * Immediate children correspond to different resources.
 */
-class SchedulerRoot : public ISchedulerNode
+class SchedulerRoot final : public ISchedulerNode
 {
 private:
-    struct TResource
+    struct Resource
    {
        SchedulerNodePtr root;

        // Intrusive cyclic list of active resources
-        TResource * next = nullptr;
-        TResource * prev = nullptr;
+        Resource * next = nullptr;
+        Resource * prev = nullptr;

-        explicit TResource(const SchedulerNodePtr & root_)
+        explicit Resource(const SchedulerNodePtr & root_)
            : root(root_)
        {
            root->info.parent.ptr = this;
        }

        // Get pointer stored by ctor in info
-        static TResource * get(SchedulerNodeInfo & info)
+        static Resource * get(SchedulerNodeInfo & info)
        {
-            return reinterpret_cast<TResource *>(info.parent.ptr);
+            return reinterpret_cast<Resource *>(info.parent.ptr);
        }
    };

@ -60,6 +60,8 @@ public:
    ~SchedulerRoot() override
    {
        stop();
+        while (!children.empty())
+            removeChild(children.begin()->first);
    }

    /// Runs separate scheduler thread
@ -95,6 +97,12 @@ public:
        }
    }

+    const String & getTypeName() const override
+    {
+        static String type_name("scheduler");
+        return type_name;
+    }
+
    bool equals(ISchedulerNode * other) override
    {
        if (!ISchedulerNode::equals(other))
@ -179,16 +187,11 @@ public:

    void activateChild(ISchedulerNode * child) override
    {
-        activate(TResource::get(child->info));
-    }
-
-    void setParent(ISchedulerNode *) override
-    {
-        abort(); // scheduler must be the root and this function should not be called
+        activate(Resource::get(child->info));
    }

 private:
-    void activate(TResource * value)
+    void activate(Resource * value)
    {
        assert(value->next == nullptr && value->prev == nullptr);
        if (current == nullptr) // No active children
@ -206,7 +209,7 @@ private:
        }
    }

-    void deactivate(TResource * value)
+    void deactivate(Resource * value)
    {
        if (value->next == nullptr)
            return; // Already deactivated
@ -251,8 +254,8 @@ private:
        request->execute();
    }

-    TResource * current = nullptr; // round-robin pointer
-    std::unordered_map<ISchedulerNode *, TResource> children; // resources by pointer
+    Resource * current = nullptr; // round-robin pointer
+    std::unordered_map<ISchedulerNode *, Resource> children; // resources by pointer
    std::atomic<bool> stop_flag = false;
    EventQueue events;
    ThreadFromGlobalPool scheduler;
--- a/src/Common/Scheduler/SchedulingSettings.cpp
+++ b/src/Common/Scheduler/SchedulingSettings.cpp
@ -0,0 +1,130 @@
+#include <limits>
+#include <Common/Scheduler/SchedulingSettings.h>
+#include <Common/Scheduler/ISchedulerNode.h>
+#include <Parsers/ASTSetQuery.h>
+
+
+namespace DB
+{
+
+namespace ErrorCodes
+{
+    extern const int BAD_ARGUMENTS;
+}
+
+void SchedulingSettings::updateFromChanges(const ASTCreateWorkloadQuery::SettingsChanges & changes, const String & resource_name)
+{
+    struct {
+        std::optional<Float64> new_weight;
+        std::optional<Priority> new_priority;
+        std::optional<Float64> new_max_speed;
+        std::optional<Float64> new_max_burst;
+        std::optional<Int64> new_max_requests;
+        std::optional<Int64> new_max_cost;
+
+        static Float64 getNotNegativeFloat64(const String & name, const Field & field)
+        {
+            {
+                UInt64 val;
+                if (field.tryGet(val))
+                    return static_cast<Float64>(val); // We dont mind slight loss of precision
+            }
+
+            {
+                Int64 val;
+                if (field.tryGet(val))
+                {
+                    if (val < 0)
+                        throw Exception(ErrorCodes::BAD_ARGUMENTS, "Unexpected negative Int64 value for workload setting '{}'", name);
+                    return static_cast<Float64>(val); // We dont mind slight loss of precision
+                }
+            }
+
+            return field.safeGet<Float64>();
+        }
+
+        static Int64 getNotNegativeInt64(const String & name, const Field & field)
+        {
+            {
+                UInt64 val;
+                if (field.tryGet(val))
+                {
+                    // Saturate on overflow
+                    if (val > static_cast<UInt64>(std::numeric_limits<Int64>::max()))
+                        val = std::numeric_limits<Int64>::max();
+                    return static_cast<Int64>(val);
+                }
+            }
+
+            {
+                Int64 val;
+                if (field.tryGet(val))
+                {
+                    if (val < 0)
+                        throw Exception(ErrorCodes::BAD_ARGUMENTS, "Unexpected negative Int64 value for workload setting '{}'", name);
+                    return val;
+                }
+            }
+
+            return field.safeGet<Int64>();
+        }
+
+        void read(const String & name, const Field & value)
+        {
+            if (name == "weight")
+                new_weight = getNotNegativeFloat64(name, value);
+            else if (name == "priority")
+                new_priority = Priority{value.safeGet<Priority::Value>()};
+            else if (name == "max_speed")
+                new_max_speed = getNotNegativeFloat64(name, value);
+            else if (name == "max_burst")
+                new_max_burst = getNotNegativeFloat64(name, value);
+            else if (name == "max_requests")
+                new_max_requests = getNotNegativeInt64(name, value);
+            else if (name == "max_cost")
+                new_max_cost = getNotNegativeInt64(name, value);
+        }
+    } regular, specific;
+
+    // Read changed setting values
+    for (const auto & [name, value, resource] : changes)
+    {
+        if (resource.empty())
+            regular.read(name, value);
+        else if (resource == resource_name)
+            specific.read(name, value);
+    }
+
+    auto get_value = [] <typename T> (const std::optional<T> & specific_new, const std::optional<T> & regular_new, T & old)
+    {
+        if (specific_new)
+            return *specific_new;
+        if (regular_new)
+            return *regular_new;
+        return old;
+    };
+
+    // Validate that we could use values read in a scheduler node
+    {
+        SchedulerNodeInfo validating_node(
+            get_value(specific.new_weight, regular.new_weight, weight),
+            get_value(specific.new_priority, regular.new_priority, priority));
+    }
+
+    // Commit new values.
+    // Previous values are left intentionally for ALTER query to be able to skip not mentioned setting values
+    weight = get_value(specific.new_weight, regular.new_weight, weight);
+    priority = get_value(specific.new_priority, regular.new_priority, priority);
+    if (specific.new_max_speed || regular.new_max_speed)
+    {
+        max_speed = get_value(specific.new_max_speed, regular.new_max_speed, max_speed);
+        // We always set max_burst if max_speed is changed.
+        // This is done for users to be able to ignore more advanced max_burst setting and rely only on max_speed
+        max_burst = default_burst_seconds * max_speed;
+    }
+    max_burst = get_value(specific.new_max_burst, regular.new_max_burst, max_burst);
+    max_requests = get_value(specific.new_max_requests, regular.new_max_requests, max_requests);
+    max_cost = get_value(specific.new_max_cost, regular.new_max_cost, max_cost);
+}
+
+}
--- a/src/Common/Scheduler/SchedulingSettings.h
+++ b/src/Common/Scheduler/SchedulingSettings.h
@ -0,0 +1,39 @@
+#pragma once
+
+#include <base/types.h>
+
+#include <Common/Priority.h>
+#include <Parsers/ASTCreateWorkloadQuery.h>
+
+#include <limits>
+
+namespace DB
+{
+
+struct SchedulingSettings
+{
+    /// Priority and weight among siblings
+    Float64 weight = 1.0;
+    Priority priority;
+
+    /// Throttling constraints.
+    /// Up to 2 independent throttlers: one for average speed and one for peek speed.
+    static constexpr Float64 default_burst_seconds = 1.0;
+    Float64 max_speed = 0; // Zero means unlimited
+    Float64 max_burst = 0; // default is `default_burst_seconds * max_speed`
+
+    /// Limits total number of concurrent resource requests that are allowed to consume
+    static constexpr Int64 default_max_requests = std::numeric_limits<Int64>::max();
+    Int64 max_requests = default_max_requests;
+
+    /// Limits total cost of concurrent resource requests that are allowed to consume
+    static constexpr Int64 default_max_cost = std::numeric_limits<Int64>::max();
+    Int64 max_cost = default_max_cost;
+
+    bool hasThrottler() const { return max_speed != 0; }
+    bool hasSemaphore() const { return max_requests != default_max_requests || max_cost != default_max_cost; }
+
+    void updateFromChanges(const ASTCreateWorkloadQuery::SettingsChanges & changes, const String & resource_name = {});
+};
+
+}
--- a/src/Common/Scheduler/Workload/IWorkloadEntityStorage.h
+++ b/src/Common/Scheduler/Workload/IWorkloadEntityStorage.h
@ -0,0 +1,91 @@
+#pragma once
+
+#include <base/types.h>
+#include <base/scope_guard.h>
+
+#include <Interpreters/Context_fwd.h>
+
+#include <Parsers/IAST_fwd.h>
+
+
+namespace DB
+{
+
+class IAST;
+struct Settings;
+
+enum class WorkloadEntityType : uint8_t
+{
+    Workload,
+    Resource,
+
+    MAX
+};
+
+/// Interface for a storage of workload entities (WORKLOAD and RESOURCE).
+class IWorkloadEntityStorage
+{
+public:
+    virtual ~IWorkloadEntityStorage() = default;
+
+    /// Whether this storage can replicate entities to another node.
+    virtual bool isReplicated() const { return false; }
+    virtual String getReplicationID() const { return ""; }
+
+    /// Loads all entities. Can be called once - if entities are already loaded the function does nothing.
+    virtual void loadEntities() = 0;
+
+    /// Get entity by name. If no entity stored with entity_name throws exception.
+    virtual ASTPtr get(const String & entity_name) const = 0;
+
+    /// Get entity by name. If no entity stored with entity_name return nullptr.
+    virtual ASTPtr tryGet(const String & entity_name) const = 0;
+
+    /// Check if entity with entity_name is stored.
+    virtual bool has(const String & entity_name) const = 0;
+
+    /// Get all entity names.
+    virtual std::vector<String> getAllEntityNames() const = 0;
+
+    /// Get all entity names of specified type.
+    virtual std::vector<String> getAllEntityNames(WorkloadEntityType entity_type) const = 0;
+
+    /// Get all entities.
+    virtual std::vector<std::pair<String, ASTPtr>> getAllEntities() const = 0;
+
+    /// Check whether any entity have been stored.
+    virtual bool empty() const = 0;
+
+    /// Stops watching.
+    virtual void stopWatching() {}
+
+    /// Stores an entity.
+    virtual bool storeEntity(
+        const ContextPtr & current_context,
+        WorkloadEntityType entity_type,
+        const String & entity_name,
+        ASTPtr create_entity_query,
+        bool throw_if_exists,
+        bool replace_if_exists,
+        const Settings & settings) = 0;
+
+    /// Removes an entity.
+    virtual bool removeEntity(
+        const ContextPtr & current_context,
+        WorkloadEntityType entity_type,
+        const String & entity_name,
+        bool throw_if_not_exists) = 0;
+
+    struct Event
+    {
+        WorkloadEntityType type;
+        String name;
+        ASTPtr entity; /// new or changed entity, null if removed
+    };
+    using OnChangedHandler = std::function<void(const std::vector<Event> &)>;
+
+    /// Gets all current entries, pass them through `handler` and subscribes for all later changes.
+    virtual scope_guard getAllEntitiesAndSubscribe(const OnChangedHandler & handler) = 0;
+};
+
+}
--- a/src/Common/Scheduler/Workload/WorkloadEntityDiskStorage.cpp
+++ b/src/Common/Scheduler/Workload/WorkloadEntityDiskStorage.cpp
@ -0,0 +1,287 @@
+#include <Common/Scheduler/Workload/WorkloadEntityDiskStorage.h>
+
+#include <Common/StringUtils.h>
+#include <Common/atomicRename.h>
+#include <Common/escapeForFileName.h>
+#include <Common/logger_useful.h>
+#include <Common/quoteString.h>
+
+#include <Core/Settings.h>
+
+#include <IO/ReadBufferFromFile.h>
+#include <IO/ReadHelpers.h>
+#include <IO/WriteBufferFromFile.h>
+#include <IO/WriteHelpers.h>
+
+#include <Interpreters/Context.h>
+
+#include <Parsers/parseQuery.h>
+#include <Parsers/formatAST.h>
+#include <Parsers/ParserCreateWorkloadQuery.h>
+#include <Parsers/ParserCreateResourceQuery.h>
+
+#include <Poco/DirectoryIterator.h>
+#include <Poco/Logger.h>
+
+#include <filesystem>
+
+namespace fs = std::filesystem;
+
+
+namespace DB
+{
+
+namespace Setting
+{
+    extern const SettingsUInt64 max_parser_backtracks;
+    extern const SettingsUInt64 max_parser_depth;
+    extern const SettingsBool fsync_metadata;
+}
+
+namespace ErrorCodes
+{
+    extern const int DIRECTORY_DOESNT_EXIST;
+    extern const int BAD_ARGUMENTS;
+}
+
+
+namespace
+{
+    constexpr std::string_view workload_prefix = "workload_";
+    constexpr std::string_view resource_prefix = "resource_";
+    constexpr std::string_view sql_suffix = ".sql";
+
+    /// Converts a path to an absolute path and append it with a separator.
+    String makeDirectoryPathCanonical(const String & directory_path)
+    {
+        auto canonical_directory_path = std::filesystem::weakly_canonical(directory_path);
+        if (canonical_directory_path.has_filename())
+            canonical_directory_path += std::filesystem::path::preferred_separator;
+        return canonical_directory_path;
+    }
+}
+
+WorkloadEntityDiskStorage::WorkloadEntityDiskStorage(const ContextPtr & global_context_, const String & dir_path_)
+    : WorkloadEntityStorageBase(global_context_)
+    , dir_path{makeDirectoryPathCanonical(dir_path_)}
+{
+    log = getLogger("WorkloadEntityDiskStorage");
+}
+
+
+ASTPtr WorkloadEntityDiskStorage::tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name)
+{
+    return tryLoadEntity(entity_type, entity_name, getFilePath(entity_type, entity_name), /* check_file_exists= */ true);
+}
+
+
+ASTPtr WorkloadEntityDiskStorage::tryLoadEntity(WorkloadEntityType entity_type, const String & entity_name, const String & path, bool check_file_exists)
+{
+    LOG_DEBUG(log, "Loading workload entity {} from file {}", backQuote(entity_name), path);
+
+    try
+    {
+        if (check_file_exists && !fs::exists(path))
+            return nullptr;
+
+        /// There is .sql file with workload entity creation statement.
+        ReadBufferFromFile in(path);
+
+        String entity_create_query;
+        readStringUntilEOF(entity_create_query, in);
+
+        auto parse = [&] (auto parser)
+        {
+            return parseQuery(
+                parser,
+                entity_create_query.data(),
+                entity_create_query.data() + entity_create_query.size(),
+                "",
+                0,
+                global_context->getSettingsRef()[Setting::max_parser_depth],
+                global_context->getSettingsRef()[Setting::max_parser_backtracks]);
+        };
+
+        switch (entity_type)
+        {
+            case WorkloadEntityType::Workload: return parse(ParserCreateWorkloadQuery());
+            case WorkloadEntityType::Resource: return parse(ParserCreateResourceQuery());
+            case WorkloadEntityType::MAX: return nullptr;
+        }
+    }
+    catch (...)
+    {
+        tryLogCurrentException(log, fmt::format("while loading workload entity {} from path {}", backQuote(entity_name), path));
+        return nullptr; /// Failed to load this entity, will ignore it
+    }
+}
+
+
+void WorkloadEntityDiskStorage::loadEntities()
+{
+    if (!entities_loaded)
+        loadEntitiesImpl();
+}
+
+
+void WorkloadEntityDiskStorage::loadEntitiesImpl()
+{
+    LOG_INFO(log, "Loading workload entities from {}", dir_path);
+
+    if (!std::filesystem::exists(dir_path))
+    {
+        LOG_DEBUG(log, "The directory for workload entities ({}) does not exist: nothing to load", dir_path);
+        return;
+    }
+
+    std::vector<std::pair<String, ASTPtr>> entities_name_and_queries;
+
+    Poco::DirectoryIterator dir_end;
+    for (Poco::DirectoryIterator it(dir_path); it != dir_end; ++it)
+    {
+        if (it->isDirectory())
+            continue;
+
+        const String & file_name = it.name();
+
+        if (file_name.starts_with(workload_prefix) && file_name.ends_with(sql_suffix))
+        {
+            String name = unescapeForFileName(file_name.substr(
+                workload_prefix.size(),
+                file_name.size() - workload_prefix.size() - sql_suffix.size()));
+
+            if (name.empty())
+                continue;
+
+            ASTPtr ast = tryLoadEntity(WorkloadEntityType::Workload, name, dir_path + it.name(), /* check_file_exists= */ false);
+            if (ast)
+                entities_name_and_queries.emplace_back(name, ast);
+        }
+
+        if (file_name.starts_with(resource_prefix) && file_name.ends_with(sql_suffix))
+        {
+            String name = unescapeForFileName(file_name.substr(
+                resource_prefix.size(),
+                file_name.size() - resource_prefix.size() - sql_suffix.size()));
+
+            if (name.empty())
+                continue;
+
+            ASTPtr ast = tryLoadEntity(WorkloadEntityType::Resource, name, dir_path + it.name(), /* check_file_exists= */ false);
+            if (ast)
+                entities_name_and_queries.emplace_back(name, ast);
+        }
+    }
+
+    setAllEntities(entities_name_and_queries);
+    entities_loaded = true;
+
+    LOG_DEBUG(log, "Workload entities loaded");
+}
+
+
+void WorkloadEntityDiskStorage::createDirectory()
+{
+    std::error_code create_dir_error_code;
+    fs::create_directories(dir_path, create_dir_error_code);
+    if (!fs::exists(dir_path) || !fs::is_directory(dir_path) || create_dir_error_code)
+        throw Exception(ErrorCodes::DIRECTORY_DOESNT_EXIST, "Couldn't create directory {} reason: '{}'",
+                        dir_path, create_dir_error_code.message());
+}
+
+
+WorkloadEntityStorageBase::OperationResult WorkloadEntityDiskStorage::storeEntityImpl(
+    const ContextPtr & /*current_context*/,
+    WorkloadEntityType entity_type,
+    const String & entity_name,
+    ASTPtr create_entity_query,
+    bool throw_if_exists,
+    bool replace_if_exists,
+    const Settings & settings)
+{
+    createDirectory();
+    String file_path = getFilePath(entity_type, entity_name);
+    LOG_DEBUG(log, "Storing workload entity {} to file {}", backQuote(entity_name), file_path);
+
+    if (fs::exists(file_path))
+    {
+        if (throw_if_exists)
+            throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' already exists", entity_name);
+        else if (!replace_if_exists)
+            return OperationResult::Failed;
+    }
+
+
+    String temp_file_path = file_path + ".tmp";
+
+    try
+    {
+        WriteBufferFromFile out(temp_file_path);
+        formatAST(*create_entity_query, out, false);
+        writeChar('\n', out);
+        out.next();
+        if (settings[Setting::fsync_metadata])
+            out.sync();
+        out.close();
+
+        if (replace_if_exists)
+            fs::rename(temp_file_path, file_path);
+        else
+            renameNoReplace(temp_file_path, file_path);
+    }
+    catch (...)
+    {
+        fs::remove(temp_file_path);
+        throw;
+    }
+
+    LOG_TRACE(log, "Entity {} stored", backQuote(entity_name));
+    return OperationResult::Ok;
+}
+
+
+WorkloadEntityStorageBase::OperationResult WorkloadEntityDiskStorage::removeEntityImpl(
+    const ContextPtr & /*current_context*/,
+    WorkloadEntityType entity_type,
+    const String & entity_name,
+    bool throw_if_not_exists)
+{
+    String file_path = getFilePath(entity_type, entity_name);
+    LOG_DEBUG(log, "Removing workload entity {} stored in file {}", backQuote(entity_name), file_path);
+
+    bool existed = fs::remove(file_path);
+
+    if (!existed)
+    {
+        if (throw_if_not_exists)
+            throw Exception(ErrorCodes::BAD_ARGUMENTS, "Workload entity '{}' doesn't exist", entity_name);
+        else
+            return OperationResult::Failed;
+    }
+
+    LOG_TRACE(log, "Entity {} removed", backQuote(entity_name));
+    return OperationResult::Ok;
+}
+
+
+String WorkloadEntityDiskStorage::getFilePath(WorkloadEntityType entity_type, const String & entity_name) const
+{
+    String file_path;
+    switch (entity_type)
+    {
+        case WorkloadEntityType::Workload:
+        {
+            file_path = dir_path + "workload_" + escapeForFileName(entity_name) + ".sql";
+            break;
+        }
+        case WorkloadEntityType::Resource:
+        {
+            file_path = dir_path + "resource_" + escapeForFileName(entity_name) + ".sql";
+            break;
+        }
+        case WorkloadEntityType::MAX: break;
+    }
+    return file_path;
+}
+
+}
--- a/Show More
+++ b/Show More