diff --git a/.gitignore b/.gitignore
index 4bc162c1b0f..8a745655cbf 100644
--- a/.gitignore
+++ b/.gitignore
@@ -159,6 +159,7 @@ website/package-lock.json
/programs/server/store
/programs/server/uuid
/programs/server/coordination
+/programs/server/workload
# temporary test files
tests/queries/0_stateless/test_*
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 6c0d21a4698..90285582b4e 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,4 +1,5 @@
### Table of Contents
+**[ClickHouse release v24.10, 2024-10-31](#2410)**
**[ClickHouse release v24.9, 2024-09-26](#249)**
**[ClickHouse release v24.8 LTS, 2024-08-20](#248)**
**[ClickHouse release v24.7, 2024-07-30](#247)**
@@ -12,6 +13,165 @@
# 2024 Changelog
+### ClickHouse release 24.10, 2024-10-31
+
+#### Backward Incompatible Change
+* Allow to write `SETTINGS` before `FORMAT` in a chain of queries with `UNION` when subqueries are inside parentheses. This closes [#39712](https://github.com/ClickHouse/ClickHouse/issues/39712). Change the behavior when a query has the SETTINGS clause specified twice in a sequence. The closest SETTINGS clause will have a preference for the corresponding subquery. In the previous versions, the outermost SETTINGS clause could take a preference over the inner one. [#68614](https://github.com/ClickHouse/ClickHouse/pull/68614) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Reordering of filter conditions from `[PRE]WHERE` clause is now allowed by default. It could be disabled by setting `allow_reorder_prewhere_conditions` to `false`. [#70657](https://github.com/ClickHouse/ClickHouse/pull/70657) ([Nikita Taranov](https://github.com/nickitat)).
+* Remove the `idxd-config` library, which has an incompatible license. This also removes the experimental Intel DeflateQPL codec. [#70987](https://github.com/ClickHouse/ClickHouse/pull/70987) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+
+#### New Feature
+* Allow to grant access to the wildcard prefixes. `GRANT SELECT ON db.table_pefix_* TO user`. [#65311](https://github.com/ClickHouse/ClickHouse/pull/65311) ([pufit](https://github.com/pufit)).
+* If you press space bar during query runtime, the client will display a real-time table with detailed metrics. You can enable it globally with the new `--progress-table` option in clickhouse-client; a new `--enable-progress-table-toggle` is associated with the `--progress-table` option, and toggles the rendering of the progress table by pressing the control key (Space). [#63689](https://github.com/ClickHouse/ClickHouse/pull/63689) ([Maria Khristenko](https://github.com/mariaKhr)), [#70423](https://github.com/ClickHouse/ClickHouse/pull/70423) ([Julia Kartseva](https://github.com/jkartseva)).
+* Allow to cache read files for object storage table engines and data lakes using hash from ETag + file path as cache key. [#70135](https://github.com/ClickHouse/ClickHouse/pull/70135) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Support creating a table with a query: `CREATE TABLE ... CLONE AS ...`. It clones the source table's schema and then attaches all partitions to the newly created table. This feature is only supported with tables of the `MergeTree` family Closes [#65015](https://github.com/ClickHouse/ClickHouse/issues/65015). [#69091](https://github.com/ClickHouse/ClickHouse/pull/69091) ([tuanpach](https://github.com/tuanpach)).
+* Add a new system table, `system.query_metric_log` which contains history of memory and metric values from table system.events for individual queries, periodically flushed to disk. [#66532](https://github.com/ClickHouse/ClickHouse/pull/66532) ([Pablo Marcos](https://github.com/pamarcos)).
+* A simple SELECT query can be written with implicit SELECT to enable calculator-style expressions, e.g., `ch "1 + 2"`. This is controlled by a new setting, `implicit_select`. [#68502](https://github.com/ClickHouse/ClickHouse/pull/68502) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Support the `--copy` mode for clickhouse local as a shortcut for format conversion [#68503](https://github.com/ClickHouse/ClickHouse/issues/68503). [#68583](https://github.com/ClickHouse/ClickHouse/pull/68583) ([Denis Hananein](https://github.com/denis-hananein)).
+* Add a builtin HTML page for visualizing merges which is available at the `/merges` path. [#70821](https://github.com/ClickHouse/ClickHouse/pull/70821) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Add support for `arrayUnion` function. [#68989](https://github.com/ClickHouse/ClickHouse/pull/68989) ([Peter Nguyen](https://github.com/petern48)).
+* Allow parametrised SQL aliases. [#50665](https://github.com/ClickHouse/ClickHouse/pull/50665) ([Anton Kozlov](https://github.com/tonickkozlov)).
+* A new aggregate function `quantileExactWeightedInterpolated`, which is a interpolated version based on quantileExactWeighted. Some people may wonder why we need a new `quantileExactWeightedInterpolated` since we already have `quantileExactInterpolatedWeighted`. The reason is the new one is more accurate than the old one. This is for spark compatibility. [#69619](https://github.com/ClickHouse/ClickHouse/pull/69619) ([李扬](https://github.com/taiyang-li)).
+* A new function `arrayElementOrNull`. It returns `NULL` if the array index is out of range or a Map key not found. [#69646](https://github.com/ClickHouse/ClickHouse/pull/69646) ([李扬](https://github.com/taiyang-li)).
+* Allows users to specify regular expressions through new `message_regexp` and `message_regexp_negative` fields in the `config.xml` file to filter out logging. The logging is applied to the formatted un-colored text for the most intuitive developer experience. [#69657](https://github.com/ClickHouse/ClickHouse/pull/69657) ([Peter Nguyen](https://github.com/petern48)).
+* Added `RIPEMD160` function, which computes the RIPEMD-160 cryptographic hash of a string. Example: `SELECT HEX(RIPEMD160('The quick brown fox jumps over the lazy dog'))` returns `37F332F68DB77BD9D7EDD4969571AD671CF9DD3B`. [#70087](https://github.com/ClickHouse/ClickHouse/pull/70087) ([Dergousov Maxim](https://github.com/m7kss1)).
+* Support reading `Iceberg` tables on `HDFS`. [#70268](https://github.com/ClickHouse/ClickHouse/pull/70268) ([flynn](https://github.com/ucasfl)).
+* Support for CTE in the form of `WITH ... INSERT`, as previously we only supported `INSERT ... WITH ...`. [#70593](https://github.com/ClickHouse/ClickHouse/pull/70593) ([Shichao Jin](https://github.com/jsc0218)).
+* MongoDB integration: support for all MongoDB types, support for WHERE and ORDER BY statements on MongoDB side, restriction for expressions unsupported by MongoDB. Note that the new inegration is disabled by default, to use it, please set `` to `false` in server config. [#63279](https://github.com/ClickHouse/ClickHouse/pull/63279) ([Kirill Nikiforov](https://github.com/allmazz)).
+* A new function `getSettingOrDefault` added to return the default value and avoid exception if a custom setting is not found in the current profile. [#69917](https://github.com/ClickHouse/ClickHouse/pull/69917) ([Shankar](https://github.com/shiyer7474)).
+
+#### Experimental feature
+* Refreshable materialized views are production ready. [#70550](https://github.com/ClickHouse/ClickHouse/pull/70550) ([Michael Kolupaev](https://github.com/al13n321)). Refreshable materialized views are now supported in Replicated databases. [#60669](https://github.com/ClickHouse/ClickHouse/pull/60669) ([Michael Kolupaev](https://github.com/al13n321)).
+* Parallel replicas are moved from experimental to beta. Reworked settings that control the behavior of parallel replicas algorithms. A quick recap: ClickHouse has four different algorithms for parallel reading involving multiple replicas, which is reflected in the setting `parallel_replicas_mode`, the default value for it is `read_tasks` Additionally, the toggle-switch setting `enable_parallel_replicas` has been added. [#63151](https://github.com/ClickHouse/ClickHouse/pull/63151) ([Alexey Milovidov](https://github.com/alexey-milovidov)), ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)).
+* Support for the `Dynamic` type in most functions by executing them on internal types inside `Dynamic`. [#69691](https://github.com/ClickHouse/ClickHouse/pull/69691) ([Pavel Kruglov](https://github.com/Avogar)).
+* Allow to read/write the `JSON` type as a binary string in `RowBinary` format under settings `input_format_binary_read_json_as_string/output_format_binary_write_json_as_string`. [#70288](https://github.com/ClickHouse/ClickHouse/pull/70288) ([Pavel Kruglov](https://github.com/Avogar)).
+* Allow to serialize/deserialize `JSON` column as single String column in the Native format. For output use setting `output_format_native_write_json_as_string`. For input, use serialization version `1` before the column data. [#70312](https://github.com/ClickHouse/ClickHouse/pull/70312) ([Pavel Kruglov](https://github.com/Avogar)).
+* Introduced a special (experimental) mode of a merge selector for MergeTree tables which makes it more aggressive for the partitions that are close to the limit by the number of parts. It is controlled by the `merge_selector_use_blurry_base` MergeTree-level setting. [#70645](https://github.com/ClickHouse/ClickHouse/pull/70645) ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)).
+* Implement generic ser/de between Avro's `Union` and ClickHouse's `Variant` types. Resolves [#69713](https://github.com/ClickHouse/ClickHouse/issues/69713). [#69712](https://github.com/ClickHouse/ClickHouse/pull/69712) ([Jiří Kozlovský](https://github.com/jirislav)).
+
+#### Performance Improvement
+* Refactor `IDisk` and `IObjectStorage` for better performance. Tables from `plain` and `plain_rewritable` object storages will initialize faster. [#68146](https://github.com/ClickHouse/ClickHouse/pull/68146) ([Alexey Milovidov](https://github.com/alexey-milovidov), [Julia Kartseva](https://github.com/jkartseva)). Do not call the LIST object storage API when determining if a file or directory exists on the plain rewritable disk, as it can be cost-inefficient. [#70852](https://github.com/ClickHouse/ClickHouse/pull/70852) ([Julia Kartseva](https://github.com/jkartseva)). Reduce the number of object storage HEAD API requests in the plain_rewritable disk. [#70915](https://github.com/ClickHouse/ClickHouse/pull/70915) ([Julia Kartseva](https://github.com/jkartseva)).
+* Added an ability to parse data directly into sparse columns. [#69828](https://github.com/ClickHouse/ClickHouse/pull/69828) ([Anton Popov](https://github.com/CurtizJ)).
+* Improved performance of parsing formats with high number of missed values (e.g. `JSONEachRow`). [#69875](https://github.com/ClickHouse/ClickHouse/pull/69875) ([Anton Popov](https://github.com/CurtizJ)).
+* Supports parallel reading of parquet row groups and prefetching of row groups in single-threaded mode. [#69862](https://github.com/ClickHouse/ClickHouse/pull/69862) ([LiuNeng](https://github.com/liuneng1994)).
+* Support minmax index for `pointInPolygon`. [#62085](https://github.com/ClickHouse/ClickHouse/pull/62085) ([JackyWoo](https://github.com/JackyWoo)).
+* Use bloom filters when reading Parquet files. [#62966](https://github.com/ClickHouse/ClickHouse/pull/62966) ([Arthur Passos](https://github.com/arthurpassos)).
+* Lock-free parts rename to avoid INSERT affect SELECT (due to parts lock) (under normal circumstances with `fsync_part_directory`, QPS of SELECT with INSERT in parallel, increased 2x, under heavy load the effect is even bigger). Note, this only includes `ReplicatedMergeTree` for now. [#64955](https://github.com/ClickHouse/ClickHouse/pull/64955) ([Azat Khuzhin](https://github.com/azat)).
+* Respect `ttl_only_drop_parts` on `materialize ttl`; only read necessary columns to recalculate TTL and drop parts by replacing them with an empty one. [#65488](https://github.com/ClickHouse/ClickHouse/pull/65488) ([Andrey Zvonov](https://github.com/zvonand)).
+* Optimized thread creation in the ThreadPool to minimize lock contention. Thread creation is now performed outside of the critical section to avoid delays in job scheduling and thread management under high load conditions. This leads to a much more responsive ClickHouse under heavy concurrent load. [#68694](https://github.com/ClickHouse/ClickHouse/pull/68694) ([filimonov](https://github.com/filimonov)).
+* Enable reading `LowCardinality` string columns from `ORC`. [#69481](https://github.com/ClickHouse/ClickHouse/pull/69481) ([李扬](https://github.com/taiyang-li)).
+* Use `LowCardinality` for `ProfileEvents` in system logs such as `part_log`, `query_views_log`, `filesystem_cache_log`. [#70152](https://github.com/ClickHouse/ClickHouse/pull/70152) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Improve performance of `fromUnixTimestamp`/`toUnixTimestamp` functions. [#71042](https://github.com/ClickHouse/ClickHouse/pull/71042) ([kevinyhzou](https://github.com/KevinyhZou)).
+* Don't disable nonblocking read from page cache for the entire server when reading from a blocking I/O. This was leading to a poorer performance when a single filesystem (e.g., tmpfs) didn't support the `preadv2` syscall while others do. [#70299](https://github.com/ClickHouse/ClickHouse/pull/70299) ([Antonio Andelic](https://github.com/antonio2368)).
+* `ALTER TABLE .. REPLACE PARTITION` doesn't wait anymore for mutations/merges that happen in other partitions. [#59138](https://github.com/ClickHouse/ClickHouse/pull/59138) ([Vasily Nemkov](https://github.com/Enmk)).
+* Don't do validation when synchronizing ACL from Keeper. It's validating during creation. It shouldn't matter that much, but there are installations with tens of thousands or even more user created, and the unnecessary hash validation can take a long time to finish during server startup (it synchronizes everything from keeper). [#70644](https://github.com/ClickHouse/ClickHouse/pull/70644) ([Raúl Marín](https://github.com/Algunenano)).
+
+#### Improvement
+* `CREATE TABLE AS` will copy `PRIMARY KEY`, `ORDER BY`, and similar clauses (of `MergeTree` tables). [#69739](https://github.com/ClickHouse/ClickHouse/pull/69739) ([sakulali](https://github.com/sakulali)).
+* Support 64-bit XID in Keeper. It can be enabled with the `use_xid_64` configuration value. [#69908](https://github.com/ClickHouse/ClickHouse/pull/69908) ([Antonio Andelic](https://github.com/antonio2368)).
+* Command-line arguments for Bool settings are set to true when no value is provided for the argument (e.g. `clickhouse-client --optimize_aggregation_in_order --query "SELECT 1"`). [#70459](https://github.com/ClickHouse/ClickHouse/pull/70459) ([davidtsuk](https://github.com/davidtsuk)).
+* Added user-level settings `min_free_disk_bytes_to_throw_insert` and `min_free_disk_ratio_to_throw_insert` to prevent insertions on disks that are almost full. [#69755](https://github.com/ClickHouse/ClickHouse/pull/69755) ([Marco Vilas Boas](https://github.com/marco-vb)).
+* Embedded documentation for settings will be strictly more detailed and complete than the documentation on the website. This is the first step before making the website documentation always auto-generated from the source code. This has long-standing implications: - it will be guaranteed to have every setting; - there is no chance of having default values obsolete; - we can generate this documentation for each ClickHouse version; - the documentation can be displayed by the server itself even without Internet access. Generate the docs on the website from the source code. [#70289](https://github.com/ClickHouse/ClickHouse/pull/70289) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Allow empty needle in the function `replace`, the same behavior with PostgreSQL. [#69918](https://github.com/ClickHouse/ClickHouse/pull/69918) ([zhanglistar](https://github.com/zhanglistar)).
+* Allow empty needle in functions `replaceRegexp*`. [#70053](https://github.com/ClickHouse/ClickHouse/pull/70053) ([zhanglistar](https://github.com/zhanglistar)).
+* Symbolic links for tables in the `data/database_name/` directory are created for the actual paths to the table's data, depending on the storage policy, instead of the `store/...` directory on the default disk. [#61777](https://github.com/ClickHouse/ClickHouse/pull/61777) ([Kirill](https://github.com/kirillgarbar)).
+* While parsing an `Enum` field from `JSON`, a string containing an integer will be interpreted as the corresponding `Enum` element. This closes [#65119](https://github.com/ClickHouse/ClickHouse/issues/65119). [#66801](https://github.com/ClickHouse/ClickHouse/pull/66801) ([scanhex12](https://github.com/scanhex12)).
+* Allow `TRIM` -ing `LEADING` or `TRAILING` empty string as a no-op. Closes [#67792](https://github.com/ClickHouse/ClickHouse/issues/67792). [#68455](https://github.com/ClickHouse/ClickHouse/pull/68455) ([Peter Nguyen](https://github.com/petern48)).
+* Improve compatibility of `cast(timestamp as String)` with Spark. [#69179](https://github.com/ClickHouse/ClickHouse/pull/69179) ([Wenzheng Liu](https://github.com/lwz9103)).
+* Always use the new analyzer to calculate constant expressions when `enable_analyzer` is set to `true`. Support calculation of `executable` table function arguments without using `SELECT` query for constant expressions. [#69292](https://github.com/ClickHouse/ClickHouse/pull/69292) ([Dmitry Novik](https://github.com/novikd)).
+* Add a setting `enable_secure_identifiers` to disallow identifiers with special characters. [#69411](https://github.com/ClickHouse/ClickHouse/pull/69411) ([tuanpach](https://github.com/tuanpach)).
+* Add `show_create_query_identifier_quoting_rule` to define identifier quoting behavior in the `SHOW CREATE TABLE` query result. Possible values: - `user_display`: When the identifiers is a keyword. - `when_necessary`: When the identifiers is one of `{"distinct", "all", "table"}` and when it could lead to ambiguity: column names, dictionary attribute names. - `always`: Always quote identifiers. [#69448](https://github.com/ClickHouse/ClickHouse/pull/69448) ([tuanpach](https://github.com/tuanpach)).
+* Improve restoring of access entities' dependencies [#69563](https://github.com/ClickHouse/ClickHouse/pull/69563) ([Vitaly Baranov](https://github.com/vitlibar)).
+* If you run `clickhouse-client` or other CLI application, and it starts up slowly due to an overloaded server, and you start typing your query, such as `SELECT`, the previous versions will display the remaining of the terminal echo contents before printing the greetings message, such as `SELECTClickHouse local version 24.10.1.1.` instead of `ClickHouse local version 24.10.1.1.`. Now it is fixed. This closes [#31696](https://github.com/ClickHouse/ClickHouse/issues/31696). [#69856](https://github.com/ClickHouse/ClickHouse/pull/69856) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Add new column `readonly_duration` to the `system.replicas` table. Needed to be able to distinguish actual readonly replicas from sentinel ones in alerts. [#69871](https://github.com/ClickHouse/ClickHouse/pull/69871) ([Miсhael Stetsyuk](https://github.com/mstetsyuk)).
+* Change the type of `join_output_by_rowlist_perkey_rows_threshold` setting type to unsigned integer. [#69886](https://github.com/ClickHouse/ClickHouse/pull/69886) ([kevinyhzou](https://github.com/KevinyhZou)).
+* Enhance OpenTelemetry span logging to include query settings. [#70011](https://github.com/ClickHouse/ClickHouse/pull/70011) ([sharathks118](https://github.com/sharathks118)).
+* Add diagnostic info about higher-order array functions if lambda result type is unexpected. [#70093](https://github.com/ClickHouse/ClickHouse/pull/70093) ([ttanay](https://github.com/ttanay)).
+* Keeper improvement: less locking during cluster changes. [#70275](https://github.com/ClickHouse/ClickHouse/pull/70275) ([Antonio Andelic](https://github.com/antonio2368)).
+* Add `WITH IMPLICIT` and `FINAL` keywords to the `SHOW GRANTS` command. Fix a minor bug with implicit grants: [#70094](https://github.com/ClickHouse/ClickHouse/issues/70094). [#70293](https://github.com/ClickHouse/ClickHouse/pull/70293) ([pufit](https://github.com/pufit)).
+* Respect `compatibility` for MergeTree settings. The `compatibility` value is taken from the `default` profile on server startup, and default MergeTree settings are changed accordingly. Further changes of the `compatibility` setting do not affect MergeTree settings. [#70322](https://github.com/ClickHouse/ClickHouse/pull/70322) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Avoid spamming the logs with large HTTP response bodies in case of errors during inter-server communication. [#70487](https://github.com/ClickHouse/ClickHouse/pull/70487) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Added a new setting `max_parts_to_move` to control the maximum number of parts that can be moved at once. [#70520](https://github.com/ClickHouse/ClickHouse/pull/70520) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Limit the frequency of certain log messages. [#70601](https://github.com/ClickHouse/ClickHouse/pull/70601) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* `CHECK TABLE` with `PART` qualifier was incorrectly formatted in the client. [#70660](https://github.com/ClickHouse/ClickHouse/pull/70660) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Support writing the column index and the offset index using parquet native writer. [#70669](https://github.com/ClickHouse/ClickHouse/pull/70669) ([LiuNeng](https://github.com/liuneng1994)).
+* Support parsing `DateTime64` for microsecond and timezone in joda syntax ("joda" is a popular Java library for date and time, and the "joda syntax" is that library's style). [#70737](https://github.com/ClickHouse/ClickHouse/pull/70737) ([kevinyhzou](https://github.com/KevinyhZou)).
+* Changed an approach to figure out if a cloud storage supports [batch delete](https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html) or not. [#70786](https://github.com/ClickHouse/ClickHouse/pull/70786) ([Vitaly Baranov](https://github.com/vitlibar)).
+* Support for Parquet page v2 in the native reader. [#70807](https://github.com/ClickHouse/ClickHouse/pull/70807) ([Arthur Passos](https://github.com/arthurpassos)).
+* A check if table has both `storage_policy` and `disk` set. A check if a new storage policy is compatible with an old one when using `disk` setting is added. [#70839](https://github.com/ClickHouse/ClickHouse/pull/70839) ([Kirill](https://github.com/kirillgarbar)).
+* Add `system.s3_queue_settings` and `system.azure_queue_settings`. [#70841](https://github.com/ClickHouse/ClickHouse/pull/70841) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Functions `base58Encode` and `base58Decode` now accept arguments of type `FixedString`. Example: `SELECT base58Encode(toFixedString('plaintext', 9));`. [#70846](https://github.com/ClickHouse/ClickHouse/pull/70846) ([Faizan Patel](https://github.com/faizan2786)).
+* Add the `partition` column to every entry type of the part log. Previously, it was set only for some entries. This closes [#70819](https://github.com/ClickHouse/ClickHouse/issues/70819). [#70848](https://github.com/ClickHouse/ClickHouse/pull/70848) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Add `MergeStart` and `MutateStart` events into `system.part_log` which helps with merges analysis and visualization. [#70850](https://github.com/ClickHouse/ClickHouse/pull/70850) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Add a profile event about the number of merged source parts. It allows the monitoring of the fanout of the merge tree in production. [#70908](https://github.com/ClickHouse/ClickHouse/pull/70908) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Background downloads to the filesystem cache were enabled back. [#70929](https://github.com/ClickHouse/ClickHouse/pull/70929) ([Nikita Taranov](https://github.com/nickitat)).
+* Add a new merge selector algorithm, named `Trivial`, for professional usage only. It is worse than the `Simple` merge selector. [#70969](https://github.com/ClickHouse/ClickHouse/pull/70969) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Support for atomic `CREATE OR REPLACE VIEW`. [#70536](https://github.com/ClickHouse/ClickHouse/pull/70536) ([tuanpach](https://github.com/tuanpach))
+* Added `strict_once` mode to aggregate function `windowFunnel` to avoid counting one event several times in case it matches multiple conditions, close [#21835](https://github.com/ClickHouse/ClickHouse/issues/21835). [#69738](https://github.com/ClickHouse/ClickHouse/pull/69738) ([Vladimir Cherkasov](https://github.com/vdimir)).
+
+#### Bug Fix (user-visible misbehavior in an official stable release)
+* Apply configuration updates in global context object. It fixes issues like [#62308](https://github.com/ClickHouse/ClickHouse/issues/62308). [#62944](https://github.com/ClickHouse/ClickHouse/pull/62944) ([Amos Bird](https://github.com/amosbird)).
+* Fix `ReadSettings` not using user set values, because defaults were only used. [#65625](https://github.com/ClickHouse/ClickHouse/pull/65625) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Fix type mismatch issue in `sumMapFiltered` when using signed arguments. [#58408](https://github.com/ClickHouse/ClickHouse/pull/58408) ([Chen768959](https://github.com/Chen768959)).
+* Fix toHour-like conversion functions' monotonicity when optional time zone argument is passed. [#60264](https://github.com/ClickHouse/ClickHouse/pull/60264) ([Amos Bird](https://github.com/amosbird)).
+* Relax `supportsPrewhere` check for `Merge` tables. This fixes [#61064](https://github.com/ClickHouse/ClickHouse/issues/61064). It was hardened unnecessarily in [#60082](https://github.com/ClickHouse/ClickHouse/issues/60082). [#61091](https://github.com/ClickHouse/ClickHouse/pull/61091) ([Amos Bird](https://github.com/amosbird)).
+* Fix `use_concurrency_control` setting handling for proper `concurrent_threads_soft_limit_num` limit enforcing. This enables concurrency control by default because previously it was broken. [#61473](https://github.com/ClickHouse/ClickHouse/pull/61473) ([Sergei Trifonov](https://github.com/serxa)).
+* Fix incorrect `JOIN ON` section optimization in case of `IS NULL` check under any other function (like `NOT`) that may lead to wrong results. Closes [#67915](https://github.com/ClickHouse/ClickHouse/issues/67915). [#68049](https://github.com/ClickHouse/ClickHouse/pull/68049) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Prevent `ALTER` queries that would make the `CREATE` query of tables invalid. [#68574](https://github.com/ClickHouse/ClickHouse/pull/68574) ([János Benjamin Antal](https://github.com/antaljanosbenjamin)).
+* Fix inconsistent AST formatting for `negate` (`-`) and `NOT` functions with tuples and arrays. [#68600](https://github.com/ClickHouse/ClickHouse/pull/68600) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Fix insertion of incomplete type into `Dynamic` during deserialization. It could lead to `Parameter out of bound` errors. [#69291](https://github.com/ClickHouse/ClickHouse/pull/69291) ([Pavel Kruglov](https://github.com/Avogar)).
+* Zero-copy replication, which is experimental and should not be used in production: fix inf loop after `restore replica` in the replicated merge tree with zero copy. [#69293](https://github.com/CljmnickHouse/ClickHouse/pull/69293) ([MikhailBurdukov](https://github.com/MikhailBurdukov)).
+* Return back default value of `processing_threads_num` as number of cpu cores in storage `S3Queue`. [#69384](https://github.com/ClickHouse/ClickHouse/pull/69384) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Bypass try/catch flow when de/serializing nested repeated protobuf to nested columns (fixes [#41971](https://github.com/ClickHouse/ClickHouse/issues/41971)). [#69556](https://github.com/ClickHouse/ClickHouse/pull/69556) ([Eliot Hautefeuille](https://github.com/hileef)).
+* Fix crash during insertion into FixedString column in PostgreSQL engine. [#69584](https://github.com/ClickHouse/ClickHouse/pull/69584) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix crash when executing `create view t as (with recursive 42 as ttt select ttt);`. [#69676](https://github.com/ClickHouse/ClickHouse/pull/69676) ([Han Fei](https://github.com/hanfei1991)).
+* Fixed `maxMapState` throwing 'Bad get' if value type is DateTime64. [#69787](https://github.com/ClickHouse/ClickHouse/pull/69787) ([Michael Kolupaev](https://github.com/al13n321)).
+* Fix `getSubcolumn` with `LowCardinality` columns by overriding `useDefaultImplementationForLowCardinalityColumns` to return `true`. [#69831](https://github.com/ClickHouse/ClickHouse/pull/69831) ([Miсhael Stetsyuk](https://github.com/mstetsyuk)).
+* Fix permanent blocked distributed sends if a DROP of distributed table failed. [#69843](https://github.com/ClickHouse/ClickHouse/pull/69843) ([Azat Khuzhin](https://github.com/azat)).
+* Fix non-cancellable queries containing WITH FILL with NaN keys. This closes [#69261](https://github.com/ClickHouse/ClickHouse/issues/69261). [#69845](https://github.com/ClickHouse/ClickHouse/pull/69845) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Fix analyzer default with old compatibility value. [#69895](https://github.com/ClickHouse/ClickHouse/pull/69895) ([Raúl Marín](https://github.com/Algunenano)).
+* Don't check dependencies during CREATE OR REPLACE VIEW during DROP of old table. Previously CREATE OR REPLACE query failed when there are dependent tables of the recreated view. [#69907](https://github.com/ClickHouse/ClickHouse/pull/69907) ([Pavel Kruglov](https://github.com/Avogar)).
+* Something for Decimal. Fixes [#69730](https://github.com/ClickHouse/ClickHouse/issues/69730). [#69978](https://github.com/ClickHouse/ClickHouse/pull/69978) ([Arthur Passos](https://github.com/arthurpassos)).
+* Now DEFINER/INVOKER will work with parameterized views. [#69984](https://github.com/ClickHouse/ClickHouse/pull/69984) ([pufit](https://github.com/pufit)).
+* Fix parsing for view's definers. [#69985](https://github.com/ClickHouse/ClickHouse/pull/69985) ([pufit](https://github.com/pufit)).
+* Fixed a bug when the timezone could change the result of the query with a `Date` or `Date32` arguments. [#70036](https://github.com/ClickHouse/ClickHouse/pull/70036) ([Yarik Briukhovetskyi](https://github.com/yariks5s)).
+* Fixes `Block structure mismatch` for queries with nested views and `WHERE` condition. Fixes [#66209](https://github.com/ClickHouse/ClickHouse/issues/66209). [#70054](https://github.com/ClickHouse/ClickHouse/pull/70054) ([Nikolai Kochetov](https://github.com/KochetovNicolai)).
+* Avoid reusing columns among different named tuples when evaluating `tuple` functions. This fixes [#70022](https://github.com/ClickHouse/ClickHouse/issues/70022). [#70103](https://github.com/ClickHouse/ClickHouse/pull/70103) ([Amos Bird](https://github.com/amosbird)).
+* Fix wrong LOGICAL_ERROR when replacing literals in ranges. [#70122](https://github.com/ClickHouse/ClickHouse/pull/70122) ([Pablo Marcos](https://github.com/pamarcos)).
+* Check for Nullable(Nothing) type during ALTER TABLE MODIFY COLUMN/QUERY to prevent tables with such data type. [#70123](https://github.com/ClickHouse/ClickHouse/pull/70123) ([Pavel Kruglov](https://github.com/Avogar)).
+* Proper error message for illegal query `JOIN ... ON *` , close [#68650](https://github.com/ClickHouse/ClickHouse/issues/68650). [#70124](https://github.com/ClickHouse/ClickHouse/pull/70124) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Fix wrong result with skipping index. [#70127](https://github.com/ClickHouse/ClickHouse/pull/70127) ([Raúl Marín](https://github.com/Algunenano)).
+* Fix data race in ColumnObject/ColumnTuple decompress method that could lead to heap use after free. [#70137](https://github.com/ClickHouse/ClickHouse/pull/70137) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix possible hung in ALTER COLUMN with Dynamic type. [#70144](https://github.com/ClickHouse/ClickHouse/pull/70144) ([Pavel Kruglov](https://github.com/Avogar)).
+* Now ClickHouse will consider more errors as retriable and will not mark data parts as broken in case of such errors. [#70145](https://github.com/ClickHouse/ClickHouse/pull/70145) ([alesapin](https://github.com/alesapin)).
+* Use correct `max_types` parameter during Dynamic type creation for JSON subcolumn. [#70147](https://github.com/ClickHouse/ClickHouse/pull/70147) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix the password being displayed in `system.query_log` for users with bcrypt password authentication method. [#70148](https://github.com/ClickHouse/ClickHouse/pull/70148) ([Nikolay Degterinsky](https://github.com/evillique)).
+* Fix event counter for the native interface (InterfaceNativeSendBytes). [#70153](https://github.com/ClickHouse/ClickHouse/pull/70153) ([Yakov Olkhovskiy](https://github.com/yakov-olkhovskiy)).
+* Fix possible crash related to JSON columns. [#70172](https://github.com/ClickHouse/ClickHouse/pull/70172) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix multiple issues with arrayMin and arrayMax. [#70207](https://github.com/ClickHouse/ClickHouse/pull/70207) ([Raúl Marín](https://github.com/Algunenano)).
+* Respect setting allow_simdjson in the JSON type parser. [#70218](https://github.com/ClickHouse/ClickHouse/pull/70218) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix a null pointer dereference on creating a materialized view with two selects and an `INTERSECT`, e.g. `CREATE MATERIALIZED VIEW v0 AS (SELECT 1) INTERSECT (SELECT 1);`. [#70264](https://github.com/ClickHouse/ClickHouse/pull/70264) ([Konstantin Bogdanov](https://github.com/thevar1able)).
+* Don't modify global settings with startup scripts. Previously, changing a setting in a startup script would change it globally. [#70310](https://github.com/ClickHouse/ClickHouse/pull/70310) ([Antonio Andelic](https://github.com/antonio2368)).
+* Fix ALTER of `Dynamic` type with reducing max_types parameter that could lead to server crash. [#70328](https://github.com/ClickHouse/ClickHouse/pull/70328) ([Pavel Kruglov](https://github.com/Avogar)).
+* Fix crash when using WITH FILL incorrectly. [#70338](https://github.com/ClickHouse/ClickHouse/pull/70338) ([Raúl Marín](https://github.com/Algunenano)).
+* Fix possible use-after-free in `SYSTEM DROP FORMAT SCHEMA CACHE FOR Protobuf`. [#70358](https://github.com/ClickHouse/ClickHouse/pull/70358) ([Azat Khuzhin](https://github.com/azat)).
+* Fix crash during GROUP BY JSON sub-object subcolumn. [#70374](https://github.com/ClickHouse/ClickHouse/pull/70374) ([Pavel Kruglov](https://github.com/Avogar)).
+* Don't prefetch parts for vertical merges if part has no rows. [#70452](https://github.com/ClickHouse/ClickHouse/pull/70452) ([Antonio Andelic](https://github.com/antonio2368)).
+* Fix crash in WHERE with lambda functions. [#70464](https://github.com/ClickHouse/ClickHouse/pull/70464) ([Raúl Marín](https://github.com/Algunenano)).
+* Fix table creation with `CREATE ... AS table_function(...)` with database `Replicated` and unavailable table function source on secondary replica. [#70511](https://github.com/ClickHouse/ClickHouse/pull/70511) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Ignore all output on async insert with `wait_for_async_insert=1`. Closes [#62644](https://github.com/ClickHouse/ClickHouse/issues/62644). [#70530](https://github.com/ClickHouse/ClickHouse/pull/70530) ([Konstantin Bogdanov](https://github.com/thevar1able)).
+* Ignore frozen_metadata.txt while traversing shadow directory from system.remote_data_paths. [#70590](https://github.com/ClickHouse/ClickHouse/pull/70590) ([Aleksei Filatov](https://github.com/aalexfvk)).
+* Fix creation of stateful window functions on misaligned memory. [#70631](https://github.com/ClickHouse/ClickHouse/pull/70631) ([Raúl Marín](https://github.com/Algunenano)).
+* Fixed rare crashes in `SELECT`-s and merges after adding a column of `Array` type with non-empty default expression. [#70695](https://github.com/ClickHouse/ClickHouse/pull/70695) ([Anton Popov](https://github.com/CurtizJ)).
+* Insert into table function s3 will respect query settings. [#70696](https://github.com/ClickHouse/ClickHouse/pull/70696) ([Vladimir Cherkasov](https://github.com/vdimir)).
+* Fix infinite recursion when inferring a protobuf schema when skipping unsupported fields is enabled. [#70697](https://github.com/ClickHouse/ClickHouse/pull/70697) ([Raúl Marín](https://github.com/Algunenano)).
+* Disable enable_named_columns_in_function_tuple by default. [#70833](https://github.com/ClickHouse/ClickHouse/pull/70833) ([Raúl Marín](https://github.com/Algunenano)).
+* Fix S3Queue table engine setting processing_threads_num not being effective in case it was deduced from the number of cpu cores on the server. [#70837](https://github.com/ClickHouse/ClickHouse/pull/70837) ([Kseniia Sumarokova](https://github.com/kssenii)).
+* Normalize named tuple arguments in aggregation states. This fixes [#69732](https://github.com/ClickHouse/ClickHouse/issues/69732) . [#70853](https://github.com/ClickHouse/ClickHouse/pull/70853) ([Amos Bird](https://github.com/amosbird)).
+* Fix a logical error due to negative zeros in the two-level hash table. This closes [#70973](https://github.com/ClickHouse/ClickHouse/issues/70973). [#70979](https://github.com/ClickHouse/ClickHouse/pull/70979) ([Alexey Milovidov](https://github.com/alexey-milovidov)).
+* Fix `limit by`, `limit with ties` for distributed and parallel replicas. [#70880](https://github.com/ClickHouse/ClickHouse/pull/70880) ([Nikita Taranov](https://github.com/nickitat)).
+
+
### ClickHouse release 24.9, 2024-09-26
#### Backward Incompatible Change
diff --git a/README.md b/README.md
index 3b5209dcbe9..dcaeda13acd 100644
--- a/README.md
+++ b/README.md
@@ -42,31 +42,19 @@ Keep an eye out for upcoming meetups and events around the world. Somewhere else
Upcoming meetups
-* [Jakarta Meetup](https://www.meetup.com/clickhouse-indonesia-user-group/events/303191359/) - October 1
-* [Singapore Meetup](https://www.meetup.com/clickhouse-singapore-meetup-group/events/303212064/) - October 3
-* [Madrid Meetup](https://www.meetup.com/clickhouse-spain-user-group/events/303096564/) - October 22
-* [Oslo Meetup](https://www.meetup.com/open-source-real-time-data-warehouse-real-time-analytics/events/302938622) - October 31
* [Barcelona Meetup](https://www.meetup.com/clickhouse-spain-user-group/events/303096876/) - November 12
* [Ghent Meetup](https://www.meetup.com/clickhouse-belgium-user-group/events/303049405/) - November 19
* [Dubai Meetup](https://www.meetup.com/clickhouse-dubai-meetup-group/events/303096989/) - November 21
* [Paris Meetup](https://www.meetup.com/clickhouse-france-user-group/events/303096434) - November 26
+* [Amsterdam Meetup](https://www.meetup.com/clickhouse-netherlands-user-group/events/303638814) - December 3
+* [New York Meetup](https://www.meetup.com/clickhouse-new-york-user-group/events/304268174) - December 9
+* [San Francisco Meetup](https://www.meetup.com/clickhouse-silicon-valley-meetup-group/events/304286951/) - December 12
Recently completed meetups
-* [ClickHouse Guangzhou User Group Meetup](https://mp.weixin.qq.com/s/GSvo-7xUoVzCsuUvlLTpCw) - August 25
-* [Seattle Meetup (Statsig)](https://www.meetup.com/clickhouse-seattle-user-group/events/302518075/) - August 27
-* [Melbourne Meetup](https://www.meetup.com/clickhouse-australia-user-group/events/302732666/) - August 27
-* [Sydney Meetup](https://www.meetup.com/clickhouse-australia-user-group/events/302862966/) - September 5
-* [Zurich Meetup](https://www.meetup.com/clickhouse-switzerland-meetup-group/events/302267429/) - September 5
-* [San Francisco Meetup (Cloudflare)](https://www.meetup.com/clickhouse-silicon-valley-meetup-group/events/302540575) - September 5
-* [Raleigh Meetup (Deutsche Bank)](https://www.meetup.com/triangletechtalks/events/302723486/) - September 9
-* [New York Meetup (Rokt)](https://www.meetup.com/clickhouse-new-york-user-group/events/302575342) - September 10
-* [Toronto Meetup (Shopify)](https://www.meetup.com/clickhouse-toronto-user-group/events/301490855/) - September 10
-* [Chicago Meetup (Jump Capital)](https://lu.ma/43tvmrfw) - September 12
-* [London Meetup](https://www.meetup.com/clickhouse-london-user-group/events/302977267) - September 17
-* [Austin Meetup](https://www.meetup.com/clickhouse-austin-user-group/events/302558689/) - September 17
-* [Bangalore Meetup](https://www.meetup.com/clickhouse-bangalore-user-group/events/303208274/) - September 18
-* [Tel Aviv Meetup](https://www.meetup.com/clickhouse-meetup-israel/events/303095121) - September 22
+* [Madrid Meetup](https://www.meetup.com/clickhouse-spain-user-group/events/303096564/) - October 22
+* [Singapore Meetup](https://www.meetup.com/clickhouse-singapore-meetup-group/events/303212064/) - October 3
+* [Jakarta Meetup](https://www.meetup.com/clickhouse-indonesia-user-group/events/303191359/) - October 1
## Recent Recordings
* **Recent Meetup Videos**: [Meetup Playlist](https://www.youtube.com/playlist?list=PL0Z2YDlm0b3iNDUzpY1S3L_iV4nARda_U) Whenever possible recordings of the ClickHouse Community Meetups are edited and presented as individual talks. Current featuring "Modern SQL in 2023", "Fast, Concurrent, and Consistent Asynchronous INSERTS in ClickHouse", and "Full-Text Indices: Design and Experiments"
diff --git a/base/base/chrono_io.h b/base/base/chrono_io.h
index 4ee8dec6634..d55aa11bc1d 100644
--- a/base/base/chrono_io.h
+++ b/base/base/chrono_io.h
@@ -4,6 +4,7 @@
#include
#include
#include
+#include
inline std::string to_string(const std::time_t & time)
@@ -11,18 +12,6 @@ inline std::string to_string(const std::time_t & time)
return cctz::format("%Y-%m-%d %H:%M:%S", std::chrono::system_clock::from_time_t(time), cctz::local_time_zone());
}
-template
-std::string to_string(const std::chrono::time_point & tp)
-{
- // Don't use DateLUT because it shows weird characters for
- // TimePoint::max(). I wish we could use C++20 format, but it's not
- // there yet.
- // return DateLUT::instance().timeToString(std::chrono::system_clock::to_time_t(tp));
-
- auto in_time_t = std::chrono::system_clock::to_time_t(tp);
- return to_string(in_time_t);
-}
-
template >
std::string to_string(const std::chrono::duration & duration)
{
@@ -33,6 +22,20 @@ std::string to_string(const std::chrono::duration & duration)
return std::to_string(seconds_as_double.count()) + "s";
}
+template
+std::string to_string(const std::chrono::time_point & tp)
+{
+ // Don't use DateLUT because it shows weird characters for
+ // TimePoint::max(). I wish we could use C++20 format, but it's not
+ // there yet.
+ // return DateLUT::instance().timeToString(std::chrono::system_clock::to_time_t(tp));
+
+ if constexpr (std::is_same_v)
+ return to_string(std::chrono::system_clock::to_time_t(tp));
+ else
+ return to_string(tp.time_since_epoch());
+}
+
template
std::ostream & operator<<(std::ostream & o, const std::chrono::time_point & tp)
{
@@ -44,3 +47,23 @@ std::ostream & operator<<(std::ostream & o, const std::chrono::duration
+struct fmt::formatter> : fmt::formatter
+{
+ template
+ auto format(const std::chrono::time_point & tp, FormatCtx & ctx) const
+ {
+ return fmt::formatter::format(::to_string(tp), ctx);
+ }
+};
+
+template
+struct fmt::formatter> : fmt::formatter
+{
+ template
+ auto format(const std::chrono::duration & duration, FormatCtx & ctx) const
+ {
+ return fmt::formatter::format(::to_string(duration), ctx);
+ }
+};
diff --git a/contrib/arrow b/contrib/arrow
index 5cfccd8ea65..6e2574f5013 160000
--- a/contrib/arrow
+++ b/contrib/arrow
@@ -1 +1 @@
-Subproject commit 5cfccd8ea65f33d4517e7409815d761c7650b45d
+Subproject commit 6e2574f5013a005c050c9a7787d341aef09d0063
diff --git a/contrib/arrow-cmake/CMakeLists.txt b/contrib/arrow-cmake/CMakeLists.txt
index 96d1f4adda7..208d48df178 100644
--- a/contrib/arrow-cmake/CMakeLists.txt
+++ b/contrib/arrow-cmake/CMakeLists.txt
@@ -213,13 +213,19 @@ target_include_directories(_orc SYSTEM PRIVATE
set(LIBRARY_DIR "${ClickHouse_SOURCE_DIR}/contrib/arrow/cpp/src/arrow")
# arrow/cpp/src/arrow/CMakeLists.txt (ARROW_SRCS + ARROW_COMPUTE + ARROW_IPC)
+# find . \( -iname \*.cc -o -iname \*.cpp -o -iname \*.c \) | sort | awk '{print "\"${LIBRARY_DIR}" substr($1,2) "\"" }' | grep -v 'test.cc' | grep -v 'json' | grep -v 'flight' \|
+# grep -v 'csv' | grep -v 'acero' | grep -v 'dataset' | grep -v 'testing' | grep -v 'gpu' | grep -v 'engine' | grep -v 'filesystem' | grep -v 'benchmark.cc'
set(ARROW_SRCS
+ "${LIBRARY_DIR}/adapters/orc/adapter.cc"
+ "${LIBRARY_DIR}/adapters/orc/options.cc"
+ "${LIBRARY_DIR}/adapters/orc/util.cc"
"${LIBRARY_DIR}/array/array_base.cc"
"${LIBRARY_DIR}/array/array_binary.cc"
"${LIBRARY_DIR}/array/array_decimal.cc"
"${LIBRARY_DIR}/array/array_dict.cc"
"${LIBRARY_DIR}/array/array_nested.cc"
"${LIBRARY_DIR}/array/array_primitive.cc"
+ "${LIBRARY_DIR}/array/array_run_end.cc"
"${LIBRARY_DIR}/array/builder_adaptive.cc"
"${LIBRARY_DIR}/array/builder_base.cc"
"${LIBRARY_DIR}/array/builder_binary.cc"
@@ -227,124 +233,26 @@ set(ARROW_SRCS
"${LIBRARY_DIR}/array/builder_dict.cc"
"${LIBRARY_DIR}/array/builder_nested.cc"
"${LIBRARY_DIR}/array/builder_primitive.cc"
- "${LIBRARY_DIR}/array/builder_union.cc"
"${LIBRARY_DIR}/array/builder_run_end.cc"
- "${LIBRARY_DIR}/array/array_run_end.cc"
+ "${LIBRARY_DIR}/array/builder_union.cc"
"${LIBRARY_DIR}/array/concatenate.cc"
"${LIBRARY_DIR}/array/data.cc"
"${LIBRARY_DIR}/array/diff.cc"
"${LIBRARY_DIR}/array/util.cc"
"${LIBRARY_DIR}/array/validate.cc"
- "${LIBRARY_DIR}/builder.cc"
"${LIBRARY_DIR}/buffer.cc"
- "${LIBRARY_DIR}/chunked_array.cc"
- "${LIBRARY_DIR}/chunk_resolver.cc"
- "${LIBRARY_DIR}/compare.cc"
- "${LIBRARY_DIR}/config.cc"
- "${LIBRARY_DIR}/datum.cc"
- "${LIBRARY_DIR}/device.cc"
- "${LIBRARY_DIR}/extension_type.cc"
- "${LIBRARY_DIR}/memory_pool.cc"
- "${LIBRARY_DIR}/pretty_print.cc"
- "${LIBRARY_DIR}/record_batch.cc"
- "${LIBRARY_DIR}/result.cc"
- "${LIBRARY_DIR}/scalar.cc"
- "${LIBRARY_DIR}/sparse_tensor.cc"
- "${LIBRARY_DIR}/status.cc"
- "${LIBRARY_DIR}/table.cc"
- "${LIBRARY_DIR}/table_builder.cc"
- "${LIBRARY_DIR}/tensor.cc"
- "${LIBRARY_DIR}/tensor/coo_converter.cc"
- "${LIBRARY_DIR}/tensor/csf_converter.cc"
- "${LIBRARY_DIR}/tensor/csx_converter.cc"
- "${LIBRARY_DIR}/type.cc"
- "${LIBRARY_DIR}/visitor.cc"
+ "${LIBRARY_DIR}/builder.cc"
"${LIBRARY_DIR}/c/bridge.cc"
- "${LIBRARY_DIR}/io/buffered.cc"
- "${LIBRARY_DIR}/io/caching.cc"
- "${LIBRARY_DIR}/io/compressed.cc"
- "${LIBRARY_DIR}/io/file.cc"
- "${LIBRARY_DIR}/io/hdfs.cc"
- "${LIBRARY_DIR}/io/hdfs_internal.cc"
- "${LIBRARY_DIR}/io/interfaces.cc"
- "${LIBRARY_DIR}/io/memory.cc"
- "${LIBRARY_DIR}/io/slow.cc"
- "${LIBRARY_DIR}/io/stdio.cc"
- "${LIBRARY_DIR}/io/transform.cc"
- "${LIBRARY_DIR}/util/async_util.cc"
- "${LIBRARY_DIR}/util/basic_decimal.cc"
- "${LIBRARY_DIR}/util/bit_block_counter.cc"
- "${LIBRARY_DIR}/util/bit_run_reader.cc"
- "${LIBRARY_DIR}/util/bit_util.cc"
- "${LIBRARY_DIR}/util/bitmap.cc"
- "${LIBRARY_DIR}/util/bitmap_builders.cc"
- "${LIBRARY_DIR}/util/bitmap_ops.cc"
- "${LIBRARY_DIR}/util/bpacking.cc"
- "${LIBRARY_DIR}/util/cancel.cc"
- "${LIBRARY_DIR}/util/compression.cc"
- "${LIBRARY_DIR}/util/counting_semaphore.cc"
- "${LIBRARY_DIR}/util/cpu_info.cc"
- "${LIBRARY_DIR}/util/decimal.cc"
- "${LIBRARY_DIR}/util/delimiting.cc"
- "${LIBRARY_DIR}/util/formatting.cc"
- "${LIBRARY_DIR}/util/future.cc"
- "${LIBRARY_DIR}/util/int_util.cc"
- "${LIBRARY_DIR}/util/io_util.cc"
- "${LIBRARY_DIR}/util/logging.cc"
- "${LIBRARY_DIR}/util/key_value_metadata.cc"
- "${LIBRARY_DIR}/util/memory.cc"
- "${LIBRARY_DIR}/util/mutex.cc"
- "${LIBRARY_DIR}/util/string.cc"
- "${LIBRARY_DIR}/util/string_builder.cc"
- "${LIBRARY_DIR}/util/task_group.cc"
- "${LIBRARY_DIR}/util/tdigest.cc"
- "${LIBRARY_DIR}/util/thread_pool.cc"
- "${LIBRARY_DIR}/util/time.cc"
- "${LIBRARY_DIR}/util/trie.cc"
- "${LIBRARY_DIR}/util/unreachable.cc"
- "${LIBRARY_DIR}/util/uri.cc"
- "${LIBRARY_DIR}/util/utf8.cc"
- "${LIBRARY_DIR}/util/value_parsing.cc"
- "${LIBRARY_DIR}/util/byte_size.cc"
- "${LIBRARY_DIR}/util/debug.cc"
- "${LIBRARY_DIR}/util/tracing.cc"
- "${LIBRARY_DIR}/util/atfork_internal.cc"
- "${LIBRARY_DIR}/util/crc32.cc"
- "${LIBRARY_DIR}/util/hashing.cc"
- "${LIBRARY_DIR}/util/ree_util.cc"
- "${LIBRARY_DIR}/util/union_util.cc"
- "${LIBRARY_DIR}/vendored/base64.cpp"
- "${LIBRARY_DIR}/vendored/datetime/tz.cpp"
- "${LIBRARY_DIR}/vendored/musl/strptime.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriCommon.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriCompare.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriEscape.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriFile.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriIp4Base.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriIp4.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriMemory.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriNormalizeBase.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriNormalize.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriParseBase.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriParse.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriQuery.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriRecompose.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriResolve.c"
- "${LIBRARY_DIR}/vendored/uriparser/UriShorten.c"
- "${LIBRARY_DIR}/vendored/double-conversion/bignum.cc"
- "${LIBRARY_DIR}/vendored/double-conversion/bignum-dtoa.cc"
- "${LIBRARY_DIR}/vendored/double-conversion/cached-powers.cc"
- "${LIBRARY_DIR}/vendored/double-conversion/double-to-string.cc"
- "${LIBRARY_DIR}/vendored/double-conversion/fast-dtoa.cc"
- "${LIBRARY_DIR}/vendored/double-conversion/fixed-dtoa.cc"
- "${LIBRARY_DIR}/vendored/double-conversion/string-to-double.cc"
- "${LIBRARY_DIR}/vendored/double-conversion/strtod.cc"
-
+ "${LIBRARY_DIR}/c/dlpack.cc"
+ "${LIBRARY_DIR}/chunk_resolver.cc"
+ "${LIBRARY_DIR}/chunked_array.cc"
+ "${LIBRARY_DIR}/compare.cc"
"${LIBRARY_DIR}/compute/api_aggregate.cc"
"${LIBRARY_DIR}/compute/api_scalar.cc"
"${LIBRARY_DIR}/compute/api_vector.cc"
"${LIBRARY_DIR}/compute/cast.cc"
"${LIBRARY_DIR}/compute/exec.cc"
+ "${LIBRARY_DIR}/compute/expression.cc"
"${LIBRARY_DIR}/compute/function.cc"
"${LIBRARY_DIR}/compute/function_internal.cc"
"${LIBRARY_DIR}/compute/kernel.cc"
@@ -355,6 +263,7 @@ set(ARROW_SRCS
"${LIBRARY_DIR}/compute/kernels/aggregate_var_std.cc"
"${LIBRARY_DIR}/compute/kernels/codegen_internal.cc"
"${LIBRARY_DIR}/compute/kernels/hash_aggregate.cc"
+ "${LIBRARY_DIR}/compute/kernels/ree_util_internal.cc"
"${LIBRARY_DIR}/compute/kernels/row_encoder.cc"
"${LIBRARY_DIR}/compute/kernels/scalar_arithmetic.cc"
"${LIBRARY_DIR}/compute/kernels/scalar_boolean.cc"
@@ -382,30 +291,139 @@ set(ARROW_SRCS
"${LIBRARY_DIR}/compute/kernels/vector_cumulative_ops.cc"
"${LIBRARY_DIR}/compute/kernels/vector_hash.cc"
"${LIBRARY_DIR}/compute/kernels/vector_nested.cc"
+ "${LIBRARY_DIR}/compute/kernels/vector_pairwise.cc"
"${LIBRARY_DIR}/compute/kernels/vector_rank.cc"
"${LIBRARY_DIR}/compute/kernels/vector_replace.cc"
+ "${LIBRARY_DIR}/compute/kernels/vector_run_end_encode.cc"
"${LIBRARY_DIR}/compute/kernels/vector_select_k.cc"
"${LIBRARY_DIR}/compute/kernels/vector_selection.cc"
- "${LIBRARY_DIR}/compute/kernels/vector_sort.cc"
- "${LIBRARY_DIR}/compute/kernels/vector_selection_internal.cc"
"${LIBRARY_DIR}/compute/kernels/vector_selection_filter_internal.cc"
+ "${LIBRARY_DIR}/compute/kernels/vector_selection_internal.cc"
"${LIBRARY_DIR}/compute/kernels/vector_selection_take_internal.cc"
- "${LIBRARY_DIR}/compute/light_array.cc"
- "${LIBRARY_DIR}/compute/registry.cc"
- "${LIBRARY_DIR}/compute/expression.cc"
+ "${LIBRARY_DIR}/compute/kernels/vector_sort.cc"
+ "${LIBRARY_DIR}/compute/key_hash_internal.cc"
+ "${LIBRARY_DIR}/compute/key_map_internal.cc"
+ "${LIBRARY_DIR}/compute/light_array_internal.cc"
"${LIBRARY_DIR}/compute/ordering.cc"
+ "${LIBRARY_DIR}/compute/registry.cc"
"${LIBRARY_DIR}/compute/row/compare_internal.cc"
"${LIBRARY_DIR}/compute/row/encode_internal.cc"
"${LIBRARY_DIR}/compute/row/grouper.cc"
"${LIBRARY_DIR}/compute/row/row_internal.cc"
-
+ "${LIBRARY_DIR}/compute/util.cc"
+ "${LIBRARY_DIR}/config.cc"
+ "${LIBRARY_DIR}/datum.cc"
+ "${LIBRARY_DIR}/device.cc"
+ "${LIBRARY_DIR}/extension_type.cc"
+ "${LIBRARY_DIR}/integration/c_data_integration_internal.cc"
+ "${LIBRARY_DIR}/io/buffered.cc"
+ "${LIBRARY_DIR}/io/caching.cc"
+ "${LIBRARY_DIR}/io/compressed.cc"
+ "${LIBRARY_DIR}/io/file.cc"
+ "${LIBRARY_DIR}/io/hdfs.cc"
+ "${LIBRARY_DIR}/io/hdfs_internal.cc"
+ "${LIBRARY_DIR}/io/interfaces.cc"
+ "${LIBRARY_DIR}/io/memory.cc"
+ "${LIBRARY_DIR}/io/slow.cc"
+ "${LIBRARY_DIR}/io/stdio.cc"
+ "${LIBRARY_DIR}/io/transform.cc"
"${LIBRARY_DIR}/ipc/dictionary.cc"
"${LIBRARY_DIR}/ipc/feather.cc"
+ "${LIBRARY_DIR}/ipc/file_to_stream.cc"
"${LIBRARY_DIR}/ipc/message.cc"
"${LIBRARY_DIR}/ipc/metadata_internal.cc"
"${LIBRARY_DIR}/ipc/options.cc"
"${LIBRARY_DIR}/ipc/reader.cc"
+ "${LIBRARY_DIR}/ipc/stream_to_file.cc"
"${LIBRARY_DIR}/ipc/writer.cc"
+ "${LIBRARY_DIR}/memory_pool.cc"
+ "${LIBRARY_DIR}/pretty_print.cc"
+ "${LIBRARY_DIR}/record_batch.cc"
+ "${LIBRARY_DIR}/result.cc"
+ "${LIBRARY_DIR}/scalar.cc"
+ "${LIBRARY_DIR}/sparse_tensor.cc"
+ "${LIBRARY_DIR}/status.cc"
+ "${LIBRARY_DIR}/table.cc"
+ "${LIBRARY_DIR}/table_builder.cc"
+ "${LIBRARY_DIR}/tensor.cc"
+ "${LIBRARY_DIR}/tensor/coo_converter.cc"
+ "${LIBRARY_DIR}/tensor/csf_converter.cc"
+ "${LIBRARY_DIR}/tensor/csx_converter.cc"
+ "${LIBRARY_DIR}/type.cc"
+ "${LIBRARY_DIR}/type_traits.cc"
+ "${LIBRARY_DIR}/util/align_util.cc"
+ "${LIBRARY_DIR}/util/async_util.cc"
+ "${LIBRARY_DIR}/util/atfork_internal.cc"
+ "${LIBRARY_DIR}/util/basic_decimal.cc"
+ "${LIBRARY_DIR}/util/bit_block_counter.cc"
+ "${LIBRARY_DIR}/util/bit_run_reader.cc"
+ "${LIBRARY_DIR}/util/bit_util.cc"
+ "${LIBRARY_DIR}/util/bitmap.cc"
+ "${LIBRARY_DIR}/util/bitmap_builders.cc"
+ "${LIBRARY_DIR}/util/bitmap_ops.cc"
+ "${LIBRARY_DIR}/util/bpacking.cc"
+ "${LIBRARY_DIR}/util/byte_size.cc"
+ "${LIBRARY_DIR}/util/cancel.cc"
+ "${LIBRARY_DIR}/util/compression.cc"
+ "${LIBRARY_DIR}/util/counting_semaphore.cc"
+ "${LIBRARY_DIR}/util/cpu_info.cc"
+ "${LIBRARY_DIR}/util/crc32.cc"
+ "${LIBRARY_DIR}/util/debug.cc"
+ "${LIBRARY_DIR}/util/decimal.cc"
+ "${LIBRARY_DIR}/util/delimiting.cc"
+ "${LIBRARY_DIR}/util/dict_util.cc"
+ "${LIBRARY_DIR}/util/float16.cc"
+ "${LIBRARY_DIR}/util/formatting.cc"
+ "${LIBRARY_DIR}/util/future.cc"
+ "${LIBRARY_DIR}/util/hashing.cc"
+ "${LIBRARY_DIR}/util/int_util.cc"
+ "${LIBRARY_DIR}/util/io_util.cc"
+ "${LIBRARY_DIR}/util/key_value_metadata.cc"
+ "${LIBRARY_DIR}/util/list_util.cc"
+ "${LIBRARY_DIR}/util/logging.cc"
+ "${LIBRARY_DIR}/util/memory.cc"
+ "${LIBRARY_DIR}/util/mutex.cc"
+ "${LIBRARY_DIR}/util/ree_util.cc"
+ "${LIBRARY_DIR}/util/string.cc"
+ "${LIBRARY_DIR}/util/string_builder.cc"
+ "${LIBRARY_DIR}/util/task_group.cc"
+ "${LIBRARY_DIR}/util/tdigest.cc"
+ "${LIBRARY_DIR}/util/thread_pool.cc"
+ "${LIBRARY_DIR}/util/time.cc"
+ "${LIBRARY_DIR}/util/tracing.cc"
+ "${LIBRARY_DIR}/util/trie.cc"
+ "${LIBRARY_DIR}/util/union_util.cc"
+ "${LIBRARY_DIR}/util/unreachable.cc"
+ "${LIBRARY_DIR}/util/uri.cc"
+ "${LIBRARY_DIR}/util/utf8.cc"
+ "${LIBRARY_DIR}/util/value_parsing.cc"
+ "${LIBRARY_DIR}/vendored/base64.cpp"
+ "${LIBRARY_DIR}/vendored/datetime/tz.cpp"
+ "${LIBRARY_DIR}/vendored/double-conversion/bignum-dtoa.cc"
+ "${LIBRARY_DIR}/vendored/double-conversion/bignum.cc"
+ "${LIBRARY_DIR}/vendored/double-conversion/cached-powers.cc"
+ "${LIBRARY_DIR}/vendored/double-conversion/double-to-string.cc"
+ "${LIBRARY_DIR}/vendored/double-conversion/fast-dtoa.cc"
+ "${LIBRARY_DIR}/vendored/double-conversion/fixed-dtoa.cc"
+ "${LIBRARY_DIR}/vendored/double-conversion/string-to-double.cc"
+ "${LIBRARY_DIR}/vendored/double-conversion/strtod.cc"
+ "${LIBRARY_DIR}/vendored/musl/strptime.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriCommon.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriCompare.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriEscape.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriFile.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriIp4.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriIp4Base.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriMemory.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriNormalize.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriNormalizeBase.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriParse.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriParseBase.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriQuery.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriRecompose.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriResolve.c"
+ "${LIBRARY_DIR}/vendored/uriparser/UriShorten.c"
+ "${LIBRARY_DIR}/visitor.cc"
"${ARROW_SRC_DIR}/arrow/adapters/orc/adapter.cc"
"${ARROW_SRC_DIR}/arrow/adapters/orc/util.cc"
@@ -465,22 +483,38 @@ set(PARQUET_SRCS
"${LIBRARY_DIR}/arrow/schema.cc"
"${LIBRARY_DIR}/arrow/schema_internal.cc"
"${LIBRARY_DIR}/arrow/writer.cc"
+ "${LIBRARY_DIR}/benchmark_util.cc"
"${LIBRARY_DIR}/bloom_filter.cc"
+ "${LIBRARY_DIR}/bloom_filter_reader.cc"
"${LIBRARY_DIR}/column_reader.cc"
"${LIBRARY_DIR}/column_scanner.cc"
"${LIBRARY_DIR}/column_writer.cc"
"${LIBRARY_DIR}/encoding.cc"
+ "${LIBRARY_DIR}/encryption/crypto_factory.cc"
"${LIBRARY_DIR}/encryption/encryption.cc"
"${LIBRARY_DIR}/encryption/encryption_internal.cc"
+ "${LIBRARY_DIR}/encryption/encryption_internal_nossl.cc"
+ "${LIBRARY_DIR}/encryption/file_key_unwrapper.cc"
+ "${LIBRARY_DIR}/encryption/file_key_wrapper.cc"
+ "${LIBRARY_DIR}/encryption/file_system_key_material_store.cc"
"${LIBRARY_DIR}/encryption/internal_file_decryptor.cc"
"${LIBRARY_DIR}/encryption/internal_file_encryptor.cc"
+ "${LIBRARY_DIR}/encryption/key_material.cc"
+ "${LIBRARY_DIR}/encryption/key_metadata.cc"
+ "${LIBRARY_DIR}/encryption/key_toolkit.cc"
+ "${LIBRARY_DIR}/encryption/key_toolkit_internal.cc"
+ "${LIBRARY_DIR}/encryption/kms_client.cc"
+ "${LIBRARY_DIR}/encryption/local_wrap_kms_client.cc"
+ "${LIBRARY_DIR}/encryption/openssl_internal.cc"
"${LIBRARY_DIR}/exception.cc"
"${LIBRARY_DIR}/file_reader.cc"
"${LIBRARY_DIR}/file_writer.cc"
- "${LIBRARY_DIR}/page_index.cc"
- "${LIBRARY_DIR}/level_conversion.cc"
"${LIBRARY_DIR}/level_comparison.cc"
+ "${LIBRARY_DIR}/level_comparison_avx2.cc"
+ "${LIBRARY_DIR}/level_conversion.cc"
+ "${LIBRARY_DIR}/level_conversion_bmi2.cc"
"${LIBRARY_DIR}/metadata.cc"
+ "${LIBRARY_DIR}/page_index.cc"
"${LIBRARY_DIR}/platform.cc"
"${LIBRARY_DIR}/printer.cc"
"${LIBRARY_DIR}/properties.cc"
@@ -489,7 +523,6 @@ set(PARQUET_SRCS
"${LIBRARY_DIR}/stream_reader.cc"
"${LIBRARY_DIR}/stream_writer.cc"
"${LIBRARY_DIR}/types.cc"
- "${LIBRARY_DIR}/bloom_filter_reader.cc"
"${LIBRARY_DIR}/xxhasher.cc"
"${GEN_LIBRARY_DIR}/parquet_constants.cpp"
@@ -520,6 +553,9 @@ endif ()
add_definitions(-DPARQUET_THRIFT_VERSION_MAJOR=0)
add_definitions(-DPARQUET_THRIFT_VERSION_MINOR=16)
+# As per https://github.com/apache/arrow/pull/35672 you need to enable it explicitly.
+add_definitions(-DARROW_ENABLE_THREADING)
+
# === tools
set(TOOLS_DIR "${ClickHouse_SOURCE_DIR}/contrib/arrow/cpp/tools/parquet")
diff --git a/contrib/flatbuffers b/contrib/flatbuffers
index eb3f8279482..0100f6a5779 160000
--- a/contrib/flatbuffers
+++ b/contrib/flatbuffers
@@ -1 +1 @@
-Subproject commit eb3f827948241ce0e701516f16cd67324802bce9
+Subproject commit 0100f6a5779831fa7a651e4b67ef389a8752bd9b
diff --git a/docker/test/base/setup_export_logs.sh b/docker/test/base/setup_export_logs.sh
index a39f96867be..12f1cc4d357 100755
--- a/docker/test/base/setup_export_logs.sh
+++ b/docker/test/base/setup_export_logs.sh
@@ -25,7 +25,7 @@ EXTRA_COLUMNS_EXPRESSION_TRACE_LOG="${EXTRA_COLUMNS_EXPRESSION}, arrayMap(x -> d
# coverage_log needs more columns for symbolization, but only symbol names (the line numbers are too heavy to calculate)
EXTRA_COLUMNS_COVERAGE_LOG="${EXTRA_COLUMNS} symbols Array(LowCardinality(String)), "
-EXTRA_COLUMNS_EXPRESSION_COVERAGE_LOG="${EXTRA_COLUMNS_EXPRESSION}, arrayMap(x -> demangle(addressToSymbol(x)), coverage)::Array(LowCardinality(String)) AS symbols"
+EXTRA_COLUMNS_EXPRESSION_COVERAGE_LOG="${EXTRA_COLUMNS_EXPRESSION}, arrayDistinct(arrayMap(x -> demangle(addressToSymbol(x)), coverage))::Array(LowCardinality(String)) AS symbols"
function __set_connection_args
diff --git a/docker/test/style/Dockerfile b/docker/test/style/Dockerfile
index fa6b087eb7d..564301f447c 100644
--- a/docker/test/style/Dockerfile
+++ b/docker/test/style/Dockerfile
@@ -28,7 +28,7 @@ COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && locale-gen en_US.UTF-8
-ENV LC_ALL en_US.UTF-8
+ENV LC_ALL=en_US.UTF-8
# Architecture of the image when BuildKit/buildx is used
ARG TARGETARCH
diff --git a/docker/test/style/requirements.txt b/docker/test/style/requirements.txt
index cc87f6e548d..aab20b5bee0 100644
--- a/docker/test/style/requirements.txt
+++ b/docker/test/style/requirements.txt
@@ -12,6 +12,7 @@ charset-normalizer==3.3.2
click==8.1.7
codespell==2.2.1
cryptography==43.0.1
+datacompy==0.7.3
Deprecated==1.2.14
dill==0.3.8
flake8==4.0.1
@@ -23,6 +24,7 @@ mccabe==0.6.1
multidict==6.0.5
mypy==1.8.0
mypy-extensions==1.0.0
+pandas==2.2.3
packaging==24.1
pathspec==0.9.0
pip==24.1.1
diff --git a/docs/en/engines/table-engines/integrations/s3.md b/docs/en/engines/table-engines/integrations/s3.md
index 2675c193519..fd27d4b6ed9 100644
--- a/docs/en/engines/table-engines/integrations/s3.md
+++ b/docs/en/engines/table-engines/integrations/s3.md
@@ -290,6 +290,7 @@ The following settings can be specified in configuration file for given endpoint
- `expiration_window_seconds` — Grace period for checking if expiration-based credentials have expired. Optional, default value is `120`.
- `no_sign_request` - Ignore all the credentials so requests are not signed. Useful for accessing public buckets.
- `header` — Adds specified HTTP header to a request to given endpoint. Optional, can be specified multiple times.
+- `access_header` - Adds specified HTTP header to a request to given endpoint, in cases where there are no other credentials from another source.
- `server_side_encryption_customer_key_base64` — If specified, required headers for accessing S3 objects with SSE-C encryption will be set. Optional.
- `server_side_encryption_kms_key_id` - If specified, required headers for accessing S3 objects with [SSE-KMS encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html) will be set. If an empty string is specified, the AWS managed S3 key will be used. Optional.
- `server_side_encryption_kms_encryption_context` - If specified alongside `server_side_encryption_kms_key_id`, the given encryption context header for SSE-KMS will be set. Optional.
@@ -320,6 +321,32 @@ The following settings can be specified in configuration file for given endpoint
```
+## Working with archives
+
+Suppose that we have several archive files with following URIs on S3:
+
+- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-10.csv.zip'
+- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-11.csv.zip'
+- 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-12.csv.zip'
+
+Extracting data from these archives is possible using ::. Globs can be used both in the url part as well as in the part after :: (responsible for the name of a file inside the archive).
+
+``` sql
+SELECT *
+FROM s3(
+ 'https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-2018-01-1{0..2}.csv.zip :: *.csv'
+);
+```
+
+:::note
+ClickHouse supports three archive formats:
+ZIP
+TAR
+7Z
+While ZIP and TAR archives can be accessed from any supported storage location, 7Z archives can only be read from the local filesystem where ClickHouse is installed.
+:::
+
+
## Accessing public buckets
ClickHouse tries to fetch credentials from many different types of sources.
diff --git a/docs/en/interfaces/cli.md b/docs/en/interfaces/cli.md
index 66291014ed7..504f6eec6de 100644
--- a/docs/en/interfaces/cli.md
+++ b/docs/en/interfaces/cli.md
@@ -190,6 +190,7 @@ You can pass parameters to `clickhouse-client` (all parameters have a default va
- `--config-file` – The name of the configuration file.
- `--secure` – If specified, will connect to server over secure connection (TLS). You might need to configure your CA certificates in the [configuration file](#configuration_files). The available configuration settings are the same as for [server-side TLS configuration](../operations/server-configuration-parameters/settings.md#openssl).
- `--history_file` — Path to a file containing command history.
+- `--history_max_entries` — Maximum number of entries in the history file. Default value: 1 000 000.
- `--param_` — Value for a [query with parameters](#cli-queries-with-parameters).
- `--hardware-utilization` — Print hardware utilization information in progress bar.
- `--print-profile-events` – Print `ProfileEvents` packets.
diff --git a/docs/en/operations/server-configuration-parameters/settings.md b/docs/en/operations/server-configuration-parameters/settings.md
index 76d6f5388e3..02fa5a8ca58 100644
--- a/docs/en/operations/server-configuration-parameters/settings.md
+++ b/docs/en/operations/server-configuration-parameters/settings.md
@@ -3224,6 +3224,34 @@ Default value: "default"
**See Also**
- [Workload Scheduling](/docs/en/operations/workload-scheduling.md)
+## workload_path {#workload_path}
+
+The directory used as a storage for all `CREATE WORKLOAD` and `CREATE RESOURCE` queries. By default `/workload/` folder under server working directory is used.
+
+**Example**
+
+``` xml
+/var/lib/clickhouse/workload/
+```
+
+**See Also**
+- [Workload Hierarchy](/docs/en/operations/workload-scheduling.md#workloads)
+- [workload_zookeeper_path](#workload_zookeeper_path)
+
+## workload_zookeeper_path {#workload_zookeeper_path}
+
+The path to a ZooKeeper node, which is used as a storage for all `CREATE WORKLOAD` and `CREATE RESOURCE` queries. For consistency all SQL definitions are stored as a value of this single znode. By default ZooKeeper is not used and definitions are stored on [disk](#workload_path).
+
+**Example**
+
+``` xml
+/clickhouse/workload/definitions.sql
+```
+
+**See Also**
+- [Workload Hierarchy](/docs/en/operations/workload-scheduling.md#workloads)
+- [workload_path](#workload_path)
+
## max_authentication_methods_per_user {#max_authentication_methods_per_user}
The maximum number of authentication methods a user can be created with or altered to.
diff --git a/docs/en/operations/system-tables/merge_tree_settings.md b/docs/en/operations/system-tables/merge_tree_settings.md
index 48217d63f9d..473315d3941 100644
--- a/docs/en/operations/system-tables/merge_tree_settings.md
+++ b/docs/en/operations/system-tables/merge_tree_settings.md
@@ -18,6 +18,11 @@ Columns:
- `1` — Current user can’t change the setting.
- `type` ([String](../../sql-reference/data-types/string.md)) — Setting type (implementation specific string value).
- `is_obsolete` ([UInt8](../../sql-reference/data-types/int-uint.md#uint-ranges)) - Shows whether a setting is obsolete.
+- `tier` ([Enum8](../../sql-reference/data-types/enum.md)) — Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their development and the expectations one might have when using them. Values:
+ - `'Production'` — The feature is stable, safe to use and does not have issues interacting with other **production** features. .
+ - `'Beta'` — The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome.
+ - `'Experimental'` — The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time.
+ - `'Obsolete'` — No longer supported. Either it is already removed or it will be removed in future releases.
**Example**
```sql
diff --git a/docs/en/operations/system-tables/resources.md b/docs/en/operations/system-tables/resources.md
new file mode 100644
index 00000000000..6329f05f610
--- /dev/null
+++ b/docs/en/operations/system-tables/resources.md
@@ -0,0 +1,37 @@
+---
+slug: /en/operations/system-tables/resources
+---
+# resources
+
+Contains information for [resources](/docs/en/operations/workload-scheduling.md#workload_entity_storage) residing on the local server. The table contains a row for every resource.
+
+Example:
+
+``` sql
+SELECT *
+FROM system.resources
+FORMAT Vertical
+```
+
+``` text
+Row 1:
+──────
+name: io_read
+read_disks: ['s3']
+write_disks: []
+create_query: CREATE RESOURCE io_read (READ DISK s3)
+
+Row 2:
+──────
+name: io_write
+read_disks: []
+write_disks: ['s3']
+create_query: CREATE RESOURCE io_write (WRITE DISK s3)
+```
+
+Columns:
+
+- `name` (`String`) - Resource name.
+- `read_disks` (`Array(String)`) - The array of disk names that uses this resource for read operations.
+- `write_disks` (`Array(String)`) - The array of disk names that uses this resource for write operations.
+- `create_query` (`String`) - The definition of the resource.
diff --git a/docs/en/operations/system-tables/settings.md b/docs/en/operations/system-tables/settings.md
index a04e095e990..1cfee0ba5f4 100644
--- a/docs/en/operations/system-tables/settings.md
+++ b/docs/en/operations/system-tables/settings.md
@@ -18,6 +18,11 @@ Columns:
- `1` — Current user can’t change the setting.
- `default` ([String](../../sql-reference/data-types/string.md)) — Setting default value.
- `is_obsolete` ([UInt8](../../sql-reference/data-types/int-uint.md#uint-ranges)) - Shows whether a setting is obsolete.
+- `tier` ([Enum8](../../sql-reference/data-types/enum.md)) — Support level for this feature. ClickHouse features are organized in tiers, varying depending on the current status of their development and the expectations one might have when using them. Values:
+ - `'Production'` — The feature is stable, safe to use and does not have issues interacting with other **production** features. .
+ - `'Beta'` — The feature is stable and safe. The outcome of using it together with other features is unknown and correctness is not guaranteed. Testing and reports are welcome.
+ - `'Experimental'` — The feature is under development. Only intended for developers and ClickHouse enthusiasts. The feature might or might not work and could be removed at any time.
+ - `'Obsolete'` — No longer supported. Either it is already removed or it will be removed in future releases.
**Example**
@@ -26,19 +31,99 @@ The following example shows how to get information about settings which name con
``` sql
SELECT *
FROM system.settings
-WHERE name LIKE '%min_i%'
+WHERE name LIKE '%min_insert_block_size_%'
+FORMAT Vertical
```
``` text
-┌─name───────────────────────────────────────────────_─value─────_─changed─_─description───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────_─min──_─max──_─readonly─_─type─────────_─default───_─alias_for─_─is_obsolete─┐
-│ min_insert_block_size_rows │ 1048449 │ 0 │ Squash blocks passed to INSERT query to specified size in rows, if blocks are not big enough. │ ____ │ ____ │ 0 │ UInt64 │ 1048449 │ │ 0 │
-│ min_insert_block_size_bytes │ 268402944 │ 0 │ Squash blocks passed to INSERT query to specified size in bytes, if blocks are not big enough. │ ____ │ ____ │ 0 │ UInt64 │ 268402944 │ │ 0 │
-│ min_insert_block_size_rows_for_materialized_views │ 0 │ 0 │ Like min_insert_block_size_rows, but applied only during pushing to MATERIALIZED VIEW (default: min_insert_block_size_rows) │ ____ │ ____ │ 0 │ UInt64 │ 0 │ │ 0 │
-│ min_insert_block_size_bytes_for_materialized_views │ 0 │ 0 │ Like min_insert_block_size_bytes, but applied only during pushing to MATERIALIZED VIEW (default: min_insert_block_size_bytes) │ ____ │ ____ │ 0 │ UInt64 │ 0 │ │ 0 │
-│ read_backoff_min_interval_between_events_ms │ 1000 │ 0 │ Settings to reduce the number of threads in case of slow reads. Do not pay attention to the event, if the previous one has passed less than a certain amount of time. │ ____ │ ____ │ 0 │ Milliseconds │ 1000 │ │ 0 │
-└────────────────────────────────────────────────────┴───────────┴─────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-──────────────────────────────────────────────────────┴──────┴──────┴──────────┴──────────────┴───────────┴───────────┴─────────────┘
-```
+Row 1:
+──────
+name: min_insert_block_size_rows
+value: 1048449
+changed: 0
+description: Sets the minimum number of rows in the block that can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones.
+
+Possible values:
+
+- Positive integer.
+- 0 — Squashing disabled.
+min: ᴺᵁᴸᴸ
+max: ᴺᵁᴸᴸ
+readonly: 0
+type: UInt64
+default: 1048449
+alias_for:
+is_obsolete: 0
+tier: Production
+
+Row 2:
+──────
+name: min_insert_block_size_bytes
+value: 268402944
+changed: 0
+description: Sets the minimum number of bytes in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones.
+
+Possible values:
+
+- Positive integer.
+- 0 — Squashing disabled.
+min: ᴺᵁᴸᴸ
+max: ᴺᵁᴸᴸ
+readonly: 0
+type: UInt64
+default: 268402944
+alias_for:
+is_obsolete: 0
+tier: Production
+
+Row 3:
+──────
+name: min_insert_block_size_rows_for_materialized_views
+value: 0
+changed: 0
+description: Sets the minimum number of rows in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. This setting is applied only for blocks inserted into [materialized view](../../sql-reference/statements/create/view.md). By adjusting this setting, you control blocks squashing while pushing to materialized view and avoid excessive memory usage.
+
+Possible values:
+
+- Any positive integer.
+- 0 — Squashing disabled.
+
+**See Also**
+
+- [min_insert_block_size_rows](#min-insert-block-size-rows)
+min: ᴺᵁᴸᴸ
+max: ᴺᵁᴸᴸ
+readonly: 0
+type: UInt64
+default: 0
+alias_for:
+is_obsolete: 0
+tier: Production
+
+Row 4:
+──────
+name: min_insert_block_size_bytes_for_materialized_views
+value: 0
+changed: 0
+description: Sets the minimum number of bytes in the block which can be inserted into a table by an `INSERT` query. Smaller-sized blocks are squashed into bigger ones. This setting is applied only for blocks inserted into [materialized view](../../sql-reference/statements/create/view.md). By adjusting this setting, you control blocks squashing while pushing to materialized view and avoid excessive memory usage.
+
+Possible values:
+
+- Any positive integer.
+- 0 — Squashing disabled.
+
+**See also**
+
+- [min_insert_block_size_bytes](#min-insert-block-size-bytes)
+min: ᴺᵁᴸᴸ
+max: ᴺᵁᴸᴸ
+readonly: 0
+type: UInt64
+default: 0
+alias_for:
+is_obsolete: 0
+tier: Production
+ ```
Using of `WHERE changed` can be useful, for example, when you want to check:
diff --git a/docs/en/operations/system-tables/workloads.md b/docs/en/operations/system-tables/workloads.md
new file mode 100644
index 00000000000..d9c62372044
--- /dev/null
+++ b/docs/en/operations/system-tables/workloads.md
@@ -0,0 +1,40 @@
+---
+slug: /en/operations/system-tables/workloads
+---
+# workloads
+
+Contains information for [workloads](/docs/en/operations/workload-scheduling.md#workload_entity_storage) residing on the local server. The table contains a row for every workload.
+
+Example:
+
+``` sql
+SELECT *
+FROM system.workloads
+FORMAT Vertical
+```
+
+``` text
+Row 1:
+──────
+name: production
+parent: all
+create_query: CREATE WORKLOAD production IN `all` SETTINGS weight = 9
+
+Row 2:
+──────
+name: development
+parent: all
+create_query: CREATE WORKLOAD development IN `all`
+
+Row 3:
+──────
+name: all
+parent:
+create_query: CREATE WORKLOAD `all`
+```
+
+Columns:
+
+- `name` (`String`) - Workload name.
+- `parent` (`String`) - Parent workload name.
+- `create_query` (`String`) - The definition of the workload.
diff --git a/docs/en/operations/workload-scheduling.md b/docs/en/operations/workload-scheduling.md
index 08629492ec6..a43bea7a5b1 100644
--- a/docs/en/operations/workload-scheduling.md
+++ b/docs/en/operations/workload-scheduling.md
@@ -43,6 +43,20 @@ Example:
```
+An alternative way to express which disks are used by a resource is SQL syntax:
+
+```sql
+CREATE RESOURCE resource_name (WRITE DISK disk1, READ DISK disk2)
+```
+
+Resource could be used for any number of disk for READ or WRITE or both for READ and WRITE. There a syntax allowing to use a resource for all the disks:
+
+```sql
+CREATE RESOURCE all_io (READ ANY DISK, WRITE ANY DISK);
+```
+
+Note that server configuration options have priority over SQL way to define resources.
+
## Workload markup {#workload_markup}
Queries can be marked with setting `workload` to distinguish different workloads. If `workload` is not set, than value "default" is used. Note that you are able to specify the other value using settings profiles. Setting constraints can be used to make `workload` constant if you want all queries from the user to be marked with fixed value of `workload` setting.
@@ -153,9 +167,48 @@ Example:
```
+## Workload hierarchy (SQL only) {#workloads}
+
+Defining resources and classifiers in XML could be challenging. ClickHouse provides SQL syntax that is much more convenient. All resources that were created with `CREATE RESOURCE` share the same structure of the hierarchy, but could differ in some aspects. Every workload created with `CREATE WORKLOAD` maintains a few automatically created scheduling nodes for every resource. A child workload can be created inside another parent workload. Here is the example that defines exactly the same hierarchy as XML configuration above:
+
+```sql
+CREATE RESOURCE network_write (WRITE DISK s3)
+CREATE RESOURCE network_read (READ DISK s3)
+CREATE WORKLOAD all SETTINGS max_requests = 100
+CREATE WORKLOAD development IN all
+CREATE WORKLOAD production IN all SETTINGS weight = 3
+```
+
+The name of a leaf workload without children could be used in query settings `SETTINGS workload = 'name'`. Note that workload classifiers are also created automatically when using SQL syntax.
+
+To customize workload the following settings could be used:
+* `priority` - sibling workloads are served according to static priority values (lower value means higher priority).
+* `weight` - sibling workloads having the same static priority share resources according to weights.
+* `max_requests` - the limit on the number of concurrent resource requests in this workload.
+* `max_cost` - the limit on the total inflight bytes count of concurrent resource requests in this workload.
+* `max_speed` - the limit on byte processing rate of this workload (the limit is independent for every resource).
+* `max_burst` - maximum number of bytes that could be processed by the workload without being throttled (for every resource independently).
+
+Note that workload settings are translated into a proper set of scheduling nodes. For more details, see the description of the scheduling node [types and options](#hierarchy).
+
+There is no way to specify different hierarchies of workloads for different resources. But there is a way to specify different workload setting value for a specific resource:
+
+```sql
+CREATE OR REPLACE WORKLOAD all SETTINGS max_requests = 100, max_speed = 1000000 FOR network_read, max_speed = 2000000 FOR network_write
+```
+
+Also note that workload or resource could not be dropped if it is referenced from another workload. To update a definition of a workload use `CREATE OR REPLACE WORKLOAD` query.
+
+## Workloads and resources storage {#workload_entity_storage}
+Definitions of all workloads and resources in the form of `CREATE WORKLOAD` and `CREATE RESOURCE` queries are stored persistently either on disk at `workload_path` or in ZooKeeper at `workload_zookeeper_path`. ZooKeeper storage is recommended to achieve consistency between nodes. Alternatively `ON CLUSTER` clause could be used along with disk storage.
+
## See also
- [system.scheduler](/docs/en/operations/system-tables/scheduler.md)
+ - [system.workloads](/docs/en/operations/system-tables/workloads.md)
+ - [system.resources](/docs/en/operations/system-tables/resources.md)
- [merge_workload](/docs/en/operations/settings/merge-tree-settings.md#merge_workload) merge tree setting
- [merge_workload](/docs/en/operations/server-configuration-parameters/settings.md#merge_workload) global server setting
- [mutation_workload](/docs/en/operations/settings/merge-tree-settings.md#mutation_workload) merge tree setting
- [mutation_workload](/docs/en/operations/server-configuration-parameters/settings.md#mutation_workload) global server setting
+ - [workload_path](/docs/en/operations/server-configuration-parameters/settings.md#workload_path) global server setting
+ - [workload_zookeeper_path](/docs/en/operations/server-configuration-parameters/settings.md#workload_zookeeper_path) global server setting
diff --git a/docs/en/sql-reference/statements/alter/user.md b/docs/en/sql-reference/statements/alter/user.md
index a56532e2ab0..1514b16a657 100644
--- a/docs/en/sql-reference/statements/alter/user.md
+++ b/docs/en/sql-reference/statements/alter/user.md
@@ -12,7 +12,7 @@ Syntax:
``` sql
ALTER USER [IF EXISTS] name1 [RENAME TO new_name |, name2 [,...]]
[ON CLUSTER cluster_name]
- [NOT IDENTIFIED | RESET AUTHENTICATION METHODS TO NEW | {IDENTIFIED | ADD IDENTIFIED} {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']}
+ [NOT IDENTIFIED | RESET AUTHENTICATION METHODS TO NEW | {IDENTIFIED | ADD IDENTIFIED} {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} [VALID UNTIL datetime]
[, {[{plaintext_password | sha256_password | sha256_hash | ...}] BY {'password' | 'hash'}} | {ldap SERVER 'server_name'} | {...} | ... [,...]]]
[[ADD | DROP] HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE]
[VALID UNTIL datetime]
@@ -91,3 +91,15 @@ Reset authentication methods and keep the most recent added one:
``` sql
ALTER USER user1 RESET AUTHENTICATION METHODS TO NEW
```
+
+## VALID UNTIL Clause
+
+Allows you to specify the expiration date and, optionally, the time for an authentication method. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`.
+The `VALID UNTIL` clause can only be specified along with an authentication method, except for the case where no authentication method has been specified in the query. In this scenario, the `VALID UNTIL` clause will be applied to all existing authentication methods.
+
+Examples:
+
+- `ALTER USER name1 VALID UNTIL '2025-01-01'`
+- `ALTER USER name1 VALID UNTIL '2025-01-01 12:00:00 UTC'`
+- `ALTER USER name1 VALID UNTIL 'infinity'`
+- `ALTER USER name1 IDENTIFIED WITH plaintext_password BY 'no_expiration', bcrypt_password BY 'expiration_set' VALID UNTIL'2025-01-01''`
diff --git a/docs/en/sql-reference/statements/create/user.md b/docs/en/sql-reference/statements/create/user.md
index 21f3dd6692c..03d93fc3365 100644
--- a/docs/en/sql-reference/statements/create/user.md
+++ b/docs/en/sql-reference/statements/create/user.md
@@ -11,7 +11,7 @@ Syntax:
``` sql
CREATE USER [IF NOT EXISTS | OR REPLACE] name1 [, name2 [,...]] [ON CLUSTER cluster_name]
- [NOT IDENTIFIED | IDENTIFIED {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']}
+ [NOT IDENTIFIED | IDENTIFIED {[WITH {plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | WITH NO_PASSWORD | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']} | {WITH ssl_certificate CN 'common_name' | SAN 'TYPE:subject_alt_name'} | {WITH ssh_key BY KEY 'public_key' TYPE 'ssh-rsa|...'} | {WITH http SERVER 'server_name' [SCHEME 'Basic']} [VALID UNTIL datetime]
[, {[{plaintext_password | sha256_password | sha256_hash | ...}] BY {'password' | 'hash'}} | {ldap SERVER 'server_name'} | {...} | ... [,...]]]
[HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE]
[VALID UNTIL datetime]
@@ -178,7 +178,8 @@ ClickHouse treats `user_name@'address'` as a username as a whole. Thus, technica
## VALID UNTIL Clause
-Allows you to specify the expiration date and, optionally, the time for user credentials. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`.
+Allows you to specify the expiration date and, optionally, the time for an authentication method. It accepts a string as a parameter. It is recommended to use the `YYYY-MM-DD [hh:mm:ss] [timezone]` format for datetime. By default, this parameter equals `'infinity'`.
+The `VALID UNTIL` clause can only be specified along with an authentication method, except for the case where no authentication method has been specified in the query. In this scenario, the `VALID UNTIL` clause will be applied to all existing authentication methods.
Examples:
@@ -186,6 +187,7 @@ Examples:
- `CREATE USER name1 VALID UNTIL '2025-01-01 12:00:00 UTC'`
- `CREATE USER name1 VALID UNTIL 'infinity'`
- ```CREATE USER name1 VALID UNTIL '2025-01-01 12:00:00 `Asia/Tokyo`'```
+- `CREATE USER name1 IDENTIFIED WITH plaintext_password BY 'no_expiration', bcrypt_password BY 'expiration_set' VALID UNTIL '2025-01-01''`
## GRANTEES Clause
diff --git a/docs/en/sql-reference/statements/create/view.md b/docs/en/sql-reference/statements/create/view.md
index 0e5d5250e0f..c770348bce0 100644
--- a/docs/en/sql-reference/statements/create/view.md
+++ b/docs/en/sql-reference/statements/create/view.md
@@ -55,7 +55,7 @@ SELECT * FROM view(column1=value1, column2=value2 ...)
## Materialized View
``` sql
-CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER] [TO[db.]name] [ENGINE = engine] [POPULATE]
+CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster_name] [TO[db.]name] [ENGINE = engine] [POPULATE]
[DEFINER = { user | CURRENT_USER }] [SQL SECURITY { DEFINER | INVOKER | NONE }]
AS SELECT ...
[COMMENT 'comment']
diff --git a/docs/en/sql-reference/statements/grant.md b/docs/en/sql-reference/statements/grant.md
index c11299baf38..decb28d44d5 100644
--- a/docs/en/sql-reference/statements/grant.md
+++ b/docs/en/sql-reference/statements/grant.md
@@ -78,6 +78,10 @@ Specifying privileges you can use asterisk (`*`) instead of a table or a databas
Also, you can omit database name. In this case privileges are granted for current database.
For example, `GRANT SELECT ON * TO john` grants the privilege on all the tables in the current database, `GRANT SELECT ON mytable TO john` grants the privilege on the `mytable` table in the current database.
+:::note
+The feature described below is available starting with the 24.10 ClickHouse version.
+:::
+
You can also put asterisks at the end of a table or a database name. This feature allows you to grant privileges on an abstract prefix of the table's path.
Example: `GRANT SELECT ON db.my_tables* TO john`. This query allows `john` to execute the `SELECT` query over all the `db` database tables with the prefix `my_tables*`.
@@ -238,10 +242,13 @@ Hierarchy of privileges:
- `HDFS`
- `HIVE`
- `JDBC`
+ - `KAFKA`
- `MONGO`
- `MYSQL`
+ - `NATS`
- `ODBC`
- `POSTGRES`
+ - `RABBITMQ`
- `REDIS`
- `REMOTE`
- `S3`
@@ -520,10 +527,13 @@ Allows using external data sources. Applies to [table engines](../../engines/tab
- `HDFS`. Level: `GLOBAL`
- `HIVE`. Level: `GLOBAL`
- `JDBC`. Level: `GLOBAL`
+ - `KAFKA`. Level: `GLOBAL`
- `MONGO`. Level: `GLOBAL`
- `MYSQL`. Level: `GLOBAL`
+ - `NATS`. Level: `GLOBAL`
- `ODBC`. Level: `GLOBAL`
- `POSTGRES`. Level: `GLOBAL`
+ - `RABBITMQ`. Level: `GLOBAL`
- `REDIS`. Level: `GLOBAL`
- `REMOTE`. Level: `GLOBAL`
- `S3`. Level: `GLOBAL`
diff --git a/docs/en/sql-reference/statements/kill.md b/docs/en/sql-reference/statements/kill.md
index 667a5b51f5c..ff6f64a97fe 100644
--- a/docs/en/sql-reference/statements/kill.md
+++ b/docs/en/sql-reference/statements/kill.md
@@ -83,7 +83,7 @@ The presence of long-running or incomplete mutations often indicates that a Clic
- Or manually kill some of these mutations by sending a `KILL` command.
``` sql
-KILL MUTATION [ON CLUSTER cluster]
+KILL MUTATION
WHERE
[TEST]
[FORMAT format]
@@ -135,7 +135,6 @@ KILL MUTATION WHERE database = 'default' AND table = 'table'
-- Cancel the specific mutation:
KILL MUTATION WHERE database = 'default' AND table = 'table' AND mutation_id = 'mutation_3.txt'
```
-:::tip If you are killing a mutation in ClickHouse Cloud or in a self-managed cluster, then be sure to use the ```ON CLUSTER [cluster-name]``` option, in order to ensure the mutation is killed on all replicas:::
The query is useful when a mutation is stuck and cannot finish (e.g. if some function in the mutation query throws an exception when applied to the data contained in the table).
diff --git a/docs/en/sql-reference/table-functions/s3.md b/docs/en/sql-reference/table-functions/s3.md
index df4e10425a5..b14eb84392f 100644
--- a/docs/en/sql-reference/table-functions/s3.md
+++ b/docs/en/sql-reference/table-functions/s3.md
@@ -284,6 +284,14 @@ FROM s3(
);
```
+:::note
+ClickHouse supports three archive formats:
+ZIP
+TAR
+7Z
+While ZIP and TAR archives can be accessed from any supported storage location, 7Z archives can only be read from the local filesystem where ClickHouse is installed.
+:::
+
## Virtual Columns {#virtual-columns}
diff --git a/docs/ru/engines/table-engines/integrations/s3.md b/docs/ru/engines/table-engines/integrations/s3.md
index a1c69df4d0a..2bab78c0612 100644
--- a/docs/ru/engines/table-engines/integrations/s3.md
+++ b/docs/ru/engines/table-engines/integrations/s3.md
@@ -138,6 +138,7 @@ CREATE TABLE table_with_asterisk (name String, value UInt32)
- `use_insecure_imds_request` — признак использования менее безопасного соединения при выполнении запроса к IMDS при получении учётных данных из метаданных Amazon EC2. Значение по умолчанию — `false`.
- `region` — название региона S3.
- `header` — добавляет указанный HTTP-заголовок к запросу на заданную точку приема запроса. Может быть определен несколько раз.
+- `access_header` - добавляет указанный HTTP-заголовок к запросу на заданную точку приема запроса, в случая если не указаны другие способы авторизации.
- `server_side_encryption_customer_key_base64` — устанавливает необходимые заголовки для доступа к объектам S3 с шифрованием SSE-C.
- `single_read_retries` — Максимальное количество попыток запроса при единичном чтении. Значение по умолчанию — `4`.
diff --git a/docs/ru/sql-reference/statements/create/view.md b/docs/ru/sql-reference/statements/create/view.md
index 8fa30446bb3..5dbffd90205 100644
--- a/docs/ru/sql-reference/statements/create/view.md
+++ b/docs/ru/sql-reference/statements/create/view.md
@@ -39,7 +39,7 @@ SELECT a, b, c FROM (SELECT ...)
## Материализованные представления {#materialized}
``` sql
-CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER] [TO[db.]name] [ENGINE = engine] [POPULATE]
+CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster_name] [TO[db.]name] [ENGINE = engine] [POPULATE]
[DEFINER = { user | CURRENT_USER }] [SQL SECURITY { DEFINER | INVOKER | NONE }]
AS SELECT ...
```
diff --git a/docs/ru/sql-reference/statements/grant.md b/docs/ru/sql-reference/statements/grant.md
index 2ccc2d05452..79682dc42cd 100644
--- a/docs/ru/sql-reference/statements/grant.md
+++ b/docs/ru/sql-reference/statements/grant.md
@@ -192,14 +192,23 @@ GRANT SELECT(x,y) ON db.table TO john WITH GRANT OPTION
- `addressToSymbol`
- `demangle`
- [SOURCES](#grant-sources)
+ - `AZURE`
- `FILE`
- - `URL`
- - `REMOTE`
- - `MYSQL`
- - `ODBC`
- - `JDBC`
- `HDFS`
+ - `HIVE`
+ - `JDBC`
+ - `KAFKA`
+ - `MONGO`
+ - `MYSQL`
+ - `NATS`
+ - `ODBC`
+ - `POSTGRES`
+ - `RABBITMQ`
+ - `REDIS`
+ - `REMOTE`
- `S3`
+ - `SQLITE`
+ - `URL`
- [dictGet](#grant-dictget)
Примеры того, как трактуется данная иерархия:
@@ -461,14 +470,23 @@ GRANT INSERT(x,y) ON db.table TO john
Разрешает использовать внешние источники данных. Применяется к [движкам таблиц](../../engines/table-engines/index.md) и [табличным функциям](../table-functions/index.md#table-functions).
- `SOURCES`. Уровень: `GROUP`
+ - `AZURE`. Уровень: `GLOBAL`
- `FILE`. Уровень: `GLOBAL`
- - `URL`. Уровень: `GLOBAL`
- - `REMOTE`. Уровень: `GLOBAL`
- - `MYSQL`. Уровень: `GLOBAL`
- - `ODBC`. Уровень: `GLOBAL`
- - `JDBC`. Уровень: `GLOBAL`
- `HDFS`. Уровень: `GLOBAL`
+ - `HIVE`. Уровень: `GLOBAL`
+ - `JDBC`. Уровень: `GLOBAL`
+ - `KAFKA`. Уровень: `GLOBAL`
+ - `MONGO`. Уровень: `GLOBAL`
+ - `MYSQL`. Уровень: `GLOBAL`
+ - `NATS`. Уровень: `GLOBAL`
+ - `ODBC`. Уровень: `GLOBAL`
+ - `POSTGRES`. Уровень: `GLOBAL`
+ - `RABBITMQ`. Уровень: `GLOBAL`
+ - `REDIS`. Уровень: `GLOBAL`
+ - `REMOTE`. Уровень: `GLOBAL`
- `S3`. Уровень: `GLOBAL`
+ - `SQLITE`. Уровень: `GLOBAL`
+ - `URL`. Уровень: `GLOBAL`
Привилегия `SOURCES` разрешает использование всех источников. Также вы можете присвоить привилегию для каждого источника отдельно. Для использования источников необходимы дополнительные привилегии.
diff --git a/docs/zh/sql-reference/statements/create/view.md b/docs/zh/sql-reference/statements/create/view.md
index 49a1d66bdf1..6c93240644d 100644
--- a/docs/zh/sql-reference/statements/create/view.md
+++ b/docs/zh/sql-reference/statements/create/view.md
@@ -39,7 +39,7 @@ SELECT a, b, c FROM (SELECT ...)
## Materialized {#materialized}
``` sql
-CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER] [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...
+CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster_name] [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...
```
物化视图存储由相应的[SELECT](../../../sql-reference/statements/select/index.md)管理.
diff --git a/docs/zh/sql-reference/statements/grant.md b/docs/zh/sql-reference/statements/grant.md
index fea51d590d5..3fd314c791f 100644
--- a/docs/zh/sql-reference/statements/grant.md
+++ b/docs/zh/sql-reference/statements/grant.md
@@ -170,14 +170,23 @@ GRANT SELECT(x,y) ON db.table TO john WITH GRANT OPTION
- `addressToSymbol`
- `demangle`
- [SOURCES](#grant-sources)
+ - `AZURE`
- `FILE`
- - `URL`
- - `REMOTE`
- - `YSQL`
- - `ODBC`
- - `JDBC`
- `HDFS`
+ - `HIVE`
+ - `JDBC`
+ - `KAFKA`
+ - `MONGO`
+ - `MYSQL`
+ - `NATS`
+ - `ODBC`
+ - `POSTGRES`
+ - `RABBITMQ`
+ - `REDIS`
+ - `REMOTE`
- `S3`
+ - `SQLITE`
+ - `URL`
- [dictGet](#grant-dictget)
如何对待该层级的示例:
@@ -428,14 +437,23 @@ GRANT INSERT(x,y) ON db.table TO john
允许在 [table engines](../../engines/table-engines/index.md) 和 [table functions](../../sql-reference/table-functions/index.md#table-functions)中使用外部数据源。
- `SOURCES`. 级别: `GROUP`
+ - `AZURE`. 级别: `GLOBAL`
- `FILE`. 级别: `GLOBAL`
- - `URL`. 级别: `GLOBAL`
- - `REMOTE`. 级别: `GLOBAL`
- - `YSQL`. 级别: `GLOBAL`
- - `ODBC`. 级别: `GLOBAL`
- - `JDBC`. 级别: `GLOBAL`
- `HDFS`. 级别: `GLOBAL`
+ - `HIVE`. 级别: `GLOBAL`
+ - `JDBC`. 级别: `GLOBAL`
+ - `KAFKA`. 级别: `GLOBAL`
+ - `MONGO`. 级别: `GLOBAL`
+ - `MYSQL`. 级别: `GLOBAL`
+ - `NATS`. 级别: `GLOBAL`
+ - `ODBC`. 级别: `GLOBAL`
+ - `POSTGRES`. 级别: `GLOBAL`
+ - `RABBITMQ`. 级别: `GLOBAL`
+ - `REDIS`. 级别: `GLOBAL`
+ - `REMOTE`. 级别: `GLOBAL`
- `S3`. 级别: `GLOBAL`
+ - `SQLITE`. 级别: `GLOBAL`
+ - `URL`. 级别: `GLOBAL`
`SOURCES` 权限允许使用所有数据源。当然也可以单独对每个数据源进行授权。要使用数据源时,还需要额外的权限。
diff --git a/programs/client/Client.cpp b/programs/client/Client.cpp
index 4aab7fcae14..d7190444f0b 100644
--- a/programs/client/Client.cpp
+++ b/programs/client/Client.cpp
@@ -192,6 +192,10 @@ void Client::parseConnectionsCredentials(Poco::Util::AbstractConfiguration & con
history_file = home_path + "/" + history_file.substr(1);
config.setString("history_file", history_file);
}
+ if (config.has(prefix + ".history_max_entries"))
+ {
+ config.setUInt("history_max_entries", history_max_entries);
+ }
if (config.has(prefix + ".accept-invalid-certificate"))
config.setBool("accept-invalid-certificate", config.getBool(prefix + ".accept-invalid-certificate"));
}
diff --git a/programs/disks/DisksApp.cpp b/programs/disks/DisksApp.cpp
index 5fddfce0678..610d8eaa638 100644
--- a/programs/disks/DisksApp.cpp
+++ b/programs/disks/DisksApp.cpp
@@ -236,6 +236,7 @@ void DisksApp::runInteractiveReplxx()
ReplxxLineReader lr(
suggest,
history_file,
+ history_max_entries,
/* multiline= */ false,
/* ignore_shell_suspend= */ false,
query_extenders,
@@ -398,6 +399,8 @@ void DisksApp::initializeHistoryFile()
throw;
}
}
+
+ history_max_entries = config().getUInt("history-max-entries", 1000000);
}
void DisksApp::init(const std::vector & common_arguments)
diff --git a/programs/disks/DisksApp.h b/programs/disks/DisksApp.h
index 5b240648508..4f2bd7fcad6 100644
--- a/programs/disks/DisksApp.h
+++ b/programs/disks/DisksApp.h
@@ -62,6 +62,8 @@ private:
// Fields responsible for the REPL work
String history_file;
+ UInt32 history_max_entries = 0; /// Maximum number of entries in the history file. Needs to be initialized to 0 since we don't have a proper constructor. Worry not, actual value is set within the initializeHistoryFile method.
+
LineReader::Suggest suggest;
static LineReader::Patterns query_extenders;
static LineReader::Patterns query_delimiters;
diff --git a/programs/keeper-client/KeeperClient.cpp b/programs/keeper-client/KeeperClient.cpp
index 101ed270fc5..2a426fad7ac 100644
--- a/programs/keeper-client/KeeperClient.cpp
+++ b/programs/keeper-client/KeeperClient.cpp
@@ -243,6 +243,8 @@ void KeeperClient::initialize(Poco::Util::Application & /* self */)
}
}
+ history_max_entries = config().getUInt("history-max-entries", 1000000);
+
String default_log_level;
if (config().has("query"))
/// We don't want to see any information log in query mode, unless it was set explicitly
@@ -319,6 +321,7 @@ void KeeperClient::runInteractiveReplxx()
ReplxxLineReader lr(
suggest,
history_file,
+ history_max_entries,
/* multiline= */ false,
/* ignore_shell_suspend= */ false,
query_extenders,
diff --git a/programs/keeper-client/KeeperClient.h b/programs/keeper-client/KeeperClient.h
index 0d3db3c2f02..359663c6a13 100644
--- a/programs/keeper-client/KeeperClient.h
+++ b/programs/keeper-client/KeeperClient.h
@@ -59,6 +59,8 @@ protected:
std::vector getCompletions(const String & prefix) const;
String history_file;
+ UInt32 history_max_entries; /// Maximum number of entries in the history file.
+
LineReader::Suggest suggest;
zkutil::ZooKeeperArgs zk_args;
diff --git a/programs/keeper/Keeper.cpp b/programs/keeper/Keeper.cpp
index 3007df60765..74af9950e13 100644
--- a/programs/keeper/Keeper.cpp
+++ b/programs/keeper/Keeper.cpp
@@ -590,6 +590,7 @@ try
#if USE_SSL
CertificateReloader::instance().tryLoad(*config);
+ CertificateReloader::instance().tryLoadClient(*config);
#endif
});
diff --git a/programs/server/Server.cpp b/programs/server/Server.cpp
index 56b43a39351..1f481381b2b 100644
--- a/programs/server/Server.cpp
+++ b/programs/server/Server.cpp
@@ -86,7 +86,7 @@
#include
#include
#include
-#include
+#include
#include
#include
#include "MetricsTransmitter.h"
@@ -207,7 +207,6 @@ namespace ServerSetting
extern const ServerSettingsBool format_alter_operations_with_parentheses;
extern const ServerSettingsUInt64 global_profiler_cpu_time_period_ns;
extern const ServerSettingsUInt64 global_profiler_real_time_period_ns;
- extern const ServerSettingsDouble gwp_asan_force_sample_probability;
extern const ServerSettingsUInt64 http_connections_soft_limit;
extern const ServerSettingsUInt64 http_connections_store_limit;
extern const ServerSettingsUInt64 http_connections_warn_limit;
@@ -622,7 +621,7 @@ void sanityChecks(Server & server)
#if defined(OS_LINUX)
try
{
- const std::unordered_set fastClockSources = {
+ const std::unordered_set fast_clock_sources = {
// ARM clock
"arch_sys_counter",
// KVM guest clock
@@ -631,7 +630,7 @@ void sanityChecks(Server & server)
"tsc",
};
const char * filename = "/sys/devices/system/clocksource/clocksource0/current_clocksource";
- if (!fastClockSources.contains(readLine(filename)))
+ if (!fast_clock_sources.contains(readLine(filename)))
server.context()->addWarningMessage("Linux is not using a fast clock source. Performance can be degraded. Check " + String(filename));
}
catch (...) // NOLINT(bugprone-empty-catch)
@@ -921,7 +920,6 @@ try
registerFormats();
registerRemoteFileMetadatas();
registerSchedulerNodes();
- registerResourceManagers();
CurrentMetrics::set(CurrentMetrics::Revision, ClickHouseRevision::getVersionRevision());
CurrentMetrics::set(CurrentMetrics::VersionInteger, ClickHouseRevision::getVersionInteger());
@@ -1930,10 +1928,6 @@ try
if (global_context->isServerCompletelyStarted())
CannotAllocateThreadFaultInjector::setFaultProbability(new_server_settings[ServerSetting::cannot_allocate_thread_fault_injection_probability]);
-#if USE_GWP_ASAN
- GWPAsan::setForceSampleProbability(new_server_settings[ServerSetting::gwp_asan_force_sample_probability]);
-#endif
-
ProfileEvents::increment(ProfileEvents::MainConfigLoads);
/// Must be the last.
@@ -2258,6 +2252,8 @@ try
database_catalog.assertDatabaseExists(default_database);
/// Load user-defined SQL functions.
global_context->getUserDefinedSQLObjectsStorage().loadObjects();
+ /// Load WORKLOADs and RESOURCEs.
+ global_context->getWorkloadEntityStorage().loadEntities();
global_context->getRefreshSet().setRefreshesStopped(false);
}
@@ -2345,6 +2341,7 @@ try
#if USE_SSL
CertificateReloader::instance().tryLoad(config());
+ CertificateReloader::instance().tryLoadClient(config());
#endif
/// Must be done after initialization of `servers`, because async_metrics will access `servers` variable from its thread.
@@ -2440,7 +2437,6 @@ try
#if USE_GWP_ASAN
GWPAsan::initFinished();
- GWPAsan::setForceSampleProbability(server_settings[ServerSetting::gwp_asan_force_sample_probability]);
#endif
try
diff --git a/programs/server/config.xml b/programs/server/config.xml
index 15649b5c95d..9807f8c0d5a 100644
--- a/programs/server/config.xml
+++ b/programs/server/config.xml
@@ -1399,6 +1399,10 @@
If not specified they will be stored locally. -->
+
+
+
diff --git a/src/Access/AuthenticationData.cpp b/src/Access/AuthenticationData.cpp
index 57a1cd756ff..37a4e356af8 100644
--- a/src/Access/AuthenticationData.cpp
+++ b/src/Access/AuthenticationData.cpp
@@ -1,12 +1,16 @@
#include
#include
#include
+#include
#include
#include
#include
#include
#include
#include
+#include
+#include
+#include
#include
#include
@@ -113,7 +117,8 @@ bool operator ==(const AuthenticationData & lhs, const AuthenticationData & rhs)
&& (lhs.ssh_keys == rhs.ssh_keys)
#endif
&& (lhs.http_auth_scheme == rhs.http_auth_scheme)
- && (lhs.http_auth_server_name == rhs.http_auth_server_name);
+ && (lhs.http_auth_server_name == rhs.http_auth_server_name)
+ && (lhs.valid_until == rhs.valid_until);
}
@@ -384,14 +389,34 @@ std::shared_ptr AuthenticationData::toAST() const
throw Exception(ErrorCodes::LOGICAL_ERROR, "AST: Unexpected authentication type {}", toString(auth_type));
}
+
+ if (valid_until)
+ {
+ WriteBufferFromOwnString out;
+ writeDateTimeText(valid_until, out);
+
+ node->valid_until = std::make_shared(out.str());
+ }
+
return node;
}
AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & query, ContextPtr context, bool validate)
{
+ time_t valid_until = 0;
+
+ if (query.valid_until)
+ {
+ valid_until = getValidUntilFromAST(query.valid_until, context);
+ }
+
if (query.type && query.type == AuthenticationType::NO_PASSWORD)
- return AuthenticationData();
+ {
+ AuthenticationData auth_data;
+ auth_data.setValidUntil(valid_until);
+ return auth_data;
+ }
/// For this type of authentication we have ASTPublicSSHKey as children for ASTAuthenticationData
if (query.type && query.type == AuthenticationType::SSH_KEY)
@@ -418,6 +443,7 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que
}
auth_data.setSSHKeys(std::move(keys));
+ auth_data.setValidUntil(valid_until);
return auth_data;
#else
throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "SSH is disabled, because ClickHouse is built without libssh");
@@ -451,6 +477,8 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que
AuthenticationData auth_data(current_type);
+ auth_data.setValidUntil(valid_until);
+
if (validate)
context->getAccessControl().checkPasswordComplexityRules(value);
@@ -494,6 +522,7 @@ AuthenticationData AuthenticationData::fromAST(const ASTAuthenticationData & que
}
AuthenticationData auth_data(*query.type);
+ auth_data.setValidUntil(valid_until);
if (query.contains_hash)
{
diff --git a/src/Access/AuthenticationData.h b/src/Access/AuthenticationData.h
index a0c100264f8..2d8d008c925 100644
--- a/src/Access/AuthenticationData.h
+++ b/src/Access/AuthenticationData.h
@@ -74,6 +74,9 @@ public:
const String & getHTTPAuthenticationServerName() const { return http_auth_server_name; }
void setHTTPAuthenticationServerName(const String & name) { http_auth_server_name = name; }
+ time_t getValidUntil() const { return valid_until; }
+ void setValidUntil(time_t valid_until_) { valid_until = valid_until_; }
+
friend bool operator ==(const AuthenticationData & lhs, const AuthenticationData & rhs);
friend bool operator !=(const AuthenticationData & lhs, const AuthenticationData & rhs) { return !(lhs == rhs); }
@@ -106,6 +109,7 @@ private:
/// HTTP authentication properties
String http_auth_server_name;
HTTPAuthenticationScheme http_auth_scheme = HTTPAuthenticationScheme::BASIC;
+ time_t valid_until = 0;
};
}
diff --git a/src/Access/Common/AccessType.h b/src/Access/Common/AccessType.h
index 010d11e533a..ec543104167 100644
--- a/src/Access/Common/AccessType.h
+++ b/src/Access/Common/AccessType.h
@@ -99,6 +99,8 @@ enum class AccessType : uint8_t
M(CREATE_ARBITRARY_TEMPORARY_TABLE, "", GLOBAL, CREATE) /* allows to create and manipulate temporary tables
with arbitrary table engine */\
M(CREATE_FUNCTION, "", GLOBAL, CREATE) /* allows to execute CREATE FUNCTION */ \
+ M(CREATE_WORKLOAD, "", GLOBAL, CREATE) /* allows to execute CREATE WORKLOAD */ \
+ M(CREATE_RESOURCE, "", GLOBAL, CREATE) /* allows to execute CREATE RESOURCE */ \
M(CREATE_NAMED_COLLECTION, "", NAMED_COLLECTION, NAMED_COLLECTION_ADMIN) /* allows to execute CREATE NAMED COLLECTION */ \
M(CREATE, "", GROUP, ALL) /* allows to execute {CREATE|ATTACH} */ \
\
@@ -108,6 +110,8 @@ enum class AccessType : uint8_t
implicitly enabled by the grant DROP_TABLE */\
M(DROP_DICTIONARY, "", DICTIONARY, DROP) /* allows to execute {DROP|DETACH} DICTIONARY */\
M(DROP_FUNCTION, "", GLOBAL, DROP) /* allows to execute DROP FUNCTION */\
+ M(DROP_WORKLOAD, "", GLOBAL, DROP) /* allows to execute DROP WORKLOAD */\
+ M(DROP_RESOURCE, "", GLOBAL, DROP) /* allows to execute DROP RESOURCE */\
M(DROP_NAMED_COLLECTION, "", NAMED_COLLECTION, NAMED_COLLECTION_ADMIN) /* allows to execute DROP NAMED COLLECTION */\
M(DROP, "", GROUP, ALL) /* allows to execute {DROP|DETACH} */\
\
@@ -239,6 +243,9 @@ enum class AccessType : uint8_t
M(S3, "", GLOBAL, SOURCES) \
M(HIVE, "", GLOBAL, SOURCES) \
M(AZURE, "", GLOBAL, SOURCES) \
+ M(KAFKA, "", GLOBAL, SOURCES) \
+ M(NATS, "", GLOBAL, SOURCES) \
+ M(RABBITMQ, "", GLOBAL, SOURCES) \
M(SOURCES, "", GROUP, ALL) \
\
M(CLUSTER, "", GLOBAL, ALL) /* ON CLUSTER queries */ \
diff --git a/src/Access/ContextAccess.cpp b/src/Access/ContextAccess.cpp
index 949fd37e403..06e89d78339 100644
--- a/src/Access/ContextAccess.cpp
+++ b/src/Access/ContextAccess.cpp
@@ -52,7 +52,10 @@ namespace
{AccessType::HDFS, "HDFS"},
{AccessType::S3, "S3"},
{AccessType::HIVE, "Hive"},
- {AccessType::AZURE, "AzureBlobStorage"}
+ {AccessType::AZURE, "AzureBlobStorage"},
+ {AccessType::KAFKA, "Kafka"},
+ {AccessType::NATS, "NATS"},
+ {AccessType::RABBITMQ, "RabbitMQ"}
};
@@ -701,15 +704,17 @@ bool ContextAccess::checkAccessImplHelper(const ContextPtr & context, AccessFlag
const AccessFlags dictionary_ddl = AccessType::CREATE_DICTIONARY | AccessType::DROP_DICTIONARY;
const AccessFlags function_ddl = AccessType::CREATE_FUNCTION | AccessType::DROP_FUNCTION;
+ const AccessFlags workload_ddl = AccessType::CREATE_WORKLOAD | AccessType::DROP_WORKLOAD;
+ const AccessFlags resource_ddl = AccessType::CREATE_RESOURCE | AccessType::DROP_RESOURCE;
const AccessFlags table_and_dictionary_ddl = table_ddl | dictionary_ddl;
const AccessFlags table_and_dictionary_and_function_ddl = table_ddl | dictionary_ddl | function_ddl;
const AccessFlags write_table_access = AccessType::INSERT | AccessType::OPTIMIZE;
const AccessFlags write_dcl_access = AccessType::ACCESS_MANAGEMENT - AccessType::SHOW_ACCESS;
- const AccessFlags not_readonly_flags = write_table_access | table_and_dictionary_and_function_ddl | write_dcl_access | AccessType::SYSTEM | AccessType::KILL_QUERY;
+ const AccessFlags not_readonly_flags = write_table_access | table_and_dictionary_and_function_ddl | workload_ddl | resource_ddl | write_dcl_access | AccessType::SYSTEM | AccessType::KILL_QUERY;
const AccessFlags not_readonly_1_flags = AccessType::CREATE_TEMPORARY_TABLE;
- const AccessFlags ddl_flags = table_ddl | dictionary_ddl | function_ddl;
+ const AccessFlags ddl_flags = table_ddl | dictionary_ddl | function_ddl | workload_ddl | resource_ddl;
const AccessFlags introspection_flags = AccessType::INTROSPECTION;
};
static const PrecalculatedFlags precalc;
diff --git a/src/Access/IAccessStorage.cpp b/src/Access/IAccessStorage.cpp
index 3249d89ba87..72e0933e214 100644
--- a/src/Access/IAccessStorage.cpp
+++ b/src/Access/IAccessStorage.cpp
@@ -554,7 +554,7 @@ std::optional IAccessStorage::authenticateImpl(
continue;
}
- if (areCredentialsValid(user->getName(), user->valid_until, auth_method, credentials, external_authenticators, auth_result.settings))
+ if (areCredentialsValid(user->getName(), auth_method, credentials, external_authenticators, auth_result.settings))
{
auth_result.authentication_data = auth_method;
return auth_result;
@@ -579,7 +579,6 @@ std::optional IAccessStorage::authenticateImpl(
bool IAccessStorage::areCredentialsValid(
const std::string & user_name,
- time_t valid_until,
const AuthenticationData & authentication_method,
const Credentials & credentials,
const ExternalAuthenticators & external_authenticators,
@@ -591,6 +590,7 @@ bool IAccessStorage::areCredentialsValid(
if (credentials.getUserName() != user_name)
return false;
+ auto valid_until = authentication_method.getValidUntil();
if (valid_until)
{
const time_t now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
diff --git a/src/Access/IAccessStorage.h b/src/Access/IAccessStorage.h
index 84cbdd0a751..4e2b27a1864 100644
--- a/src/Access/IAccessStorage.h
+++ b/src/Access/IAccessStorage.h
@@ -236,7 +236,6 @@ protected:
bool allow_plaintext_password) const;
virtual bool areCredentialsValid(
const std::string & user_name,
- time_t valid_until,
const AuthenticationData & authentication_method,
const Credentials & credentials,
const ExternalAuthenticators & external_authenticators,
diff --git a/src/Access/User.cpp b/src/Access/User.cpp
index 887abc213f9..1c92f467003 100644
--- a/src/Access/User.cpp
+++ b/src/Access/User.cpp
@@ -19,8 +19,7 @@ bool User::equal(const IAccessEntity & other) const
return (authentication_methods == other_user.authentication_methods)
&& (allowed_client_hosts == other_user.allowed_client_hosts)
&& (access == other_user.access) && (granted_roles == other_user.granted_roles) && (default_roles == other_user.default_roles)
- && (settings == other_user.settings) && (grantees == other_user.grantees) && (default_database == other_user.default_database)
- && (valid_until == other_user.valid_until);
+ && (settings == other_user.settings) && (grantees == other_user.grantees) && (default_database == other_user.default_database);
}
void User::setName(const String & name_)
@@ -88,7 +87,6 @@ void User::clearAllExceptDependencies()
access = {};
settings.removeSettingsKeepProfiles();
default_database = {};
- valid_until = 0;
}
}
diff --git a/src/Access/User.h b/src/Access/User.h
index 03d62bf2277..f54e74a305d 100644
--- a/src/Access/User.h
+++ b/src/Access/User.h
@@ -23,7 +23,6 @@ struct User : public IAccessEntity
SettingsProfileElements settings;
RolesOrUsersSet grantees = RolesOrUsersSet::AllTag{};
String default_database;
- time_t valid_until = 0;
bool equal(const IAccessEntity & other) const override;
std::shared_ptr clone() const override { return cloneImpl(); }
diff --git a/src/Analyzer/Resolve/QueryAnalyzer.cpp b/src/Analyzer/Resolve/QueryAnalyzer.cpp
index 381edee607d..cb3087af707 100644
--- a/src/Analyzer/Resolve/QueryAnalyzer.cpp
+++ b/src/Analyzer/Resolve/QueryAnalyzer.cpp
@@ -227,8 +227,13 @@ void QueryAnalyzer::resolveConstantExpression(QueryTreeNodePtr & node, const Que
scope.context = context;
auto node_type = node->getNodeType();
+ if (node_type == QueryTreeNodeType::QUERY || node_type == QueryTreeNodeType::UNION)
+ {
+ evaluateScalarSubqueryIfNeeded(node, scope);
+ return;
+ }
- if (table_expression && node_type != QueryTreeNodeType::QUERY && node_type != QueryTreeNodeType::UNION)
+ if (table_expression)
{
scope.expression_join_tree_node = table_expression;
validateTableExpressionModifiers(scope.expression_join_tree_node, scope);
diff --git a/src/Backups/BackupConcurrencyCheck.cpp b/src/Backups/BackupConcurrencyCheck.cpp
new file mode 100644
index 00000000000..8b29ae41b53
--- /dev/null
+++ b/src/Backups/BackupConcurrencyCheck.cpp
@@ -0,0 +1,135 @@
+#include
+
+#include
+#include
+
+
+namespace DB
+{
+
+namespace ErrorCodes
+{
+ extern const int CONCURRENT_ACCESS_NOT_SUPPORTED;
+}
+
+
+BackupConcurrencyCheck::BackupConcurrencyCheck(
+ const UUID & backup_or_restore_uuid_,
+ bool is_restore_,
+ bool on_cluster_,
+ bool allow_concurrency_,
+ BackupConcurrencyCounters & counters_)
+ : is_restore(is_restore_), backup_or_restore_uuid(backup_or_restore_uuid_), on_cluster(on_cluster_), counters(counters_)
+{
+ std::lock_guard lock{counters.mutex};
+
+ if (!allow_concurrency_)
+ {
+ bool found_concurrent_operation = false;
+ if (is_restore)
+ {
+ size_t num_local_restores = counters.local_restores;
+ size_t num_on_cluster_restores = counters.on_cluster_restores.size();
+ if (on_cluster)
+ {
+ if (!counters.on_cluster_restores.contains(backup_or_restore_uuid))
+ ++num_on_cluster_restores;
+ }
+ else
+ {
+ ++num_local_restores;
+ }
+ found_concurrent_operation = (num_local_restores + num_on_cluster_restores > 1);
+ }
+ else
+ {
+ size_t num_local_backups = counters.local_backups;
+ size_t num_on_cluster_backups = counters.on_cluster_backups.size();
+ if (on_cluster)
+ {
+ if (!counters.on_cluster_backups.contains(backup_or_restore_uuid))
+ ++num_on_cluster_backups;
+ }
+ else
+ {
+ ++num_local_backups;
+ }
+ found_concurrent_operation = (num_local_backups + num_on_cluster_backups > 1);
+ }
+
+ if (found_concurrent_operation)
+ throwConcurrentOperationNotAllowed(is_restore);
+ }
+
+ if (on_cluster)
+ {
+ if (is_restore)
+ ++counters.on_cluster_restores[backup_or_restore_uuid];
+ else
+ ++counters.on_cluster_backups[backup_or_restore_uuid];
+ }
+ else
+ {
+ if (is_restore)
+ ++counters.local_restores;
+ else
+ ++counters.local_backups;
+ }
+}
+
+
+BackupConcurrencyCheck::~BackupConcurrencyCheck()
+{
+ std::lock_guard lock{counters.mutex};
+
+ if (on_cluster)
+ {
+ if (is_restore)
+ {
+ auto it = counters.on_cluster_restores.find(backup_or_restore_uuid);
+ if (it != counters.on_cluster_restores.end())
+ {
+ if (!--it->second)
+ counters.on_cluster_restores.erase(it);
+ }
+ }
+ else
+ {
+ auto it = counters.on_cluster_backups.find(backup_or_restore_uuid);
+ if (it != counters.on_cluster_backups.end())
+ {
+ if (!--it->second)
+ counters.on_cluster_backups.erase(it);
+ }
+ }
+ }
+ else
+ {
+ if (is_restore)
+ --counters.local_restores;
+ else
+ --counters.local_backups;
+ }
+}
+
+
+void BackupConcurrencyCheck::throwConcurrentOperationNotAllowed(bool is_restore)
+{
+ throw Exception(
+ ErrorCodes::CONCURRENT_ACCESS_NOT_SUPPORTED,
+ "Concurrent {} are not allowed, turn on setting '{}'",
+ is_restore ? "restores" : "backups",
+ is_restore ? "allow_concurrent_restores" : "allow_concurrent_backups");
+}
+
+
+BackupConcurrencyCounters::BackupConcurrencyCounters() = default;
+
+
+BackupConcurrencyCounters::~BackupConcurrencyCounters()
+{
+ if (local_backups > 0 || local_restores > 0 || !on_cluster_backups.empty() || !on_cluster_restores.empty())
+ LOG_ERROR(getLogger(__PRETTY_FUNCTION__), "Some backups or restores are processing");
+}
+
+}
diff --git a/src/Backups/BackupConcurrencyCheck.h b/src/Backups/BackupConcurrencyCheck.h
new file mode 100644
index 00000000000..048a23a716a
--- /dev/null
+++ b/src/Backups/BackupConcurrencyCheck.h
@@ -0,0 +1,55 @@
+#pragma once
+
+#include
+#include
+#include
+#include
+
+
+namespace DB
+{
+class BackupConcurrencyCounters;
+
+/// Local checker for concurrent BACKUP or RESTORE operations.
+/// This class is used by implementations of IBackupCoordination and IRestoreCoordination
+/// to throw an exception if concurrent backups or restores are not allowed.
+class BackupConcurrencyCheck
+{
+public:
+ /// Checks concurrency of a BACKUP operation or a RESTORE operation.
+ /// Keep a constructed instance of BackupConcurrencyCheck until the operation is done.
+ BackupConcurrencyCheck(
+ const UUID & backup_or_restore_uuid_,
+ bool is_restore_,
+ bool on_cluster_,
+ bool allow_concurrency_,
+ BackupConcurrencyCounters & counters_);
+
+ ~BackupConcurrencyCheck();
+
+ [[noreturn]] static void throwConcurrentOperationNotAllowed(bool is_restore);
+
+private:
+ const bool is_restore;
+ const UUID backup_or_restore_uuid;
+ const bool on_cluster;
+ BackupConcurrencyCounters & counters;
+};
+
+
+class BackupConcurrencyCounters
+{
+public:
+ BackupConcurrencyCounters();
+ ~BackupConcurrencyCounters();
+
+private:
+ friend class BackupConcurrencyCheck;
+ size_t local_backups TSA_GUARDED_BY(mutex) = 0;
+ size_t local_restores TSA_GUARDED_BY(mutex) = 0;
+ std::unordered_map on_cluster_backups TSA_GUARDED_BY(mutex);
+ std::unordered_map on_cluster_restores TSA_GUARDED_BY(mutex);
+ std::mutex mutex;
+};
+
+}
diff --git a/src/Backups/BackupCoordinationCleaner.cpp b/src/Backups/BackupCoordinationCleaner.cpp
new file mode 100644
index 00000000000..1f5068a94de
--- /dev/null
+++ b/src/Backups/BackupCoordinationCleaner.cpp
@@ -0,0 +1,64 @@
+#include
+
+
+namespace DB
+{
+
+BackupCoordinationCleaner::BackupCoordinationCleaner(const String & zookeeper_path_, const WithRetries & with_retries_, LoggerPtr log_)
+ : zookeeper_path(zookeeper_path_), with_retries(with_retries_), log(log_)
+{
+}
+
+void BackupCoordinationCleaner::cleanup()
+{
+ tryRemoveAllNodes(/* throw_if_error = */ true, /* retries_kind = */ WithRetries::kNormal);
+}
+
+bool BackupCoordinationCleaner::tryCleanupAfterError() noexcept
+{
+ return tryRemoveAllNodes(/* throw_if_error = */ false, /* retries_kind = */ WithRetries::kNormal);
+}
+
+bool BackupCoordinationCleaner::tryRemoveAllNodes(bool throw_if_error, WithRetries::Kind retries_kind)
+{
+ {
+ std::lock_guard lock{mutex};
+ if (cleanup_result.succeeded)
+ return true;
+ if (cleanup_result.exception)
+ {
+ if (throw_if_error)
+ std::rethrow_exception(cleanup_result.exception);
+ return false;
+ }
+ }
+
+ try
+ {
+ LOG_TRACE(log, "Removing nodes from ZooKeeper");
+ auto holder = with_retries.createRetriesControlHolder("removeAllNodes", retries_kind);
+ holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]()
+ {
+ with_retries.renewZooKeeper(zookeeper);
+ zookeeper->removeRecursive(zookeeper_path);
+ });
+
+ std::lock_guard lock{mutex};
+ cleanup_result.succeeded = true;
+ return true;
+ }
+ catch (...)
+ {
+ LOG_TRACE(log, "Caught exception while removing nodes from ZooKeeper for this restore: {}",
+ getCurrentExceptionMessage(/* with_stacktrace= */ false, /* check_embedded_stacktrace= */ true));
+
+ std::lock_guard lock{mutex};
+ cleanup_result.exception = std::current_exception();
+
+ if (throw_if_error)
+ throw;
+ return false;
+ }
+}
+
+}
diff --git a/src/Backups/BackupCoordinationCleaner.h b/src/Backups/BackupCoordinationCleaner.h
new file mode 100644
index 00000000000..43e095d9f33
--- /dev/null
+++ b/src/Backups/BackupCoordinationCleaner.h
@@ -0,0 +1,40 @@
+#pragma once
+
+#include
+
+
+namespace DB
+{
+
+/// Removes all the nodes from ZooKeeper used to coordinate a BACKUP ON CLUSTER operation or
+/// a RESTORE ON CLUSTER operation (successful or not).
+/// This class is used by BackupCoordinationOnCluster and RestoreCoordinationOnCluster to cleanup.
+class BackupCoordinationCleaner
+{
+public:
+ BackupCoordinationCleaner(const String & zookeeper_path_, const WithRetries & with_retries_, LoggerPtr log_);
+
+ void cleanup();
+ bool tryCleanupAfterError() noexcept;
+
+private:
+ bool tryRemoveAllNodes(bool throw_if_error, WithRetries::Kind retries_kind);
+
+ const String zookeeper_path;
+
+ /// A reference to a field of the parent object which is either BackupCoordinationOnCluster or RestoreCoordinationOnCluster.
+ const WithRetries & with_retries;
+
+ const LoggerPtr log;
+
+ struct CleanupResult
+ {
+ bool succeeded = false;
+ std::exception_ptr exception;
+ };
+ CleanupResult cleanup_result TSA_GUARDED_BY(mutex);
+
+ std::mutex mutex;
+};
+
+}
diff --git a/src/Backups/BackupCoordinationLocal.cpp b/src/Backups/BackupCoordinationLocal.cpp
index efdc18cc29c..8bd6b4d327d 100644
--- a/src/Backups/BackupCoordinationLocal.cpp
+++ b/src/Backups/BackupCoordinationLocal.cpp
@@ -1,5 +1,7 @@
#include
+
#include
+#include
#include
#include
#include
@@ -8,27 +10,20 @@
namespace DB
{
-BackupCoordinationLocal::BackupCoordinationLocal(bool plain_backup_)
- : log(getLogger("BackupCoordinationLocal")), file_infos(plain_backup_)
+BackupCoordinationLocal::BackupCoordinationLocal(
+ const UUID & backup_uuid_,
+ bool is_plain_backup_,
+ bool allow_concurrent_backup_,
+ BackupConcurrencyCounters & concurrency_counters_)
+ : log(getLogger("BackupCoordinationLocal"))
+ , concurrency_check(backup_uuid_, /* is_restore = */ false, /* on_cluster = */ false, allow_concurrent_backup_, concurrency_counters_)
+ , file_infos(is_plain_backup_)
{
}
BackupCoordinationLocal::~BackupCoordinationLocal() = default;
-void BackupCoordinationLocal::setStage(const String &, const String &)
-{
-}
-
-void BackupCoordinationLocal::setError(const Exception &)
-{
-}
-
-Strings BackupCoordinationLocal::waitForStage(const String &)
-{
- return {};
-}
-
-Strings BackupCoordinationLocal::waitForStage(const String &, std::chrono::milliseconds)
+ZooKeeperRetriesInfo BackupCoordinationLocal::getOnClusterInitializationKeeperRetriesInfo() const
{
return {};
}
@@ -135,15 +130,4 @@ bool BackupCoordinationLocal::startWritingFile(size_t data_file_index)
return writing_files.emplace(data_file_index).second;
}
-
-bool BackupCoordinationLocal::hasConcurrentBackups(const std::atomic & num_active_backups) const
-{
- if (num_active_backups > 1)
- {
- LOG_WARNING(log, "Found concurrent backups: num_active_backups={}", num_active_backups);
- return true;
- }
- return false;
-}
-
}
diff --git a/src/Backups/BackupCoordinationLocal.h b/src/Backups/BackupCoordinationLocal.h
index a7f15c79649..09991c0d301 100644
--- a/src/Backups/BackupCoordinationLocal.h
+++ b/src/Backups/BackupCoordinationLocal.h
@@ -1,6 +1,7 @@
#pragma once
#include
+#include
#include
#include
#include
@@ -21,13 +22,21 @@ namespace DB
class BackupCoordinationLocal : public IBackupCoordination
{
public:
- explicit BackupCoordinationLocal(bool plain_backup_);
+ explicit BackupCoordinationLocal(
+ const UUID & backup_uuid_,
+ bool is_plain_backup_,
+ bool allow_concurrent_backup_,
+ BackupConcurrencyCounters & concurrency_counters_);
+
~BackupCoordinationLocal() override;
- void setStage(const String & new_stage, const String & message) override;
- void setError(const Exception & exception) override;
- Strings waitForStage(const String & stage_to_wait) override;
- Strings waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) override;
+ Strings setStage(const String &, const String &, bool) override { return {}; }
+ void setBackupQueryWasSentToOtherHosts() override {}
+ bool trySetError(std::exception_ptr) override { return true; }
+ void finish() override {}
+ bool tryFinishAfterError() noexcept override { return true; }
+ void waitForOtherHostsToFinish() override {}
+ bool tryWaitForOtherHostsToFinishAfterError() noexcept override { return true; }
void addReplicatedPartNames(const String & table_zk_path, const String & table_name_for_logs, const String & replica_name,
const std::vector & part_names_and_checksums) override;
@@ -54,17 +63,18 @@ public:
BackupFileInfos getFileInfosForAllHosts() const override;
bool startWritingFile(size_t data_file_index) override;
- bool hasConcurrentBackups(const std::atomic & num_active_backups) const override;
+ ZooKeeperRetriesInfo getOnClusterInitializationKeeperRetriesInfo() const override;
private:
LoggerPtr const log;
+ BackupConcurrencyCheck concurrency_check;
- BackupCoordinationReplicatedTables TSA_GUARDED_BY(replicated_tables_mutex) replicated_tables;
- BackupCoordinationReplicatedAccess TSA_GUARDED_BY(replicated_access_mutex) replicated_access;
- BackupCoordinationReplicatedSQLObjects TSA_GUARDED_BY(replicated_sql_objects_mutex) replicated_sql_objects;
- BackupCoordinationFileInfos TSA_GUARDED_BY(file_infos_mutex) file_infos;
+ BackupCoordinationReplicatedTables replicated_tables TSA_GUARDED_BY(replicated_tables_mutex);
+ BackupCoordinationReplicatedAccess replicated_access TSA_GUARDED_BY(replicated_access_mutex);
+ BackupCoordinationReplicatedSQLObjects replicated_sql_objects TSA_GUARDED_BY(replicated_sql_objects_mutex);
+ BackupCoordinationFileInfos file_infos TSA_GUARDED_BY(file_infos_mutex);
BackupCoordinationKeeperMapTables keeper_map_tables TSA_GUARDED_BY(keeper_map_tables_mutex);
- std::unordered_set TSA_GUARDED_BY(writing_files_mutex) writing_files;
+ std::unordered_set writing_files TSA_GUARDED_BY(writing_files_mutex);
mutable std::mutex replicated_tables_mutex;
mutable std::mutex replicated_access_mutex;
diff --git a/src/Backups/BackupCoordinationRemote.cpp b/src/Backups/BackupCoordinationOnCluster.cpp
similarity index 73%
rename from src/Backups/BackupCoordinationRemote.cpp
rename to src/Backups/BackupCoordinationOnCluster.cpp
index a60ac0c636f..dc34939f805 100644
--- a/src/Backups/BackupCoordinationRemote.cpp
+++ b/src/Backups/BackupCoordinationOnCluster.cpp
@@ -1,7 +1,4 @@
-#include
-
-#include
-#include
+#include
#include
#include
@@ -26,8 +23,6 @@ namespace ErrorCodes
extern const int LOGICAL_ERROR;
}
-namespace Stage = BackupCoordinationStage;
-
namespace
{
using PartNameAndChecksum = IBackupCoordination::PartNameAndChecksum;
@@ -149,144 +144,152 @@ namespace
};
}
-size_t BackupCoordinationRemote::findCurrentHostIndex(const Strings & all_hosts, const String & current_host)
+Strings BackupCoordinationOnCluster::excludeInitiator(const Strings & all_hosts)
+{
+ Strings all_hosts_without_initiator = all_hosts;
+ bool has_initiator = (std::erase(all_hosts_without_initiator, kInitiator) > 0);
+ chassert(has_initiator);
+ return all_hosts_without_initiator;
+}
+
+size_t BackupCoordinationOnCluster::findCurrentHostIndex(const String & current_host, const Strings & all_hosts)
{
auto it = std::find(all_hosts.begin(), all_hosts.end(), current_host);
if (it == all_hosts.end())
- return 0;
+ return all_hosts.size();
return it - all_hosts.begin();
}
-BackupCoordinationRemote::BackupCoordinationRemote(
- zkutil::GetZooKeeper get_zookeeper_,
+
+BackupCoordinationOnCluster::BackupCoordinationOnCluster(
+ const UUID & backup_uuid_,
+ bool is_plain_backup_,
const String & root_zookeeper_path_,
+ zkutil::GetZooKeeper get_zookeeper_,
const BackupKeeperSettings & keeper_settings_,
- const String & backup_uuid_,
- const Strings & all_hosts_,
const String & current_host_,
- bool plain_backup_,
- bool is_internal_,
+ const Strings & all_hosts_,
+ bool allow_concurrent_backup_,
+ BackupConcurrencyCounters & concurrency_counters_,
+ ThreadPoolCallbackRunnerUnsafe schedule_,
QueryStatusPtr process_list_element_)
: root_zookeeper_path(root_zookeeper_path_)
- , zookeeper_path(root_zookeeper_path_ + "/backup-" + backup_uuid_)
+ , zookeeper_path(root_zookeeper_path_ + "/backup-" + toString(backup_uuid_))
, keeper_settings(keeper_settings_)
, backup_uuid(backup_uuid_)
, all_hosts(all_hosts_)
+ , all_hosts_without_initiator(excludeInitiator(all_hosts))
, current_host(current_host_)
- , current_host_index(findCurrentHostIndex(all_hosts, current_host))
- , plain_backup(plain_backup_)
- , is_internal(is_internal_)
- , log(getLogger("BackupCoordinationRemote"))
- , with_retries(
- log,
- get_zookeeper_,
- keeper_settings,
- process_list_element_,
- [my_zookeeper_path = zookeeper_path, my_current_host = current_host, my_is_internal = is_internal]
- (WithRetries::FaultyKeeper & zk)
- {
- /// Recreate this ephemeral node to signal that we are alive.
- if (my_is_internal)
- {
- String alive_node_path = my_zookeeper_path + "/stage/alive|" + my_current_host;
-
- /// Delete the ephemeral node from the previous connection so we don't have to wait for keeper to do it automatically.
- zk->tryRemove(alive_node_path);
-
- zk->createAncestors(alive_node_path);
- zk->create(alive_node_path, "", zkutil::CreateMode::Ephemeral);
- }
- })
+ , current_host_index(findCurrentHostIndex(current_host, all_hosts))
+ , plain_backup(is_plain_backup_)
+ , log(getLogger("BackupCoordinationOnCluster"))
+ , with_retries(log, get_zookeeper_, keeper_settings, process_list_element_, [root_zookeeper_path_](Coordination::ZooKeeperWithFaultInjection::Ptr zk) { zk->sync(root_zookeeper_path_); })
+ , concurrency_check(backup_uuid_, /* is_restore = */ false, /* on_cluster = */ true, allow_concurrent_backup_, concurrency_counters_)
+ , stage_sync(/* is_restore = */ false, fs::path{zookeeper_path} / "stage", current_host, all_hosts, allow_concurrent_backup_, with_retries, schedule_, process_list_element_, log)
+ , cleaner(zookeeper_path, with_retries, log)
{
createRootNodes();
-
- stage_sync.emplace(
- zookeeper_path,
- with_retries,
- log);
}
-BackupCoordinationRemote::~BackupCoordinationRemote()
+BackupCoordinationOnCluster::~BackupCoordinationOnCluster()
{
- try
- {
- if (!is_internal)
- removeAllNodes();
- }
- catch (...)
- {
- tryLogCurrentException(__PRETTY_FUNCTION__);
- }
+ tryFinishImpl();
}
-void BackupCoordinationRemote::createRootNodes()
+void BackupCoordinationOnCluster::createRootNodes()
{
- auto holder = with_retries.createRetriesControlHolder("createRootNodes");
+ auto holder = with_retries.createRetriesControlHolder("createRootNodes", WithRetries::kInitialization);
holder.retries_ctl.retryLoop(
[&, &zk = holder.faulty_zookeeper]()
{
with_retries.renewZooKeeper(zk);
zk->createAncestors(zookeeper_path);
-
- Coordination::Requests ops;
- Coordination::Responses responses;
- ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path, "", zkutil::CreateMode::Persistent));
- ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_part_names", "", zkutil::CreateMode::Persistent));
- ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_mutations", "", zkutil::CreateMode::Persistent));
- ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_data_paths", "", zkutil::CreateMode::Persistent));
- ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_access", "", zkutil::CreateMode::Persistent));
- ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/repl_sql_objects", "", zkutil::CreateMode::Persistent));
- ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/keeper_map_tables", "", zkutil::CreateMode::Persistent));
- ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/file_infos", "", zkutil::CreateMode::Persistent));
- ops.emplace_back(zkutil::makeCreateRequest(zookeeper_path + "/writing_files", "", zkutil::CreateMode::Persistent));
- zk->tryMulti(ops, responses);
+ zk->createIfNotExists(zookeeper_path, "");
+ zk->createIfNotExists(zookeeper_path + "/repl_part_names", "");
+ zk->createIfNotExists(zookeeper_path + "/repl_mutations", "");
+ zk->createIfNotExists(zookeeper_path + "/repl_data_paths", "");
+ zk->createIfNotExists(zookeeper_path + "/repl_access", "");
+ zk->createIfNotExists(zookeeper_path + "/repl_sql_objects", "");
+ zk->createIfNotExists(zookeeper_path + "/keeper_map_tables", "");
+ zk->createIfNotExists(zookeeper_path + "/file_infos", "");
+ zk->createIfNotExists(zookeeper_path + "/writing_files", "");
});
}
-void BackupCoordinationRemote::removeAllNodes()
+Strings BackupCoordinationOnCluster::setStage(const String & new_stage, const String & message, bool sync)
{
- auto holder = with_retries.createRetriesControlHolder("removeAllNodes");
- holder.retries_ctl.retryLoop(
- [&, &zk = holder.faulty_zookeeper]()
+ stage_sync.setStage(new_stage, message);
+
+ if (!sync)
+ return {};
+
+ return stage_sync.waitForHostsToReachStage(new_stage, all_hosts_without_initiator);
+}
+
+void BackupCoordinationOnCluster::setBackupQueryWasSentToOtherHosts()
+{
+ backup_query_was_sent_to_other_hosts = true;
+}
+
+bool BackupCoordinationOnCluster::trySetError(std::exception_ptr exception)
+{
+ return stage_sync.trySetError(exception);
+}
+
+void BackupCoordinationOnCluster::finish()
+{
+ bool other_hosts_also_finished = false;
+ stage_sync.finish(other_hosts_also_finished);
+
+ if ((current_host == kInitiator) && (other_hosts_also_finished || !backup_query_was_sent_to_other_hosts))
+ cleaner.cleanup();
+}
+
+bool BackupCoordinationOnCluster::tryFinishAfterError() noexcept
+{
+ return tryFinishImpl();
+}
+
+bool BackupCoordinationOnCluster::tryFinishImpl() noexcept
+{
+ bool other_hosts_also_finished = false;
+ if (!stage_sync.tryFinishAfterError(other_hosts_also_finished))
+ return false;
+
+ if ((current_host == kInitiator) && (other_hosts_also_finished || !backup_query_was_sent_to_other_hosts))
{
- /// Usually this function is called by the initiator when a backup is complete so we don't need the coordination anymore.
- ///
- /// However there can be a rare situation when this function is called after an error occurs on the initiator of a query
- /// while some hosts are still making the backup. Removing all the nodes will remove the parent node of the backup coordination
- /// at `zookeeper_path` which might cause such hosts to stop with exception "ZNONODE". Or such hosts might still do some useless part
- /// of their backup work before that. Anyway in this case backup won't be finalized (because only an initiator can do that).
- with_retries.renewZooKeeper(zk);
- zk->removeRecursive(zookeeper_path);
- });
+ if (!cleaner.tryCleanupAfterError())
+ return false;
+ }
+
+ return true;
}
-
-void BackupCoordinationRemote::setStage(const String & new_stage, const String & message)
+void BackupCoordinationOnCluster::waitForOtherHostsToFinish()
{
- if (is_internal)
- stage_sync->set(current_host, new_stage, message);
- else
- stage_sync->set(current_host, new_stage, /* message */ "", /* all_hosts */ true);
+ if ((current_host != kInitiator) || !backup_query_was_sent_to_other_hosts)
+ return;
+ stage_sync.waitForOtherHostsToFinish();
}
-void BackupCoordinationRemote::setError(const Exception & exception)
+bool BackupCoordinationOnCluster::tryWaitForOtherHostsToFinishAfterError() noexcept
{
- stage_sync->setError(current_host, exception);
+ if (current_host != kInitiator)
+ return false;
+ if (!backup_query_was_sent_to_other_hosts)
+ return true;
+ return stage_sync.tryWaitForOtherHostsToFinishAfterError();
}
-Strings BackupCoordinationRemote::waitForStage(const String & stage_to_wait)
+ZooKeeperRetriesInfo BackupCoordinationOnCluster::getOnClusterInitializationKeeperRetriesInfo() const
{
- return stage_sync->wait(all_hosts, stage_to_wait);
+ return ZooKeeperRetriesInfo{keeper_settings.max_retries_while_initializing,
+ static_cast(keeper_settings.retry_initial_backoff_ms.count()),
+ static_cast(keeper_settings.retry_max_backoff_ms.count())};
}
-Strings BackupCoordinationRemote::waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout)
-{
- return stage_sync->waitFor(all_hosts, stage_to_wait, timeout);
-}
-
-
-void BackupCoordinationRemote::serializeToMultipleZooKeeperNodes(const String & path, const String & value, const String & logging_name)
+void BackupCoordinationOnCluster::serializeToMultipleZooKeeperNodes(const String & path, const String & value, const String & logging_name)
{
{
auto holder = with_retries.createRetriesControlHolder(logging_name + "::create");
@@ -301,7 +304,7 @@ void BackupCoordinationRemote::serializeToMultipleZooKeeperNodes(const String &
if (value.empty())
return;
- size_t max_part_size = keeper_settings.keeper_value_max_size;
+ size_t max_part_size = keeper_settings.value_max_size;
if (!max_part_size)
max_part_size = value.size();
@@ -324,7 +327,7 @@ void BackupCoordinationRemote::serializeToMultipleZooKeeperNodes(const String &
}
}
-String BackupCoordinationRemote::deserializeFromMultipleZooKeeperNodes(const String & path, const String & logging_name) const
+String BackupCoordinationOnCluster::deserializeFromMultipleZooKeeperNodes(const String & path, const String & logging_name) const
{
Strings part_names;
@@ -357,7 +360,7 @@ String BackupCoordinationRemote::deserializeFromMultipleZooKeeperNodes(const Str
}
-void BackupCoordinationRemote::addReplicatedPartNames(
+void BackupCoordinationOnCluster::addReplicatedPartNames(
const String & table_zk_path,
const String & table_name_for_logs,
const String & replica_name,
@@ -381,14 +384,14 @@ void BackupCoordinationRemote::addReplicatedPartNames(
});
}
-Strings BackupCoordinationRemote::getReplicatedPartNames(const String & table_zk_path, const String & replica_name) const
+Strings BackupCoordinationOnCluster::getReplicatedPartNames(const String & table_zk_path, const String & replica_name) const
{
std::lock_guard lock{replicated_tables_mutex};
prepareReplicatedTables();
return replicated_tables->getPartNames(table_zk_path, replica_name);
}
-void BackupCoordinationRemote::addReplicatedMutations(
+void BackupCoordinationOnCluster::addReplicatedMutations(
const String & table_zk_path,
const String & table_name_for_logs,
const String & replica_name,
@@ -412,7 +415,7 @@ void BackupCoordinationRemote::addReplicatedMutations(
});
}
-std::vector BackupCoordinationRemote::getReplicatedMutations(const String & table_zk_path, const String & replica_name) const
+std::vector BackupCoordinationOnCluster::getReplicatedMutations(const String & table_zk_path, const String & replica_name) const
{
std::lock_guard lock{replicated_tables_mutex};
prepareReplicatedTables();
@@ -420,7 +423,7 @@ std::vector BackupCoordinationRemote::getRepl
}
-void BackupCoordinationRemote::addReplicatedDataPath(
+void BackupCoordinationOnCluster::addReplicatedDataPath(
const String & table_zk_path, const String & data_path)
{
{
@@ -441,7 +444,7 @@ void BackupCoordinationRemote::addReplicatedDataPath(
});
}
-Strings BackupCoordinationRemote::getReplicatedDataPaths(const String & table_zk_path) const
+Strings BackupCoordinationOnCluster::getReplicatedDataPaths(const String & table_zk_path) const
{
std::lock_guard lock{replicated_tables_mutex};
prepareReplicatedTables();
@@ -449,7 +452,7 @@ Strings BackupCoordinationRemote::getReplicatedDataPaths(const String & table_zk
}
-void BackupCoordinationRemote::prepareReplicatedTables() const
+void BackupCoordinationOnCluster::prepareReplicatedTables() const
{
if (replicated_tables)
return;
@@ -536,7 +539,7 @@ void BackupCoordinationRemote::prepareReplicatedTables() const
replicated_tables->addDataPath(std::move(data_paths));
}
-void BackupCoordinationRemote::addReplicatedAccessFilePath(const String & access_zk_path, AccessEntityType access_entity_type, const String & file_path)
+void BackupCoordinationOnCluster::addReplicatedAccessFilePath(const String & access_zk_path, AccessEntityType access_entity_type, const String & file_path)
{
{
std::lock_guard lock{replicated_access_mutex};
@@ -558,14 +561,14 @@ void BackupCoordinationRemote::addReplicatedAccessFilePath(const String & access
});
}
-Strings BackupCoordinationRemote::getReplicatedAccessFilePaths(const String & access_zk_path, AccessEntityType access_entity_type) const
+Strings BackupCoordinationOnCluster::getReplicatedAccessFilePaths(const String & access_zk_path, AccessEntityType access_entity_type) const
{
std::lock_guard lock{replicated_access_mutex};
prepareReplicatedAccess();
return replicated_access->getFilePaths(access_zk_path, access_entity_type, current_host);
}
-void BackupCoordinationRemote::prepareReplicatedAccess() const
+void BackupCoordinationOnCluster::prepareReplicatedAccess() const
{
if (replicated_access)
return;
@@ -601,7 +604,7 @@ void BackupCoordinationRemote::prepareReplicatedAccess() const
replicated_access->addFilePath(std::move(file_path));
}
-void BackupCoordinationRemote::addReplicatedSQLObjectsDir(const String & loader_zk_path, UserDefinedSQLObjectType object_type, const String & dir_path)
+void BackupCoordinationOnCluster::addReplicatedSQLObjectsDir(const String & loader_zk_path, UserDefinedSQLObjectType object_type, const String & dir_path)
{
{
std::lock_guard lock{replicated_sql_objects_mutex};
@@ -631,14 +634,14 @@ void BackupCoordinationRemote::addReplicatedSQLObjectsDir(const String & loader_
});
}
-Strings BackupCoordinationRemote::getReplicatedSQLObjectsDirs(const String & loader_zk_path, UserDefinedSQLObjectType object_type) const
+Strings BackupCoordinationOnCluster::getReplicatedSQLObjectsDirs(const String & loader_zk_path, UserDefinedSQLObjectType object_type) const
{
std::lock_guard lock{replicated_sql_objects_mutex};
prepareReplicatedSQLObjects();
return replicated_sql_objects->getDirectories(loader_zk_path, object_type, current_host);
}
-void BackupCoordinationRemote::prepareReplicatedSQLObjects() const
+void BackupCoordinationOnCluster::prepareReplicatedSQLObjects() const
{
if (replicated_sql_objects)
return;
@@ -674,7 +677,7 @@ void BackupCoordinationRemote::prepareReplicatedSQLObjects() const
replicated_sql_objects->addDirectory(std::move(directory));
}
-void BackupCoordinationRemote::addKeeperMapTable(const String & table_zookeeper_root_path, const String & table_id, const String & data_path_in_backup)
+void BackupCoordinationOnCluster::addKeeperMapTable(const String & table_zookeeper_root_path, const String & table_id, const String & data_path_in_backup)
{
{
std::lock_guard lock{keeper_map_tables_mutex};
@@ -695,7 +698,7 @@ void BackupCoordinationRemote::addKeeperMapTable(const String & table_zookeeper_
});
}
-void BackupCoordinationRemote::prepareKeeperMapTables() const
+void BackupCoordinationOnCluster::prepareKeeperMapTables() const
{
if (keeper_map_tables)
return;
@@ -740,7 +743,7 @@ void BackupCoordinationRemote::prepareKeeperMapTables() const
}
-String BackupCoordinationRemote::getKeeperMapDataPath(const String & table_zookeeper_root_path) const
+String BackupCoordinationOnCluster::getKeeperMapDataPath(const String & table_zookeeper_root_path) const
{
std::lock_guard lock(keeper_map_tables_mutex);
prepareKeeperMapTables();
@@ -748,7 +751,7 @@ String BackupCoordinationRemote::getKeeperMapDataPath(const String & table_zooke
}
-void BackupCoordinationRemote::addFileInfos(BackupFileInfos && file_infos_)
+void BackupCoordinationOnCluster::addFileInfos(BackupFileInfos && file_infos_)
{
{
std::lock_guard lock{file_infos_mutex};
@@ -761,21 +764,21 @@ void BackupCoordinationRemote::addFileInfos(BackupFileInfos && file_infos_)
serializeToMultipleZooKeeperNodes(zookeeper_path + "/file_infos/" + current_host, file_infos_str, "addFileInfos");
}
-BackupFileInfos BackupCoordinationRemote::getFileInfos() const
+BackupFileInfos BackupCoordinationOnCluster::getFileInfos() const
{
std::lock_guard lock{file_infos_mutex};
prepareFileInfos();
return file_infos->getFileInfos(current_host);
}
-BackupFileInfos BackupCoordinationRemote::getFileInfosForAllHosts() const
+BackupFileInfos BackupCoordinationOnCluster::getFileInfosForAllHosts() const
{
std::lock_guard lock{file_infos_mutex};
prepareFileInfos();
return file_infos->getFileInfosForAllHosts();
}
-void BackupCoordinationRemote::prepareFileInfos() const
+void BackupCoordinationOnCluster::prepareFileInfos() const
{
if (file_infos)
return;
@@ -801,7 +804,7 @@ void BackupCoordinationRemote::prepareFileInfos() const
}
}
-bool BackupCoordinationRemote::startWritingFile(size_t data_file_index)
+bool BackupCoordinationOnCluster::startWritingFile(size_t data_file_index)
{
{
/// Check if this host is already writing this file.
@@ -842,66 +845,4 @@ bool BackupCoordinationRemote::startWritingFile(size_t data_file_index)
}
}
-bool BackupCoordinationRemote::hasConcurrentBackups(const std::atomic &) const
-{
- /// If its internal concurrency will be checked for the base backup
- if (is_internal)
- return false;
-
- std::string backup_stage_path = zookeeper_path + "/stage";
-
- bool result = false;
-
- auto holder = with_retries.createRetriesControlHolder("getAllArchiveSuffixes");
- holder.retries_ctl.retryLoop(
- [&, &zk = holder.faulty_zookeeper]()
- {
- with_retries.renewZooKeeper(zk);
-
- if (!zk->exists(root_zookeeper_path))
- zk->createAncestors(root_zookeeper_path);
-
- for (size_t attempt = 0; attempt < MAX_ZOOKEEPER_ATTEMPTS; ++attempt)
- {
- Coordination::Stat stat;
- zk->get(root_zookeeper_path, &stat);
- Strings existing_backup_paths = zk->getChildren(root_zookeeper_path);
-
- for (const auto & existing_backup_path : existing_backup_paths)
- {
- if (startsWith(existing_backup_path, "restore-"))
- continue;
-
- String existing_backup_uuid = existing_backup_path;
- existing_backup_uuid.erase(0, String("backup-").size());
-
- if (existing_backup_uuid == toString(backup_uuid))
- continue;
-
- String status;
- if (zk->tryGet(root_zookeeper_path + "/" + existing_backup_path + "/stage", status))
- {
- /// Check if some other backup is in progress
- if (status == Stage::SCHEDULED_TO_START)
- {
- LOG_WARNING(log, "Found a concurrent backup: {}, current backup: {}", existing_backup_uuid, toString(backup_uuid));
- result = true;
- return;
- }
- }
- }
-
- zk->createIfNotExists(backup_stage_path, "");
- auto code = zk->trySet(backup_stage_path, Stage::SCHEDULED_TO_START, stat.version);
- if (code == Coordination::Error::ZOK)
- break;
- bool is_last_attempt = (attempt == MAX_ZOOKEEPER_ATTEMPTS - 1);
- if ((code != Coordination::Error::ZBADVERSION) || is_last_attempt)
- throw zkutil::KeeperException::fromPath(code, backup_stage_path);
- }
- });
-
- return result;
-}
-
}
diff --git a/src/Backups/BackupCoordinationRemote.h b/src/Backups/BackupCoordinationOnCluster.h
similarity index 67%
rename from src/Backups/BackupCoordinationRemote.h
rename to src/Backups/BackupCoordinationOnCluster.h
index 7a56b1a4eb8..7369c2cc746 100644
--- a/src/Backups/BackupCoordinationRemote.h
+++ b/src/Backups/BackupCoordinationOnCluster.h
@@ -1,6 +1,8 @@
#pragma once
#include
+#include
+#include
#include
#include
#include
@@ -13,32 +15,35 @@
namespace DB
{
-/// We try to store data to zookeeper several times due to possible version conflicts.
-constexpr size_t MAX_ZOOKEEPER_ATTEMPTS = 10;
-
/// Implementation of the IBackupCoordination interface performing coordination via ZooKeeper. It's necessary for "BACKUP ON CLUSTER".
-class BackupCoordinationRemote : public IBackupCoordination
+class BackupCoordinationOnCluster : public IBackupCoordination
{
public:
- using BackupKeeperSettings = WithRetries::KeeperSettings;
+ /// Empty string as the current host is used to mark the initiator of a BACKUP ON CLUSTER query.
+ static const constexpr std::string_view kInitiator;
- BackupCoordinationRemote(
- zkutil::GetZooKeeper get_zookeeper_,
+ BackupCoordinationOnCluster(
+ const UUID & backup_uuid_,
+ bool is_plain_backup_,
const String & root_zookeeper_path_,
+ zkutil::GetZooKeeper get_zookeeper_,
const BackupKeeperSettings & keeper_settings_,
- const String & backup_uuid_,
- const Strings & all_hosts_,
const String & current_host_,
- bool plain_backup_,
- bool is_internal_,
+ const Strings & all_hosts_,
+ bool allow_concurrent_backup_,
+ BackupConcurrencyCounters & concurrency_counters_,
+ ThreadPoolCallbackRunnerUnsafe schedule_,
QueryStatusPtr process_list_element_);
- ~BackupCoordinationRemote() override;
+ ~BackupCoordinationOnCluster() override;
- void setStage(const String & new_stage, const String & message) override;
- void setError(const Exception & exception) override;
- Strings waitForStage(const String & stage_to_wait) override;
- Strings waitForStage(const String & stage_to_wait, std::chrono::milliseconds timeout) override;
+ Strings setStage(const String & new_stage, const String & message, bool sync) override;
+ void setBackupQueryWasSentToOtherHosts() override;
+ bool trySetError(std::exception_ptr exception) override;
+ void finish() override;
+ bool tryFinishAfterError() noexcept override;
+ void waitForOtherHostsToFinish() override;
+ bool tryWaitForOtherHostsToFinishAfterError() noexcept override;
void addReplicatedPartNames(
const String & table_zk_path,
@@ -73,13 +78,14 @@ public:
BackupFileInfos getFileInfosForAllHosts() const override;
bool startWritingFile(size_t data_file_index) override;
- bool hasConcurrentBackups(const std::atomic & num_active_backups) const override;
+ ZooKeeperRetriesInfo getOnClusterInitializationKeeperRetriesInfo() const override;
- static size_t findCurrentHostIndex(const Strings & all_hosts, const String & current_host);
+ static Strings excludeInitiator(const Strings & all_hosts);
+ static size_t findCurrentHostIndex(const String & current_host, const Strings & all_hosts);
private:
void createRootNodes();
- void removeAllNodes();
+ bool tryFinishImpl() noexcept;
void serializeToMultipleZooKeeperNodes(const String & path, const String & value, const String & logging_name);
String deserializeFromMultipleZooKeeperNodes(const String & path, const String & logging_name) const;
@@ -96,26 +102,27 @@ private:
const String root_zookeeper_path;
const String zookeeper_path;
const BackupKeeperSettings keeper_settings;
- const String backup_uuid;
+ const UUID backup_uuid;
const Strings all_hosts;
+ const Strings all_hosts_without_initiator;
const String current_host;
const size_t current_host_index;
const bool plain_backup;
- const bool is_internal;
LoggerPtr const log;
- /// The order of these two fields matters, because stage_sync holds a reference to with_retries object
- mutable WithRetries with_retries;
- std::optional stage_sync;
+ const WithRetries with_retries;
+ BackupConcurrencyCheck concurrency_check;
+ BackupCoordinationStageSync stage_sync;
+ BackupCoordinationCleaner cleaner;
+ std::atomic backup_query_was_sent_to_other_hosts = false;
- mutable std::optional TSA_GUARDED_BY(replicated_tables_mutex) replicated_tables;
- mutable std::optional TSA_GUARDED_BY(replicated_access_mutex) replicated_access;
- mutable std::optional TSA_GUARDED_BY(replicated_sql_objects_mutex) replicated_sql_objects;
- mutable std::optional TSA_GUARDED_BY(file_infos_mutex) file_infos;
+ mutable std::optional replicated_tables TSA_GUARDED_BY(replicated_tables_mutex);
+ mutable std::optional replicated_access TSA_GUARDED_BY(replicated_access_mutex);
+ mutable std::optional replicated_sql_objects TSA_GUARDED_BY(replicated_sql_objects_mutex);
+ mutable std::optional file_infos TSA_GUARDED_BY(file_infos_mutex);
mutable std::optional keeper_map_tables TSA_GUARDED_BY(keeper_map_tables_mutex);
- std::unordered_set TSA_GUARDED_BY(writing_files_mutex) writing_files;
+ std::unordered_set writing_files TSA_GUARDED_BY(writing_files_mutex);
- mutable std::mutex zookeeper_mutex;
mutable std::mutex replicated_tables_mutex;
mutable std::mutex replicated_access_mutex;
mutable std::mutex replicated_sql_objects_mutex;
diff --git a/src/Backups/BackupCoordinationStage.h b/src/Backups/BackupCoordinationStage.h
index 9abdc019784..2cd1efb5404 100644
--- a/src/Backups/BackupCoordinationStage.h
+++ b/src/Backups/BackupCoordinationStage.h
@@ -8,10 +8,6 @@ namespace DB
namespace BackupCoordinationStage
{
- /// This stage is set after concurrency check so ensure we dont start other backup/restores
- /// when concurrent backup/restores are not allowed
- constexpr const char * SCHEDULED_TO_START = "scheduled to start";
-
/// Finding all tables and databases which we're going to put to the backup and collecting their metadata.
constexpr const char * GATHERING_METADATA = "gathering metadata";
@@ -46,10 +42,6 @@ namespace BackupCoordinationStage
/// Coordination stage meaning that a host finished its work.
constexpr const char * COMPLETED = "completed";
-
- /// Coordination stage meaning that backup/restore has failed due to an error
- /// Check '/error' for the error message
- constexpr const char * ERROR = "error";
}
}
diff --git a/src/Backups/BackupCoordinationStageSync.cpp b/src/Backups/BackupCoordinationStageSync.cpp
index 17ef163ce35..9a05f9490c2 100644
--- a/src/Backups/BackupCoordinationStageSync.cpp
+++ b/src/Backups/BackupCoordinationStageSync.cpp
@@ -9,267 +9,1117 @@
#include
#include
#include
+#include
+#include
+#include
+
namespace DB
{
-namespace Stage = BackupCoordinationStage;
-
namespace ErrorCodes
{
extern const int FAILED_TO_SYNC_BACKUP_OR_RESTORE;
+ extern const int LOGICAL_ERROR;
}
+namespace
+{
+ /// The coordination version is stored in the 'start' node for each host
+ /// by each host when it starts working on this backup or restore.
+ enum Version
+ {
+ kInitialVersion = 1,
+
+ /// This old version didn't create the 'finish' node, it uses stage "completed" to tell other hosts that the work is done.
+ /// If an error happened this old version didn't change any nodes to tell other hosts that the error handling is done.
+ /// So while using this old version hosts couldn't know when other hosts are done with the error handling,
+ /// and that situation caused weird errors in the logs somehow.
+ /// Also this old version didn't create the 'start' node for the initiator.
+ kVersionWithoutFinishNode = 1,
+
+ /// Now we create the 'finish' node both if the work is done or if the error handling is done.
+
+ kCurrentVersion = 2,
+ };
+
+ /// Empty string as the current host is used to mark the initiator of a BACKUP ON CLUSTER or RESTORE ON CLUSTER query.
+ const constexpr std::string_view kInitiator;
+}
+
+bool BackupCoordinationStageSync::HostInfo::operator ==(const HostInfo & other) const
+{
+ /// We don't compare `last_connection_time` here.
+ return (host == other.host) && (started == other.started) && (connected == other.connected) && (finished == other.finished)
+ && (stages == other.stages) && (!!exception == !!other.exception);
+}
+
+bool BackupCoordinationStageSync::HostInfo::operator !=(const HostInfo & other) const
+{
+ return !(*this == other);
+}
+
+bool BackupCoordinationStageSync::State::operator ==(const State & other) const = default;
+bool BackupCoordinationStageSync::State::operator !=(const State & other) const = default;
+
BackupCoordinationStageSync::BackupCoordinationStageSync(
- const String & root_zookeeper_path_,
- WithRetries & with_retries_,
- LoggerPtr log_)
- : zookeeper_path(root_zookeeper_path_ + "/stage")
+ bool is_restore_,
+ const String & zookeeper_path_,
+ const String & current_host_,
+ const Strings & all_hosts_,
+ bool allow_concurrency_,
+ const WithRetries & with_retries_,
+ ThreadPoolCallbackRunnerUnsafe schedule_,
+ QueryStatusPtr process_list_element_,
+ LoggerPtr log_)
+ : is_restore(is_restore_)
+ , operation_name(is_restore ? "restore" : "backup")
+ , current_host(current_host_)
+ , current_host_desc(getHostDesc(current_host))
+ , all_hosts(all_hosts_)
+ , allow_concurrency(allow_concurrency_)
, with_retries(with_retries_)
+ , schedule(schedule_)
+ , process_list_element(process_list_element_)
, log(log_)
+ , failure_after_host_disconnected_for_seconds(with_retries.getKeeperSettings().failure_after_host_disconnected_for_seconds)
+ , finish_timeout_after_error(with_retries.getKeeperSettings().finish_timeout_after_error)
+ , sync_period_ms(with_retries.getKeeperSettings().sync_period_ms)
+ , max_attempts_after_bad_version(with_retries.getKeeperSettings().max_attempts_after_bad_version)
+ , zookeeper_path(zookeeper_path_)
+ , root_zookeeper_path(zookeeper_path.parent_path().parent_path())
+ , operation_node_path(zookeeper_path.parent_path())
+ , operation_node_name(zookeeper_path.parent_path().filename())
+ , stage_node_path(zookeeper_path)
+ , start_node_path(zookeeper_path / ("started|" + current_host))
+ , finish_node_path(zookeeper_path / ("finished|" + current_host))
+ , num_hosts_node_path(zookeeper_path / "num_hosts")
+ , alive_node_path(zookeeper_path / ("alive|" + current_host))
+ , alive_tracker_node_path(fs::path{root_zookeeper_path} / "alive_tracker")
+ , error_node_path(zookeeper_path / "error")
+ , zk_nodes_changed(std::make_shared())
{
+ if ((zookeeper_path.filename() != "stage") || !operation_node_name.starts_with(is_restore ? "restore-" : "backup-")
+ || (root_zookeeper_path == operation_node_path))
+ {
+ throw Exception(ErrorCodes::LOGICAL_ERROR, "Unexpected path in ZooKeeper specified: {}", zookeeper_path);
+ }
+
+ initializeState();
createRootNodes();
+
+ try
+ {
+ createStartAndAliveNodes();
+ startWatchingThread();
+ }
+ catch (...)
+ {
+ trySetError(std::current_exception());
+ tryFinishImpl();
+ throw;
+ }
}
+
+BackupCoordinationStageSync::~BackupCoordinationStageSync()
+{
+ tryFinishImpl();
+}
+
+
+void BackupCoordinationStageSync::initializeState()
+{
+ std::lock_guard lock{mutex};
+ auto now = std::chrono::system_clock::now();
+ auto monotonic_now = std::chrono::steady_clock::now();
+
+ for (const String & host : all_hosts)
+ state.hosts.emplace(host, HostInfo{.host = host, .last_connection_time = now, .last_connection_time_monotonic = monotonic_now});
+}
+
+
+String BackupCoordinationStageSync::getHostDesc(const String & host)
+{
+ String res;
+ if (host.empty())
+ {
+ res = "the initiator";
+ }
+ else
+ {
+ try
+ {
+ res = "host ";
+ Poco::URI::decode(host, res); /// Append the decoded host name to `res`.
+ }
+ catch (const Poco::URISyntaxException &)
+ {
+ res = "host " + host;
+ }
+ }
+ return res;
+}
+
+
+String BackupCoordinationStageSync::getHostsDesc(const Strings & hosts)
+{
+ String res = "[";
+ for (const String & host : hosts)
+ {
+ if (res != "[")
+ res += ", ";
+ res += getHostDesc(host);
+ }
+ res += "]";
+ return res;
+}
+
+
void BackupCoordinationStageSync::createRootNodes()
{
- auto holder = with_retries.createRetriesControlHolder("createRootNodes");
+ auto holder = with_retries.createRetriesControlHolder("BackupStageSync::createRootNodes", WithRetries::kInitialization);
holder.retries_ctl.retryLoop(
[&, &zookeeper = holder.faulty_zookeeper]()
+ {
+ with_retries.renewZooKeeper(zookeeper);
+ zookeeper->createAncestors(root_zookeeper_path);
+ zookeeper->createIfNotExists(root_zookeeper_path, "");
+ });
+}
+
+
+void BackupCoordinationStageSync::createStartAndAliveNodes()
+{
+ auto holder = with_retries.createRetriesControlHolder("BackupStageSync::createStartAndAliveNodes", WithRetries::kInitialization);
+ holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]()
{
with_retries.renewZooKeeper(zookeeper);
- zookeeper->createAncestors(zookeeper_path);
- zookeeper->createIfNotExists(zookeeper_path, "");
+ createStartAndAliveNodes(zookeeper);
});
}
-void BackupCoordinationStageSync::set(const String & current_host, const String & new_stage, const String & message, const bool & all_hosts)
-{
- auto holder = with_retries.createRetriesControlHolder("set");
- holder.retries_ctl.retryLoop(
- [&, &zookeeper = holder.faulty_zookeeper]()
- {
- with_retries.renewZooKeeper(zookeeper);
- if (all_hosts)
+void BackupCoordinationStageSync::createStartAndAliveNodes(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper)
+{
+ /// The "num_hosts" node keeps the number of hosts which started (created the "started" node)
+ /// but not yet finished (not created the "finished" node).
+ /// The number of alive hosts can be less than that.
+
+ /// The "alive_tracker" node always keeps an empty string, we track its version only.
+ /// The "alive_tracker" node increases its version each time when any "alive" nodes are created
+ /// so we use it to check concurrent backups/restores.
+ zookeeper->createIfNotExists(alive_tracker_node_path, "");
+
+ std::optional num_hosts;
+ int num_hosts_version = -1;
+
+ bool check_concurrency = !allow_concurrency;
+ int alive_tracker_version = -1;
+
+ for (size_t attempt_no = 1; attempt_no <= max_attempts_after_bad_version; ++attempt_no)
+ {
+ if (!num_hosts)
{
- auto code = zookeeper->trySet(zookeeper_path, new_stage);
- if (code != Coordination::Error::ZOK)
- throw zkutil::KeeperException::fromPath(code, zookeeper_path);
+ String num_hosts_str;
+ Coordination::Stat stat;
+ if (zookeeper->tryGet(num_hosts_node_path, num_hosts_str, &stat))
+ {
+ num_hosts = parseFromString(num_hosts_str);
+ num_hosts_version = stat.version;
+ }
+ }
+
+ String serialized_error;
+ if (zookeeper->tryGet(error_node_path, serialized_error))
+ {
+ auto [exception, host] = parseErrorNode(serialized_error);
+ if (exception)
+ std::rethrow_exception(exception);
+ }
+
+ if (check_concurrency)
+ {
+ Coordination::Stat stat;
+ zookeeper->exists(alive_tracker_node_path, &stat);
+ alive_tracker_version = stat.version;
+
+ checkConcurrency(zookeeper);
+ check_concurrency = false;
+ }
+
+ Coordination::Requests requests;
+ requests.reserve(6);
+
+ size_t operation_node_path_pos = static_cast(-1);
+ if (!zookeeper->exists(operation_node_path))
+ {
+ operation_node_path_pos = requests.size();
+ requests.emplace_back(zkutil::makeCreateRequest(operation_node_path, "", zkutil::CreateMode::Persistent));
+ }
+
+ size_t stage_node_path_pos = static_cast(-1);
+ if (!zookeeper->exists(stage_node_path))
+ {
+ stage_node_path_pos = requests.size();
+ requests.emplace_back(zkutil::makeCreateRequest(stage_node_path, "", zkutil::CreateMode::Persistent));
+ }
+
+ size_t num_hosts_node_path_pos = requests.size();
+ if (num_hosts)
+ requests.emplace_back(zkutil::makeSetRequest(num_hosts_node_path, toString(*num_hosts + 1), num_hosts_version));
+ else
+ requests.emplace_back(zkutil::makeCreateRequest(num_hosts_node_path, "1", zkutil::CreateMode::Persistent));
+
+ size_t alive_tracker_node_path_pos = requests.size();
+ requests.emplace_back(zkutil::makeSetRequest(alive_tracker_node_path, "", alive_tracker_version));
+
+ requests.emplace_back(zkutil::makeCreateRequest(start_node_path, std::to_string(kCurrentVersion), zkutil::CreateMode::Persistent));
+ requests.emplace_back(zkutil::makeCreateRequest(alive_node_path, "", zkutil::CreateMode::Ephemeral));
+
+ Coordination::Responses responses;
+ auto code = zookeeper->tryMulti(requests, responses);
+
+ if (code == Coordination::Error::ZOK)
+ {
+ LOG_INFO(log, "Created start node #{} in ZooKeeper for {} (coordination version: {})",
+ num_hosts.value_or(0) + 1, current_host_desc, kCurrentVersion);
+ return;
+ }
+
+ auto show_error_before_next_attempt = [&](const String & message)
+ {
+ bool will_try_again = (attempt_no < max_attempts_after_bad_version);
+ LOG_TRACE(log, "{} (attempt #{}){}", message, attempt_no, will_try_again ? ", will try again" : "");
+ };
+
+ if ((responses.size() > operation_node_path_pos) &&
+ (responses[operation_node_path_pos]->error == Coordination::Error::ZNODEEXISTS))
+ {
+ show_error_before_next_attempt(fmt::format("Node {} in ZooKeeper already exists", operation_node_path));
+ /// needs another attempt
+ }
+ else if ((responses.size() > stage_node_path_pos) &&
+ (responses[stage_node_path_pos]->error == Coordination::Error::ZNODEEXISTS))
+ {
+ show_error_before_next_attempt(fmt::format("Node {} in ZooKeeper already exists", stage_node_path));
+ /// needs another attempt
+ }
+ else if ((responses.size() > num_hosts_node_path_pos) && num_hosts &&
+ (responses[num_hosts_node_path_pos]->error == Coordination::Error::ZBADVERSION))
+ {
+ show_error_before_next_attempt("Other host changed the 'num_hosts' node in ZooKeeper");
+ num_hosts.reset(); /// needs to reread 'num_hosts' again
+ }
+ else if ((responses.size() > num_hosts_node_path_pos) && num_hosts &&
+ (responses[num_hosts_node_path_pos]->error == Coordination::Error::ZNONODE))
+ {
+ show_error_before_next_attempt("Other host removed the 'num_hosts' node in ZooKeeper");
+ num_hosts.reset(); /// needs to reread 'num_hosts' again
+ }
+ else if ((responses.size() > num_hosts_node_path_pos) && !num_hosts &&
+ (responses[num_hosts_node_path_pos]->error == Coordination::Error::ZNODEEXISTS))
+ {
+ show_error_before_next_attempt("Other host created the 'num_hosts' node in ZooKeeper");
+ /// needs another attempt
+ }
+ else if ((responses.size() > alive_tracker_node_path_pos) &&
+ (responses[alive_tracker_node_path_pos]->error == Coordination::Error::ZBADVERSION))
+ {
+ show_error_before_next_attempt("Concurrent backup or restore changed some 'alive' nodes in ZooKeeper");
+ check_concurrency = true; /// needs to recheck for concurrency again
}
else
{
- zookeeper->createIfNotExists(zookeeper_path + "/started|" + current_host, "");
- zookeeper->createIfNotExists(zookeeper_path + "/current|" + current_host + "|" + new_stage, message);
+ zkutil::KeeperMultiException::check(code, requests, responses);
}
+ }
+
+ throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
+ "Couldn't create the 'start' node in ZooKeeper for {} after {} attempts",
+ current_host_desc, max_attempts_after_bad_version);
+}
+
+
+void BackupCoordinationStageSync::checkConcurrency(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper)
+{
+ if (allow_concurrency)
+ return;
+
+ Strings found_operations;
+ auto code = zookeeper->tryGetChildren(root_zookeeper_path, found_operations);
+
+ if (!((code == Coordination::Error::ZOK) || (code == Coordination::Error::ZNONODE)))
+ throw zkutil::KeeperException::fromPath(code, root_zookeeper_path);
+
+ if (code == Coordination::Error::ZNONODE)
+ return;
+
+ for (const String & found_operation : found_operations)
+ {
+ if (found_operation.starts_with(is_restore ? "restore-" : "backup-") && (found_operation != operation_node_name))
+ {
+ Strings stages;
+ code = zookeeper->tryGetChildren(fs::path{root_zookeeper_path} / found_operation / "stage", stages);
+
+ if (!((code == Coordination::Error::ZOK) || (code == Coordination::Error::ZNONODE)))
+ throw zkutil::KeeperException::fromPath(code, fs::path{root_zookeeper_path} / found_operation / "stage");
+
+ if (code == Coordination::Error::ZOK)
+ {
+ for (const String & stage : stages)
+ {
+ if (stage.starts_with("alive"))
+ BackupConcurrencyCheck::throwConcurrentOperationNotAllowed(is_restore);
+ }
+ }
+ }
+ }
+}
+
+
+void BackupCoordinationStageSync::startWatchingThread()
+{
+ watching_thread_future = schedule([this]() { watchingThread(); }, Priority{});
+}
+
+
+void BackupCoordinationStageSync::stopWatchingThread()
+{
+ should_stop_watching_thread = true;
+
+ /// Wake up waiting threads.
+ if (zk_nodes_changed)
+ zk_nodes_changed->set();
+ state_changed.notify_all();
+
+ if (watching_thread_future.valid())
+ watching_thread_future.wait();
+}
+
+
+void BackupCoordinationStageSync::watchingThread()
+{
+ while (!should_stop_watching_thread)
+ {
+ try
+ {
+ /// Check if the current BACKUP or RESTORE command is already cancelled.
+ checkIfQueryCancelled();
+
+ /// Reset the `connected` flag for each host, we'll set them to true again after we find the 'alive' nodes.
+ resetConnectedFlag();
+
+ /// Recreate the 'alive' node if necessary and read a new state from ZooKeeper.
+ auto holder = with_retries.createRetriesControlHolder("BackupStageSync::watchingThread");
+ auto & zookeeper = holder.faulty_zookeeper;
+ with_retries.renewZooKeeper(zookeeper);
+
+ if (should_stop_watching_thread)
+ return;
+
+ /// Recreate the 'alive' node if it was removed.
+ createAliveNode(zookeeper);
+
+ /// Reads the current state from nodes in ZooKeeper.
+ readCurrentState(zookeeper);
+ }
+ catch (...)
+ {
+ tryLogCurrentException(log, "Caugth exception while watching");
+ }
+
+ try
+ {
+ /// Cancel the query if there is an error on another host or if some host was disconnected too long.
+ cancelQueryIfError();
+ cancelQueryIfDisconnectedTooLong();
+ }
+ catch (...)
+ {
+ tryLogCurrentException(log, "Caugth exception while checking if the query should be cancelled");
+ }
+
+ zk_nodes_changed->tryWait(sync_period_ms.count());
+ }
+}
+
+
+void BackupCoordinationStageSync::createAliveNode(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper)
+{
+ if (zookeeper->exists(alive_node_path))
+ return;
+
+ Coordination::Requests requests;
+ requests.emplace_back(zkutil::makeCreateRequest(alive_node_path, "", zkutil::CreateMode::Ephemeral));
+ requests.emplace_back(zkutil::makeSetRequest(alive_tracker_node_path, "", -1));
+ zookeeper->multi(requests);
+
+ LOG_INFO(log, "The alive node was recreated for {}", current_host_desc);
+}
+
+
+void BackupCoordinationStageSync::resetConnectedFlag()
+{
+ std::lock_guard lock{mutex};
+ for (auto & [_, host_info] : state.hosts)
+ host_info.connected = false;
+}
+
+
+void BackupCoordinationStageSync::readCurrentState(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper)
+{
+ zk_nodes_changed->reset();
+
+ /// Get zk nodes and subscribe on their changes.
+ Strings new_zk_nodes = zookeeper->getChildren(stage_node_path, nullptr, zk_nodes_changed);
+ std::sort(new_zk_nodes.begin(), new_zk_nodes.end()); /// Sorting is necessary because we compare the list of zk nodes with its previous versions.
+
+ State new_state;
+
+ {
+ std::lock_guard lock{mutex};
+
+ /// Log all changes in zookeeper nodes in the "stage" folder to make debugging easier.
+ Strings added_zk_nodes, removed_zk_nodes;
+ std::set_difference(new_zk_nodes.begin(), new_zk_nodes.end(), zk_nodes.begin(), zk_nodes.end(), back_inserter(added_zk_nodes));
+ std::set_difference(zk_nodes.begin(), zk_nodes.end(), new_zk_nodes.begin(), new_zk_nodes.end(), back_inserter(removed_zk_nodes));
+ if (!added_zk_nodes.empty())
+ LOG_TRACE(log, "Detected new zookeeper nodes appeared in the stage folder: {}", boost::algorithm::join(added_zk_nodes, ", "));
+ if (!removed_zk_nodes.empty())
+ LOG_TRACE(log, "Detected that some zookeeper nodes disappeared from the stage folder: {}", boost::algorithm::join(removed_zk_nodes, ", "));
+
+ zk_nodes = new_zk_nodes;
+ new_state = state;
+ }
+
+ auto get_host_info = [&](const String & host) -> HostInfo *
+ {
+ auto it = new_state.hosts.find(host);
+ if (it == new_state.hosts.end())
+ return nullptr;
+ return &it->second;
+ };
+
+ auto now = std::chrono::system_clock::now();
+ auto monotonic_now = std::chrono::steady_clock::now();
+
+ /// Read the current state from zookeeper nodes.
+ for (const auto & zk_node : new_zk_nodes)
+ {
+ if (zk_node == "error")
+ {
+ if (!new_state.host_with_error)
+ {
+ String serialized_error = zookeeper->get(error_node_path);
+ auto [exception, host] = parseErrorNode(serialized_error);
+ if (auto * host_info = get_host_info(host))
+ {
+ host_info->exception = exception;
+ new_state.host_with_error = host;
+ }
+ }
+ }
+ else if (zk_node.starts_with("started|"))
+ {
+ String host = zk_node.substr(strlen("started|"));
+ if (auto * host_info = get_host_info(host))
+ {
+ if (!host_info->started)
+ {
+ host_info->version = parseStartNode(zookeeper->get(zookeeper_path / zk_node), host);
+ host_info->started = true;
+ }
+ }
+ }
+ else if (zk_node.starts_with("finished|"))
+ {
+ String host = zk_node.substr(strlen("finished|"));
+ if (auto * host_info = get_host_info(host))
+ host_info->finished = true;
+ }
+ else if (zk_node.starts_with("alive|"))
+ {
+ String host = zk_node.substr(strlen("alive|"));
+ if (auto * host_info = get_host_info(host))
+ {
+ host_info->connected = true;
+ host_info->last_connection_time = now;
+ host_info->last_connection_time_monotonic = monotonic_now;
+ }
+ }
+ else if (zk_node.starts_with("current|"))
+ {
+ String host_and_stage = zk_node.substr(strlen("current|"));
+ size_t separator_pos = host_and_stage.find('|');
+ if (separator_pos != String::npos)
+ {
+ String host = host_and_stage.substr(0, separator_pos);
+ String stage = host_and_stage.substr(separator_pos + 1);
+ if (auto * host_info = get_host_info(host))
+ {
+ String result = zookeeper->get(fs::path{zookeeper_path} / zk_node);
+ host_info->stages[stage] = std::move(result);
+
+ /// That old version didn't create the 'finish' node so we consider that a host finished its work
+ /// if it reached the "completed" stage.
+ if ((host_info->version == kVersionWithoutFinishNode) && (stage == BackupCoordinationStage::COMPLETED))
+ host_info->finished = true;
+ }
+ }
+ }
+ }
+
+ /// Check if the state has been just changed, and if so then wake up waiting threads (see waitHostsReachStage()).
+ bool was_state_changed = false;
+
+ {
+ std::lock_guard lock{mutex};
+ was_state_changed = (new_state != state);
+ state = std::move(new_state);
+ }
+
+ if (was_state_changed)
+ state_changed.notify_all();
+}
+
+
+int BackupCoordinationStageSync::parseStartNode(const String & start_node_contents, const String & host) const
+{
+ int version;
+ if (start_node_contents.empty())
+ {
+ version = kInitialVersion;
+ }
+ else if (!tryParse(version, start_node_contents) || (version < kInitialVersion))
+ {
+ throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
+ "Coordination version {} used by {} is not supported", start_node_contents, getHostDesc(host));
+ }
+
+ if (version < kCurrentVersion)
+ LOG_WARNING(log, "Coordination version {} used by {} is outdated", version, getHostDesc(host));
+ return version;
+}
+
+
+std::pair BackupCoordinationStageSync::parseErrorNode(const String & error_node_contents)
+{
+ ReadBufferFromOwnString buf{error_node_contents};
+ String host;
+ readStringBinary(host, buf);
+ auto exception = std::make_exception_ptr(readException(buf, fmt::format("Got error from {}", getHostDesc(host))));
+ return {exception, host};
+}
+
+
+void BackupCoordinationStageSync::checkIfQueryCancelled()
+{
+ if (process_list_element->checkTimeLimitSoft())
+ return; /// Not cancelled.
+
+ std::lock_guard lock{mutex};
+ if (state.cancelled)
+ return; /// Already marked as cancelled.
+
+ state.cancelled = true;
+ state_changed.notify_all();
+}
+
+
+void BackupCoordinationStageSync::cancelQueryIfError()
+{
+ std::exception_ptr exception;
+
+ {
+ std::lock_guard lock{mutex};
+ if (state.cancelled || !state.host_with_error)
+ return;
+
+ state.cancelled = true;
+ exception = state.hosts.at(*state.host_with_error).exception;
+ }
+
+ process_list_element->cancelQuery(false, exception);
+ state_changed.notify_all();
+}
+
+
+void BackupCoordinationStageSync::cancelQueryIfDisconnectedTooLong()
+{
+ std::exception_ptr exception;
+
+ {
+ std::lock_guard lock{mutex};
+ if (state.cancelled || state.host_with_error || ((failure_after_host_disconnected_for_seconds.count() == 0)))
+ return;
+
+ auto monotonic_now = std::chrono::steady_clock::now();
+ bool info_shown = false;
+
+ for (auto & [host, host_info] : state.hosts)
+ {
+ if (!host_info.connected && !host_info.finished && (host != current_host))
+ {
+ auto disconnected_duration = std::chrono::duration_cast(monotonic_now - host_info.last_connection_time_monotonic);
+ if (disconnected_duration > failure_after_host_disconnected_for_seconds)
+ {
+ /// Host `host` was disconnected too long.
+ /// We can't just throw an exception here because readCurrentState() is called from a background thread.
+ /// So here we're writingh the error to the `process_list_element` and let it to be thrown later
+ /// from `process_list_element->checkTimeLimit()`.
+ String message = fmt::format("The 'alive' node hasn't been updated in ZooKeeper for {} for {} "
+ "which is more than the specified timeout {}. Last time the 'alive' node was detected at {}",
+ getHostDesc(host), disconnected_duration, failure_after_host_disconnected_for_seconds,
+ host_info.last_connection_time);
+ LOG_WARNING(log, "Lost connection to {}: {}", getHostDesc(host), message);
+ exception = std::make_exception_ptr(Exception{ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, "Lost connection to {}: {}", getHostDesc(host), message});
+ break;
+ }
+
+ if ((disconnected_duration >= std::chrono::seconds{1}) && !info_shown)
+ {
+ LOG_TRACE(log, "The 'alive' node hasn't been updated in ZooKeeper for {} for {}", getHostDesc(host), disconnected_duration);
+ info_shown = true;
+ }
+ }
+ }
+
+ if (!exception)
+ return;
+
+ state.cancelled = true;
+ }
+
+ process_list_element->cancelQuery(false, exception);
+ state_changed.notify_all();
+}
+
+
+void BackupCoordinationStageSync::setStage(const String & stage, const String & stage_result)
+{
+ LOG_INFO(log, "{} reached stage {}", current_host_desc, stage);
+ auto holder = with_retries.createRetriesControlHolder("BackupStageSync::setStage");
+ holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]()
+ {
+ with_retries.renewZooKeeper(zookeeper);
+ zookeeper->createIfNotExists(getStageNodePath(stage), stage_result);
});
}
-void BackupCoordinationStageSync::setError(const String & current_host, const Exception & exception)
+
+String BackupCoordinationStageSync::getStageNodePath(const String & stage) const
{
- auto holder = with_retries.createRetriesControlHolder("setError");
- holder.retries_ctl.retryLoop(
- [&, &zookeeper = holder.faulty_zookeeper]()
+ return fs::path{zookeeper_path} / ("current|" + current_host + "|" + stage);
+}
+
+
+bool BackupCoordinationStageSync::trySetError(std::exception_ptr exception) noexcept
+{
+ try
+ {
+ std::rethrow_exception(exception);
+ }
+ catch (const Exception & e)
+ {
+ return trySetError(e);
+ }
+ catch (...)
+ {
+ return trySetError(Exception(getCurrentExceptionMessageAndPattern(true, true), getCurrentExceptionCode()));
+ }
+}
+
+
+bool BackupCoordinationStageSync::trySetError(const Exception & exception)
+{
+ try
+ {
+ setError(exception);
+ return true;
+ }
+ catch (...)
+ {
+ return false;
+ }
+}
+
+
+void BackupCoordinationStageSync::setError(const Exception & exception)
+{
+ /// Most likely this exception has been already logged so here we're logging it without stacktrace.
+ String exception_message = getExceptionMessage(exception, /* with_stacktrace= */ false, /* check_embedded_stacktrace= */ true);
+ LOG_INFO(log, "Sending exception from {} to other hosts: {}", current_host_desc, exception_message);
+
+ auto holder = with_retries.createRetriesControlHolder("BackupStageSync::setError", WithRetries::kErrorHandling);
+
+ holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]()
{
with_retries.renewZooKeeper(zookeeper);
WriteBufferFromOwnString buf;
writeStringBinary(current_host, buf);
writeException(exception, buf, true);
- zookeeper->createIfNotExists(zookeeper_path + "/error", buf.str());
+ auto code = zookeeper->tryCreate(error_node_path, buf.str(), zkutil::CreateMode::Persistent);
- /// When backup/restore fails, it removes the nodes from Zookeeper.
- /// Sometimes it fails to remove all nodes. It's possible that it removes /error node, but fails to remove /stage node,
- /// so the following line tries to preserve the error status.
- auto code = zookeeper->trySet(zookeeper_path, Stage::ERROR);
- if (code != Coordination::Error::ZOK)
- throw zkutil::KeeperException::fromPath(code, zookeeper_path);
+ if (code == Coordination::Error::ZOK)
+ {
+ LOG_TRACE(log, "Sent exception from {} to other hosts", current_host_desc);
+ }
+ else if (code == Coordination::Error::ZNODEEXISTS)
+ {
+ LOG_INFO(log, "An error has been already assigned for this {}", operation_name);
+ }
+ else
+ {
+ throw zkutil::KeeperException::fromPath(code, error_node_path);
+ }
});
}
-Strings BackupCoordinationStageSync::wait(const Strings & all_hosts, const String & stage_to_wait)
+
+Strings BackupCoordinationStageSync::waitForHostsToReachStage(const String & stage_to_wait, const Strings & hosts, std::optional timeout) const
{
- return waitImpl(all_hosts, stage_to_wait, {});
-}
-
-Strings BackupCoordinationStageSync::waitFor(const Strings & all_hosts, const String & stage_to_wait, std::chrono::milliseconds timeout)
-{
- return waitImpl(all_hosts, stage_to_wait, timeout);
-}
-
-namespace
-{
- struct UnreadyHost
- {
- String host;
- bool started = false;
- };
-}
-
-struct BackupCoordinationStageSync::State
-{
- std::optional results;
- std::optional> error;
- std::optional disconnected_host;
- std::optional unready_host;
-};
-
-BackupCoordinationStageSync::State BackupCoordinationStageSync::readCurrentState(
- WithRetries::RetriesControlHolder & retries_control_holder,
- const Strings & zk_nodes,
- const Strings & all_hosts,
- const String & stage_to_wait) const
-{
- auto zookeeper = retries_control_holder.faulty_zookeeper;
- auto & retries_ctl = retries_control_holder.retries_ctl;
-
- std::unordered_set zk_nodes_set{zk_nodes.begin(), zk_nodes.end()};
-
- State state;
- if (zk_nodes_set.contains("error"))
- {
- String errors = zookeeper->get(zookeeper_path + "/error");
- ReadBufferFromOwnString buf{errors};
- String host;
- readStringBinary(host, buf);
- state.error = std::make_pair(host, readException(buf, fmt::format("Got error from {}", host)));
- return state;
- }
-
- std::optional unready_host;
-
- for (const auto & host : all_hosts)
- {
- if (!zk_nodes_set.contains("current|" + host + "|" + stage_to_wait))
- {
- const String started_node_name = "started|" + host;
- const String alive_node_name = "alive|" + host;
-
- bool started = zk_nodes_set.contains(started_node_name);
- bool alive = zk_nodes_set.contains(alive_node_name);
-
- if (!alive)
- {
- /// If the "alive" node doesn't exist then we don't have connection to the corresponding host.
- /// This node is ephemeral so probably it will be recreated soon. We use zookeeper retries to wait.
- /// In worst case when we won't manage to see the alive node for a long time we will just abort the backup.
- const auto * const suffix = retries_ctl.isLastRetry() ? "" : ", will retry";
- if (started)
- retries_ctl.setUserError(Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
- "Lost connection to host {}{}", host, suffix));
- else
- retries_ctl.setUserError(Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
- "No connection to host {} yet{}", host, suffix));
-
- state.disconnected_host = host;
- return state;
- }
-
- if (!unready_host)
- unready_host.emplace(UnreadyHost{.host = host, .started = started});
- }
- }
-
- if (unready_host)
- {
- state.unready_host = std::move(unready_host);
- return state;
- }
-
Strings results;
- for (const auto & host : all_hosts)
- results.emplace_back(zookeeper->get(zookeeper_path + "/current|" + host + "|" + stage_to_wait));
- state.results = std::move(results);
+ results.resize(hosts.size());
- return state;
+ std::unique_lock lock{mutex};
+
+ /// TSA_NO_THREAD_SAFETY_ANALYSIS is here because Clang Thread Safety Analysis doesn't understand std::unique_lock.
+ auto check_if_hosts_ready = [&](bool time_is_out) TSA_NO_THREAD_SAFETY_ANALYSIS
+ {
+ return checkIfHostsReachStage(hosts, stage_to_wait, time_is_out, timeout, results);
+ };
+
+ if (timeout)
+ {
+ if (!state_changed.wait_for(lock, *timeout, [&] { return check_if_hosts_ready(/* time_is_out = */ false); }))
+ check_if_hosts_ready(/* time_is_out = */ true);
+ }
+ else
+ {
+ state_changed.wait(lock, [&] { return check_if_hosts_ready(/* time_is_out = */ false); });
+ }
+
+ return results;
}
-Strings BackupCoordinationStageSync::waitImpl(
- const Strings & all_hosts, const String & stage_to_wait, std::optional timeout) const
+
+bool BackupCoordinationStageSync::checkIfHostsReachStage(
+ const Strings & hosts,
+ const String & stage_to_wait,
+ bool time_is_out,
+ std::optional timeout,
+ Strings & results) const
{
- if (all_hosts.empty())
- return {};
+ if (should_stop_watching_thread)
+ throw Exception(ErrorCodes::LOGICAL_ERROR, "finish() was called while waiting for a stage");
- /// Wait until all hosts are ready or an error happens or time is out.
+ process_list_element->checkTimeLimit();
- bool use_timeout = timeout.has_value();
- std::chrono::steady_clock::time_point end_of_timeout;
- if (use_timeout)
- end_of_timeout = std::chrono::steady_clock::now() + std::chrono::duration_cast(*timeout);
-
- State state;
- for (;;)
+ for (size_t i = 0; i != hosts.size(); ++i)
{
- LOG_INFO(log, "Waiting for the stage {}", stage_to_wait);
- /// Set by ZooKepper when list of zk nodes have changed.
- auto watch = std::make_shared();
- Strings zk_nodes;
- {
- auto holder = with_retries.createRetriesControlHolder("waitImpl");
- holder.retries_ctl.retryLoop(
- [&, &zookeeper = holder.faulty_zookeeper]()
- {
- with_retries.renewZooKeeper(zookeeper);
- watch->reset();
- /// Get zk nodes and subscribe on their changes.
- zk_nodes = zookeeper->getChildren(zookeeper_path, nullptr, watch);
+ const String & host = hosts[i];
+ auto it = state.hosts.find(host);
- /// Read the current state of zk nodes.
- state = readCurrentState(holder, zk_nodes, all_hosts, stage_to_wait);
- });
+ if (it == state.hosts.end())
+ throw Exception(ErrorCodes::LOGICAL_ERROR, "waitForHostsToReachStage() was called for unexpected {}, all hosts are {}", getHostDesc(host), getHostsDesc(all_hosts));
+
+ const HostInfo & host_info = it->second;
+ auto stage_it = host_info.stages.find(stage_to_wait);
+ if (stage_it != host_info.stages.end())
+ {
+ results[i] = stage_it->second;
+ continue;
}
- /// Analyze the current state of zk nodes.
- chassert(state.results || state.error || state.disconnected_host || state.unready_host);
-
- if (state.results || state.error || state.disconnected_host)
- break; /// Everything is ready or error happened.
-
- /// Log what we will wait.
- const auto & unready_host = *state.unready_host;
- LOG_INFO(log, "Waiting on ZooKeeper watch for any node to be changed (currently waiting for host {}{})",
- unready_host.host,
- (!unready_host.started ? " which didn't start the operation yet" : ""));
-
- /// Wait until `watch_callback` is called by ZooKeeper meaning that zk nodes have changed.
+ if (host_info.finished)
{
- if (use_timeout)
+ throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
+ "{} finished without coming to stage {}", getHostDesc(host), stage_to_wait);
+ }
+
+ String host_status;
+ if (!host_info.started)
+ host_status = fmt::format(": the host hasn't started working on this {} yet", operation_name);
+ else if (!host_info.connected)
+ host_status = fmt::format(": the host is currently disconnected, last connection was at {}", host_info.last_connection_time);
+
+ if (!time_is_out)
+ {
+ LOG_TRACE(log, "Waiting for {} to reach stage {}{}", getHostDesc(host), stage_to_wait, host_status);
+ return false;
+ }
+ else
+ {
+ throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
+ "Waited longer than timeout {} for {} to reach stage {}{}",
+ *timeout, getHostDesc(host), stage_to_wait, host_status);
+ }
+ }
+
+ LOG_INFO(log, "Hosts {} reached stage {}", getHostsDesc(hosts), stage_to_wait);
+ return true;
+}
+
+
+void BackupCoordinationStageSync::finish(bool & other_hosts_also_finished)
+{
+ tryFinishImpl(other_hosts_also_finished, /* throw_if_error = */ true, /* retries_kind = */ WithRetries::kNormal);
+}
+
+
+bool BackupCoordinationStageSync::tryFinishAfterError(bool & other_hosts_also_finished) noexcept
+{
+ return tryFinishImpl(other_hosts_also_finished, /* throw_if_error = */ false, /* retries_kind = */ WithRetries::kErrorHandling);
+}
+
+
+bool BackupCoordinationStageSync::tryFinishImpl()
+{
+ bool other_hosts_also_finished;
+ return tryFinishAfterError(other_hosts_also_finished);
+}
+
+
+bool BackupCoordinationStageSync::tryFinishImpl(bool & other_hosts_also_finished, bool throw_if_error, WithRetries::Kind retries_kind)
+{
+ auto get_value_other_hosts_also_finished = [&] TSA_REQUIRES(mutex)
+ {
+ other_hosts_also_finished = true;
+ for (const auto & [host, host_info] : state.hosts)
+ {
+ if ((host != current_host) && !host_info.finished)
+ other_hosts_also_finished = false;
+ }
+ };
+
+ {
+ std::lock_guard lock{mutex};
+ if (finish_result.succeeded)
+ {
+ get_value_other_hosts_also_finished();
+ return true;
+ }
+ if (finish_result.exception)
+ {
+ if (throw_if_error)
+ std::rethrow_exception(finish_result.exception);
+ return false;
+ }
+ }
+
+ try
+ {
+ stopWatchingThread();
+
+ auto holder = with_retries.createRetriesControlHolder("BackupStageSync::finish", retries_kind);
+ holder.retries_ctl.retryLoop([&, &zookeeper = holder.faulty_zookeeper]()
+ {
+ with_retries.renewZooKeeper(zookeeper);
+ createFinishNodeAndRemoveAliveNode(zookeeper);
+ });
+
+ std::lock_guard lock{mutex};
+ finish_result.succeeded = true;
+ get_value_other_hosts_also_finished();
+ return true;
+ }
+ catch (...)
+ {
+ LOG_TRACE(log, "Caught exception while creating the 'finish' node for {}: {}",
+ current_host_desc,
+ getCurrentExceptionMessage(/* with_stacktrace= */ false, /* check_embedded_stacktrace= */ true));
+
+ std::lock_guard lock{mutex};
+ finish_result.exception = std::current_exception();
+ if (throw_if_error)
+ throw;
+ return false;
+ }
+}
+
+
+void BackupCoordinationStageSync::createFinishNodeAndRemoveAliveNode(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper)
+{
+ if (zookeeper->exists(finish_node_path))
+ return;
+
+ /// If the initiator of the query has that old version then it doesn't expect us to create the 'finish' node and moreover
+ /// the initiator can start removing all the nodes immediately after all hosts report about reaching the "completed" status.
+ /// So to avoid weird errors in the logs we won't create the 'finish' node if the initiator of the query has that old version.
+ if ((getInitiatorVersion() == kVersionWithoutFinishNode) && (current_host != kInitiator))
+ {
+ LOG_INFO(log, "Skipped creating the 'finish' node because the initiator uses outdated version {}", getInitiatorVersion());
+ return;
+ }
+
+ std::optional num_hosts;
+ int num_hosts_version = -1;
+
+ for (size_t attempt_no = 1; attempt_no <= max_attempts_after_bad_version; ++attempt_no)
+ {
+ if (!num_hosts)
+ {
+ Coordination::Stat stat;
+ num_hosts = parseFromString(zookeeper->get(num_hosts_node_path, &stat));
+ num_hosts_version = stat.version;
+ }
+
+ Coordination::Requests requests;
+ requests.reserve(3);
+
+ requests.emplace_back(zkutil::makeCreateRequest(finish_node_path, "", zkutil::CreateMode::Persistent));
+
+ size_t num_hosts_node_path_pos = requests.size();
+ requests.emplace_back(zkutil::makeSetRequest(num_hosts_node_path, toString(*num_hosts - 1), num_hosts_version));
+
+ size_t alive_node_path_pos = static_cast(-1);
+ if (zookeeper->exists(alive_node_path))
+ {
+ alive_node_path_pos = requests.size();
+ requests.emplace_back(zkutil::makeRemoveRequest(alive_node_path, -1));
+ }
+
+ Coordination::Responses responses;
+ auto code = zookeeper->tryMulti(requests, responses);
+
+ if (code == Coordination::Error::ZOK)
+ {
+ --*num_hosts;
+ String hosts_left_desc = ((*num_hosts == 0) ? "no hosts left" : fmt::format("{} hosts left", *num_hosts));
+ LOG_INFO(log, "Created the 'finish' node in ZooKeeper for {}, {}", current_host_desc, hosts_left_desc);
+ return;
+ }
+
+ auto show_error_before_next_attempt = [&](const String & message)
+ {
+ bool will_try_again = (attempt_no < max_attempts_after_bad_version);
+ LOG_TRACE(log, "{} (attempt #{}){}", message, attempt_no, will_try_again ? ", will try again" : "");
+ };
+
+ if ((responses.size() > num_hosts_node_path_pos) &&
+ (responses[num_hosts_node_path_pos]->error == Coordination::Error::ZBADVERSION))
+ {
+ show_error_before_next_attempt("Other host changed the 'num_hosts' node in ZooKeeper");
+ num_hosts.reset(); /// needs to reread 'num_hosts' again
+ }
+ else if ((responses.size() > alive_node_path_pos) &&
+ (responses[alive_node_path_pos]->error == Coordination::Error::ZNONODE))
+ {
+ show_error_before_next_attempt(fmt::format("Node {} in ZooKeeper doesn't exist", alive_node_path_pos));
+ /// needs another attempt
+ }
+ else
+ {
+ zkutil::KeeperMultiException::check(code, requests, responses);
+ }
+ }
+
+ throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
+ "Couldn't create the 'finish' node for {} after {} attempts",
+ current_host_desc, max_attempts_after_bad_version);
+}
+
+
+int BackupCoordinationStageSync::getInitiatorVersion() const
+{
+ std::lock_guard lock{mutex};
+ auto it = state.hosts.find(String{kInitiator});
+ if (it == state.hosts.end())
+ throw Exception(ErrorCodes::LOGICAL_ERROR, "There is no initiator of this {} query, it's a bug", operation_name);
+ const HostInfo & host_info = it->second;
+ return host_info.version;
+}
+
+
+void BackupCoordinationStageSync::waitForOtherHostsToFinish() const
+{
+ tryWaitForOtherHostsToFinishImpl(/* reason = */ "", /* throw_if_error = */ true, /* timeout = */ {});
+}
+
+
+bool BackupCoordinationStageSync::tryWaitForOtherHostsToFinishAfterError() const noexcept
+{
+ std::optional timeout;
+ if (finish_timeout_after_error.count() != 0)
+ timeout = finish_timeout_after_error;
+
+ String reason = fmt::format("{} needs other hosts to finish before cleanup", current_host_desc);
+ return tryWaitForOtherHostsToFinishImpl(reason, /* throw_if_error = */ false, timeout);
+}
+
+
+bool BackupCoordinationStageSync::tryWaitForOtherHostsToFinishImpl(const String & reason, bool throw_if_error, std::optional timeout) const
+{
+ std::unique_lock lock{mutex};
+
+ /// TSA_NO_THREAD_SAFETY_ANALYSIS is here because Clang Thread Safety Analysis doesn't understand std::unique_lock.
+ auto check_if_other_hosts_finish = [&](bool time_is_out) TSA_NO_THREAD_SAFETY_ANALYSIS
+ {
+ return checkIfOtherHostsFinish(reason, throw_if_error, time_is_out, timeout);
+ };
+
+ if (timeout)
+ {
+ if (state_changed.wait_for(lock, *timeout, [&] { return check_if_other_hosts_finish(/* time_is_out = */ false); }))
+ return true;
+ return check_if_other_hosts_finish(/* time_is_out = */ true);
+ }
+ else
+ {
+ state_changed.wait(lock, [&] { return check_if_other_hosts_finish(/* time_is_out = */ false); });
+ return true;
+ }
+}
+
+
+bool BackupCoordinationStageSync::checkIfOtherHostsFinish(const String & reason, bool throw_if_error, bool time_is_out, std::optional timeout) const
+{
+ if (should_stop_watching_thread)
+ throw Exception(ErrorCodes::LOGICAL_ERROR, "finish() was called while waiting for other hosts to finish");
+
+ if (throw_if_error)
+ process_list_element->checkTimeLimit();
+
+ for (const auto & [host, host_info] : state.hosts)
+ {
+ if ((host == current_host) || host_info.finished)
+ continue;
+
+ String host_status;
+ if (!host_info.started)
+ host_status = fmt::format(": the host hasn't started working on this {} yet", operation_name);
+ else if (!host_info.connected)
+ host_status = fmt::format(": the host is currently disconnected, last connection was at {}", host_info.last_connection_time);
+
+ if (!time_is_out)
+ {
+ String reason_text = reason.empty() ? "" : (" because " + reason);
+ LOG_TRACE(log, "Waiting for {} to finish{}{}", getHostDesc(host), reason_text, host_status);
+ return false;
+ }
+ else
+ {
+ String reason_text = reason.empty() ? "" : fmt::format(" (reason of waiting: {})", reason);
+ if (!throw_if_error)
{
- auto current_time = std::chrono::steady_clock::now();
- if ((current_time > end_of_timeout)
- || !watch->tryWait(std::chrono::duration_cast(end_of_timeout - current_time).count()))
- break;
+ LOG_INFO(log, "Waited longer than timeout {} for {} to finish{}{}",
+ *timeout, getHostDesc(host), host_status, reason_text);
+ return false;
}
else
{
- watch->wait();
+ throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
+ "Waited longer than timeout {} for {} to finish{}{}",
+ *timeout, getHostDesc(host), host_status, reason_text);
}
}
}
- /// Rethrow an error raised originally on another host.
- if (state.error)
- state.error->second.rethrow();
-
- /// Another host terminated without errors.
- if (state.disconnected_host)
- throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE, "No connection to host {}", *state.disconnected_host);
-
- /// Something's unready, timeout is probably not enough.
- if (state.unready_host)
- {
- const auto & unready_host = *state.unready_host;
- throw Exception(
- ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
- "Waited for host {} too long (> {}){}",
- unready_host.host,
- to_string(*timeout),
- unready_host.started ? "" : ": Operation didn't start");
- }
-
- LOG_TRACE(log, "Everything is Ok. All hosts achieved stage {}", stage_to_wait);
- return std::move(*state.results);
+ LOG_TRACE(log, "Other hosts finished working on this {}", operation_name);
+ return true;
}
}
diff --git a/src/Backups/BackupCoordinationStageSync.h b/src/Backups/BackupCoordinationStageSync.h
index a06c5c61041..dc0d3c3c83d 100644
--- a/src/Backups/BackupCoordinationStageSync.h
+++ b/src/Backups/BackupCoordinationStageSync.h
@@ -10,33 +10,193 @@ class BackupCoordinationStageSync
{
public:
BackupCoordinationStageSync(
- const String & root_zookeeper_path_,
- WithRetries & with_retries_,
+ bool is_restore_, /// true if this is a RESTORE ON CLUSTER command, false if this is a BACKUP ON CLUSTER command
+ const String & zookeeper_path_, /// path to the "stage" folder in ZooKeeper
+ const String & current_host_, /// the current host, or an empty string if it's the initiator of the BACKUP/RESTORE ON CLUSTER command
+ const Strings & all_hosts_, /// all the hosts (including the initiator and the current host) performing the BACKUP/RESTORE ON CLUSTER command
+ bool allow_concurrency_, /// whether it's allowed to have concurrent backups or restores.
+ const WithRetries & with_retries_,
+ ThreadPoolCallbackRunnerUnsafe schedule_,
+ QueryStatusPtr process_list_element_,
LoggerPtr log_);
+ ~BackupCoordinationStageSync();
+
/// Sets the stage of the current host and signal other hosts if there were other hosts waiting for that.
- void set(const String & current_host, const String & new_stage, const String & message, const bool & all_hosts = false);
- void setError(const String & current_host, const Exception & exception);
+ void setStage(const String & stage, const String & stage_result = {});
- /// Sets the stage of the current host and waits until all hosts come to the same stage.
- /// The function returns the messages all hosts set when they come to the required stage.
- Strings wait(const Strings & all_hosts, const String & stage_to_wait);
+ /// Waits until all the specified hosts come to the specified stage.
+ /// The function returns the results which specified hosts set when they came to the required stage.
+ /// If it doesn't happen before the timeout then the function will stop waiting and throw an exception.
+ Strings waitForHostsToReachStage(const String & stage_to_wait, const Strings & hosts, std::optional timeout = {}) const;
- /// Almost the same as setAndWait() but this one stops waiting and throws an exception after a specific amount of time.
- Strings waitFor(const Strings & all_hosts, const String & stage_to_wait, std::chrono::milliseconds timeout);
+ /// Waits until all the other hosts finish their work.
+ /// Stops waiting and throws an exception if another host encounters an error or if some host gets cancelled.
+ void waitForOtherHostsToFinish() const;
+
+ /// Lets other host know that the current host has finished its work.
+ void finish(bool & other_hosts_also_finished);
+
+ /// Lets other hosts know that the current host has encountered an error.
+ bool trySetError(std::exception_ptr exception) noexcept;
+
+ /// Waits until all the other hosts finish their work (as a part of error-handling process).
+ /// Doesn't stops waiting if some host encounters an error or gets cancelled.
+ bool tryWaitForOtherHostsToFinishAfterError() const noexcept;
+
+ /// Lets other host know that the current host has finished its work (as a part of error-handling process).
+ bool tryFinishAfterError(bool & other_hosts_also_finished) noexcept;
+
+ /// Returns a printable name of a specific host. For empty host the function returns "initiator".
+ static String getHostDesc(const String & host);
+ static String getHostsDesc(const Strings & hosts);
private:
+ /// Initializes the original state. It will be updated then with readCurrentState().
+ void initializeState();
+
+ /// Creates the root node in ZooKeeper.
void createRootNodes();
- struct State;
- State readCurrentState(WithRetries::RetriesControlHolder & retries_control_holder, const Strings & zk_nodes, const Strings & all_hosts, const String & stage_to_wait) const;
+ /// Atomically creates both 'start' and 'alive' nodes and also checks that there is no concurrent backup or restore if `allow_concurrency` is false.
+ void createStartAndAliveNodes();
+ void createStartAndAliveNodes(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper);
- Strings waitImpl(const Strings & all_hosts, const String & stage_to_wait, std::optional timeout) const;
+ /// Deserialize the version of a node stored in the 'start' node.
+ int parseStartNode(const String & start_node_contents, const String & host) const;
- String zookeeper_path;
- /// A reference to the field of parent object - BackupCoordinationRemote or RestoreCoordinationRemote
- WithRetries & with_retries;
- LoggerPtr log;
+ /// Recreates the 'alive' node if it doesn't exist. It's an ephemeral node so it's removed automatically after disconnections.
+ void createAliveNode(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper);
+
+ /// Checks that there is no concurrent backup or restore if `allow_concurrency` is false.
+ void checkConcurrency(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper);
+
+ /// Watching thread periodically reads the current state from ZooKeeper and recreates the 'alive' node.
+ void startWatchingThread();
+ void stopWatchingThread();
+ void watchingThread();
+
+ /// Reads the current state from ZooKeeper without throwing exceptions.
+ void readCurrentState(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper);
+ String getStageNodePath(const String & stage) const;
+
+ /// Lets other hosts know that the current host has encountered an error.
+ bool trySetError(const Exception & exception);
+ void setError(const Exception & exception);
+
+ /// Deserializes an error stored in the error node.
+ static std::pair parseErrorNode(const String & error_node_contents);
+
+ /// Reset the `connected` flag for each host.
+ void resetConnectedFlag();
+
+ /// Checks if the current query is cancelled, and if so then the function sets the `cancelled` flag in the current state.
+ void checkIfQueryCancelled();
+
+ /// Checks if the current state contains an error, and if so then the function passes this error to the query status
+ /// to cancel the current BACKUP or RESTORE command.
+ void cancelQueryIfError();
+
+ /// Checks if some host was disconnected for too long, and if so then the function generates an error and pass it to the query status
+ /// to cancel the current BACKUP or RESTORE command.
+ void cancelQueryIfDisconnectedTooLong();
+
+ /// Used by waitForHostsToReachStage() to check if everything is ready to return.
+ bool checkIfHostsReachStage(const Strings & hosts, const String & stage_to_wait, bool time_is_out, std::optional timeout, Strings & results) const TSA_REQUIRES(mutex);
+
+ /// Creates the 'finish' node.
+ bool tryFinishImpl();
+ bool tryFinishImpl(bool & other_hosts_also_finished, bool throw_if_error, WithRetries::Kind retries_kind);
+ void createFinishNodeAndRemoveAliveNode(Coordination::ZooKeeperWithFaultInjection::Ptr zookeeper);
+
+ /// Returns the version used by the initiator.
+ int getInitiatorVersion() const;
+
+ /// Waits until all the other hosts finish their work.
+ bool tryWaitForOtherHostsToFinishImpl(const String & reason, bool throw_if_error, std::optional timeout) const;
+ bool checkIfOtherHostsFinish(const String & reason, bool throw_if_error, bool time_is_out, std::optional timeout) const TSA_REQUIRES(mutex);
+
+ const bool is_restore;
+ const String operation_name;
+ const String current_host;
+ const String current_host_desc;
+ const Strings all_hosts;
+ const bool allow_concurrency;
+
+ /// A reference to a field of the parent object which is either BackupCoordinationOnCluster or RestoreCoordinationOnCluster.
+ const WithRetries & with_retries;
+
+ const ThreadPoolCallbackRunnerUnsafe schedule;
+ const QueryStatusPtr process_list_element;
+ const LoggerPtr log;
+
+ const std::chrono::seconds failure_after_host_disconnected_for_seconds;
+ const std::chrono::seconds finish_timeout_after_error;
+ const std::chrono::milliseconds sync_period_ms;
+ const size_t max_attempts_after_bad_version;
+
+ /// Paths in ZooKeeper.
+ const std::filesystem::path zookeeper_path;
+ const String root_zookeeper_path;
+ const String operation_node_path;
+ const String operation_node_name;
+ const String stage_node_path;
+ const String start_node_path;
+ const String finish_node_path;
+ const String num_hosts_node_path;
+ const String alive_node_path;
+ const String alive_tracker_node_path;
+ const String error_node_path;
+
+ std::shared_ptr zk_nodes_changed;
+
+ /// We store list of previously found ZooKeeper nodes to show better logging messages.
+ Strings zk_nodes;
+
+ /// Information about one host read from ZooKeeper.
+ struct HostInfo
+ {
+ String host;
+ bool started = false;
+ bool connected = false;
+ bool finished = false;
+ int version = 1;
+ std::map stages = {}; /// std::map because we need to compare states
+ std::exception_ptr exception = nullptr;
+
+ std::chrono::time_point last_connection_time = {};
+ std::chrono::time_point last_connection_time_monotonic = {};
+
+ bool operator ==(const HostInfo & other) const;
+ bool operator !=(const HostInfo & other) const;
+ };
+
+ /// Information about all the host participating in the current BACKUP or RESTORE operation.
+ struct State
+ {
+ std::map hosts; /// std::map because we need to compare states
+ std::optional host_with_error;
+ bool cancelled = false;
+
+ bool operator ==(const State & other) const;
+ bool operator !=(const State & other) const;
+ };
+
+ State state TSA_GUARDED_BY(mutex);
+ mutable std::condition_variable state_changed;
+
+ std::future watching_thread_future;
+ std::atomic should_stop_watching_thread = false;
+
+ struct FinishResult
+ {
+ bool succeeded = false;
+ std::exception_ptr exception;
+ bool other_hosts_also_finished = false;
+ };
+ FinishResult finish_result TSA_GUARDED_BY(mutex);
+
+ mutable std::mutex mutex;
};
}
diff --git a/src/Backups/BackupEntriesCollector.cpp b/src/Backups/BackupEntriesCollector.cpp
index ae73630d41c..00a4471d994 100644
--- a/src/Backups/BackupEntriesCollector.cpp
+++ b/src/Backups/BackupEntriesCollector.cpp
@@ -102,7 +102,6 @@ BackupEntriesCollector::BackupEntriesCollector(
, read_settings(read_settings_)
, context(context_)
, process_list_element(context->getProcessListElement())
- , on_cluster_first_sync_timeout(context->getConfigRef().getUInt64("backups.on_cluster_first_sync_timeout", 180000))
, collect_metadata_timeout(context->getConfigRef().getUInt64(
"backups.collect_metadata_timeout", context->getConfigRef().getUInt64("backups.consistent_metadata_snapshot_timeout", 600000)))
, attempts_to_collect_metadata_before_sleep(context->getConfigRef().getUInt("backups.attempts_to_collect_metadata_before_sleep", 2))
@@ -176,21 +175,7 @@ Strings BackupEntriesCollector::setStage(const String & new_stage, const String
checkIsQueryCancelled();
current_stage = new_stage;
- backup_coordination->setStage(new_stage, message);
-
- if (new_stage == Stage::formatGatheringMetadata(0))
- {
- return backup_coordination->waitForStage(new_stage, on_cluster_first_sync_timeout);
- }
- if (new_stage.starts_with(Stage::GATHERING_METADATA))
- {
- auto current_time = std::chrono::steady_clock::now();
- auto end_of_timeout = std::max(current_time, collect_metadata_end_time);
- return backup_coordination->waitForStage(
- new_stage, std::chrono::duration_cast(end_of_timeout - current_time));
- }
-
- return backup_coordination->waitForStage(new_stage);
+ return backup_coordination->setStage(new_stage, message, /* sync = */ true);
}
void BackupEntriesCollector::checkIsQueryCancelled() const
diff --git a/src/Backups/BackupEntriesCollector.h b/src/Backups/BackupEntriesCollector.h
index ae076a84c8b..504489cce6b 100644
--- a/src/Backups/BackupEntriesCollector.h
+++ b/src/Backups/BackupEntriesCollector.h
@@ -111,10 +111,6 @@ private:
ContextPtr context;
QueryStatusPtr process_list_element;
- /// The time a BACKUP ON CLUSTER or RESTORE ON CLUSTER command will wait until all the nodes receive the BACKUP (or RESTORE) query and start working.
- /// This setting is similar to `distributed_ddl_task_timeout`.
- const std::chrono::milliseconds on_cluster_first_sync_timeout;
-
/// The time a BACKUP command will try to collect the metadata of tables & databases.
const std::chrono::milliseconds collect_metadata_timeout;
diff --git a/src/Backups/BackupIO.h b/src/Backups/BackupIO.h
index ee2f38c785b..c9e0f25f9a0 100644
--- a/src/Backups/BackupIO.h
+++ b/src/Backups/BackupIO.h
@@ -5,6 +5,7 @@
namespace DB
{
+
class IDisk;
using DiskPtr = std::shared_ptr;
class SeekableReadBuffer;
@@ -63,9 +64,13 @@ public:
virtual void copyFile(const String & destination, const String & source, size_t size) = 0;
+ /// Removes a file written to the backup, if it still exists.
virtual void removeFile(const String & file_name) = 0;
virtual void removeFiles(const Strings & file_names) = 0;
+ /// Removes the backup folder if it's empty or contains empty subfolders.
+ virtual void removeEmptyDirectories() = 0;
+
virtual const ReadSettings & getReadSettings() const = 0;
virtual const WriteSettings & getWriteSettings() const = 0;
virtual size_t getWriteBufferSize() const = 0;
diff --git a/src/Backups/BackupIO_AzureBlobStorage.h b/src/Backups/BackupIO_AzureBlobStorage.h
index c3b88f245ab..c90a030a1e7 100644
--- a/src/Backups/BackupIO_AzureBlobStorage.h
+++ b/src/Backups/BackupIO_AzureBlobStorage.h
@@ -81,6 +81,7 @@ public:
void removeFile(const String & file_name) override;
void removeFiles(const Strings & file_names) override;
+ void removeEmptyDirectories() override {}
private:
std::unique_ptr readFile(const String & file_name, size_t expected_file_size) override;
diff --git a/src/Backups/BackupIO_Disk.cpp b/src/Backups/BackupIO_Disk.cpp
index aeb07b154f5..794fb5be936 100644
--- a/src/Backups/BackupIO_Disk.cpp
+++ b/src/Backups/BackupIO_Disk.cpp
@@ -91,16 +91,36 @@ std::unique_ptr BackupWriterDisk::writeFile(const String & file_nam
void BackupWriterDisk::removeFile(const String & file_name)
{
disk->removeFileIfExists(root_path / file_name);
- if (disk->existsDirectory(root_path) && disk->isDirectoryEmpty(root_path))
- disk->removeDirectory(root_path);
}
void BackupWriterDisk::removeFiles(const Strings & file_names)
{
for (const auto & file_name : file_names)
disk->removeFileIfExists(root_path / file_name);
- if (disk->existsDirectory(root_path) && disk->isDirectoryEmpty(root_path))
- disk->removeDirectory(root_path);
+}
+
+void BackupWriterDisk::removeEmptyDirectories()
+{
+ removeEmptyDirectoriesImpl(root_path);
+}
+
+void BackupWriterDisk::removeEmptyDirectoriesImpl(const fs::path & current_dir)
+{
+ if (!disk->existsDirectory(current_dir))
+ return;
+
+ if (disk->isDirectoryEmpty(current_dir))
+ {
+ disk->removeDirectory(current_dir);
+ return;
+ }
+
+ /// Backups are not too deep, so recursion is good enough here.
+ for (auto it = disk->iterateDirectory(current_dir); it->isValid(); it->next())
+ removeEmptyDirectoriesImpl(current_dir / it->name());
+
+ if (disk->isDirectoryEmpty(current_dir))
+ disk->removeDirectory(current_dir);
}
void BackupWriterDisk::copyFileFromDisk(const String & path_in_backup, DiskPtr src_disk, const String & src_path,
diff --git a/src/Backups/BackupIO_Disk.h b/src/Backups/BackupIO_Disk.h
index 3d3253877bd..c77513935a9 100644
--- a/src/Backups/BackupIO_Disk.h
+++ b/src/Backups/BackupIO_Disk.h
@@ -50,9 +50,11 @@ public:
void removeFile(const String & file_name) override;
void removeFiles(const Strings & file_names) override;
+ void removeEmptyDirectories() override;
private:
std::unique_ptr readFile(const String & file_name, size_t expected_file_size) override;
+ void removeEmptyDirectoriesImpl(const std::filesystem::path & current_dir);
const DiskPtr disk;
const std::filesystem::path root_path;
diff --git a/src/Backups/BackupIO_File.cpp b/src/Backups/BackupIO_File.cpp
index 681513bf7ce..80f084d241c 100644
--- a/src/Backups/BackupIO_File.cpp
+++ b/src/Backups/BackupIO_File.cpp
@@ -106,16 +106,36 @@ std::unique_ptr BackupWriterFile::writeFile(const String & file_nam
void BackupWriterFile::removeFile(const String & file_name)
{
(void)fs::remove(root_path / file_name);
- if (fs::is_directory(root_path) && fs::is_empty(root_path))
- (void)fs::remove(root_path);
}
void BackupWriterFile::removeFiles(const Strings & file_names)
{
for (const auto & file_name : file_names)
(void)fs::remove(root_path / file_name);
- if (fs::is_directory(root_path) && fs::is_empty(root_path))
- (void)fs::remove(root_path);
+}
+
+void BackupWriterFile::removeEmptyDirectories()
+{
+ removeEmptyDirectoriesImpl(root_path);
+}
+
+void BackupWriterFile::removeEmptyDirectoriesImpl(const fs::path & current_dir)
+{
+ if (!fs::is_directory(current_dir))
+ return;
+
+ if (fs::is_empty(current_dir))
+ {
+ (void)fs::remove(current_dir);
+ return;
+ }
+
+ /// Backups are not too deep, so recursion is good enough here.
+ for (const auto & it : std::filesystem::directory_iterator{current_dir})
+ removeEmptyDirectoriesImpl(it.path());
+
+ if (fs::is_empty(current_dir))
+ (void)fs::remove(current_dir);
}
void BackupWriterFile::copyFileFromDisk(const String & path_in_backup, DiskPtr src_disk, const String & src_path,
diff --git a/src/Backups/BackupIO_File.h b/src/Backups/BackupIO_File.h
index ebe9a0f02cb..a2169ac7b4b 100644
--- a/src/Backups/BackupIO_File.h
+++ b/src/Backups/BackupIO_File.h
@@ -42,9 +42,11 @@ public:
void removeFile(const String & file_name) override;
void removeFiles(const Strings & file_names) override;
+ void removeEmptyDirectories() override;
private:
std::unique_ptr readFile(const String & file_name, size_t expected_file_size) override;
+ void removeEmptyDirectoriesImpl(const std::filesystem::path & current_dir);
const std::filesystem::path root_path;
const DataSourceDescription data_source_description;
diff --git a/src/Backups/BackupIO_S3.h b/src/Backups/BackupIO_S3.h
index a04f1c915b9..4ccf477b369 100644
--- a/src/Backups/BackupIO_S3.h
+++ b/src/Backups/BackupIO_S3.h
@@ -74,6 +74,7 @@ public:
void removeFile(const String & file_name) override;
void removeFiles(const Strings & file_names) override;
+ void removeEmptyDirectories() override {}
private:
std::unique_ptr readFile(const String & file_name, size_t expected_file_size) override;
diff --git a/src/Backups/BackupImpl.cpp b/src/Backups/BackupImpl.cpp
index b95a2e10b4d..af3fa5531b8 100644
--- a/src/Backups/BackupImpl.cpp
+++ b/src/Backups/BackupImpl.cpp
@@ -147,11 +147,11 @@ BackupImpl::BackupImpl(
BackupImpl::~BackupImpl()
{
- if ((open_mode == OpenMode::WRITE) && !is_internal_backup && !writing_finalized && !std::uncaught_exceptions() && !std::current_exception())
+ if ((open_mode == OpenMode::WRITE) && !writing_finalized && !corrupted)
{
/// It is suspicious to destroy BackupImpl without finalization while writing a backup when there is no exception.
- LOG_ERROR(log, "BackupImpl is not finalized when destructor is called. Stack trace: {}", StackTrace().toString());
- chassert(false && "BackupImpl is not finalized when destructor is called.");
+ LOG_ERROR(log, "BackupImpl is not finalized or marked as corrupted when destructor is called. Stack trace: {}", StackTrace().toString());
+ chassert(false, "BackupImpl is not finalized or marked as corrupted when destructor is called.");
}
try
@@ -196,9 +196,6 @@ void BackupImpl::open()
if (open_mode == OpenMode::READ)
readBackupMetadata();
-
- if ((open_mode == OpenMode::WRITE) && base_backup_info)
- base_backup_uuid = getBaseBackupUnlocked()->getUUID();
}
void BackupImpl::close()
@@ -280,6 +277,8 @@ std::shared_ptr BackupImpl::getBaseBackupUnlocked() const
toString(base_backup->getUUID()),
(base_backup_uuid ? toString(*base_backup_uuid) : ""));
}
+
+ base_backup_uuid = base_backup->getUUID();
}
return base_backup;
}
@@ -369,7 +368,7 @@ void BackupImpl::writeBackupMetadata()
if (base_backup_in_use)
{
*out << "" << xml << base_backup_info->toString() << "";
- *out << "" << toString(*base_backup_uuid) << "";
+ *out << "" << getBaseBackupUnlocked()->getUUID() << "";
}
}
@@ -594,9 +593,6 @@ bool BackupImpl::checkLockFile(bool throw_if_failed) const
void BackupImpl::removeLockFile()
{
- if (is_internal_backup)
- return; /// Internal backup must not remove the lock file (it's still used by the initiator).
-
if (checkLockFile(false))
writer->removeFile(lock_file_name);
}
@@ -989,8 +985,11 @@ void BackupImpl::finalizeWriting()
if (open_mode != OpenMode::WRITE)
throw Exception(ErrorCodes::LOGICAL_ERROR, "Backup is not opened for writing");
+ if (corrupted)
+ throw Exception(ErrorCodes::LOGICAL_ERROR, "Backup can't be finalized after an error happened");
+
if (writing_finalized)
- throw Exception(ErrorCodes::LOGICAL_ERROR, "Backup is already finalized");
+ return;
if (!is_internal_backup)
{
@@ -1015,20 +1014,58 @@ void BackupImpl::setCompressedSize()
}
-void BackupImpl::tryRemoveAllFiles()
+bool BackupImpl::setIsCorrupted() noexcept
{
- if (open_mode != OpenMode::WRITE)
- throw Exception(ErrorCodes::LOGICAL_ERROR, "Backup is not opened for writing");
-
- if (is_internal_backup)
- return;
-
try
{
- LOG_INFO(log, "Removing all files of backup {}", backup_name_for_logging);
+ std::lock_guard lock{mutex};
+ if (open_mode != OpenMode::WRITE)
+ {
+ LOG_ERROR(log, "Backup is not opened for writing. Stack trace: {}", StackTrace().toString());
+ chassert(false, "Backup is not opened for writing when setIsCorrupted() is called");
+ return false;
+ }
+
+ if (writing_finalized)
+ {
+ LOG_WARNING(log, "An error happened after the backup was completed successfully, the backup must be correct!");
+ return false;
+ }
+
+ if (corrupted)
+ return true;
+
+ LOG_WARNING(log, "An error happened, the backup won't be completed");
+
closeArchive(/* finalize= */ false);
+ corrupted = true;
+ return true;
+ }
+ catch (...)
+ {
+ DB::tryLogCurrentException(log, "Caught exception while setting that the backup was corrupted");
+ return false;
+ }
+}
+
+
+bool BackupImpl::tryRemoveAllFiles() noexcept
+{
+ try
+ {
+ std::lock_guard lock{mutex};
+ if (!corrupted)
+ {
+ LOG_ERROR(log, "Backup is not set as corrupted. Stack trace: {}", StackTrace().toString());
+ chassert(false, "Backup is not set as corrupted when tryRemoveAllFiles() is called");
+ return false;
+ }
+
+ LOG_INFO(log, "Removing all files of backup {}", backup_name_for_logging);
+
Strings files_to_remove;
+
if (use_archive)
{
files_to_remove.push_back(archive_params.archive_name);
@@ -1041,14 +1078,17 @@ void BackupImpl::tryRemoveAllFiles()
}
if (!checkLockFile(false))
- return;
+ return false;
writer->removeFiles(files_to_remove);
removeLockFile();
+ writer->removeEmptyDirectories();
+ return true;
}
catch (...)
{
- DB::tryLogCurrentException(__PRETTY_FUNCTION__);
+ DB::tryLogCurrentException(log, "Caught exception while removing files of a corrupted backup");
+ return false;
}
}
diff --git a/src/Backups/BackupImpl.h b/src/Backups/BackupImpl.h
index d7846104c4c..4b0f9f879ec 100644
--- a/src/Backups/BackupImpl.h
+++ b/src/Backups/BackupImpl.h
@@ -86,7 +86,8 @@ public:
void writeFile(const BackupFileInfo & info, BackupEntryPtr entry) override;
bool supportsWritingInMultipleThreads() const override { return !use_archive; }
void finalizeWriting() override;
- void tryRemoveAllFiles() override;
+ bool setIsCorrupted() noexcept override;
+ bool tryRemoveAllFiles() noexcept override;
private:
void open();
@@ -146,13 +147,14 @@ private:
int version;
mutable std::optional base_backup_info;
mutable std::shared_ptr base_backup;
- std::optional base_backup_uuid;
+ mutable std::optional base_backup_uuid;
std::shared_ptr archive_reader;
std::shared_ptr archive_writer;
String lock_file_name;
std::atomic lock_file_before_first_file_checked = false;
bool writing_finalized = false;
+ bool corrupted = false;
bool deduplicate_files = true;
bool use_same_s3_credentials_for_base_backup = false;
bool use_same_password_for_base_backup = false;
diff --git a/src/Backups/BackupKeeperSettings.cpp b/src/Backups/BackupKeeperSettings.cpp
new file mode 100644
index 00000000000..180633cea1f
--- /dev/null
+++ b/src/Backups/BackupKeeperSettings.cpp
@@ -0,0 +1,58 @@
+#include
+
+#include
+#include
+#include
+
+
+namespace DB
+{
+
+namespace Setting
+{
+ extern const SettingsUInt64 backup_restore_keeper_max_retries;
+ extern const SettingsUInt64 backup_restore_keeper_retry_initial_backoff_ms;
+ extern const SettingsUInt64 backup_restore_keeper_retry_max_backoff_ms;
+ extern const SettingsUInt64 backup_restore_failure_after_host_disconnected_for_seconds;
+ extern const SettingsUInt64 backup_restore_keeper_max_retries_while_initializing;
+ extern const SettingsUInt64 backup_restore_keeper_max_retries_while_handling_error;
+ extern const SettingsUInt64 backup_restore_finish_timeout_after_error_sec;
+ extern const SettingsUInt64 backup_restore_keeper_value_max_size;
+ extern const SettingsUInt64 backup_restore_batch_size_for_keeper_multi;
+ extern const SettingsUInt64 backup_restore_batch_size_for_keeper_multiread;
+ extern const SettingsFloat backup_restore_keeper_fault_injection_probability;
+ extern const SettingsUInt64 backup_restore_keeper_fault_injection_seed;
+}
+
+BackupKeeperSettings BackupKeeperSettings::fromContext(const ContextPtr & context)
+{
+ BackupKeeperSettings keeper_settings;
+
+ const auto & settings = context->getSettingsRef();
+ const auto & config = context->getConfigRef();
+
+ keeper_settings.max_retries = settings[Setting::backup_restore_keeper_max_retries];
+ keeper_settings.retry_initial_backoff_ms = std::chrono::milliseconds{settings[Setting::backup_restore_keeper_retry_initial_backoff_ms]};
+ keeper_settings.retry_max_backoff_ms = std::chrono::milliseconds{settings[Setting::backup_restore_keeper_retry_max_backoff_ms]};
+
+ keeper_settings.failure_after_host_disconnected_for_seconds = std::chrono::seconds{settings[Setting::backup_restore_failure_after_host_disconnected_for_seconds]};
+ keeper_settings.max_retries_while_initializing = settings[Setting::backup_restore_keeper_max_retries_while_initializing];
+ keeper_settings.max_retries_while_handling_error = settings[Setting::backup_restore_keeper_max_retries_while_handling_error];
+ keeper_settings.finish_timeout_after_error = std::chrono::seconds(settings[Setting::backup_restore_finish_timeout_after_error_sec]);
+
+ if (config.has("backups.sync_period_ms"))
+ keeper_settings.sync_period_ms = std::chrono::milliseconds{config.getUInt64("backups.sync_period_ms")};
+
+ if (config.has("backups.max_attempts_after_bad_version"))
+ keeper_settings.max_attempts_after_bad_version = config.getUInt64("backups.max_attempts_after_bad_version");
+
+ keeper_settings.value_max_size = settings[Setting::backup_restore_keeper_value_max_size];
+ keeper_settings.batch_size_for_multi = settings[Setting::backup_restore_batch_size_for_keeper_multi];
+ keeper_settings.batch_size_for_multiread = settings[Setting::backup_restore_batch_size_for_keeper_multiread];
+ keeper_settings.fault_injection_probability = settings[Setting::backup_restore_keeper_fault_injection_probability];
+ keeper_settings.fault_injection_seed = settings[Setting::backup_restore_keeper_fault_injection_seed];
+
+ return keeper_settings;
+}
+
+}
diff --git a/src/Backups/BackupKeeperSettings.h b/src/Backups/BackupKeeperSettings.h
new file mode 100644
index 00000000000..6c4b2187094
--- /dev/null
+++ b/src/Backups/BackupKeeperSettings.h
@@ -0,0 +1,64 @@
+#pragma once
+
+#include
+
+
+namespace DB
+{
+
+/// Settings for [Zoo]Keeper-related works during BACKUP or RESTORE.
+struct BackupKeeperSettings
+{
+ /// Maximum number of retries in the middle of a BACKUP ON CLUSTER or RESTORE ON CLUSTER operation.
+ /// Should be big enough so the whole operation won't be cancelled in the middle of it because of a temporary ZooKeeper failure.
+ UInt64 max_retries{1000};
+
+ /// Initial backoff timeout for ZooKeeper operations during backup or restore.
+ std::chrono::milliseconds retry_initial_backoff_ms{100};
+
+ /// Max backoff timeout for ZooKeeper operations during backup or restore.
+ std::chrono::milliseconds retry_max_backoff_ms{5000};
+
+ /// If a host during BACKUP ON CLUSTER or RESTORE ON CLUSTER doesn't recreate its 'alive' node in ZooKeeper
+ /// for this amount of time then the whole backup or restore is considered as failed.
+ /// Should be bigger than any reasonable time for a host to reconnect to ZooKeeper after a failure.
+ /// Set to zero to disable (if it's zero and some host crashed then BACKUP ON CLUSTER or RESTORE ON CLUSTER will be waiting
+ /// for the crashed host forever until the operation is explicitly cancelled with KILL QUERY).
+ std::chrono::seconds failure_after_host_disconnected_for_seconds{3600};
+
+ /// Maximum number of retries during the initialization of a BACKUP ON CLUSTER or RESTORE ON CLUSTER operation.
+ /// Shouldn't be too big because if the operation is going to fail then it's better if it fails faster.
+ UInt64 max_retries_while_initializing{20};
+
+ /// Maximum number of retries while handling an error of a BACKUP ON CLUSTER or RESTORE ON CLUSTER operation.
+ /// Shouldn't be too big because those retries are just for cleanup after the operation has failed already.
+ UInt64 max_retries_while_handling_error{20};
+
+ /// How long the initiator should wait for other host to handle the 'error' node and finish their work.
+ std::chrono::seconds finish_timeout_after_error{180};
+
+ /// How often the "stage" folder in ZooKeeper must be scanned in a background thread to track changes done by other hosts.
+ std::chrono::milliseconds sync_period_ms{5000};
+
+ /// Number of attempts after getting error ZBADVERSION from ZooKeeper.
+ size_t max_attempts_after_bad_version{10};
+
+ /// Maximum size of data of a ZooKeeper's node during backup.
+ UInt64 value_max_size{1048576};
+
+ /// Maximum size of a batch for a multi request.
+ UInt64 batch_size_for_multi{1000};
+
+ /// Maximum size of a batch for a multiread request.
+ UInt64 batch_size_for_multiread{10000};
+
+ /// Approximate probability of failure for a keeper request during backup or restore. Valid value is in interval [0.0f, 1.0f].
+ Float64 fault_injection_probability{0};
+
+ /// Seed for `fault_injection_probability`: 0 - random seed, otherwise the setting value.
+ UInt64 fault_injection_seed{0};
+
+ static BackupKeeperSettings fromContext(const ContextPtr & context);
+};
+
+}
diff --git a/src/Backups/BackupSettings.cpp b/src/Backups/BackupSettings.cpp
index 9b8117c6587..915989735c3 100644
--- a/src/Backups/BackupSettings.cpp
+++ b/src/Backups/BackupSettings.cpp
@@ -74,6 +74,17 @@ BackupSettings BackupSettings::fromBackupQuery(const ASTBackupQuery & query)
return res;
}
+bool BackupSettings::isAsync(const ASTBackupQuery & query)
+{
+ if (query.settings)
+ {
+ const auto * field = query.settings->as().changes.tryGet("async");
+ if (field)
+ return field->safeGet();
+ }
+ return false; /// `async` is false by default.
+}
+
void BackupSettings::copySettingsToQuery(ASTBackupQuery & query) const
{
auto query_settings = std::make_shared();
diff --git a/src/Backups/BackupSettings.h b/src/Backups/BackupSettings.h
index 8c2ea21df01..fa1e5025935 100644
--- a/src/Backups/BackupSettings.h
+++ b/src/Backups/BackupSettings.h
@@ -101,6 +101,8 @@ struct BackupSettings
static BackupSettings fromBackupQuery(const ASTBackupQuery & query);
void copySettingsToQuery(ASTBackupQuery & query) const;
+ static bool isAsync(const ASTBackupQuery & query);
+
struct Util
{
static std::vector clusterHostIDsFromAST(const IAST & ast);
diff --git a/src/Backups/BackupsWorker.cpp b/src/Backups/BackupsWorker.cpp
index d3889295598..8480dc5d64d 100644
--- a/src/Backups/BackupsWorker.cpp
+++ b/src/Backups/BackupsWorker.cpp
@@ -1,4 +1,6 @@
#include
+
+#include
#include
#include
#include
@@ -6,9 +8,9 @@
#include
#include
#include
-#include
+#include
#include
-#include
+#include
#include
#include
#include
@@ -43,21 +45,11 @@ namespace CurrentMetrics
namespace DB
{
-namespace Setting
-{
- extern const SettingsUInt64 backup_restore_batch_size_for_keeper_multiread;
- extern const SettingsUInt64 backup_restore_keeper_max_retries;
- extern const SettingsUInt64 backup_restore_keeper_retry_initial_backoff_ms;
- extern const SettingsUInt64 backup_restore_keeper_retry_max_backoff_ms;
- extern const SettingsUInt64 backup_restore_keeper_fault_injection_seed;
- extern const SettingsFloat backup_restore_keeper_fault_injection_probability;
-}
namespace ErrorCodes
{
extern const int BAD_ARGUMENTS;
extern const int LOGICAL_ERROR;
- extern const int CONCURRENT_ACCESS_NOT_SUPPORTED;
extern const int QUERY_WAS_CANCELLED;
}
@@ -66,102 +58,6 @@ namespace Stage = BackupCoordinationStage;
namespace
{
- std::shared_ptr makeBackupCoordination(const ContextPtr & context, const BackupSettings & backup_settings, bool remote)
- {
- if (remote)
- {
- String root_zk_path = context->getConfigRef().getString("backups.zookeeper_path", "/clickhouse/backups");
-
- auto get_zookeeper = [global_context = context->getGlobalContext()] { return global_context->getZooKeeper(); };
-
- BackupCoordinationRemote::BackupKeeperSettings keeper_settings = WithRetries::KeeperSettings::fromContext(context);
-
- auto all_hosts = BackupSettings::Util::filterHostIDs(
- backup_settings.cluster_host_ids, backup_settings.shard_num, backup_settings.replica_num);
-
- return std::make_shared(
- get_zookeeper,
- root_zk_path,
- keeper_settings,
- toString(*backup_settings.backup_uuid),
- all_hosts,
- backup_settings.host_id,
- !backup_settings.deduplicate_files,
- backup_settings.internal,
- context->getProcessListElement());
- }
-
- return std::make_shared(!backup_settings.deduplicate_files);
- }
-
- std::shared_ptr
- makeRestoreCoordination(const ContextPtr & context, const RestoreSettings & restore_settings, bool remote)
- {
- if (remote)
- {
- String root_zk_path = context->getConfigRef().getString("backups.zookeeper_path", "/clickhouse/backups");
-
- auto get_zookeeper = [global_context = context->getGlobalContext()] { return global_context->getZooKeeper(); };
-
- RestoreCoordinationRemote::RestoreKeeperSettings keeper_settings
- {
- .keeper_max_retries = context->getSettingsRef()[Setting::backup_restore_keeper_max_retries],
- .keeper_retry_initial_backoff_ms = context->getSettingsRef()[Setting::backup_restore_keeper_retry_initial_backoff_ms],
- .keeper_retry_max_backoff_ms = context->getSettingsRef()[Setting::backup_restore_keeper_retry_max_backoff_ms],
- .batch_size_for_keeper_multiread = context->getSettingsRef()[Setting::backup_restore_batch_size_for_keeper_multiread],
- .keeper_fault_injection_probability = context->getSettingsRef()[Setting::backup_restore_keeper_fault_injection_probability],
- .keeper_fault_injection_seed = context->getSettingsRef()[Setting::backup_restore_keeper_fault_injection_seed]
- };
-
- auto all_hosts = BackupSettings::Util::filterHostIDs(
- restore_settings.cluster_host_ids, restore_settings.shard_num, restore_settings.replica_num);
-
- return std::make_shared(
- get_zookeeper,
- root_zk_path,
- keeper_settings,
- toString(*restore_settings.restore_uuid),
- all_hosts,
- restore_settings.host_id,
- restore_settings.internal,
- context->getProcessListElement());
- }
-
- return std::make_shared();
- }
-
- /// Sends information about an exception to IBackupCoordination or IRestoreCoordination.
- template
- void sendExceptionToCoordination(std::shared_ptr coordination, const Exception & exception)
- {
- try
- {
- if (coordination)
- coordination->setError(exception);
- }
- catch (...) // NOLINT(bugprone-empty-catch)
- {
- }
- }
-
- /// Sends information about the current exception to IBackupCoordination or IRestoreCoordination.
- template
- void sendCurrentExceptionToCoordination(std::shared_ptr coordination)
- {
- try
- {
- throw;
- }
- catch (const Exception & e)
- {
- sendExceptionToCoordination(coordination, e);
- }
- catch (...)
- {
- sendExceptionToCoordination(coordination, Exception(getCurrentExceptionMessageAndPattern(true, true), getCurrentExceptionCode()));
- }
- }
-
bool isFinishedSuccessfully(BackupStatus status)
{
return (status == BackupStatus::BACKUP_CREATED) || (status == BackupStatus::RESTORED);
@@ -262,24 +158,27 @@ namespace
/// while the thread pool is still occupied with the waiting task then a scheduled task can be never executed).
enum class BackupsWorker::ThreadPoolId : uint8_t
{
- /// "BACKUP ON CLUSTER ASYNC" waits in background while "BACKUP ASYNC" is finished on the nodes of the cluster, then finalizes the backup.
- BACKUP_ASYNC_ON_CLUSTER = 0,
+ /// Making a list of files to copy or copying those files.
+ BACKUP,
- /// "BACKUP ASYNC" waits in background while all file infos are built and then it copies the backup's files.
- BACKUP_ASYNC = 1,
+ /// Creating of tables and databases during RESTORE and filling them with data.
+ RESTORE,
- /// Making a list of files to copy and copying of those files is always sequential, so those operations can share one thread pool.
- BACKUP_MAKE_FILES_LIST = 2,
- BACKUP_COPY_FILES = BACKUP_MAKE_FILES_LIST,
+ /// We need background threads for ASYNC backups and restores.
+ ASYNC_BACKGROUND_BACKUP,
+ ASYNC_BACKGROUND_RESTORE,
- /// "RESTORE ON CLUSTER ASYNC" waits in background while "BACKUP ASYNC" is finished on the nodes of the cluster, then finalizes the backup.
- RESTORE_ASYNC_ON_CLUSTER = 3,
+ /// We need background threads for coordination workers (see BackgroundCoordinationStageSync).
+ ON_CLUSTER_COORDINATION_BACKUP,
+ ON_CLUSTER_COORDINATION_RESTORE,
- /// "RESTORE ASYNC" waits in background while the data of all tables are restored.
- RESTORE_ASYNC = 4,
-
- /// Restores from backups.
- RESTORE = 5,
+ /// We need separate threads for internal backups and restores.
+ /// An internal backup is a helper backup invoked on some shard and replica by a BACKUP ON CLUSTER command,
+ /// (see BackupSettings.internal); and the same for restores.
+ ASYNC_BACKGROUND_INTERNAL_BACKUP,
+ ASYNC_BACKGROUND_INTERNAL_RESTORE,
+ ON_CLUSTER_COORDINATION_INTERNAL_BACKUP,
+ ON_CLUSTER_COORDINATION_INTERNAL_RESTORE,
};
@@ -312,22 +211,26 @@ public:
switch (thread_pool_id)
{
- case ThreadPoolId::BACKUP_ASYNC:
- case ThreadPoolId::BACKUP_ASYNC_ON_CLUSTER:
- case ThreadPoolId::BACKUP_COPY_FILES:
+ case ThreadPoolId::BACKUP:
+ case ThreadPoolId::ASYNC_BACKGROUND_BACKUP:
+ case ThreadPoolId::ON_CLUSTER_COORDINATION_BACKUP:
+ case ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_BACKUP:
+ case ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_BACKUP:
{
metric_threads = CurrentMetrics::BackupsThreads;
metric_active_threads = CurrentMetrics::BackupsThreadsActive;
metric_active_threads = CurrentMetrics::BackupsThreadsScheduled;
max_threads = num_backup_threads;
/// We don't use thread pool queues for thread pools with a lot of tasks otherwise that queue could be memory-wasting.
- use_queue = (thread_pool_id != ThreadPoolId::BACKUP_COPY_FILES);
+ use_queue = (thread_pool_id != ThreadPoolId::BACKUP);
break;
}
- case ThreadPoolId::RESTORE_ASYNC:
- case ThreadPoolId::RESTORE_ASYNC_ON_CLUSTER:
case ThreadPoolId::RESTORE:
+ case ThreadPoolId::ASYNC_BACKGROUND_RESTORE:
+ case ThreadPoolId::ON_CLUSTER_COORDINATION_RESTORE:
+ case ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_RESTORE:
+ case ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_RESTORE:
{
metric_threads = CurrentMetrics::RestoreThreads;
metric_active_threads = CurrentMetrics::RestoreThreadsActive;
@@ -352,12 +255,20 @@ public:
void wait()
{
auto wait_sequence = {
- ThreadPoolId::RESTORE_ASYNC_ON_CLUSTER,
- ThreadPoolId::RESTORE_ASYNC,
+ /// ASYNC_BACKGROUND_BACKUP must be before ASYNC_BACKGROUND_INTERNAL_BACKUP,
+ /// ASYNC_BACKGROUND_RESTORE must be before ASYNC_BACKGROUND_INTERNAL_RESTORE,
+ /// and everything else is after those ones.
+ ThreadPoolId::ASYNC_BACKGROUND_BACKUP,
+ ThreadPoolId::ASYNC_BACKGROUND_RESTORE,
+ ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_BACKUP,
+ ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_RESTORE,
+ /// Others:
+ ThreadPoolId::BACKUP,
ThreadPoolId::RESTORE,
- ThreadPoolId::BACKUP_ASYNC_ON_CLUSTER,
- ThreadPoolId::BACKUP_ASYNC,
- ThreadPoolId::BACKUP_COPY_FILES,
+ ThreadPoolId::ON_CLUSTER_COORDINATION_BACKUP,
+ ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_BACKUP,
+ ThreadPoolId::ON_CLUSTER_COORDINATION_RESTORE,
+ ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_RESTORE,
};
for (auto thread_pool_id : wait_sequence)
@@ -392,6 +303,7 @@ BackupsWorker::BackupsWorker(ContextMutablePtr global_context, size_t num_backup
, log(getLogger("BackupsWorker"))
, backup_log(global_context->getBackupLog())
, process_list(global_context->getProcessList())
+ , concurrency_counters(std::make_unique())
{
}
@@ -405,7 +317,7 @@ ThreadPool & BackupsWorker::getThreadPool(ThreadPoolId thread_pool_id)
}
-OperationID BackupsWorker::start(const ASTPtr & backup_or_restore_query, ContextMutablePtr context)
+std::pair BackupsWorker::start(const ASTPtr & backup_or_restore_query, ContextMutablePtr context)
{
const ASTBackupQuery & backup_query = typeid_cast(*backup_or_restore_query);
if (backup_query.kind == ASTBackupQuery::Kind::BACKUP)
@@ -414,180 +326,147 @@ OperationID BackupsWorker::start(const ASTPtr & backup_or_restore_query, Context
}
-OperationID BackupsWorker::startMakingBackup(const ASTPtr & query, const ContextPtr & context)
+struct BackupsWorker::BackupStarter
{
- auto backup_query = std::static_pointer_cast(query->clone());
- auto backup_settings = BackupSettings::fromBackupQuery(*backup_query);
-
- auto backup_info = BackupInfo::fromAST(*backup_query->backup_name);
- String backup_name_for_logging = backup_info.toStringForLogging();
-
- if (!backup_settings.backup_uuid)
- backup_settings.backup_uuid = UUIDHelpers::generateV4();
-
- /// `backup_id` will be used as a key to the `infos` map, so it should be unique.
- OperationID backup_id;
- if (backup_settings.internal)
- backup_id = "internal-" + toString(UUIDHelpers::generateV4()); /// Always generate `backup_id` for internal backup to avoid collision if both internal and non-internal backups are on the same host
- else if (!backup_settings.id.empty())
- backup_id = backup_settings.id;
- else
- backup_id = toString(*backup_settings.backup_uuid);
-
+ BackupsWorker & backups_worker;
+ std::shared_ptr backup_query;
+ ContextPtr query_context; /// We have to keep `query_context` until the end of the operation because a pointer to it is stored inside the ThreadGroup we're using.
+ ContextMutablePtr backup_context;
+ BackupSettings backup_settings;
+ BackupInfo backup_info;
+ String backup_id;
+ String backup_name_for_logging;
+ bool on_cluster;
+ bool is_internal_backup;
std::shared_ptr backup_coordination;
+ ClusterPtr cluster;
BackupMutablePtr backup;
+ std::shared_ptr process_list_element_holder;
- /// Called in exception handlers below. This lambda function can be called on a separate thread, so it can't capture local variables by reference.
- auto on_exception = [this](BackupMutablePtr & backup_, const OperationID & backup_id_, const String & backup_name_for_logging_,
- const BackupSettings & backup_settings_, const std::shared_ptr & backup_coordination_)
+ BackupStarter(BackupsWorker & backups_worker_, const ASTPtr & query_, const ContextPtr & context_)
+ : backups_worker(backups_worker_)
+ , backup_query(std::static_pointer_cast(query_->clone()))
+ , query_context(context_)
+ , backup_context(Context::createCopy(query_context))
{
- /// Something bad happened, the backup has not built.
- tryLogCurrentException(log, fmt::format("Failed to make {} {}", (backup_settings_.internal ? "internal backup" : "backup"), backup_name_for_logging_));
- setStatusSafe(backup_id_, getBackupStatusFromCurrentException());
- sendCurrentExceptionToCoordination(backup_coordination_);
+ backup_context->makeQueryContext();
+ backup_settings = BackupSettings::fromBackupQuery(*backup_query);
+ backup_info = BackupInfo::fromAST(*backup_query->backup_name);
+ backup_name_for_logging = backup_info.toStringForLogging();
+ is_internal_backup = backup_settings.internal;
+ on_cluster = !backup_query->cluster.empty() || is_internal_backup;
- if (backup_ && remove_backup_files_after_failure)
- backup_->tryRemoveAllFiles();
- backup_.reset();
- };
+ if (!backup_settings.backup_uuid)
+ backup_settings.backup_uuid = UUIDHelpers::generateV4();
+
+ /// `backup_id` will be used as a key to the `infos` map, so it should be unique.
+ if (is_internal_backup)
+ backup_id = "internal-" + toString(UUIDHelpers::generateV4()); /// Always generate `backup_id` for internal backup to avoid collision if both internal and non-internal backups are on the same host
+ else if (!backup_settings.id.empty())
+ backup_id = backup_settings.id;
+ else
+ backup_id = toString(*backup_settings.backup_uuid);
- try
- {
String base_backup_name;
if (backup_settings.base_backup_info)
base_backup_name = backup_settings.base_backup_info->toStringForLogging();
- addInfo(backup_id,
+ /// process_list_element_holder is used to make an element in ProcessList live while BACKUP is working asynchronously.
+ auto process_list_element = backup_context->getProcessListElement();
+ if (process_list_element)
+ process_list_element_holder = process_list_element->getProcessListEntry();
+
+ backups_worker.addInfo(backup_id,
backup_name_for_logging,
base_backup_name,
- context->getCurrentQueryId(),
- backup_settings.internal,
- context->getProcessListElement(),
+ backup_context->getCurrentQueryId(),
+ is_internal_backup,
+ process_list_element,
BackupStatus::CREATING_BACKUP);
+ }
- if (backup_settings.internal)
+ void doBackup()
+ {
+ chassert(!backup_coordination);
+ if (on_cluster && !is_internal_backup)
{
- /// The following call of makeBackupCoordination() is not essential because doBackup() will later create a backup coordination
- /// if it's not created here. However to handle errors better it's better to make a coordination here because this way
- /// if an exception will be thrown in startMakingBackup() other hosts will know about that.
- backup_coordination = makeBackupCoordination(context, backup_settings, /* remote= */ true);
+ backup_query->cluster = backup_context->getMacros()->expand(backup_query->cluster);
+ cluster = backup_context->getCluster(backup_query->cluster);
+ backup_settings.cluster_host_ids = cluster->getHostIDs();
+ }
+ backup_coordination = backups_worker.makeBackupCoordination(on_cluster, backup_settings, backup_context);
+
+ chassert(!backup);
+ backup = backups_worker.openBackupForWriting(backup_info, backup_settings, backup_coordination, backup_context);
+
+ backups_worker.doBackup(
+ backup, backup_query, backup_id, backup_name_for_logging, backup_settings, backup_coordination, backup_context,
+ on_cluster, cluster);
+ }
+
+ void onException()
+ {
+ /// Something bad happened, the backup has not built.
+ tryLogCurrentException(backups_worker.log, fmt::format("Failed to make {} {}",
+ (is_internal_backup ? "internal backup" : "backup"),
+ backup_name_for_logging));
+
+ bool should_remove_files_in_backup = backup && !is_internal_backup && backups_worker.remove_backup_files_after_failure;
+
+ if (backup && !backup->setIsCorrupted())
+ should_remove_files_in_backup = false;
+
+ if (backup_coordination && backup_coordination->trySetError(std::current_exception()))
+ {
+ bool other_hosts_finished = backup_coordination->tryWaitForOtherHostsToFinishAfterError();
+
+ if (should_remove_files_in_backup && other_hosts_finished)
+ backup->tryRemoveAllFiles();
+
+ backup_coordination->tryFinishAfterError();
}
- /// Prepare context to use.
- ContextPtr context_in_use = context;
- ContextMutablePtr mutable_context;
- bool on_cluster = !backup_query->cluster.empty();
- if (on_cluster || backup_settings.async)
- {
- /// We have to clone the query context here because:
- /// if this is an "ON CLUSTER" query we need to change some settings, and
- /// if this is an "ASYNC" query it's going to be executed in another thread.
- context_in_use = mutable_context = Context::createCopy(context);
- mutable_context->makeQueryContext();
- }
+ backups_worker.setStatusSafe(backup_id, getBackupStatusFromCurrentException());
+ }
+};
- if (backup_settings.async)
- {
- auto & thread_pool = getThreadPool(on_cluster ? ThreadPoolId::BACKUP_ASYNC_ON_CLUSTER : ThreadPoolId::BACKUP_ASYNC);
- /// process_list_element_holder is used to make an element in ProcessList live while BACKUP is working asynchronously.
- auto process_list_element = context_in_use->getProcessListElement();
+std::pair BackupsWorker::startMakingBackup(const ASTPtr & query, const ContextPtr & context)
+{
+ auto starter = std::make_shared(*this, query, context);
- thread_pool.scheduleOrThrowOnError(
- [this,
- backup_query,
- backup_id,
- backup_name_for_logging,
- backup_info,
- backup_settings,
- backup_coordination,
- context_in_use,
- mutable_context,
- on_exception,
- process_list_element_holder = process_list_element ? process_list_element->getProcessListEntry() : nullptr]
+ try
+ {
+ auto thread_pool_id = starter->is_internal_backup ? ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_BACKUP: ThreadPoolId::ASYNC_BACKGROUND_BACKUP;
+ String thread_name = starter->is_internal_backup ? "BackupAsyncInt" : "BackupAsync";
+ auto schedule = threadPoolCallbackRunnerUnsafe(thread_pools->getThreadPool(thread_pool_id), thread_name);
+
+ schedule([starter]
+ {
+ try
{
- BackupMutablePtr backup_async;
- try
- {
- setThreadName("BackupWorker");
- CurrentThread::QueryScope query_scope(context_in_use);
- doBackup(
- backup_async,
- backup_query,
- backup_id,
- backup_name_for_logging,
- backup_info,
- backup_settings,
- backup_coordination,
- context_in_use,
- mutable_context);
- }
- catch (...)
- {
- on_exception(backup_async, backup_id, backup_name_for_logging, backup_settings, backup_coordination);
- }
- });
- }
- else
- {
- doBackup(
- backup,
- backup_query,
- backup_id,
- backup_name_for_logging,
- backup_info,
- backup_settings,
- backup_coordination,
- context_in_use,
- mutable_context);
- }
+ starter->doBackup();
+ }
+ catch (...)
+ {
+ starter->onException();
+ }
+ },
+ Priority{});
- return backup_id;
+ return {starter->backup_id, BackupStatus::CREATING_BACKUP};
}
catch (...)
{
- on_exception(backup, backup_id, backup_name_for_logging, backup_settings, backup_coordination);
+ starter->onException();
throw;
}
}
-void BackupsWorker::doBackup(
- BackupMutablePtr & backup,
- const std::shared_ptr & backup_query,
- const OperationID & backup_id,
- const String & backup_name_for_logging,
- const BackupInfo & backup_info,
- BackupSettings backup_settings,
- std::shared_ptr backup_coordination,
- const ContextPtr & context,
- ContextMutablePtr mutable_context)
+BackupMutablePtr BackupsWorker::openBackupForWriting(const BackupInfo & backup_info, const BackupSettings & backup_settings, std::shared_ptr backup_coordination, const ContextPtr & context) const
{
- bool on_cluster = !backup_query->cluster.empty();
- assert(!on_cluster || mutable_context);
-
- /// Checks access rights if this is not ON CLUSTER query.
- /// (If this is ON CLUSTER query executeDDLQueryOnCluster() will check access rights later.)
- auto required_access = BackupUtils::getRequiredAccessToBackup(backup_query->elements);
- if (!on_cluster)
- context->checkAccess(required_access);
-
- ClusterPtr cluster;
- if (on_cluster)
- {
- backup_query->cluster = context->getMacros()->expand(backup_query->cluster);
- cluster = context->getCluster(backup_query->cluster);
- backup_settings.cluster_host_ids = cluster->getHostIDs();
- }
-
- /// Make a backup coordination.
- if (!backup_coordination)
- backup_coordination = makeBackupCoordination(context, backup_settings, /* remote= */ on_cluster);
-
- if (!allow_concurrent_backups && backup_coordination->hasConcurrentBackups(std::ref(num_active_backups)))
- throw Exception(ErrorCodes::CONCURRENT_ACCESS_NOT_SUPPORTED, "Concurrent backups not supported, turn on setting 'allow_concurrent_backups'");
-
- /// Opens a backup for writing.
+ LOG_TRACE(log, "Opening backup for writing");
BackupFactory::CreateParams backup_create_params;
backup_create_params.open_mode = IBackup::OpenMode::WRITE;
backup_create_params.context = context;
@@ -608,37 +487,57 @@ void BackupsWorker::doBackup(
backup_create_params.azure_attempt_to_create_container = backup_settings.azure_attempt_to_create_container;
backup_create_params.read_settings = getReadSettingsForBackup(context, backup_settings);
backup_create_params.write_settings = getWriteSettingsForBackup(context);
- backup = BackupFactory::instance().createBackup(backup_create_params);
+ auto backup = BackupFactory::instance().createBackup(backup_create_params);
+ LOG_INFO(log, "Opened backup for writing");
+ return backup;
+}
+
+
+void BackupsWorker::doBackup(
+ BackupMutablePtr backup,
+ const std::shared_ptr & backup_query,
+ const OperationID & backup_id,
+ const String & backup_name_for_logging,
+ const BackupSettings & backup_settings,
+ std::shared_ptr backup_coordination,
+ ContextMutablePtr context,
+ bool on_cluster,
+ const ClusterPtr & cluster)
+{
+ bool is_internal_backup = backup_settings.internal;
+
+ /// Checks access rights if this is not ON CLUSTER query.
+ /// (If this is ON CLUSTER query executeDDLQueryOnCluster() will check access rights later.)
+ auto required_access = BackupUtils::getRequiredAccessToBackup(backup_query->elements);
+ if (!on_cluster)
+ context->checkAccess(required_access);
+
+ maybeSleepForTesting();
/// Write the backup.
- if (on_cluster)
+ if (on_cluster && !is_internal_backup)
{
- DDLQueryOnClusterParams params;
- params.cluster = cluster;
- params.only_shard_num = backup_settings.shard_num;
- params.only_replica_num = backup_settings.replica_num;
- params.access_to_check = required_access;
+ /// Send the BACKUP query to other hosts.
backup_settings.copySettingsToQuery(*backup_query);
-
- // executeDDLQueryOnCluster() will return without waiting for completion
- mutable_context->setSetting("distributed_ddl_task_timeout", Field{0});
- mutable_context->setSetting("distributed_ddl_output_mode", Field{"none"});
- executeDDLQueryOnCluster(backup_query, mutable_context, params);
+ sendQueryToOtherHosts(*backup_query, cluster, backup_settings.shard_num, backup_settings.replica_num,
+ context, required_access, backup_coordination->getOnClusterInitializationKeeperRetriesInfo());
+ backup_coordination->setBackupQueryWasSentToOtherHosts();
/// Wait until all the hosts have written their backup entries.
- backup_coordination->waitForStage(Stage::COMPLETED);
- backup_coordination->setStage(Stage::COMPLETED,"");
+ backup_coordination->waitForOtherHostsToFinish();
}
else
{
backup_query->setCurrentDatabase(context->getCurrentDatabase());
+ auto read_settings = getReadSettingsForBackup(context, backup_settings);
+
/// Prepare backup entries.
BackupEntries backup_entries;
{
BackupEntriesCollector backup_entries_collector(
backup_query->elements, backup_settings, backup_coordination,
- backup_create_params.read_settings, context, getThreadPool(ThreadPoolId::BACKUP_MAKE_FILES_LIST));
+ read_settings, context, getThreadPool(ThreadPoolId::BACKUP));
backup_entries = backup_entries_collector.run();
}
@@ -646,11 +545,11 @@ void BackupsWorker::doBackup(
chassert(backup);
chassert(backup_coordination);
chassert(context);
- buildFileInfosForBackupEntries(backup, backup_entries, backup_create_params.read_settings, backup_coordination, context->getProcessListElement());
- writeBackupEntries(backup, std::move(backup_entries), backup_id, backup_coordination, backup_settings.internal, context->getProcessListElement());
+ buildFileInfosForBackupEntries(backup, backup_entries, read_settings, backup_coordination, context->getProcessListElement());
+ writeBackupEntries(backup, std::move(backup_entries), backup_id, backup_coordination, is_internal_backup, context->getProcessListElement());
- /// We have written our backup entries, we need to tell other hosts (they could be waiting for it).
- backup_coordination->setStage(Stage::COMPLETED,"");
+ /// We have written our backup entries (there is no need to sync it with other hosts because it's the last stage).
+ backup_coordination->setStage(Stage::COMPLETED, "", /* sync = */ false);
}
size_t num_files = 0;
@@ -660,9 +559,9 @@ void BackupsWorker::doBackup(
UInt64 compressed_size = 0;
/// Finalize backup (write its metadata).
- if (!backup_settings.internal)
+ backup->finalizeWriting();
+ if (!is_internal_backup)
{
- backup->finalizeWriting();
num_files = backup->getNumFiles();
total_size = backup->getTotalSize();
num_entries = backup->getNumEntries();
@@ -673,19 +572,22 @@ void BackupsWorker::doBackup(
/// Close the backup.
backup.reset();
- LOG_INFO(log, "{} {} was created successfully", (backup_settings.internal ? "Internal backup" : "Backup"), backup_name_for_logging);
+ /// The backup coordination is not needed anymore.
+ backup_coordination->finish();
+
/// NOTE: we need to update metadata again after backup->finalizeWriting(), because backup metadata is written there.
setNumFilesAndSize(backup_id, num_files, total_size, num_entries, uncompressed_size, compressed_size, 0, 0);
+
/// NOTE: setStatus is called after setNumFilesAndSize in order to have actual information in a backup log record
+ LOG_INFO(log, "{} {} was created successfully", (is_internal_backup ? "Internal backup" : "Backup"), backup_name_for_logging);
setStatus(backup_id, BackupStatus::BACKUP_CREATED);
}
void BackupsWorker::buildFileInfosForBackupEntries(const BackupPtr & backup, const BackupEntries & backup_entries, const ReadSettings & read_settings, std::shared_ptr backup_coordination, QueryStatusPtr process_list_element)
{
- backup_coordination->setStage(Stage::BUILDING_FILE_INFOS, "");
- backup_coordination->waitForStage(Stage::BUILDING_FILE_INFOS);
- backup_coordination->addFileInfos(::DB::buildFileInfosForBackupEntries(backup_entries, backup->getBaseBackup(), read_settings, getThreadPool(ThreadPoolId::BACKUP_MAKE_FILES_LIST), process_list_element));
+ backup_coordination->setStage(Stage::BUILDING_FILE_INFOS, "", /* sync = */ true);
+ backup_coordination->addFileInfos(::DB::buildFileInfosForBackupEntries(backup_entries, backup->getBaseBackup(), read_settings, getThreadPool(ThreadPoolId::BACKUP), process_list_element));
}
@@ -694,12 +596,11 @@ void BackupsWorker::writeBackupEntries(
BackupEntries && backup_entries,
const OperationID & backup_id,
std::shared_ptr backup_coordination,
- bool internal,
+ bool is_internal_backup,
QueryStatusPtr process_list_element)
{
LOG_TRACE(log, "{}, num backup entries={}", Stage::WRITING_BACKUP, backup_entries.size());
- backup_coordination->setStage(Stage::WRITING_BACKUP, "");
- backup_coordination->waitForStage(Stage::WRITING_BACKUP);
+ backup_coordination->setStage(Stage::WRITING_BACKUP, "", /* sync = */ true);
auto file_infos = backup_coordination->getFileInfos();
if (file_infos.size() != backup_entries.size())
@@ -715,7 +616,7 @@ void BackupsWorker::writeBackupEntries(
std::atomic_bool failed = false;
bool always_single_threaded = !backup->supportsWritingInMultipleThreads();
- auto & thread_pool = getThreadPool(ThreadPoolId::BACKUP_COPY_FILES);
+ auto & thread_pool = getThreadPool(ThreadPoolId::BACKUP);
std::vector writing_order;
if (test_randomize_order)
@@ -751,7 +652,7 @@ void BackupsWorker::writeBackupEntries(
maybeSleepForTesting();
// Update metadata
- if (!internal)
+ if (!is_internal_backup)
{
setNumFilesAndSize(
backup_id,
@@ -783,142 +684,139 @@ void BackupsWorker::writeBackupEntries(
}
-OperationID BackupsWorker::startRestoring(const ASTPtr & query, ContextMutablePtr context)
+struct BackupsWorker::RestoreStarter
{
- auto restore_query = std::static_pointer_cast(query->clone());
- auto restore_settings = RestoreSettings::fromRestoreQuery(*restore_query);
-
- auto backup_info = BackupInfo::fromAST(*restore_query->backup_name);
- String backup_name_for_logging = backup_info.toStringForLogging();
-
- if (!restore_settings.restore_uuid)
- restore_settings.restore_uuid = UUIDHelpers::generateV4();
-
- /// `restore_id` will be used as a key to the `infos` map, so it should be unique.
- OperationID restore_id;
- if (restore_settings.internal)
- restore_id = "internal-" + toString(UUIDHelpers::generateV4()); /// Always generate `restore_id` for internal restore to avoid collision if both internal and non-internal restores are on the same host
- else if (!restore_settings.id.empty())
- restore_id = restore_settings.id;
- else
- restore_id = toString(*restore_settings.restore_uuid);
-
+ BackupsWorker & backups_worker;
+ std::shared_ptr restore_query;
+ ContextPtr query_context; /// We have to keep `query_context` until the end of the operation because a pointer to it is stored inside the ThreadGroup we're using.
+ ContextMutablePtr restore_context;
+ RestoreSettings restore_settings;
+ BackupInfo backup_info;
+ String restore_id;
+ String backup_name_for_logging;
+ bool on_cluster;
+ bool is_internal_restore;
std::shared_ptr restore_coordination;
+ ClusterPtr cluster;
+ std::shared_ptr process_list_element_holder;
- /// Called in exception handlers below. This lambda function can be called on a separate thread, so it can't capture local variables by reference.
- auto on_exception = [this](const OperationID & restore_id_, const String & backup_name_for_logging_,
- const RestoreSettings & restore_settings_, const std::shared_ptr & restore_coordination_)
+ RestoreStarter(BackupsWorker & backups_worker_, const ASTPtr & query_, const ContextPtr & context_)
+ : backups_worker(backups_worker_)
+ , restore_query(std::static_pointer_cast(query_->clone()))
+ , query_context(context_)
+ , restore_context(Context::createCopy(query_context))
{
- /// Something bad happened, some data were not restored.
- tryLogCurrentException(log, fmt::format("Failed to restore from {} {}", (restore_settings_.internal ? "internal backup" : "backup"), backup_name_for_logging_));
- setStatusSafe(restore_id_, getRestoreStatusFromCurrentException());
- sendCurrentExceptionToCoordination(restore_coordination_);
- };
+ restore_context->makeQueryContext();
+ restore_settings = RestoreSettings::fromRestoreQuery(*restore_query);
+ backup_info = BackupInfo::fromAST(*restore_query->backup_name);
+ backup_name_for_logging = backup_info.toStringForLogging();
+ is_internal_restore = restore_settings.internal;
+ on_cluster = !restore_query->cluster.empty() || is_internal_restore;
+
+ if (!restore_settings.restore_uuid)
+ restore_settings.restore_uuid = UUIDHelpers::generateV4();
+
+ /// `restore_id` will be used as a key to the `infos` map, so it should be unique.
+ if (is_internal_restore)
+ restore_id = "internal-" + toString(UUIDHelpers::generateV4()); /// Always generate `restore_id` for internal restore to avoid collision if both internal and non-internal restores are on the same host
+ else if (!restore_settings.id.empty())
+ restore_id = restore_settings.id;
+ else
+ restore_id = toString(*restore_settings.restore_uuid);
- try
- {
String base_backup_name;
if (restore_settings.base_backup_info)
base_backup_name = restore_settings.base_backup_info->toStringForLogging();
- addInfo(restore_id,
+ /// process_list_element_holder is used to make an element in ProcessList live while BACKUP is working asynchronously.
+ auto process_list_element = restore_context->getProcessListElement();
+ if (process_list_element)
+ process_list_element_holder = process_list_element->getProcessListEntry();
+
+ backups_worker.addInfo(restore_id,
backup_name_for_logging,
base_backup_name,
- context->getCurrentQueryId(),
- restore_settings.internal,
- context->getProcessListElement(),
+ restore_context->getCurrentQueryId(),
+ is_internal_restore,
+ process_list_element,
BackupStatus::RESTORING);
+ }
- if (restore_settings.internal)
+ void doRestore()
+ {
+ chassert(!restore_coordination);
+ if (on_cluster && !is_internal_restore)
{
- /// The following call of makeRestoreCoordination() is not essential because doRestore() will later create a restore coordination
- /// if it's not created here. However to handle errors better it's better to make a coordination here because this way
- /// if an exception will be thrown in startRestoring() other hosts will know about that.
- restore_coordination = makeRestoreCoordination(context, restore_settings, /* remote= */ true);
+ restore_query->cluster = restore_context->getMacros()->expand(restore_query->cluster);
+ cluster = restore_context->getCluster(restore_query->cluster);
+ restore_settings.cluster_host_ids = cluster->getHostIDs();
+ }
+ restore_coordination = backups_worker.makeRestoreCoordination(on_cluster, restore_settings, restore_context);
+
+ backups_worker.doRestore(
+ restore_query,
+ restore_id,
+ backup_name_for_logging,
+ backup_info,
+ restore_settings,
+ restore_coordination,
+ restore_context,
+ on_cluster,
+ cluster);
+ }
+
+ void onException()
+ {
+ /// Something bad happened, some data were not restored.
+ tryLogCurrentException(backups_worker.log, fmt::format("Failed to restore from {} {}", (is_internal_restore ? "internal backup" : "backup"), backup_name_for_logging));
+
+ if (restore_coordination && restore_coordination->trySetError(std::current_exception()))
+ {
+ restore_coordination->tryWaitForOtherHostsToFinishAfterError();
+ restore_coordination->tryFinishAfterError();
}
- /// Prepare context to use.
- ContextMutablePtr context_in_use = context;
- bool on_cluster = !restore_query->cluster.empty();
- if (restore_settings.async || on_cluster)
- {
- /// We have to clone the query context here because:
- /// if this is an "ON CLUSTER" query we need to change some settings, and
- /// if this is an "ASYNC" query it's going to be executed in another thread.
- context_in_use = Context::createCopy(context);
- context_in_use->makeQueryContext();
- }
+ backups_worker.setStatusSafe(restore_id, getRestoreStatusFromCurrentException());
+ }
+};
- if (restore_settings.async)
- {
- auto & thread_pool = getThreadPool(on_cluster ? ThreadPoolId::RESTORE_ASYNC_ON_CLUSTER : ThreadPoolId::RESTORE_ASYNC);
- /// process_list_element_holder is used to make an element in ProcessList live while RESTORE is working asynchronously.
- auto process_list_element = context_in_use->getProcessListElement();
+std::pair BackupsWorker::startRestoring(const ASTPtr & query, ContextMutablePtr context)
+{
+ auto starter = std::make_shared(*this, query, context);
- thread_pool.scheduleOrThrowOnError(
- [this,
- restore_query,
- restore_id,
- backup_name_for_logging,
- backup_info,
- restore_settings,
- restore_coordination,
- context_in_use,
- on_exception,
- process_list_element_holder = process_list_element ? process_list_element->getProcessListEntry() : nullptr]
+ try
+ {
+ auto thread_pool_id = starter->is_internal_restore ? ThreadPoolId::ASYNC_BACKGROUND_INTERNAL_RESTORE : ThreadPoolId::ASYNC_BACKGROUND_RESTORE;
+ String thread_name = starter->is_internal_restore ? "RestoreAsyncInt" : "RestoreAsync";
+ auto schedule = threadPoolCallbackRunnerUnsafe(thread_pools->getThreadPool(thread_pool_id), thread_name);
+
+ schedule([starter]
+ {
+ try
{
- try
- {
- setThreadName("RestorerWorker");
- CurrentThread::QueryScope query_scope(context_in_use);
- doRestore(
- restore_query,
- restore_id,
- backup_name_for_logging,
- backup_info,
- restore_settings,
- restore_coordination,
- context_in_use);
- }
- catch (...)
- {
- on_exception(restore_id, backup_name_for_logging, restore_settings, restore_coordination);
- }
- });
- }
- else
- {
- doRestore(
- restore_query,
- restore_id,
- backup_name_for_logging,
- backup_info,
- restore_settings,
- restore_coordination,
- context_in_use);
- }
+ starter->doRestore();
+ }
+ catch (...)
+ {
+ starter->onException();
+ }
+ },
+ Priority{});
- return restore_id;
+ return {starter->restore_id, BackupStatus::RESTORING};
}
catch (...)
{
- on_exception(restore_id, backup_name_for_logging, restore_settings, restore_coordination);
+ starter->onException();
throw;
}
}
-void BackupsWorker::doRestore(
- const std::shared_ptr & restore_query,
- const OperationID & restore_id,
- const String & backup_name_for_logging,
- const BackupInfo & backup_info,
- RestoreSettings restore_settings,
- std::shared_ptr restore_coordination,
- ContextMutablePtr context)
+BackupPtr BackupsWorker::openBackupForReading(const BackupInfo & backup_info, const RestoreSettings & restore_settings, const ContextPtr & context) const
{
- /// Open the backup for reading.
+ LOG_TRACE(log, "Opening backup for reading");
BackupFactory::CreateParams backup_open_params;
backup_open_params.open_mode = IBackup::OpenMode::READ;
backup_open_params.context = context;
@@ -931,32 +829,35 @@ void BackupsWorker::doRestore(
backup_open_params.read_settings = getReadSettingsForRestore(context);
backup_open_params.write_settings = getWriteSettingsForRestore(context);
backup_open_params.is_internal_backup = restore_settings.internal;
- BackupPtr backup = BackupFactory::instance().createBackup(backup_open_params);
+ auto backup = BackupFactory::instance().createBackup(backup_open_params);
+ LOG_TRACE(log, "Opened backup for reading");
+ return backup;
+}
+
+
+void BackupsWorker::doRestore(
+ const std::shared_ptr & restore_query,
+ const OperationID & restore_id,
+ const String & backup_name_for_logging,
+ const BackupInfo & backup_info,
+ RestoreSettings restore_settings,
+ std::shared_ptr restore_coordination,
+ ContextMutablePtr context,
+ bool on_cluster,
+ const ClusterPtr & cluster)
+{
+ bool is_internal_restore = restore_settings.internal;
+
+ maybeSleepForTesting();
+
+ /// Open the backup for reading.
+ BackupPtr backup = openBackupForReading(backup_info, restore_settings, context);
String current_database = context->getCurrentDatabase();
+
/// Checks access rights if this is ON CLUSTER query.
/// (If this isn't ON CLUSTER query RestorerFromBackup will check access rights later.)
- ClusterPtr cluster;
- bool on_cluster = !restore_query->cluster.empty();
-
- if (on_cluster)
- {
- restore_query->cluster = context->getMacros()->expand(restore_query->cluster);
- cluster = context->getCluster(restore_query->cluster);
- restore_settings.cluster_host_ids = cluster->getHostIDs();
- }
-
- /// Make a restore coordination.
- if (!restore_coordination)
- restore_coordination = makeRestoreCoordination(context, restore_settings, /* remote= */ on_cluster);
-
- if (!allow_concurrent_restores && restore_coordination->hasConcurrentRestores(std::ref(num_active_restores)))
- throw Exception(
- ErrorCodes::CONCURRENT_ACCESS_NOT_SUPPORTED,
- "Concurrent restores not supported, turn on setting 'allow_concurrent_restores'");
-
-
- if (on_cluster)
+ if (on_cluster && !is_internal_restore)
{
/// We cannot just use access checking provided by the function executeDDLQueryOnCluster(): it would be incorrect
/// because different replicas can contain different set of tables and so the required access rights can differ too.
@@ -975,27 +876,21 @@ void BackupsWorker::doRestore(
}
/// Do RESTORE.
- if (on_cluster)
+ if (on_cluster && !is_internal_restore)
{
-
- DDLQueryOnClusterParams params;
- params.cluster = cluster;
- params.only_shard_num = restore_settings.shard_num;
- params.only_replica_num = restore_settings.replica_num;
+ /// Send the RESTORE query to other hosts.
restore_settings.copySettingsToQuery(*restore_query);
+ sendQueryToOtherHosts(*restore_query, cluster, restore_settings.shard_num, restore_settings.replica_num,
+ context, {}, restore_coordination->getOnClusterInitializationKeeperRetriesInfo());
+ restore_coordination->setRestoreQueryWasSentToOtherHosts();
- // executeDDLQueryOnCluster() will return without waiting for completion
- context->setSetting("distributed_ddl_task_timeout", Field{0});
- context->setSetting("distributed_ddl_output_mode", Field{"none"});
-
- executeDDLQueryOnCluster(restore_query, context, params);
-
- /// Wait until all the hosts have written their backup entries.
- restore_coordination->waitForStage(Stage::COMPLETED);
- restore_coordination->setStage(Stage::COMPLETED,"");
+ /// Wait until all the hosts have done with their restoring work.
+ restore_coordination->waitForOtherHostsToFinish();
}
else
{
+ maybeSleepForTesting();
+
restore_query->setCurrentDatabase(current_database);
auto after_task_callback = [&]
@@ -1011,11 +906,115 @@ void BackupsWorker::doRestore(
restorer.run(RestorerFromBackup::RESTORE);
}
- LOG_INFO(log, "Restored from {} {} successfully", (restore_settings.internal ? "internal backup" : "backup"), backup_name_for_logging);
+ /// The restore coordination is not needed anymore.
+ restore_coordination->finish();
+
+ LOG_INFO(log, "Restored from {} {} successfully", (is_internal_restore ? "internal backup" : "backup"), backup_name_for_logging);
setStatus(restore_id, BackupStatus::RESTORED);
}
+void BackupsWorker::sendQueryToOtherHosts(const ASTBackupQuery & backup_or_restore_query, const ClusterPtr & cluster,
+ size_t only_shard_num, size_t only_replica_num, ContextMutablePtr context, const AccessRightsElements & access_to_check,
+ const ZooKeeperRetriesInfo & retries_info) const
+{
+ chassert(cluster);
+
+ DDLQueryOnClusterParams params;
+ params.cluster = cluster;
+ params.only_shard_num = only_shard_num;
+ params.only_replica_num = only_replica_num;
+ params.access_to_check = access_to_check;
+ params.retries_info = retries_info;
+
+ context->setSetting("distributed_ddl_task_timeout", Field{0});
+ context->setSetting("distributed_ddl_output_mode", Field{"never_throw"});
+
+ // executeDDLQueryOnCluster() will return without waiting for completion
+ executeDDLQueryOnCluster(backup_or_restore_query.clone(), context, params);
+
+ maybeSleepForTesting();
+}
+
+
+std::shared_ptr
+BackupsWorker::makeBackupCoordination(bool on_cluster, const BackupSettings & backup_settings, const ContextPtr & context) const
+{
+ if (!on_cluster)
+ {
+ return std::make_shared(
+ *backup_settings.backup_uuid, !backup_settings.deduplicate_files, allow_concurrent_backups, *concurrency_counters);
+ }
+
+ bool is_internal_backup = backup_settings.internal;
+
+ String root_zk_path = context->getConfigRef().getString("backups.zookeeper_path", "/clickhouse/backups");
+ auto get_zookeeper = [global_context = context->getGlobalContext()] { return global_context->getZooKeeper(); };
+ auto keeper_settings = BackupKeeperSettings::fromContext(context);
+
+ auto all_hosts = BackupSettings::Util::filterHostIDs(
+ backup_settings.cluster_host_ids, backup_settings.shard_num, backup_settings.replica_num);
+ all_hosts.emplace_back(BackupCoordinationOnCluster::kInitiator);
+
+ String current_host = is_internal_backup ? backup_settings.host_id : String{BackupCoordinationOnCluster::kInitiator};
+
+ auto thread_pool_id = is_internal_backup ? ThreadPoolId::ON_CLUSTER_COORDINATION_INTERNAL_BACKUP : ThreadPoolId::ON_CLUSTER_COORDINATION_BACKUP;
+ String thread_name = is_internal_backup ? "BackupCoordInt" : "BackupCoord";
+ auto schedule = threadPoolCallbackRunnerUnsafe(thread_pools->getThreadPool(thread_pool_id), thread_name);
+
+ return std::make_shared(
+ *backup_settings.backup_uuid,
+ !backup_settings.deduplicate_files,
+ root_zk_path,
+ get_zookeeper,
+ keeper_settings,
+ current_host,
+ all_hosts,
+ allow_concurrent_backups,
+ *concurrency_counters,
+ schedule,
+ context->getProcessListElement());
+}
+
+std::shared_ptr
+BackupsWorker::makeRestoreCoordination(bool on_cluster, const RestoreSettings & restore_settings, const ContextPtr & context) const
+{
+ if (!on_cluster)
+ {
+ return std::make_shared