Merge branch 'master' of github.com:ClickHouse/ClickHouse into query_plan_for_merge

2024-09-19 16:20:50 +00:00 · 2024-09-05 20:58:47 +02:00 · 2024-09-05 20:58:47 +02:00 · 2df5edc1c1
commit 2df5edc1c1
parent 20eaecc4f3 b6572e36b4
17 changed files with 166 additions and 376 deletions
--- a/docs/en/sql-reference/operators/index.md
+++ b/docs/en/sql-reference/operators/index.md
@ -265,8 +265,6 @@ SELECT now() AS current_date_time, current_date_time + INTERVAL '4' day + INTERV
 └─────────────────────┴────────────────────────────────────────────────────────────┘
 ```

-You can work with dates without using `INTERVAL`, just by adding or subtracting seconds, minutes, and hours. For example, an interval of one day can be set by adding `60*60*24`.
-
 :::note    
 The `INTERVAL` syntax or `addDays` function are always preferred. Simple addition or subtraction (syntax like `now() + ...`) doesn't consider time settings. For example, daylight saving time.
 :::
--- a/docs/en/sql-reference/statements/alter/partition.md
+++ b/docs/en/sql-reference/statements/alter/partition.md
@ -351,7 +351,7 @@ ALTER TABLE mt DELETE IN PARTITION ID '2' WHERE p = 2;
 You can specify the partition expression in `ALTER ... PARTITION` queries in different ways:

 - As a value from the `partition` column of the `system.parts` table. For example, `ALTER TABLE visits DETACH PARTITION 201901`.
- Using the keyword `ALL`. It can be used only with DROP/DETACH/ATTACH. For example, `ALTER TABLE visits ATTACH PARTITION ALL`.
+- Using the keyword `ALL`. It can be used only with DROP/DETACH/ATTACH/ATTACH FROM. For example, `ALTER TABLE visits ATTACH PARTITION ALL`.
 - As a tuple of expressions or constants that matches (in types) the table partitioning keys tuple. In the case of a single element partitioning key, the expression should be wrapped in the `tuple (...)` function. For example, `ALTER TABLE visits DETACH PARTITION tuple(toYYYYMM(toDate('2019-01-25')))`.
 - Using the partition ID. Partition ID is a string identifier of the partition (human-readable, if possible) that is used as the names of partitions in the file system and in ZooKeeper. The partition ID must be specified in the `PARTITION ID` clause, in a single quotes. For example, `ALTER TABLE visits DETACH PARTITION ID '201901'`.
 - In the [ALTER ATTACH PART](#attach-partitionpart) and [DROP DETACHED PART](#drop-detached-partitionpart) query, to specify the name of a part, use string literal with a value from the `name` column of the [system.detached_parts](/docs/en/operations/system-tables/detached_parts.md/#system_tables-detached_parts) table. For example, `ALTER TABLE visits ATTACH PART '201901_1_1_0'`.
--- a/docs/en/sql-reference/statements/delete.md
+++ b/docs/en/sql-reference/statements/delete.md
@ -13,7 +13,7 @@ The lightweight `DELETE` statement removes rows from the table `[db.]table` that
 DELETE FROM [db.]table [ON CLUSTER cluster] [IN PARTITION partition_expr] WHERE expr;
 ```

-It is called "lightweight `DELETE`" to contrast it to the [ALTER table DELETE](/en/sql-reference/statements/alter/delete) command, which is a heavyweight process.
+It is called "lightweight `DELETE`" to contrast it to the [ALTER TABLE ... DELETE](/en/sql-reference/statements/alter/delete) command, which is a heavyweight process.

 ## Examples

@ -22,11 +22,13 @@ It is called "lightweight `DELETE`" to contrast it to the [ALTER table DELETE](/
 DELETE FROM hits WHERE Title LIKE '%hello%';
 ```

-## Lightweight `DELETE` does not delete data from storage immediately
+## Lightweight `DELETE` does not delete data immediately

-With lightweight `DELETE`, deleted rows are internally marked as deleted immediately and will be automatically filtered out of all subsequent queries. However, cleanup of data happens during the next merge. As a result, it is possible that for an unspecified period, data is not actually deleted from storage and is only marked as deleted.
+Lightweight `DELETE` is implemented as a [mutation](/en/sql-reference/statements/alter#mutations), which is executed asynchronously in the background by default. The statement is going to return almost immediately, but the data can still be visible to queries until the mutation is finished.

-If you need to guarantee that your data is deleted from storage in a predictable time, consider using the [ALTER table DELETE](/en/sql-reference/statements/alter/delete) command. Note that deleting data using `ALTER table DELETE` may consume significant resources as it recreates all affected parts.
+The mutation marks rows as deleted, and at that point, they will no longer show up in query results. It does not physically delete the data, this will happen during the next merge. As a result, it is possible that for an unspecified period, data is not actually deleted from storage and is only marked as deleted.
+
+If you need to guarantee that your data is deleted from storage in a predictable time, consider using the table setting [`min_age_to_force_merge_seconds`](https://clickhouse.com/docs/en/operations/settings/merge-tree-settings#min_age_to_force_merge_seconds). Or you can use the [ALTER TABLE ... DELETE](/en/sql-reference/statements/alter/delete) command. Note that deleting data using `ALTER TABLE ... DELETE` may consume significant resources as it recreates all affected parts.

 ## Deleting large amounts of data

@ -38,7 +40,7 @@ If you anticipate frequent deletes, consider using a [custom partitioning key](/

 ### Lightweight `DELETE`s with projections

-By default, `DELETE` does not work for tables with projections. This is because rows in a projection may be affected by a `DELETE` operation. But there is a [MergeTree setting](https://clickhouse.com/docs/en/operations/settings/merge-tree-settings) `lightweight_mutation_projection_mode` can change the behavior.
+By default, `DELETE` does not work for tables with projections. This is because rows in a projection may be affected by a `DELETE` operation. But there is a [MergeTree setting](https://clickhouse.com/docs/en/operations/settings/merge-tree-settings) `lightweight_mutation_projection_mode` to change the behavior.

 ## Performance considerations when using lightweight `DELETE`

@ -48,7 +50,7 @@ The following can also negatively impact lightweight `DELETE` performance:

 - A heavy `WHERE` condition in a `DELETE` query.
 - If the mutations queue is filled with many other mutations, this can possibly lead to performance issues as all mutations on a table are executed sequentially.
- The affected table having a very large number of data parts.
+- The affected table has a very large number of data parts.
 - Having a lot of data in compact parts. In a Compact part, all columns are stored in one file.

 ## Delete permissions
@ -61,13 +63,13 @@ GRANT ALTER DELETE ON db.table to username;

 ## How lightweight DELETEs work internally in ClickHouse

-1. A "mask" is applied to affected rows
+1. **A "mask" is applied to affected rows**

-When a `DELETE FROM table ...` query is executed, ClickHouse saves a mask where each row is marked as either “existing” or as “deleted”. Those “deleted” rows are omitted for subsequent queries. However, rows are actually only removed later by subsequent merges. Writing this mask is much more lightweight than what is done by an `ALTER table DELETE` query.
+   When a `DELETE FROM table ...` query is executed, ClickHouse saves a mask where each row is marked as either “existing” or as “deleted”. Those “deleted” rows are omitted for subsequent queries. However, rows are actually only removed later by subsequent merges. Writing this mask is much more lightweight than what is done by an `ALTER TABLE ... DELETE` query.

   The mask is implemented as a hidden `_row_exists` system column that stores `True` for all visible rows and `False` for deleted ones. This column is only present in a part if some rows in the part were deleted. This column does not exist when a part has all values equal to `True`.

-2. `SELECT` queries are transformed to include the mask
+2. **`SELECT` queries are transformed to include the mask**

   When a masked column is used in a query, the `SELECT ... FROM table WHERE condition` query internally is extended by the predicate on `_row_exists` and is transformed to:
   ```sql
@ -75,17 +77,17 @@ SELECT ... FROM table PREWHERE _row_exists WHERE condition
   ```
   At execution time, the column `_row_exists` is read to determine which rows should not be returned. If there are many deleted rows, ClickHouse can determine which granules can be fully skipped when reading the rest of the columns.

-3. `DELETE` queries are transformed to `ALTER table UPDATE` queries
+3. **`DELETE` queries are transformed to `ALTER TABLE ... UPDATE` queries**

-The `DELETE FROM table WHERE condition` is translated into an `ALTER table UPDATE _row_exists = 0 WHERE condition` mutation.
+   The `DELETE FROM table WHERE condition` is translated into an `ALTER TABLE table UPDATE _row_exists = 0 WHERE condition` mutation.

   Internally, this mutation is executed in two steps:

   1. A `SELECT count() FROM table WHERE condition` command is executed for each individual part to determine if the part is affected.

-2. Based on the commands above, affected parts are then mutated, and hardlinks are created for unaffected parts. In the case of wide parts, the `_row_exists` column for each row is updated and all other columns' files are hardlinked. For compact parts, all columns are re-written because they are all stored together in one file.
+   2. Based on the commands above, affected parts are then mutated, and hardlinks are created for unaffected parts. In the case of wide parts, the `_row_exists` column for each row is updated, and all other columns' files are hardlinked. For compact parts, all columns are re-written because they are all stored together in one file.

-From the steps above, we can see that lightweight deletes using the masking technique improves performance over traditional `ALTER table DELETE` commands because `ALTER table DELETE` reads and re-writes all the columns' files for affected parts.
+   From the steps above, we can see that lightweight `DELETE` using the masking technique improves performance over traditional `ALTER TABLE ... DELETE` because it does not re-write all the columns' files for affected parts.

 ## Related content

--- a/src/Storages/MergeTree/DataPartStorageOnDiskFull.cpp
+++ b/src/Storages/MergeTree/DataPartStorageOnDiskFull.cpp
@ -95,22 +95,18 @@ UInt32 DataPartStorageOnDiskFull::getRefCount(const String & file_name) const
    return volume->getDisk()->getRefCount(fs::path(root_path) / part_dir / file_name);
 }

-std::string DataPartStorageOnDiskFull::getRemotePath(const std::string & file_name, bool if_exists) const
+std::vector<std::string> DataPartStorageOnDiskFull::getRemotePaths(const std::string & file_name) const
 {
    const std::string path = fs::path(root_path) / part_dir / file_name;
    auto objects = volume->getDisk()->getStorageObjects(path);

-    if (objects.empty() && if_exists)
-        return "";
+    std::vector<std::string> remote_paths;
+    remote_paths.reserve(objects.size());

-    if (objects.size() != 1)
-    {
-        throw Exception(ErrorCodes::LOGICAL_ERROR,
-                        "One file must be mapped to one object on blob storage by path {} in MergeTree tables, have {}.",
-                        path, objects.size());
-    }
+    for (const auto & object : objects)
+        remote_paths.push_back(object.remote_path);

-    return objects[0].remote_path;
+    return remote_paths;
 }

 String DataPartStorageOnDiskFull::getUniqueId() const
--- a/src/Storages/MergeTree/DataPartStorageOnDiskFull.h
+++ b/src/Storages/MergeTree/DataPartStorageOnDiskFull.h
@ -23,7 +23,7 @@ public:
    Poco::Timestamp getFileLastModified(const String & file_name) const override;
    size_t getFileSize(const std::string & file_name) const override;
    UInt32 getRefCount(const std::string & file_name) const override;
-    std::string getRemotePath(const std::string & file_name, bool if_exists) const override;
+    std::vector<std::string> getRemotePaths(const std::string & file_name) const override;
    String getUniqueId() const override;

    std::unique_ptr<ReadBufferFromFileBase> readFile(
--- a/src/Storages/MergeTree/IDataPartStorage.h
+++ b/src/Storages/MergeTree/IDataPartStorage.h
@ -126,7 +126,7 @@ public:
    virtual UInt32 getRefCount(const std::string & file_name) const = 0;

    /// Get path on remote filesystem from file name on local filesystem.
-    virtual std::string getRemotePath(const std::string & file_name, bool if_exists) const = 0;
+    virtual std::vector<std::string> getRemotePaths(const std::string & file_name) const = 0;

    virtual UInt64 calculateTotalSizeOnDisk() const = 0;

--- a/src/Storages/MergeTree/MergeTreeData.cpp
+++ b/src/Storages/MergeTree/MergeTreeData.cpp
@ -5009,7 +5009,7 @@ void MergeTreeData::checkAlterPartitionIsPossible(
                const auto * partition_ast = command.partition->as<ASTPartition>();
                if (partition_ast && partition_ast->all)
                {
-                    if (command.type != PartitionCommand::DROP_PARTITION && command.type != PartitionCommand::ATTACH_PARTITION)
+                    if (command.type != PartitionCommand::DROP_PARTITION && command.type != PartitionCommand::ATTACH_PARTITION && !(command.type == PartitionCommand::REPLACE_PARTITION && !command.replace))
                        throw DB::Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only support DROP/DETACH/ATTACH PARTITION ALL currently");
                }
                else
@ -5810,7 +5810,7 @@ String MergeTreeData::getPartitionIDFromQuery(const ASTPtr & ast, ContextPtr loc
    const auto & partition_ast = ast->as<ASTPartition &>();

    if (partition_ast.all)
-        throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only Support DETACH PARTITION ALL currently");
+        throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only Support DROP/DETACH/ATTACH PARTITION ALL currently");

    if (!partition_ast.value)
    {
--- a/src/Storages/MergeTree/checkDataPart.cpp
+++ b/src/Storages/MergeTree/checkDataPart.cpp
@ -391,16 +391,8 @@ IMergeTreeDataPart::Checksums checkDataPart(
            auto file_name = it->name();
            if (!data_part_storage.isDirectory(file_name))
            {
-                const bool is_projection_part = data_part->isProjectionPart();
-                auto remote_path = data_part_storage.getRemotePath(file_name, /* if_exists */is_projection_part);
-                if (remote_path.empty())
-                {
-                    chassert(is_projection_part);
-                    throw Exception(
-                        ErrorCodes::BROKEN_PROJECTION,
-                        "Remote path for {} does not exist for projection path. Projection {} is broken",
-                        file_name, data_part->name);
-                }
+                auto remote_paths = data_part_storage.getRemotePaths(file_name);
+                for (const auto & remote_path : remote_paths)
                    cache.removePathIfExists(remote_path, FileCache::getCommonUser().user_id);
            }
        }
--- a/src/Storages/StorageMergeTree.cpp
+++ b/src/Storages/StorageMergeTree.cpp
@ -2090,9 +2090,22 @@ void StorageMergeTree::replacePartitionFrom(const StoragePtr & source_table, con
    ProfileEventsScope profile_events_scope;

    MergeTreeData & src_data = checkStructureAndGetMergeTreeData(source_table, source_metadata_snapshot, my_metadata_snapshot);
-    String partition_id = getPartitionIDFromQuery(partition, local_context);
+    DataPartsVector src_parts;
+    String partition_id;
+    bool is_all = partition->as<ASTPartition>()->all;
+    if (is_all)
+    {
+        if (replace)
+            throw DB::Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only support DROP/DETACH/ATTACH PARTITION ALL currently");
+
+        src_parts = src_data.getVisibleDataPartsVector(local_context);
+    }
+    else
+    {
+        partition_id = getPartitionIDFromQuery(partition, local_context);
+        src_parts = src_data.getVisibleDataPartsVectorInPartition(local_context, partition_id);
+    }

-    DataPartsVector src_parts = src_data.getVisibleDataPartsVectorInPartition(local_context, partition_id);
    MutableDataPartsVector dst_parts;
    std::vector<scope_guard> dst_parts_locks;

@ -2100,6 +2113,9 @@ void StorageMergeTree::replacePartitionFrom(const StoragePtr & source_table, con

    for (const DataPartPtr & src_part : src_parts)
    {
+        if (is_all)
+            partition_id = src_part->partition.getID(src_data);
+
        if (!canReplacePartition(src_part))
            throw Exception(ErrorCodes::BAD_ARGUMENTS,
                            "Cannot replace partition '{}' because part '{}' has inconsistent granularity with table",
--- a/src/Storages/StorageReplicatedMergeTree.cpp
+++ b/src/Storages/StorageReplicatedMergeTree.cpp
@ -8033,24 +8033,77 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
    /// First argument is true, because we possibly will add new data to current table.
    auto lock1 = lockForShare(query_context->getCurrentQueryId(), query_context->getSettingsRef().lock_acquire_timeout);
    auto lock2 = source_table->lockForShare(query_context->getCurrentQueryId(), query_context->getSettingsRef().lock_acquire_timeout);
-    auto storage_settings_ptr = getSettings();

-    auto source_metadata_snapshot = source_table->getInMemoryMetadataPtr();
-    auto metadata_snapshot = getInMemoryMetadataPtr();
+    const auto storage_settings_ptr = getSettings();
+    const auto source_metadata_snapshot = source_table->getInMemoryMetadataPtr();
+    const auto metadata_snapshot = getInMemoryMetadataPtr();
+    const MergeTreeData & src_data = checkStructureAndGetMergeTreeData(source_table, source_metadata_snapshot, metadata_snapshot);

-    Stopwatch watch;
+    std::unordered_set<String> partitions;
+    if (partition->as<ASTPartition>()->all)
+    {
+        if (replace)
+            throw DB::Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only support DROP/DETACH/ATTACH PARTITION ALL currently");
+
+        partitions = src_data.getAllPartitionIds();
+    }
+    else
+    {
+        partitions = std::unordered_set<String>();
+        partitions.emplace(getPartitionIDFromQuery(partition, query_context));
+    }
+    LOG_INFO(log, "Will try to attach {} partitions", partitions.size());
+
+    const Stopwatch watch;
    ProfileEventsScope profile_events_scope;
+    const auto zookeeper = getZooKeeper();

-    MergeTreeData & src_data = checkStructureAndGetMergeTreeData(source_table, source_metadata_snapshot, metadata_snapshot);
-    String partition_id = getPartitionIDFromQuery(partition, query_context);
+    const bool zero_copy_enabled = storage_settings_ptr->allow_remote_fs_zero_copy_replication
+                || dynamic_cast<const MergeTreeData *>(source_table.get())->getSettings()->allow_remote_fs_zero_copy_replication;

+    std::unique_ptr<ReplicatedMergeTreeLogEntryData> entries[partitions.size()];
+    size_t idx = 0;
+    for (const auto & partition_id : partitions)
+    {
+        entries[idx] = replacePartitionFromImpl(watch,
+                profile_events_scope,
+                metadata_snapshot,
+                src_data,
+                partition_id,
+                zookeeper,
+                replace,
+                zero_copy_enabled,
+                storage_settings_ptr->always_use_copy_instead_of_hardlinks,
+                query_context);
+        ++idx;
+    }
+
+    for (const auto & entry : entries)
+        waitForLogEntryToBeProcessedIfNecessary(*entry, query_context);
+}
+
+std::unique_ptr<ReplicatedMergeTreeLogEntryData> StorageReplicatedMergeTree::replacePartitionFromImpl(
+    const Stopwatch & watch,
+    ProfileEventsScope & profile_events_scope,
+    const StorageMetadataPtr & metadata_snapshot,
+    const MergeTreeData & src_data,
+    const String & partition_id,
+    const ZooKeeperPtr & zookeeper,
+    bool replace,
+    const bool & zero_copy_enabled,
+    const bool & always_use_copy_instead_of_hardlinks,
+    const ContextPtr & query_context)
+{
    /// NOTE: Some covered parts may be missing in src_all_parts if corresponding log entries are not executed yet.
    DataPartsVector src_all_parts = src_data.getVisibleDataPartsVectorInPartition(query_context, partition_id);
-
    LOG_DEBUG(log, "Cloning {} parts", src_all_parts.size());

-    static const String TMP_PREFIX = "tmp_replace_from_";
-    auto zookeeper = getZooKeeper();
+    std::optional<ZooKeeperMetadataTransaction> txn;
+    if (auto query_txn = query_context->getZooKeeperMetadataTransaction())
+        txn.emplace(query_txn->getZooKeeper(),
+            query_txn->getDatabaseZooKeeperPath(),
+            query_txn->isInitialQuery(),
+            query_txn->getTaskZooKeeperPath());

    /// Retry if alter_partition_version changes
    for (size_t retry = 0; retry < 1000; ++retry)
@ -8136,11 +8189,9 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
            UInt64 index = lock->getNumber();
            MergeTreePartInfo dst_part_info(partition_id, index, index, src_part->info.level);

-            bool zero_copy_enabled = storage_settings_ptr->allow_remote_fs_zero_copy_replication
-                || dynamic_cast<const MergeTreeData *>(source_table.get())->getSettings()->allow_remote_fs_zero_copy_replication;
            IDataPartStorage::ClonePartParams clone_params
            {
-                .copy_instead_of_hardlink = storage_settings_ptr->always_use_copy_instead_of_hardlinks || (zero_copy_enabled && src_part->isStoredOnRemoteDiskWithZeroCopySupport()),
+                .copy_instead_of_hardlink = always_use_copy_instead_of_hardlinks || (zero_copy_enabled && src_part->isStoredOnRemoteDiskWithZeroCopySupport()),
                .metadata_version_to_write = metadata_snapshot->getMetadataVersion()
            };
            if (replace)
@ -8148,7 +8199,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
                /// Replace can only work on the same disk
                auto [dst_part, part_lock] = cloneAndLoadDataPart(
                    src_part,
-                    TMP_PREFIX,
+                    TMP_PREFIX_REPLACE_PARTITION_FROM,
                    dst_part_info,
                    metadata_snapshot,
                    clone_params,
@ -8163,7 +8214,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
                /// Attach can work on another disk
                auto [dst_part, part_lock] = cloneAndLoadDataPart(
                    src_part,
-                    TMP_PREFIX,
+                    TMP_PREFIX_REPLACE_PARTITION_FROM,
                    dst_part_info,
                    metadata_snapshot,
                    clone_params,
@ -8179,15 +8230,15 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
            part_checksums.emplace_back(hash_hex);
        }

-        ReplicatedMergeTreeLogEntryData entry;
+        auto entry = std::make_unique<ReplicatedMergeTreeLogEntryData>();
        {
            auto src_table_id = src_data.getStorageID();
-            entry.type = ReplicatedMergeTreeLogEntryData::REPLACE_RANGE;
-            entry.source_replica = replica_name;
-            entry.create_time = time(nullptr);
-            entry.replace_range_entry = std::make_shared<ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry>();
+            entry->type = ReplicatedMergeTreeLogEntryData::REPLACE_RANGE;
+            entry->source_replica = replica_name;
+            entry->create_time = time(nullptr);
+            entry->replace_range_entry = std::make_shared<ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry>();

-            auto & entry_replace = *entry.replace_range_entry;
+            auto & entry_replace = *entry->replace_range_entry;
            entry_replace.drop_range_part_name = drop_range_fake_part_name;
            entry_replace.from_database = src_table_id.database_name;
            entry_replace.from_table = src_table_id.table_name;
@ -8220,7 +8271,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
                ephemeral_locks[i].getUnlockOp(ops);
            }

-            if (auto txn = query_context->getZooKeeperMetadataTransaction())
+            if (txn)
                txn->moveOpsTo(ops);

            delimiting_block_lock->getUnlockOp(ops);
@ -8228,7 +8279,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
            ops.emplace_back(zkutil::makeSetRequest(alter_partition_version_path, "", alter_partition_version_stat.version));
            /// Just update version, because merges assignment relies on it
            ops.emplace_back(zkutil::makeSetRequest(fs::path(zookeeper_path) / "log", "", -1));
-            ops.emplace_back(zkutil::makeCreateRequest(fs::path(zookeeper_path) / "log/log-", entry.toString(), zkutil::CreateMode::PersistentSequential));
+            ops.emplace_back(zkutil::makeCreateRequest(fs::path(zookeeper_path) / "log/log-", entry->toString(), zkutil::CreateMode::PersistentSequential));

            Transaction transaction(*this, NO_TRANSACTION_RAW);
            {
@ -8278,14 +8329,11 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
        }

        String log_znode_path = dynamic_cast<const Coordination::CreateResponse &>(*op_results.back()).path_created;
-        entry.znode_name = log_znode_path.substr(log_znode_path.find_last_of('/') + 1);
+        entry->znode_name = log_znode_path.substr(log_znode_path.find_last_of('/') + 1);

        for (auto & lock : ephemeral_locks)
            lock.assumeUnlocked();

-        lock2.reset();
-        lock1.reset();
-
        /// We need to pull the REPLACE_RANGE before cleaning the replaced parts (otherwise CHeckThread may decide that parts are lost)
        queue.pullLogsToQueue(getZooKeeperAndAssertNotReadonly(), {}, ReplicatedMergeTreeQueue::SYNC);
        // No need to block operations further, especially that in case we have to wait for mutation to finish, the intent would block
@ -8294,10 +8342,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
        parts_holder.clear();
        cleanup_thread.wakeup();

-
-        waitForLogEntryToBeProcessedIfNecessary(entry, query_context);
-
-        return;
+        return entry;
    }

    throw Exception(
--- a/src/Storages/StorageReplicatedMergeTree.h
+++ b/src/Storages/StorageReplicatedMergeTree.h
@ -37,6 +37,7 @@
 #include <base/defines.h>
 #include <Core/BackgroundSchedulePool.h>
 #include <QueryPipeline/Pipe.h>
+#include <Common/ProfileEventsScope.h>
 #include <Storages/MergeTree/BackgroundJobsAssignee.h>
 #include <Parsers/SyncReplicaMode.h>

@ -1013,6 +1014,18 @@ private:
        DataPartsVector::const_iterator it;
    };

+    const String TMP_PREFIX_REPLACE_PARTITION_FROM = "tmp_replace_from_";
+    std::unique_ptr<ReplicatedMergeTreeLogEntryData> replacePartitionFromImpl(
+        const Stopwatch & watch,
+        ProfileEventsScope & profile_events_scope,
+        const StorageMetadataPtr & metadata_snapshot,
+        const MergeTreeData & src_data,
+        const String & partition_id,
+        const zkutil::ZooKeeperPtr & zookeeper,
+        bool replace,
+        const bool & zero_copy_enabled,
+        const bool & always_use_copy_instead_of_hardlinks,
+        const ContextPtr & query_context);
 };

 String getPartNamePossiblyFake(MergeTreeDataFormatVersion format_version, const MergeTreePartInfo & part_info);
--- a/tests/ci/worker/dockerhub_proxy_template.sh
+++ b/tests/ci/worker/dockerhub_proxy_template.sh
@ -1,33 +0,0 @@
-#!/usr/bin/env bash
-set -xeuo pipefail
-
-bash /usr/local/share/scripts/init-network.sh
-
-# tune sysctl for network performance
-cat > /etc/sysctl.d/10-network-memory.conf << EOF
-net.core.netdev_max_backlog=2000
-net.core.rmem_max=1048576
-net.core.wmem_max=1048576
-net.ipv4.tcp_max_syn_backlog=1024
-net.ipv4.tcp_rmem=4096 131072  16777216
-net.ipv4.tcp_wmem=4096 87380   16777216
-net.ipv4.tcp_mem=4096  131072  16777216
-EOF
-
-sysctl -p /etc/sysctl.d/10-network-memory.conf
-
-mkdir /home/ubuntu/registrystorage
-
-sed -i 's/preserve_hostname: false/preserve_hostname: true/g' /etc/cloud/cloud.cfg
-
-REGISTRY_PROXY_USERNAME=robotclickhouse
-REGISTRY_PROXY_PASSWORD=$(aws ssm get-parameter --name dockerhub_robot_password --with-decryption | jq '.Parameter.Value' -r)
-
-docker run -d --network=host -p 5000:5000 -v /home/ubuntu/registrystorage:/var/lib/registry \
-  -e REGISTRY_STORAGE_CACHE='' \
-  -e REGISTRY_HTTP_ADDR=0.0.0.0:5000 \
-  -e REGISTRY_STORAGE_DELETE_ENABLED=true \
-  -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io \
-  -e REGISTRY_PROXY_PASSWORD="$REGISTRY_PROXY_PASSWORD" \
-  -e REGISTRY_PROXY_USERNAME="$REGISTRY_PROXY_USERNAME" \
-  --restart=always --name registry registry:2
--- a/tests/ci/worker/prepare-ci-ami.sh
+++ b/tests/ci/worker/prepare-ci-ami.sh
@ -1,254 +0,0 @@
-#!/usr/bin/env bash
-# The script is downloaded the AWS image builder Task Orchestrator and Executor (AWSTOE)
-# We can't use `user data script` because cloud-init does not check the exit code
-# The script is downloaded in the component named ci-infrastructure-prepare in us-east-1
-# The link there must be adjusted to a particular RAW link, e.g.
-# https://github.com/ClickHouse/ClickHouse/raw/653da5f00219c088af66d97a8f1ea3e35e798268/tests/ci/worker/prepare-ci-ami.sh
-
-set -xeuo pipefail
-
-echo "Running prepare script"
-export DEBIAN_FRONTEND=noninteractive
-export RUNNER_VERSION=2.317.0
-export RUNNER_HOME=/home/ubuntu/actions-runner
-
-deb_arch() {
-  case $(uname -m) in
-    x86_64 )
-      echo amd64;;
-    aarch64 )
-      echo arm64;;
-  esac
-}
-
-runner_arch() {
-  case $(uname -m) in
-    x86_64 )
-      echo x64;;
-    aarch64 )
-      echo arm64;;
-  esac
-}
-
-# We have test for cgroups, and it's broken with cgroups v2
-# Ubuntu 22.04 has it enabled by default
-sed -r '/GRUB_CMDLINE_LINUX=/ s/"(.*)"/"\1 systemd.unified_cgroup_hierarchy=0"/' -i /etc/default/grub
-update-grub
-
-apt-get update
-
-apt-get install --yes --no-install-recommends \
-    apt-transport-https \
-    at \
-    atop \
-    binfmt-support \
-    build-essential \
-    ca-certificates \
-    curl \
-    gnupg \
-    jq \
-    lsb-release \
-    pigz \
-    ripgrep \
-    zstd \
-    python3-dev \
-    python3-pip \
-    qemu-user-static \
-    unzip \
-    gh
-
-# Install docker
-curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
-
-echo "deb [arch=$(deb_arch) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
-
-apt-get update
-apt-get install --yes --no-install-recommends docker-ce docker-buildx-plugin docker-ce-cli containerd.io
-
-usermod -aG docker ubuntu
-
-# enable ipv6 in containers (fixed-cidr-v6 is some random network mask)
-cat <<EOT > /etc/docker/daemon.json
-{
-  "ipv6": true,
-  "fixed-cidr-v6": "2001:db8:1::/64",
-  "log-driver": "json-file",
-  "log-opts": {
-    "max-file": "5",
-    "max-size": "1000m"
-  },
-  "insecure-registries" : ["dockerhub-proxy.dockerhub-proxy-zone:5000"],
-  "registry-mirrors" : ["http://dockerhub-proxy.dockerhub-proxy-zone:5000"]
-}
-EOT
-
-# Install azure-cli
-curl -sLS https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor -o /etc/apt/keyrings/microsoft.gpg
-AZ_DIST=$(lsb_release -cs)
-echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/microsoft.gpg] https://packages.microsoft.com/repos/azure-cli/ $AZ_DIST main" | tee /etc/apt/sources.list.d/azure-cli.list
-
-apt-get update
-apt-get install --yes --no-install-recommends azure-cli
-
-# Increase the limit on number of virtual memory mappings to aviod 'Cannot mmap' error
-echo "vm.max_map_count = 2097152" > /etc/sysctl.d/01-increase-map-counts.conf
-# Workarond for sanitizers uncompatibility with some kernels, see https://github.com/google/sanitizers/issues/856
-echo "vm.mmap_rnd_bits=28" > /etc/sysctl.d/02-vm-mmap_rnd_bits.conf
-
-systemctl restart docker
-
-# buildx builder is user-specific
-sudo -u ubuntu docker buildx version
-sudo -u ubuntu docker buildx rm default-builder || : # if it's the second attempt
-sudo -u ubuntu docker buildx create --use --name default-builder
-
-pip install boto3 pygithub requests urllib3 unidiff dohq-artifactory jwt
-
-rm -rf $RUNNER_HOME  # if it's the second attempt
-mkdir -p $RUNNER_HOME && cd $RUNNER_HOME
-
-RUNNER_ARCHIVE="actions-runner-linux-$(runner_arch)-$RUNNER_VERSION.tar.gz"
-
-curl -O -L "https://github.com/actions/runner/releases/download/v$RUNNER_VERSION/$RUNNER_ARCHIVE"
-
-tar xzf "./$RUNNER_ARCHIVE"
-rm -f "./$RUNNER_ARCHIVE"
-./bin/installdependencies.sh
-
-chown -R ubuntu:ubuntu $RUNNER_HOME
-
-cd /home/ubuntu
-curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip"
-unzip -q awscliv2.zip
-./aws/install
-
-rm -rf /home/ubuntu/awscliv2.zip /home/ubuntu/aws
-
-# SSH keys of core team
-mkdir -p /home/ubuntu/.ssh
-
-# ~/.ssh/authorized_keys is cleaned out, so we use deprecated but working  ~/.ssh/authorized_keys2
-TEAM_KEYS_URL=$(aws ssm get-parameter --region us-east-1 --name team-keys-url --query 'Parameter.Value' --output=text)
-curl "${TEAM_KEYS_URL}" > /home/ubuntu/.ssh/authorized_keys2
-chown ubuntu: /home/ubuntu/.ssh -R
-chmod 0700 /home/ubuntu/.ssh
-
-# Download cloudwatch agent and install config for it
-wget --directory-prefix=/tmp https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/"$(deb_arch)"/latest/amazon-cloudwatch-agent.deb{,.sig}
-gpg --recv-key --keyserver keyserver.ubuntu.com D58167303B789C72
-gpg --verify /tmp/amazon-cloudwatch-agent.deb.sig
-dpkg -i /tmp/amazon-cloudwatch-agent.deb
-aws ssm get-parameter --region us-east-1 --name AmazonCloudWatch-github-runners --query 'Parameter.Value' --output text > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
-systemctl enable amazon-cloudwatch-agent.service
-
-
-echo "Install tailscale"
-# Build get-authkey for tailscale
-docker run --rm -v /usr/local/bin/:/host-local-bin -i golang:alpine sh -ex <<'EOF'
-  CGO_ENABLED=0 go install -tags tag:svc-core-ci-github tailscale.com/cmd/get-authkey@main
-  mv /go/bin/get-authkey /host-local-bin
-EOF
-
-# install tailscale
-curl -fsSL "https://pkgs.tailscale.com/stable/ubuntu/$(lsb_release -cs).noarmor.gpg" > /usr/share/keyrings/tailscale-archive-keyring.gpg
-curl -fsSL "https://pkgs.tailscale.com/stable/ubuntu/$(lsb_release -cs).tailscale-keyring.list" > /etc/apt/sources.list.d/tailscale.list
-apt-get update
-apt-get install tailscale --yes --no-install-recommends
-
-
-# Create a common script for the instances
-mkdir /usr/local/share/scripts -p
-setup_cloudflare_dns() {
-    # Add cloudflare DNS as a fallback
-    # Get default gateway interface
-    local IFACE ETH_DNS CLOUDFLARE_NS new_dns
-    IFACE=$(ip --json route list | jq '.[]|select(.dst == "default").dev' --raw-output)
-    # `Link 2 (eth0): 172.31.0.2`
-    ETH_DNS=$(resolvectl dns "$IFACE") || :
-    CLOUDFLARE_NS=1.1.1.1
-    if [[ "$ETH_DNS" ]] && [[ "${ETH_DNS#*: }" != *"$CLOUDFLARE_NS"* ]]; then
-      # Cut the leading legend
-      ETH_DNS=${ETH_DNS#*: }
-      # shellcheck disable=SC2206
-      new_dns=(${ETH_DNS} "$CLOUDFLARE_NS")
-      resolvectl dns "$IFACE" "${new_dns[@]}"
-    fi
-}
-
-setup_tailscale() {
-    # Setup tailscale, the very first action
-    local TS_API_CLIENT_ID TS_API_CLIENT_SECRET TS_AUTHKEY RUNNER_TYPE
-    TS_API_CLIENT_ID=$(aws ssm get-parameter --region us-east-1 --name /tailscale/api-client-id --query 'Parameter.Value' --output text --with-decryption)
-    TS_API_CLIENT_SECRET=$(aws ssm get-parameter --region us-east-1 --name /tailscale/api-client-secret --query 'Parameter.Value' --output text --with-decryption)
-
-    RUNNER_TYPE=$(/usr/local/bin/aws ec2 describe-tags --filters "Name=resource-id,Values=$INSTANCE_ID" --query "Tags[?Key=='github:runner-type'].Value" --output text)
-    RUNNER_TYPE=${RUNNER_TYPE:-unknown}
-    # Clean possible garbage from the runner type
-    RUNNER_TYPE=${RUNNER_TYPE//[^0-9a-z]/-}
-    TS_AUTHKEY=$(TS_API_CLIENT_ID="$TS_API_CLIENT_ID" TS_API_CLIENT_SECRET="$TS_API_CLIENT_SECRET" \
-                 get-authkey -tags tag:svc-core-ci-github -ephemeral)
-    tailscale up --ssh --auth-key="$TS_AUTHKEY" --hostname="ci-runner-$RUNNER_TYPE-$INSTANCE_ID"
-}
-
-cat > /usr/local/share/scripts/init-network.sh << EOF
-!/usr/bin/env bash
-$(declare -f setup_cloudflare_dns)
-
-$(declare -f setup_tailscale)
-
-# If the script is sourced, it will return now and won't execute functions
-return 0 &>/dev/null || :
-
-echo Setup Cloudflare DNS
-setup_cloudflare_dns
-
-echo Setup Tailscale VPN
-setup_tailscale
-EOF
-
-chmod +x /usr/local/share/scripts/init-network.sh
-
-
-# The following line is used in aws TOE check.
-touch /var/tmp/clickhouse-ci-ami.success
-# END OF THE SCRIPT
-
-# TOE (Task Orchestrator and Executor) description
-# name: CIInfrastructurePrepare
-# description: installs the infrastructure for ClickHouse CI runners
-# schemaVersion: 1.0
-#
-# phases:
-#   - name: build
-#     steps:
-#       - name: DownloadRemoteScript
-#         maxAttempts: 3
-#         action: WebDownload
-#         onFailure: Abort
-#         inputs:
-#           - source: https://github.com/ClickHouse/ClickHouse/raw/653da5f00219c088af66d97a8f1ea3e35e798268/tests/ci/worker/prepare-ci-ami.sh
-#             destination: /tmp/prepare-ci-ami.sh
-#       - name: RunScript
-#         maxAttempts: 3
-#         action: ExecuteBash
-#         onFailure: Abort
-#         inputs:
-#           commands:
-#             - bash -x '{{build.DownloadRemoteScript.inputs[0].destination}}'
-#
-#
-#   - name: validate
-#     steps:
-#       - name: RunScript
-#         maxAttempts: 3
-#         action: ExecuteBash
-#         onFailure: Abort
-#         inputs:
-#           commands:
-#             - ls /var/tmp/clickhouse-ci-ami.success
-#       - name: Cleanup
-#         action: DeleteFile
-#         onFailure: Abort
-#         maxAttempts: 3
-#         inputs:
-#           - path: /var/tmp/clickhouse-ci-ami.success
--- a/tests/queries/0_stateless/00626_replace_partition_from_table.reference
+++ b/tests/queries/0_stateless/00626_replace_partition_from_table.reference
@ -10,11 +10,12 @@ REPLACE recursive
 4	8
 1
 ATTACH FROM
-5	8
+6	8
+10	12
 OPTIMIZE
-5	8	5
-5	8	3
+10	12	9
+10	12	5
 After restart
-5	8
+10	12
 DETACH+ATTACH PARTITION
-3	4
+7	7
--- a/tests/queries/0_stateless/00626_replace_partition_from_table.sql
+++ b/tests/queries/0_stateless/00626_replace_partition_from_table.sql
@ -53,12 +53,16 @@ DROP TABLE src;
 CREATE TABLE src (p UInt64, k String, d UInt64) ENGINE = MergeTree PARTITION BY p ORDER BY k;
 INSERT INTO src VALUES (1, '0', 1);
 INSERT INTO src VALUES (1, '1', 1);
+INSERT INTO src VALUES (2, '2', 1);
+INSERT INTO src VALUES (3, '3', 1);

 SYSTEM STOP MERGES dst;
-INSERT INTO dst VALUES (1, '1', 2);
+INSERT INTO dst VALUES (1, '1', 2), (1, '2', 0);
 ALTER TABLE dst ATTACH PARTITION 1 FROM src;
 SELECT count(), sum(d) FROM dst;

+ALTER TABLE dst ATTACH PARTITION ALL FROM src;
+SELECT count(), sum(d) FROM dst;

 SELECT 'OPTIMIZE';
 SELECT count(), sum(d), uniqExact(_part) FROM dst;
--- a/tests/queries/0_stateless/00626_replace_partition_from_table_zookeeper.reference
+++ b/tests/queries/0_stateless/00626_replace_partition_from_table_zookeeper.reference
@ -16,6 +16,7 @@ REPLACE recursive
 ATTACH FROM
 5	8
 5	8
+7	12
 REPLACE with fetch
 4	6
 4	6
--- a/tests/queries/0_stateless/00626_replace_partition_from_table_zookeeper.sh
+++ b/tests/queries/0_stateless/00626_replace_partition_from_table_zookeeper.sh
@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Tags: zookeeper, no-object-storage
+# Tags: zookeeper, no-object-storage, long

 # Because REPLACE PARTITION does not forces immediate removal of replaced data parts from local filesystem
 # (it tries to do it as quick as possible, but it still performed in separate thread asynchronously)
@ -82,6 +82,8 @@ $CLICKHOUSE_CLIENT --query="DROP TABLE src;"
 $CLICKHOUSE_CLIENT --query="CREATE TABLE src (p UInt64, k String, d UInt64) ENGINE = MergeTree PARTITION BY p ORDER BY k;"
 $CLICKHOUSE_CLIENT --query="INSERT INTO src VALUES (1, '0', 1);"
 $CLICKHOUSE_CLIENT --query="INSERT INTO src VALUES (1, '1', 1);"
+$CLICKHOUSE_CLIENT --query="INSERT INTO src VALUES (3, '1', 2);"
+$CLICKHOUSE_CLIENT --query="INSERT INTO src VALUES (4, '1', 2);"

 $CLICKHOUSE_CLIENT --query="INSERT INTO dst_r2 VALUES (1, '1', 2);"
 query_with_retry "ALTER TABLE dst_r2 ATTACH PARTITION 1 FROM src;"
@ -90,6 +92,13 @@ $CLICKHOUSE_CLIENT --query="SYSTEM SYNC REPLICA dst_r1;"
 $CLICKHOUSE_CLIENT --query="SELECT count(), sum(d) FROM dst_r1;"
 $CLICKHOUSE_CLIENT --query="SELECT count(), sum(d) FROM dst_r2;"

+query_with_retry "ALTER TABLE dst_r2 ATTACH PARTITION ALL FROM src;"
+$CLICKHOUSE_CLIENT --query="SYSTEM SYNC REPLICA dst_r2;"
+$CLICKHOUSE_CLIENT --query="SELECT count(), sum(d) FROM dst_r2;"
+query_with_retry "ALTER TABLE dst_r2 DROP PARTITION 3;"
+$CLICKHOUSE_CLIENT --query="SYSTEM SYNC REPLICA dst_r2;"
+query_with_retry "ALTER TABLE dst_r2 DROP PARTITION 4;"
+$CLICKHOUSE_CLIENT --query="SYSTEM SYNC REPLICA dst_r2;"

 $CLICKHOUSE_CLIENT --query="SELECT 'REPLACE with fetch';"
 $CLICKHOUSE_CLIENT --query="DROP TABLE src;"