Merge branch 'master' of github.com:ClickHouse/ClickHouse into query_plan_for_merge

This commit is contained in:
Alexander Gololobov 2024-09-05 20:58:47 +02:00
commit 2df5edc1c1
17 changed files with 166 additions and 376 deletions

View File

@ -265,8 +265,6 @@ SELECT now() AS current_date_time, current_date_time + INTERVAL '4' day + INTERV
└─────────────────────┴────────────────────────────────────────────────────────────┘
```
You can work with dates without using `INTERVAL`, just by adding or subtracting seconds, minutes, and hours. For example, an interval of one day can be set by adding `60*60*24`.
:::note
The `INTERVAL` syntax or `addDays` function are always preferred. Simple addition or subtraction (syntax like `now() + ...`) doesn't consider time settings. For example, daylight saving time.
:::

View File

@ -351,7 +351,7 @@ ALTER TABLE mt DELETE IN PARTITION ID '2' WHERE p = 2;
You can specify the partition expression in `ALTER ... PARTITION` queries in different ways:
- As a value from the `partition` column of the `system.parts` table. For example, `ALTER TABLE visits DETACH PARTITION 201901`.
- Using the keyword `ALL`. It can be used only with DROP/DETACH/ATTACH. For example, `ALTER TABLE visits ATTACH PARTITION ALL`.
- Using the keyword `ALL`. It can be used only with DROP/DETACH/ATTACH/ATTACH FROM. For example, `ALTER TABLE visits ATTACH PARTITION ALL`.
- As a tuple of expressions or constants that matches (in types) the table partitioning keys tuple. In the case of a single element partitioning key, the expression should be wrapped in the `tuple (...)` function. For example, `ALTER TABLE visits DETACH PARTITION tuple(toYYYYMM(toDate('2019-01-25')))`.
- Using the partition ID. Partition ID is a string identifier of the partition (human-readable, if possible) that is used as the names of partitions in the file system and in ZooKeeper. The partition ID must be specified in the `PARTITION ID` clause, in a single quotes. For example, `ALTER TABLE visits DETACH PARTITION ID '201901'`.
- In the [ALTER ATTACH PART](#attach-partitionpart) and [DROP DETACHED PART](#drop-detached-partitionpart) query, to specify the name of a part, use string literal with a value from the `name` column of the [system.detached_parts](/docs/en/operations/system-tables/detached_parts.md/#system_tables-detached_parts) table. For example, `ALTER TABLE visits ATTACH PART '201901_1_1_0'`.

View File

@ -13,7 +13,7 @@ The lightweight `DELETE` statement removes rows from the table `[db.]table` that
DELETE FROM [db.]table [ON CLUSTER cluster] [IN PARTITION partition_expr] WHERE expr;
```
It is called "lightweight `DELETE`" to contrast it to the [ALTER table DELETE](/en/sql-reference/statements/alter/delete) command, which is a heavyweight process.
It is called "lightweight `DELETE`" to contrast it to the [ALTER TABLE ... DELETE](/en/sql-reference/statements/alter/delete) command, which is a heavyweight process.
## Examples
@ -22,11 +22,13 @@ It is called "lightweight `DELETE`" to contrast it to the [ALTER table DELETE](/
DELETE FROM hits WHERE Title LIKE '%hello%';
```
## Lightweight `DELETE` does not delete data from storage immediately
## Lightweight `DELETE` does not delete data immediately
With lightweight `DELETE`, deleted rows are internally marked as deleted immediately and will be automatically filtered out of all subsequent queries. However, cleanup of data happens during the next merge. As a result, it is possible that for an unspecified period, data is not actually deleted from storage and is only marked as deleted.
Lightweight `DELETE` is implemented as a [mutation](/en/sql-reference/statements/alter#mutations), which is executed asynchronously in the background by default. The statement is going to return almost immediately, but the data can still be visible to queries until the mutation is finished.
If you need to guarantee that your data is deleted from storage in a predictable time, consider using the [ALTER table DELETE](/en/sql-reference/statements/alter/delete) command. Note that deleting data using `ALTER table DELETE` may consume significant resources as it recreates all affected parts.
The mutation marks rows as deleted, and at that point, they will no longer show up in query results. It does not physically delete the data, this will happen during the next merge. As a result, it is possible that for an unspecified period, data is not actually deleted from storage and is only marked as deleted.
If you need to guarantee that your data is deleted from storage in a predictable time, consider using the table setting [`min_age_to_force_merge_seconds`](https://clickhouse.com/docs/en/operations/settings/merge-tree-settings#min_age_to_force_merge_seconds). Or you can use the [ALTER TABLE ... DELETE](/en/sql-reference/statements/alter/delete) command. Note that deleting data using `ALTER TABLE ... DELETE` may consume significant resources as it recreates all affected parts.
## Deleting large amounts of data
@ -38,7 +40,7 @@ If you anticipate frequent deletes, consider using a [custom partitioning key](/
### Lightweight `DELETE`s with projections
By default, `DELETE` does not work for tables with projections. This is because rows in a projection may be affected by a `DELETE` operation. But there is a [MergeTree setting](https://clickhouse.com/docs/en/operations/settings/merge-tree-settings) `lightweight_mutation_projection_mode` can change the behavior.
By default, `DELETE` does not work for tables with projections. This is because rows in a projection may be affected by a `DELETE` operation. But there is a [MergeTree setting](https://clickhouse.com/docs/en/operations/settings/merge-tree-settings) `lightweight_mutation_projection_mode` to change the behavior.
## Performance considerations when using lightweight `DELETE`
@ -48,7 +50,7 @@ The following can also negatively impact lightweight `DELETE` performance:
- A heavy `WHERE` condition in a `DELETE` query.
- If the mutations queue is filled with many other mutations, this can possibly lead to performance issues as all mutations on a table are executed sequentially.
- The affected table having a very large number of data parts.
- The affected table has a very large number of data parts.
- Having a lot of data in compact parts. In a Compact part, all columns are stored in one file.
## Delete permissions
@ -61,13 +63,13 @@ GRANT ALTER DELETE ON db.table to username;
## How lightweight DELETEs work internally in ClickHouse
1. A "mask" is applied to affected rows
1. **A "mask" is applied to affected rows**
When a `DELETE FROM table ...` query is executed, ClickHouse saves a mask where each row is marked as either “existing” or as “deleted”. Those “deleted” rows are omitted for subsequent queries. However, rows are actually only removed later by subsequent merges. Writing this mask is much more lightweight than what is done by an `ALTER table DELETE` query.
When a `DELETE FROM table ...` query is executed, ClickHouse saves a mask where each row is marked as either “existing” or as “deleted”. Those “deleted” rows are omitted for subsequent queries. However, rows are actually only removed later by subsequent merges. Writing this mask is much more lightweight than what is done by an `ALTER TABLE ... DELETE` query.
The mask is implemented as a hidden `_row_exists` system column that stores `True` for all visible rows and `False` for deleted ones. This column is only present in a part if some rows in the part were deleted. This column does not exist when a part has all values equal to `True`.
2. `SELECT` queries are transformed to include the mask
2. **`SELECT` queries are transformed to include the mask**
When a masked column is used in a query, the `SELECT ... FROM table WHERE condition` query internally is extended by the predicate on `_row_exists` and is transformed to:
```sql
@ -75,17 +77,17 @@ SELECT ... FROM table PREWHERE _row_exists WHERE condition
```
At execution time, the column `_row_exists` is read to determine which rows should not be returned. If there are many deleted rows, ClickHouse can determine which granules can be fully skipped when reading the rest of the columns.
3. `DELETE` queries are transformed to `ALTER table UPDATE` queries
3. **`DELETE` queries are transformed to `ALTER TABLE ... UPDATE` queries**
The `DELETE FROM table WHERE condition` is translated into an `ALTER table UPDATE _row_exists = 0 WHERE condition` mutation.
The `DELETE FROM table WHERE condition` is translated into an `ALTER TABLE table UPDATE _row_exists = 0 WHERE condition` mutation.
Internally, this mutation is executed in two steps:
1. A `SELECT count() FROM table WHERE condition` command is executed for each individual part to determine if the part is affected.
2. Based on the commands above, affected parts are then mutated, and hardlinks are created for unaffected parts. In the case of wide parts, the `_row_exists` column for each row is updated and all other columns' files are hardlinked. For compact parts, all columns are re-written because they are all stored together in one file.
2. Based on the commands above, affected parts are then mutated, and hardlinks are created for unaffected parts. In the case of wide parts, the `_row_exists` column for each row is updated, and all other columns' files are hardlinked. For compact parts, all columns are re-written because they are all stored together in one file.
From the steps above, we can see that lightweight deletes using the masking technique improves performance over traditional `ALTER table DELETE` commands because `ALTER table DELETE` reads and re-writes all the columns' files for affected parts.
From the steps above, we can see that lightweight `DELETE` using the masking technique improves performance over traditional `ALTER TABLE ... DELETE` because it does not re-write all the columns' files for affected parts.
## Related content

View File

@ -95,22 +95,18 @@ UInt32 DataPartStorageOnDiskFull::getRefCount(const String & file_name) const
return volume->getDisk()->getRefCount(fs::path(root_path) / part_dir / file_name);
}
std::string DataPartStorageOnDiskFull::getRemotePath(const std::string & file_name, bool if_exists) const
std::vector<std::string> DataPartStorageOnDiskFull::getRemotePaths(const std::string & file_name) const
{
const std::string path = fs::path(root_path) / part_dir / file_name;
auto objects = volume->getDisk()->getStorageObjects(path);
if (objects.empty() && if_exists)
return "";
std::vector<std::string> remote_paths;
remote_paths.reserve(objects.size());
if (objects.size() != 1)
{
throw Exception(ErrorCodes::LOGICAL_ERROR,
"One file must be mapped to one object on blob storage by path {} in MergeTree tables, have {}.",
path, objects.size());
}
for (const auto & object : objects)
remote_paths.push_back(object.remote_path);
return objects[0].remote_path;
return remote_paths;
}
String DataPartStorageOnDiskFull::getUniqueId() const

View File

@ -23,7 +23,7 @@ public:
Poco::Timestamp getFileLastModified(const String & file_name) const override;
size_t getFileSize(const std::string & file_name) const override;
UInt32 getRefCount(const std::string & file_name) const override;
std::string getRemotePath(const std::string & file_name, bool if_exists) const override;
std::vector<std::string> getRemotePaths(const std::string & file_name) const override;
String getUniqueId() const override;
std::unique_ptr<ReadBufferFromFileBase> readFile(

View File

@ -126,7 +126,7 @@ public:
virtual UInt32 getRefCount(const std::string & file_name) const = 0;
/// Get path on remote filesystem from file name on local filesystem.
virtual std::string getRemotePath(const std::string & file_name, bool if_exists) const = 0;
virtual std::vector<std::string> getRemotePaths(const std::string & file_name) const = 0;
virtual UInt64 calculateTotalSizeOnDisk() const = 0;

View File

@ -5009,7 +5009,7 @@ void MergeTreeData::checkAlterPartitionIsPossible(
const auto * partition_ast = command.partition->as<ASTPartition>();
if (partition_ast && partition_ast->all)
{
if (command.type != PartitionCommand::DROP_PARTITION && command.type != PartitionCommand::ATTACH_PARTITION)
if (command.type != PartitionCommand::DROP_PARTITION && command.type != PartitionCommand::ATTACH_PARTITION && !(command.type == PartitionCommand::REPLACE_PARTITION && !command.replace))
throw DB::Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only support DROP/DETACH/ATTACH PARTITION ALL currently");
}
else
@ -5810,7 +5810,7 @@ String MergeTreeData::getPartitionIDFromQuery(const ASTPtr & ast, ContextPtr loc
const auto & partition_ast = ast->as<ASTPartition &>();
if (partition_ast.all)
throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only Support DETACH PARTITION ALL currently");
throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only Support DROP/DETACH/ATTACH PARTITION ALL currently");
if (!partition_ast.value)
{

View File

@ -391,16 +391,8 @@ IMergeTreeDataPart::Checksums checkDataPart(
auto file_name = it->name();
if (!data_part_storage.isDirectory(file_name))
{
const bool is_projection_part = data_part->isProjectionPart();
auto remote_path = data_part_storage.getRemotePath(file_name, /* if_exists */is_projection_part);
if (remote_path.empty())
{
chassert(is_projection_part);
throw Exception(
ErrorCodes::BROKEN_PROJECTION,
"Remote path for {} does not exist for projection path. Projection {} is broken",
file_name, data_part->name);
}
auto remote_paths = data_part_storage.getRemotePaths(file_name);
for (const auto & remote_path : remote_paths)
cache.removePathIfExists(remote_path, FileCache::getCommonUser().user_id);
}
}

View File

@ -2090,9 +2090,22 @@ void StorageMergeTree::replacePartitionFrom(const StoragePtr & source_table, con
ProfileEventsScope profile_events_scope;
MergeTreeData & src_data = checkStructureAndGetMergeTreeData(source_table, source_metadata_snapshot, my_metadata_snapshot);
String partition_id = getPartitionIDFromQuery(partition, local_context);
DataPartsVector src_parts;
String partition_id;
bool is_all = partition->as<ASTPartition>()->all;
if (is_all)
{
if (replace)
throw DB::Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only support DROP/DETACH/ATTACH PARTITION ALL currently");
src_parts = src_data.getVisibleDataPartsVector(local_context);
}
else
{
partition_id = getPartitionIDFromQuery(partition, local_context);
src_parts = src_data.getVisibleDataPartsVectorInPartition(local_context, partition_id);
}
DataPartsVector src_parts = src_data.getVisibleDataPartsVectorInPartition(local_context, partition_id);
MutableDataPartsVector dst_parts;
std::vector<scope_guard> dst_parts_locks;
@ -2100,6 +2113,9 @@ void StorageMergeTree::replacePartitionFrom(const StoragePtr & source_table, con
for (const DataPartPtr & src_part : src_parts)
{
if (is_all)
partition_id = src_part->partition.getID(src_data);
if (!canReplacePartition(src_part))
throw Exception(ErrorCodes::BAD_ARGUMENTS,
"Cannot replace partition '{}' because part '{}' has inconsistent granularity with table",

View File

@ -8033,24 +8033,77 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
/// First argument is true, because we possibly will add new data to current table.
auto lock1 = lockForShare(query_context->getCurrentQueryId(), query_context->getSettingsRef().lock_acquire_timeout);
auto lock2 = source_table->lockForShare(query_context->getCurrentQueryId(), query_context->getSettingsRef().lock_acquire_timeout);
auto storage_settings_ptr = getSettings();
auto source_metadata_snapshot = source_table->getInMemoryMetadataPtr();
auto metadata_snapshot = getInMemoryMetadataPtr();
const auto storage_settings_ptr = getSettings();
const auto source_metadata_snapshot = source_table->getInMemoryMetadataPtr();
const auto metadata_snapshot = getInMemoryMetadataPtr();
const MergeTreeData & src_data = checkStructureAndGetMergeTreeData(source_table, source_metadata_snapshot, metadata_snapshot);
Stopwatch watch;
std::unordered_set<String> partitions;
if (partition->as<ASTPartition>()->all)
{
if (replace)
throw DB::Exception(ErrorCodes::SUPPORT_IS_DISABLED, "Only support DROP/DETACH/ATTACH PARTITION ALL currently");
partitions = src_data.getAllPartitionIds();
}
else
{
partitions = std::unordered_set<String>();
partitions.emplace(getPartitionIDFromQuery(partition, query_context));
}
LOG_INFO(log, "Will try to attach {} partitions", partitions.size());
const Stopwatch watch;
ProfileEventsScope profile_events_scope;
const auto zookeeper = getZooKeeper();
MergeTreeData & src_data = checkStructureAndGetMergeTreeData(source_table, source_metadata_snapshot, metadata_snapshot);
String partition_id = getPartitionIDFromQuery(partition, query_context);
const bool zero_copy_enabled = storage_settings_ptr->allow_remote_fs_zero_copy_replication
|| dynamic_cast<const MergeTreeData *>(source_table.get())->getSettings()->allow_remote_fs_zero_copy_replication;
std::unique_ptr<ReplicatedMergeTreeLogEntryData> entries[partitions.size()];
size_t idx = 0;
for (const auto & partition_id : partitions)
{
entries[idx] = replacePartitionFromImpl(watch,
profile_events_scope,
metadata_snapshot,
src_data,
partition_id,
zookeeper,
replace,
zero_copy_enabled,
storage_settings_ptr->always_use_copy_instead_of_hardlinks,
query_context);
++idx;
}
for (const auto & entry : entries)
waitForLogEntryToBeProcessedIfNecessary(*entry, query_context);
}
std::unique_ptr<ReplicatedMergeTreeLogEntryData> StorageReplicatedMergeTree::replacePartitionFromImpl(
const Stopwatch & watch,
ProfileEventsScope & profile_events_scope,
const StorageMetadataPtr & metadata_snapshot,
const MergeTreeData & src_data,
const String & partition_id,
const ZooKeeperPtr & zookeeper,
bool replace,
const bool & zero_copy_enabled,
const bool & always_use_copy_instead_of_hardlinks,
const ContextPtr & query_context)
{
/// NOTE: Some covered parts may be missing in src_all_parts if corresponding log entries are not executed yet.
DataPartsVector src_all_parts = src_data.getVisibleDataPartsVectorInPartition(query_context, partition_id);
LOG_DEBUG(log, "Cloning {} parts", src_all_parts.size());
static const String TMP_PREFIX = "tmp_replace_from_";
auto zookeeper = getZooKeeper();
std::optional<ZooKeeperMetadataTransaction> txn;
if (auto query_txn = query_context->getZooKeeperMetadataTransaction())
txn.emplace(query_txn->getZooKeeper(),
query_txn->getDatabaseZooKeeperPath(),
query_txn->isInitialQuery(),
query_txn->getTaskZooKeeperPath());
/// Retry if alter_partition_version changes
for (size_t retry = 0; retry < 1000; ++retry)
@ -8136,11 +8189,9 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
UInt64 index = lock->getNumber();
MergeTreePartInfo dst_part_info(partition_id, index, index, src_part->info.level);
bool zero_copy_enabled = storage_settings_ptr->allow_remote_fs_zero_copy_replication
|| dynamic_cast<const MergeTreeData *>(source_table.get())->getSettings()->allow_remote_fs_zero_copy_replication;
IDataPartStorage::ClonePartParams clone_params
{
.copy_instead_of_hardlink = storage_settings_ptr->always_use_copy_instead_of_hardlinks || (zero_copy_enabled && src_part->isStoredOnRemoteDiskWithZeroCopySupport()),
.copy_instead_of_hardlink = always_use_copy_instead_of_hardlinks || (zero_copy_enabled && src_part->isStoredOnRemoteDiskWithZeroCopySupport()),
.metadata_version_to_write = metadata_snapshot->getMetadataVersion()
};
if (replace)
@ -8148,7 +8199,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
/// Replace can only work on the same disk
auto [dst_part, part_lock] = cloneAndLoadDataPart(
src_part,
TMP_PREFIX,
TMP_PREFIX_REPLACE_PARTITION_FROM,
dst_part_info,
metadata_snapshot,
clone_params,
@ -8163,7 +8214,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
/// Attach can work on another disk
auto [dst_part, part_lock] = cloneAndLoadDataPart(
src_part,
TMP_PREFIX,
TMP_PREFIX_REPLACE_PARTITION_FROM,
dst_part_info,
metadata_snapshot,
clone_params,
@ -8179,15 +8230,15 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
part_checksums.emplace_back(hash_hex);
}
ReplicatedMergeTreeLogEntryData entry;
auto entry = std::make_unique<ReplicatedMergeTreeLogEntryData>();
{
auto src_table_id = src_data.getStorageID();
entry.type = ReplicatedMergeTreeLogEntryData::REPLACE_RANGE;
entry.source_replica = replica_name;
entry.create_time = time(nullptr);
entry.replace_range_entry = std::make_shared<ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry>();
entry->type = ReplicatedMergeTreeLogEntryData::REPLACE_RANGE;
entry->source_replica = replica_name;
entry->create_time = time(nullptr);
entry->replace_range_entry = std::make_shared<ReplicatedMergeTreeLogEntryData::ReplaceRangeEntry>();
auto & entry_replace = *entry.replace_range_entry;
auto & entry_replace = *entry->replace_range_entry;
entry_replace.drop_range_part_name = drop_range_fake_part_name;
entry_replace.from_database = src_table_id.database_name;
entry_replace.from_table = src_table_id.table_name;
@ -8220,7 +8271,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
ephemeral_locks[i].getUnlockOp(ops);
}
if (auto txn = query_context->getZooKeeperMetadataTransaction())
if (txn)
txn->moveOpsTo(ops);
delimiting_block_lock->getUnlockOp(ops);
@ -8228,7 +8279,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
ops.emplace_back(zkutil::makeSetRequest(alter_partition_version_path, "", alter_partition_version_stat.version));
/// Just update version, because merges assignment relies on it
ops.emplace_back(zkutil::makeSetRequest(fs::path(zookeeper_path) / "log", "", -1));
ops.emplace_back(zkutil::makeCreateRequest(fs::path(zookeeper_path) / "log/log-", entry.toString(), zkutil::CreateMode::PersistentSequential));
ops.emplace_back(zkutil::makeCreateRequest(fs::path(zookeeper_path) / "log/log-", entry->toString(), zkutil::CreateMode::PersistentSequential));
Transaction transaction(*this, NO_TRANSACTION_RAW);
{
@ -8278,14 +8329,11 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
}
String log_znode_path = dynamic_cast<const Coordination::CreateResponse &>(*op_results.back()).path_created;
entry.znode_name = log_znode_path.substr(log_znode_path.find_last_of('/') + 1);
entry->znode_name = log_znode_path.substr(log_znode_path.find_last_of('/') + 1);
for (auto & lock : ephemeral_locks)
lock.assumeUnlocked();
lock2.reset();
lock1.reset();
/// We need to pull the REPLACE_RANGE before cleaning the replaced parts (otherwise CHeckThread may decide that parts are lost)
queue.pullLogsToQueue(getZooKeeperAndAssertNotReadonly(), {}, ReplicatedMergeTreeQueue::SYNC);
// No need to block operations further, especially that in case we have to wait for mutation to finish, the intent would block
@ -8294,10 +8342,7 @@ void StorageReplicatedMergeTree::replacePartitionFrom(
parts_holder.clear();
cleanup_thread.wakeup();
waitForLogEntryToBeProcessedIfNecessary(entry, query_context);
return;
return entry;
}
throw Exception(

View File

@ -37,6 +37,7 @@
#include <base/defines.h>
#include <Core/BackgroundSchedulePool.h>
#include <QueryPipeline/Pipe.h>
#include <Common/ProfileEventsScope.h>
#include <Storages/MergeTree/BackgroundJobsAssignee.h>
#include <Parsers/SyncReplicaMode.h>
@ -1013,6 +1014,18 @@ private:
DataPartsVector::const_iterator it;
};
const String TMP_PREFIX_REPLACE_PARTITION_FROM = "tmp_replace_from_";
std::unique_ptr<ReplicatedMergeTreeLogEntryData> replacePartitionFromImpl(
const Stopwatch & watch,
ProfileEventsScope & profile_events_scope,
const StorageMetadataPtr & metadata_snapshot,
const MergeTreeData & src_data,
const String & partition_id,
const zkutil::ZooKeeperPtr & zookeeper,
bool replace,
const bool & zero_copy_enabled,
const bool & always_use_copy_instead_of_hardlinks,
const ContextPtr & query_context);
};
String getPartNamePossiblyFake(MergeTreeDataFormatVersion format_version, const MergeTreePartInfo & part_info);

View File

@ -1,33 +0,0 @@
#!/usr/bin/env bash
set -xeuo pipefail
bash /usr/local/share/scripts/init-network.sh
# tune sysctl for network performance
cat > /etc/sysctl.d/10-network-memory.conf << EOF
net.core.netdev_max_backlog=2000
net.core.rmem_max=1048576
net.core.wmem_max=1048576
net.ipv4.tcp_max_syn_backlog=1024
net.ipv4.tcp_rmem=4096 131072 16777216
net.ipv4.tcp_wmem=4096 87380 16777216
net.ipv4.tcp_mem=4096 131072 16777216
EOF
sysctl -p /etc/sysctl.d/10-network-memory.conf
mkdir /home/ubuntu/registrystorage
sed -i 's/preserve_hostname: false/preserve_hostname: true/g' /etc/cloud/cloud.cfg
REGISTRY_PROXY_USERNAME=robotclickhouse
REGISTRY_PROXY_PASSWORD=$(aws ssm get-parameter --name dockerhub_robot_password --with-decryption | jq '.Parameter.Value' -r)
docker run -d --network=host -p 5000:5000 -v /home/ubuntu/registrystorage:/var/lib/registry \
-e REGISTRY_STORAGE_CACHE='' \
-e REGISTRY_HTTP_ADDR=0.0.0.0:5000 \
-e REGISTRY_STORAGE_DELETE_ENABLED=true \
-e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io \
-e REGISTRY_PROXY_PASSWORD="$REGISTRY_PROXY_PASSWORD" \
-e REGISTRY_PROXY_USERNAME="$REGISTRY_PROXY_USERNAME" \
--restart=always --name registry registry:2

View File

@ -1,254 +0,0 @@
#!/usr/bin/env bash
# The script is downloaded the AWS image builder Task Orchestrator and Executor (AWSTOE)
# We can't use `user data script` because cloud-init does not check the exit code
# The script is downloaded in the component named ci-infrastructure-prepare in us-east-1
# The link there must be adjusted to a particular RAW link, e.g.
# https://github.com/ClickHouse/ClickHouse/raw/653da5f00219c088af66d97a8f1ea3e35e798268/tests/ci/worker/prepare-ci-ami.sh
set -xeuo pipefail
echo "Running prepare script"
export DEBIAN_FRONTEND=noninteractive
export RUNNER_VERSION=2.317.0
export RUNNER_HOME=/home/ubuntu/actions-runner
deb_arch() {
case $(uname -m) in
x86_64 )
echo amd64;;
aarch64 )
echo arm64;;
esac
}
runner_arch() {
case $(uname -m) in
x86_64 )
echo x64;;
aarch64 )
echo arm64;;
esac
}
# We have test for cgroups, and it's broken with cgroups v2
# Ubuntu 22.04 has it enabled by default
sed -r '/GRUB_CMDLINE_LINUX=/ s/"(.*)"/"\1 systemd.unified_cgroup_hierarchy=0"/' -i /etc/default/grub
update-grub
apt-get update
apt-get install --yes --no-install-recommends \
apt-transport-https \
at \
atop \
binfmt-support \
build-essential \
ca-certificates \
curl \
gnupg \
jq \
lsb-release \
pigz \
ripgrep \
zstd \
python3-dev \
python3-pip \
qemu-user-static \
unzip \
gh
# Install docker
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(deb_arch) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update
apt-get install --yes --no-install-recommends docker-ce docker-buildx-plugin docker-ce-cli containerd.io
usermod -aG docker ubuntu
# enable ipv6 in containers (fixed-cidr-v6 is some random network mask)
cat <<EOT > /etc/docker/daemon.json
{
"ipv6": true,
"fixed-cidr-v6": "2001:db8:1::/64",
"log-driver": "json-file",
"log-opts": {
"max-file": "5",
"max-size": "1000m"
},
"insecure-registries" : ["dockerhub-proxy.dockerhub-proxy-zone:5000"],
"registry-mirrors" : ["http://dockerhub-proxy.dockerhub-proxy-zone:5000"]
}
EOT
# Install azure-cli
curl -sLS https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor -o /etc/apt/keyrings/microsoft.gpg
AZ_DIST=$(lsb_release -cs)
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/microsoft.gpg] https://packages.microsoft.com/repos/azure-cli/ $AZ_DIST main" | tee /etc/apt/sources.list.d/azure-cli.list
apt-get update
apt-get install --yes --no-install-recommends azure-cli
# Increase the limit on number of virtual memory mappings to aviod 'Cannot mmap' error
echo "vm.max_map_count = 2097152" > /etc/sysctl.d/01-increase-map-counts.conf
# Workarond for sanitizers uncompatibility with some kernels, see https://github.com/google/sanitizers/issues/856
echo "vm.mmap_rnd_bits=28" > /etc/sysctl.d/02-vm-mmap_rnd_bits.conf
systemctl restart docker
# buildx builder is user-specific
sudo -u ubuntu docker buildx version
sudo -u ubuntu docker buildx rm default-builder || : # if it's the second attempt
sudo -u ubuntu docker buildx create --use --name default-builder
pip install boto3 pygithub requests urllib3 unidiff dohq-artifactory jwt
rm -rf $RUNNER_HOME # if it's the second attempt
mkdir -p $RUNNER_HOME && cd $RUNNER_HOME
RUNNER_ARCHIVE="actions-runner-linux-$(runner_arch)-$RUNNER_VERSION.tar.gz"
curl -O -L "https://github.com/actions/runner/releases/download/v$RUNNER_VERSION/$RUNNER_ARCHIVE"
tar xzf "./$RUNNER_ARCHIVE"
rm -f "./$RUNNER_ARCHIVE"
./bin/installdependencies.sh
chown -R ubuntu:ubuntu $RUNNER_HOME
cd /home/ubuntu
curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip"
unzip -q awscliv2.zip
./aws/install
rm -rf /home/ubuntu/awscliv2.zip /home/ubuntu/aws
# SSH keys of core team
mkdir -p /home/ubuntu/.ssh
# ~/.ssh/authorized_keys is cleaned out, so we use deprecated but working ~/.ssh/authorized_keys2
TEAM_KEYS_URL=$(aws ssm get-parameter --region us-east-1 --name team-keys-url --query 'Parameter.Value' --output=text)
curl "${TEAM_KEYS_URL}" > /home/ubuntu/.ssh/authorized_keys2
chown ubuntu: /home/ubuntu/.ssh -R
chmod 0700 /home/ubuntu/.ssh
# Download cloudwatch agent and install config for it
wget --directory-prefix=/tmp https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/"$(deb_arch)"/latest/amazon-cloudwatch-agent.deb{,.sig}
gpg --recv-key --keyserver keyserver.ubuntu.com D58167303B789C72
gpg --verify /tmp/amazon-cloudwatch-agent.deb.sig
dpkg -i /tmp/amazon-cloudwatch-agent.deb
aws ssm get-parameter --region us-east-1 --name AmazonCloudWatch-github-runners --query 'Parameter.Value' --output text > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
systemctl enable amazon-cloudwatch-agent.service
echo "Install tailscale"
# Build get-authkey for tailscale
docker run --rm -v /usr/local/bin/:/host-local-bin -i golang:alpine sh -ex <<'EOF'
CGO_ENABLED=0 go install -tags tag:svc-core-ci-github tailscale.com/cmd/get-authkey@main
mv /go/bin/get-authkey /host-local-bin
EOF
# install tailscale
curl -fsSL "https://pkgs.tailscale.com/stable/ubuntu/$(lsb_release -cs).noarmor.gpg" > /usr/share/keyrings/tailscale-archive-keyring.gpg
curl -fsSL "https://pkgs.tailscale.com/stable/ubuntu/$(lsb_release -cs).tailscale-keyring.list" > /etc/apt/sources.list.d/tailscale.list
apt-get update
apt-get install tailscale --yes --no-install-recommends
# Create a common script for the instances
mkdir /usr/local/share/scripts -p
setup_cloudflare_dns() {
# Add cloudflare DNS as a fallback
# Get default gateway interface
local IFACE ETH_DNS CLOUDFLARE_NS new_dns
IFACE=$(ip --json route list | jq '.[]|select(.dst == "default").dev' --raw-output)
# `Link 2 (eth0): 172.31.0.2`
ETH_DNS=$(resolvectl dns "$IFACE") || :
CLOUDFLARE_NS=1.1.1.1
if [[ "$ETH_DNS" ]] && [[ "${ETH_DNS#*: }" != *"$CLOUDFLARE_NS"* ]]; then
# Cut the leading legend
ETH_DNS=${ETH_DNS#*: }
# shellcheck disable=SC2206
new_dns=(${ETH_DNS} "$CLOUDFLARE_NS")
resolvectl dns "$IFACE" "${new_dns[@]}"
fi
}
setup_tailscale() {
# Setup tailscale, the very first action
local TS_API_CLIENT_ID TS_API_CLIENT_SECRET TS_AUTHKEY RUNNER_TYPE
TS_API_CLIENT_ID=$(aws ssm get-parameter --region us-east-1 --name /tailscale/api-client-id --query 'Parameter.Value' --output text --with-decryption)
TS_API_CLIENT_SECRET=$(aws ssm get-parameter --region us-east-1 --name /tailscale/api-client-secret --query 'Parameter.Value' --output text --with-decryption)
RUNNER_TYPE=$(/usr/local/bin/aws ec2 describe-tags --filters "Name=resource-id,Values=$INSTANCE_ID" --query "Tags[?Key=='github:runner-type'].Value" --output text)
RUNNER_TYPE=${RUNNER_TYPE:-unknown}
# Clean possible garbage from the runner type
RUNNER_TYPE=${RUNNER_TYPE//[^0-9a-z]/-}
TS_AUTHKEY=$(TS_API_CLIENT_ID="$TS_API_CLIENT_ID" TS_API_CLIENT_SECRET="$TS_API_CLIENT_SECRET" \
get-authkey -tags tag:svc-core-ci-github -ephemeral)
tailscale up --ssh --auth-key="$TS_AUTHKEY" --hostname="ci-runner-$RUNNER_TYPE-$INSTANCE_ID"
}
cat > /usr/local/share/scripts/init-network.sh << EOF
!/usr/bin/env bash
$(declare -f setup_cloudflare_dns)
$(declare -f setup_tailscale)
# If the script is sourced, it will return now and won't execute functions
return 0 &>/dev/null || :
echo Setup Cloudflare DNS
setup_cloudflare_dns
echo Setup Tailscale VPN
setup_tailscale
EOF
chmod +x /usr/local/share/scripts/init-network.sh
# The following line is used in aws TOE check.
touch /var/tmp/clickhouse-ci-ami.success
# END OF THE SCRIPT
# TOE (Task Orchestrator and Executor) description
# name: CIInfrastructurePrepare
# description: installs the infrastructure for ClickHouse CI runners
# schemaVersion: 1.0
#
# phases:
# - name: build
# steps:
# - name: DownloadRemoteScript
# maxAttempts: 3
# action: WebDownload
# onFailure: Abort
# inputs:
# - source: https://github.com/ClickHouse/ClickHouse/raw/653da5f00219c088af66d97a8f1ea3e35e798268/tests/ci/worker/prepare-ci-ami.sh
# destination: /tmp/prepare-ci-ami.sh
# - name: RunScript
# maxAttempts: 3
# action: ExecuteBash
# onFailure: Abort
# inputs:
# commands:
# - bash -x '{{build.DownloadRemoteScript.inputs[0].destination}}'
#
#
# - name: validate
# steps:
# - name: RunScript
# maxAttempts: 3
# action: ExecuteBash
# onFailure: Abort
# inputs:
# commands:
# - ls /var/tmp/clickhouse-ci-ami.success
# - name: Cleanup
# action: DeleteFile
# onFailure: Abort
# maxAttempts: 3
# inputs:
# - path: /var/tmp/clickhouse-ci-ami.success

View File

@ -10,11 +10,12 @@ REPLACE recursive
4 8
1
ATTACH FROM
5 8
6 8
10 12
OPTIMIZE
5 8 5
5 8 3
10 12 9
10 12 5
After restart
5 8
10 12
DETACH+ATTACH PARTITION
3 4
7 7

View File

@ -53,12 +53,16 @@ DROP TABLE src;
CREATE TABLE src (p UInt64, k String, d UInt64) ENGINE = MergeTree PARTITION BY p ORDER BY k;
INSERT INTO src VALUES (1, '0', 1);
INSERT INTO src VALUES (1, '1', 1);
INSERT INTO src VALUES (2, '2', 1);
INSERT INTO src VALUES (3, '3', 1);
SYSTEM STOP MERGES dst;
INSERT INTO dst VALUES (1, '1', 2);
INSERT INTO dst VALUES (1, '1', 2), (1, '2', 0);
ALTER TABLE dst ATTACH PARTITION 1 FROM src;
SELECT count(), sum(d) FROM dst;
ALTER TABLE dst ATTACH PARTITION ALL FROM src;
SELECT count(), sum(d) FROM dst;
SELECT 'OPTIMIZE';
SELECT count(), sum(d), uniqExact(_part) FROM dst;

View File

@ -16,6 +16,7 @@ REPLACE recursive
ATTACH FROM
5 8
5 8
7 12
REPLACE with fetch
4 6
4 6

View File

@ -1,5 +1,5 @@
#!/usr/bin/env bash
# Tags: zookeeper, no-object-storage
# Tags: zookeeper, no-object-storage, long
# Because REPLACE PARTITION does not forces immediate removal of replaced data parts from local filesystem
# (it tries to do it as quick as possible, but it still performed in separate thread asynchronously)
@ -82,6 +82,8 @@ $CLICKHOUSE_CLIENT --query="DROP TABLE src;"
$CLICKHOUSE_CLIENT --query="CREATE TABLE src (p UInt64, k String, d UInt64) ENGINE = MergeTree PARTITION BY p ORDER BY k;"
$CLICKHOUSE_CLIENT --query="INSERT INTO src VALUES (1, '0', 1);"
$CLICKHOUSE_CLIENT --query="INSERT INTO src VALUES (1, '1', 1);"
$CLICKHOUSE_CLIENT --query="INSERT INTO src VALUES (3, '1', 2);"
$CLICKHOUSE_CLIENT --query="INSERT INTO src VALUES (4, '1', 2);"
$CLICKHOUSE_CLIENT --query="INSERT INTO dst_r2 VALUES (1, '1', 2);"
query_with_retry "ALTER TABLE dst_r2 ATTACH PARTITION 1 FROM src;"
@ -90,6 +92,13 @@ $CLICKHOUSE_CLIENT --query="SYSTEM SYNC REPLICA dst_r1;"
$CLICKHOUSE_CLIENT --query="SELECT count(), sum(d) FROM dst_r1;"
$CLICKHOUSE_CLIENT --query="SELECT count(), sum(d) FROM dst_r2;"
query_with_retry "ALTER TABLE dst_r2 ATTACH PARTITION ALL FROM src;"
$CLICKHOUSE_CLIENT --query="SYSTEM SYNC REPLICA dst_r2;"
$CLICKHOUSE_CLIENT --query="SELECT count(), sum(d) FROM dst_r2;"
query_with_retry "ALTER TABLE dst_r2 DROP PARTITION 3;"
$CLICKHOUSE_CLIENT --query="SYSTEM SYNC REPLICA dst_r2;"
query_with_retry "ALTER TABLE dst_r2 DROP PARTITION 4;"
$CLICKHOUSE_CLIENT --query="SYSTEM SYNC REPLICA dst_r2;"
$CLICKHOUSE_CLIENT --query="SELECT 'REPLACE with fetch';"
$CLICKHOUSE_CLIENT --query="DROP TABLE src;"