Merge branch 'master' into master

This commit is contained in:
Nikolay Degterinsky 2022-10-06 16:55:34 +02:00 committed by GitHub
commit c5ff73cb25
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
12 changed files with 272 additions and 97 deletions

View File

@ -1,13 +1,34 @@
---
sidebar_label: Installation
sidebar_position: 1
keywords: [clickhouse, install, installation, docs]
description: ClickHouse can run on any Linux, FreeBSD, or Mac OS X with x86_64, AArch64, or PowerPC64LE CPU architecture.
slug: /en/getting-started/install
title: Installation
sidebar_label: Install
keywords: [clickhouse, install, getting started, quick start]
slug: /en/install
---
## System Requirements {#system-requirements}
# Installing ClickHouse
You have two options for getting up and running with ClickHouse:
- **[ClickHouse Cloud](https://clickhouse.cloud/):** the official ClickHouse as a service, - built by, maintained, and supported by the creators of ClickHouse
- **Self-managed ClickHouse:** ClickHouse can run on any Linux, FreeBSD, or Mac OS X with x86_64, AArch64, or PowerPC64LE CPU architecture
## ClickHouse Cloud
The quickest and easiest way to get up and running with ClickHouse is to create a new service in [ClickHouse Cloud](https://clickhouse.cloud/):
<div class="eighty-percent">
![Create a ClickHouse Cloud service](@site/docs/en/_snippets/images/createservice1.png)
</div>
Once your Cloud service is provisioned, you will be able to [connect to it](/docs/en/integrations/connect-a-client.md) and start [inserting data](/docs/en/integrations/data-ingestion.md).
:::note
The [Quick Start](/docs/en/quick-start.mdx) walks through the steps to get a ClickHouse Cloud service up and running, connecting to it, and inserting data.
:::
## Self-Managed Requirements
### CPU Architecture
ClickHouse can run on any Linux, FreeBSD, or Mac OS X with x86_64, AArch64, or PowerPC64LE CPU architecture.
@ -19,6 +40,55 @@ $ grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not
To run ClickHouse on processors that do not support SSE 4.2 or have AArch64 or PowerPC64LE architecture, you should [build ClickHouse from sources](#from-sources) with proper configuration adjustments.
ClickHouse implements parallel data processing and uses all the hardware resources available. When choosing a processor, take into account that ClickHouse works more efficiently at configurations with a large number of cores but a lower clock rate than at configurations with fewer cores and a higher clock rate. For example, 16 cores with 2600 MHz is preferable to 8 cores with 3600 MHz.
It is recommended to use **Turbo Boost** and **hyper-threading** technologies. It significantly improves performance with a typical workload.
### RAM {#ram}
We recommend using a minimum of 4GB of RAM to perform non-trivial queries. The ClickHouse server can run with a much smaller amount of RAM, but it requires memory for processing queries.
The required volume of RAM depends on:
- The complexity of queries.
- The amount of data that is processed in queries.
To calculate the required volume of RAM, you should estimate the size of temporary data for [GROUP BY](/docs/en/sql-reference/statements/select/group-by.md#select-group-by-clause), [DISTINCT](/docs/en/sql-reference/statements/select/distinct.md#select-distinct), [JOIN](/docs/en/sql-reference/statements/select/join.md#select-join) and other operations you use.
ClickHouse can use external memory for temporary data. See [GROUP BY in External Memory](/docs/en/sql-reference/statements/select/group-by.md#select-group-by-in-external-memory) for details.
### Swap File {#swap-file}
Disable the swap file for production environments.
### Storage Subsystem {#storage-subsystem}
You need to have 2GB of free disk space to install ClickHouse.
The volume of storage required for your data should be calculated separately. Assessment should include:
- Estimation of the data volume.
You can take a sample of the data and get the average size of a row from it. Then multiply the value by the number of rows you plan to store.
- The data compression coefficient.
To estimate the data compression coefficient, load a sample of your data into ClickHouse, and compare the actual size of the data with the size of the table stored. For example, clickstream data is usually compressed by 6-10 times.
To calculate the final volume of data to be stored, apply the compression coefficient to the estimated data volume. If you plan to store data in several replicas, then multiply the estimated volume by the number of replicas.
### Network {#network}
If possible, use networks of 10G or higher class.
The network bandwidth is critical for processing distributed queries with a large amount of intermediate data. Besides, network speed affects replication processes.
### Software {#software}
ClickHouse is developed primarily for the Linux family of operating systems. The recommended Linux distribution is Ubuntu. The `tzdata` package should be installed in the system.
## Self-Managed Install
## Available Installation Options {#available-installation-options}
### From DEB Packages {#install-from-deb-packages}
@ -58,9 +128,9 @@ clickhouse-client # or "clickhouse-client --password" if you set up a password.
</details>
You can replace `stable` with `lts` to use different [release kinds](../faq/operations/production.md) based on your needs.
You can replace `stable` with `lts` to use different [release kinds](/docs/en/faq/operations/production.md) based on your needs.
You can also download and install packages manually from [here](https://packages.clickhouse.com/deb/pool/stable).
You can also download and install packages manually from [here](https://packages.clickhouse.com/deb/pool/main/c/).
#### Packages {#packages}
@ -105,7 +175,7 @@ clickhouse-client # or "clickhouse-client --password" if you set up a password.
</details>
You can replace `stable` with `lts` to use different [release kinds](../faq/operations/production.md) based on your needs.
You can replace `stable` with `lts` to use different [release kinds](/docs/en/faq/operations/production.md) based on your needs.
Then run these commands to install packages:
@ -226,7 +296,7 @@ Use the `clickhouse client` to connect to the server, or `clickhouse local` to p
### From Sources {#from-sources}
To manually compile ClickHouse, follow the instructions for [Linux](../development/build.md) or [Mac OS X](../development/build-osx.md).
To manually compile ClickHouse, follow the instructions for [Linux](/docs/en/development/build.md) or [Mac OS X](/docs/en/development/build-osx.md).
You can compile packages and install them or use programs without installing packages. Also by building manually you can disable SSE 4.2 requirement or build for AArch64 CPUs.
@ -281,7 +351,7 @@ If the configuration file is in the current directory, you do not need to specif
ClickHouse supports access restriction settings. They are located in the `users.xml` file (next to `config.xml`).
By default, access is allowed from anywhere for the `default` user, without a password. See `user/default/networks`.
For more information, see the section [“Configuration Files”](../operations/configuration-files.md).
For more information, see the section [“Configuration Files”](/docs/en/operations/configuration-files.md).
After launching server, you can use the command-line client to connect to it:
@ -292,7 +362,7 @@ $ clickhouse-client
By default, it connects to `localhost:9000` on behalf of the user `default` without a password. It can also be used to connect to a remote server using `--host` argument.
The terminal must use UTF-8 encoding.
For more information, see the section [“Command-line client”](../interfaces/cli.md).
For more information, see the section [“Command-line client”](/docs/en/interfaces/cli.md).
Example:
@ -317,6 +387,5 @@ SELECT 1
**Congratulations, the system works!**
To continue experimenting, you can download one of the test data sets or go through [tutorial](./../tutorial.md).
To continue experimenting, you can download one of the test data sets or go through [tutorial](/docs/en/tutorial.md).
[Original article](https://clickhouse.com/docs/en/getting_started/install/) <!--hide-->

View File

@ -1,60 +0,0 @@
---
slug: /en/operations/requirements
sidebar_position: 44
sidebar_label: Requirements
---
# Requirements
## CPU
For installation from prebuilt deb packages, use a CPU with x86_64 architecture and support for SSE 4.2 instructions. To run ClickHouse with processors that do not support SSE 4.2 or have AArch64 or PowerPC64LE architecture, you should build ClickHouse from sources.
ClickHouse implements parallel data processing and uses all the hardware resources available. When choosing a processor, take into account that ClickHouse works more efficiently at configurations with a large number of cores but a lower clock rate than at configurations with fewer cores and a higher clock rate. For example, 16 cores with 2600 MHz is preferable to 8 cores with 3600 MHz.
It is recommended to use **Turbo Boost** and **hyper-threading** technologies. It significantly improves performance with a typical workload.
## RAM {#ram}
We recommend using a minimum of 4GB of RAM to perform non-trivial queries. The ClickHouse server can run with a much smaller amount of RAM, but it requires memory for processing queries.
The required volume of RAM depends on:
- The complexity of queries.
- The amount of data that is processed in queries.
To calculate the required volume of RAM, you should estimate the size of temporary data for [GROUP BY](../sql-reference/statements/select/group-by.md#select-group-by-clause), [DISTINCT](../sql-reference/statements/select/distinct.md#select-distinct), [JOIN](../sql-reference/statements/select/join.md#select-join) and other operations you use.
ClickHouse can use external memory for temporary data. See [GROUP BY in External Memory](../sql-reference/statements/select/group-by.md#select-group-by-in-external-memory) for details.
## Swap File {#swap-file}
Disable the swap file for production environments.
## Storage Subsystem {#storage-subsystem}
You need to have 2GB of free disk space to install ClickHouse.
The volume of storage required for your data should be calculated separately. Assessment should include:
- Estimation of the data volume.
You can take a sample of the data and get the average size of a row from it. Then multiply the value by the number of rows you plan to store.
- The data compression coefficient.
To estimate the data compression coefficient, load a sample of your data into ClickHouse, and compare the actual size of the data with the size of the table stored. For example, clickstream data is usually compressed by 6-10 times.
To calculate the final volume of data to be stored, apply the compression coefficient to the estimated data volume. If you plan to store data in several replicas, then multiply the estimated volume by the number of replicas.
## Network {#network}
If possible, use networks of 10G or higher class.
The network bandwidth is critical for processing distributed queries with a large amount of intermediate data. Besides, network speed affects replication processes.
## Software {#software}
ClickHouse is developed primarily for the Linux family of operating systems. The recommended Linux distribution is Ubuntu. The `tzdata` package should be installed in the system.
ClickHouse can also work in other operating system families. See details in the [install guide](../getting-started/install.md) section of the documentation.

View File

@ -35,6 +35,33 @@ public:
BZ2_bzDecompressEnd(&stream);
}
void reinitialize()
{
auto avail_out = stream.avail_out;
auto * next_out = stream.next_out;
int ret = BZ2_bzDecompressEnd(&stream);
if (ret != BZ_OK)
throw Exception(
ErrorCodes::BZIP2_STREAM_DECODER_FAILED,
"bzip2 stream encoder reinit decompress end failed: error code: {}",
ret);
memset(&stream, 0, sizeof(bz->stream));
ret = BZ2_bzDecompressInit(&stream, 0, 0);
if (ret != BZ_OK)
throw Exception(
ErrorCodes::BZIP2_STREAM_DECODER_FAILED,
"bzip2 stream encoder reinit failed: error code: {}",
ret);
stream.avail_out = avail_out;
stream.next_out = next_out;
}
bz_stream stream;
};
@ -68,24 +95,24 @@ bool Bzip2ReadBuffer::nextImpl()
ret = BZ2_bzDecompress(&bz->stream);
in->position() = in->buffer().end() - bz->stream.avail_in;
if (ret == BZ_STREAM_END && !in->eof())
{
bz->reinitialize();
bz->stream.avail_in = in->buffer().end() - in->position();
bz->stream.next_in = in->position();
ret = BZ_OK;
}
}
while (bz->stream.avail_out == internal_buffer.size() && ret == BZ_OK && !in->eof());
working_buffer.resize(internal_buffer.size() - bz->stream.avail_out);
if (ret == BZ_STREAM_END)
if (ret == BZ_STREAM_END && in->eof())
{
if (in->eof())
{
eof_flag = true;
return !working_buffer.empty();
}
else
{
throw Exception(
ErrorCodes::BZIP2_STREAM_DECODER_FAILED,
"bzip2 decoder finished, but input stream has not exceeded: error code: {}", ret);
}
eof_flag = true;
return !working_buffer.empty();
}
if (ret != BZ_OK)

View File

@ -1,5 +1,6 @@
#include <Core/Defines.h>
#include <ranges>
#include "Common/hex.h"
#include <Common/Macros.h>
#include <Common/StringUtils/StringUtils.h>
@ -7759,14 +7760,77 @@ std::pair<bool, NameSet> StorageReplicatedMergeTree::unlockSharedData(const IMer
if (!has_metadata_in_zookeeper.has_value() && !zookeeper->exists(zookeeper_path))
return std::make_pair(true, NameSet{});
return unlockSharedDataByID(part.getUniqueId(), getTableSharedID(), part.name, replica_name, part.data_part_storage->getDiskType(), zookeeper, *getSettings(), log,
zookeeper_path);
return unlockSharedDataByID(
part.getUniqueId(), getTableSharedID(), part.name, replica_name,
part.data_part_storage->getDiskType(), zookeeper, *getSettings(), log, zookeeper_path, format_version);
}
namespace
{
/// What is going on here?
/// Actually we need this code because of flaws in hardlinks tracking. When we create child part during mutation we can hardlink some files from parent part, like
/// all_0_0_0:
/// a.bin a.mrk2 columns.txt ...
/// all_0_0_0_1: ^ ^
/// a.bin a.mrk2 columns.txt
/// So when we deleting all_0_0_0 it doesn't remove blobs for a.bin and a.mrk2 because all_0_0_0_1 use them.
/// But sometimes we need an opposite. When we deleting all_0_0_0_1 it can be non replicated to other replicas, so we are the only owner of this part.
/// In this case when we will drop all_0_0_0_1 we will drop blobs for all_0_0_0. But it will lead to dataloss. For such case we need to check that other replicas
/// still need parent part.
NameSet getParentLockedBlobs(zkutil::ZooKeeperPtr zookeeper_ptr, const std::string & zero_copy_part_path_prefix, const std::string & part_info_str, MergeTreeDataFormatVersion format_version)
{
NameSet files_not_to_remove;
MergeTreePartInfo part_info = MergeTreePartInfo::fromPartName(part_info_str, format_version);
/// No mutations -- no hardlinks -- no issues
if (part_info.mutation == 0)
return files_not_to_remove;
/// Getting all zero copy parts
Strings parts_str;
zookeeper_ptr->tryGetChildren(zero_copy_part_path_prefix, parts_str);
/// Parsing infos
std::vector<MergeTreePartInfo> parts_infos;
for (const auto & part_str : parts_str)
{
MergeTreePartInfo parent_candidate_info = MergeTreePartInfo::fromPartName(part_str, format_version);
parts_infos.push_back(parent_candidate_info);
}
/// Sort is important. We need to find our closest parent, like:
/// for part all_0_0_0_64 we can have parents
/// all_0_0_0_6 < we need the closest parent, not others
/// all_0_0_0_1
/// all_0_0_0
std::sort(parts_infos.begin(), parts_infos.end());
/// In reverse order to process from bigger to smaller
for (const auto & parent_candidate_info : parts_infos | std::views::reverse)
{
if (parent_candidate_info == part_info)
continue;
/// We are mutation child of this parent
if (part_info.isMutationChildOf(parent_candidate_info))
{
/// Get hardlinked files
String files_not_to_remove_str;
zookeeper_ptr->tryGet(fs::path(zero_copy_part_path_prefix) / parent_candidate_info.getPartName(), files_not_to_remove_str);
boost::split(files_not_to_remove, files_not_to_remove_str, boost::is_any_of("\n "));
break;
}
}
return files_not_to_remove;
}
}
std::pair<bool, NameSet> StorageReplicatedMergeTree::unlockSharedDataByID(
String part_id, const String & table_uuid, const String & part_name,
const String & replica_name_, std::string disk_type, zkutil::ZooKeeperPtr zookeeper_ptr, const MergeTreeSettings & settings,
Poco::Logger * logger, const String & zookeeper_path_old)
Poco::Logger * logger, const String & zookeeper_path_old, MergeTreeDataFormatVersion data_format_version)
{
boost::replace_all(part_id, "/", "_");
@ -7784,6 +7848,9 @@ std::pair<bool, NameSet> StorageReplicatedMergeTree::unlockSharedDataByID(
if (!files_not_to_remove_str.empty())
boost::split(files_not_to_remove, files_not_to_remove_str, boost::is_any_of("\n "));
auto parent_not_to_remove = getParentLockedBlobs(zookeeper_ptr, fs::path(zc_zookeeper_path).parent_path(), part_name, data_format_version);
files_not_to_remove.insert(parent_not_to_remove.begin(), parent_not_to_remove.end());
String zookeeper_part_uniq_node = fs::path(zc_zookeeper_path) / part_id;
/// Delete our replica node for part from zookeeper (we are not interested in it anymore)
@ -7861,7 +7928,7 @@ std::pair<bool, NameSet> StorageReplicatedMergeTree::unlockSharedDataByID(
else
{
LOG_TRACE(logger, "Can't remove parent zookeeper lock {} for part {}, because children {} ({}) exists",
zookeeper_part_node, part_name, children.size(), fmt::join(children, ", "));
zookeeper_part_node, part_name, children.size(), fmt::join(children, ", "));
}
}
@ -8397,9 +8464,14 @@ bool StorageReplicatedMergeTree::removeSharedDetachedPart(DiskPtr disk, const St
{
String id = disk->getUniqueId(checksums);
bool can_remove = false;
std::tie(can_remove, files_not_to_remove) = StorageReplicatedMergeTree::unlockSharedDataByID(id, table_uuid, part_name,
detached_replica_name, toString(disk->getDataSourceDescription().type), zookeeper, local_context->getReplicatedMergeTreeSettings(), &Poco::Logger::get("StorageReplicatedMergeTree"),
detached_zookeeper_path);
std::tie(can_remove, files_not_to_remove) = StorageReplicatedMergeTree::unlockSharedDataByID(
id, table_uuid, part_name,
detached_replica_name,
toString(disk->getDataSourceDescription().type),
zookeeper, local_context->getReplicatedMergeTreeSettings(),
&Poco::Logger::get("StorageReplicatedMergeTree"),
detached_zookeeper_path,
MERGE_TREE_DATA_MIN_FORMAT_VERSION_WITH_CUSTOM_PARTITIONING);
keep_shared = !can_remove;
}

View File

@ -283,7 +283,7 @@ public:
/// Return false if data is still used by another node
static std::pair<bool, NameSet> unlockSharedDataByID(String part_id, const String & table_uuid, const String & part_name, const String & replica_name_,
std::string disk_type, zkutil::ZooKeeperPtr zookeeper_, const MergeTreeSettings & settings, Poco::Logger * logger,
const String & zookeeper_path_old);
const String & zookeeper_path_old, MergeTreeDataFormatVersion data_format_version);
/// Fetch part only if some replica has it on shared storage like S3
DataPartStoragePtr tryToFetchIfShared(const IMergeTreeDataPart & part, const DiskPtr & disk, const String & path) override;

View File

@ -3,7 +3,7 @@ set -xeuo pipefail
echo "Running prepare script"
export DEBIAN_FRONTEND=noninteractive
export RUNNER_VERSION=2.296.2
export RUNNER_VERSION=2.298.2
export RUNNER_HOME=/home/ubuntu/actions-runner
deb_arch() {
@ -33,6 +33,7 @@ apt-get update
apt-get install --yes --no-install-recommends \
apt-transport-https \
at \
atop \
binfmt-support \
build-essential \

View File

@ -123,3 +123,4 @@ Hello, world
Hello, world
0
Part1 Part2
Part1 Part2

View File

@ -51,5 +51,6 @@ echo "'Hello, world'" | bzip2 -c | ${CLICKHOUSE_CURL} -sS --data-binary @- -H 'C
${CLICKHOUSE_CURL} -sS "${CLICKHOUSE_URL}&enable_http_compression=1" -H 'Accept-Encoding: gzip' -d 'SELECT number FROM system.numbers LIMIT 0' | wc -c;
# POST multiple concatenated gzip streams.
# POST multiple concatenated gzip and bzip2 streams.
(echo -n "SELECT 'Part1" | gzip -c; echo " Part2'" | gzip -c) | ${CLICKHOUSE_CURL} -sS -H 'Content-Encoding: gzip' "${CLICKHOUSE_URL}" --data-binary @-
(echo -n "SELECT 'Part1" | bzip2 -c; echo " Part2'" | bzip2 -c) | ${CLICKHOUSE_CURL} -sS -H 'Content-Encoding: bz2' "${CLICKHOUSE_URL}" --data-binary @-

View File

@ -0,0 +1,3 @@
1 Hello
1 Hello
1 Hello

View File

@ -0,0 +1,36 @@
DROP TABLE IF EXISTS mutation_1;
DROP TABLE IF EXISTS mutation_2;
CREATE TABLE mutation_1
(
a UInt64,
b String
)
ENGINE = ReplicatedMergeTree('/clickhouse/test/{database}/t', '1')
ORDER BY tuple() SETTINGS min_bytes_for_wide_part=0, allow_remote_fs_zero_copy_replication=1;
CREATE TABLE mutation_2
(
a UInt64,
b String
)
ENGINE = ReplicatedMergeTree('/clickhouse/test/{database}/t', '2')
ORDER BY tuple() SETTINGS min_bytes_for_wide_part=0, allow_remote_fs_zero_copy_replication=1;
INSERT INTO mutation_1 VALUES (1, 'Hello');
SYSTEM SYNC REPLICA mutation_2;
SYSTEM STOP REPLICATION QUEUES mutation_2;
ALTER TABLE mutation_1 UPDATE a = 2 WHERE b = 'xxxxxx' SETTINGS mutations_sync=1;
SELECT * from mutation_1;
SELECT * from mutation_2;
DROP TABLE mutation_1 SYNC;
SELECT * from mutation_2;
DROP TABLE IF EXISTS mutation_1;
DROP TABLE IF EXISTS mutation_2;

View File

@ -0,0 +1,4 @@
0
1
2
3

View File

@ -0,0 +1,21 @@
#!/usr/bin/env bash
# Tags: no-fasttest
# Tag no-fasttest: depends on bzip2
CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
# shellcheck source=../shell_config.sh
. "$CURDIR"/../shell_config.sh
USER_FILES_PATH=$(clickhouse-client --query "select _path,_file from file('nonexist.txt', 'CSV', 'val1 char')" 2>&1 | grep Exception | awk '{gsub("/nonexist.txt","",$9); print $9}')
WORKING_FOLDER_02457="${USER_FILES_PATH}/${CLICKHOUSE_DATABASE}"
rm -rf "${WORKING_FOLDER_02457}"
mkdir "${WORKING_FOLDER_02457}"
${CLICKHOUSE_CLIENT} --query "SELECT * FROM numbers(0, 2) INTO OUTFILE '${WORKING_FOLDER_02457}/file_1.bz2'"
${CLICKHOUSE_CLIENT} --query "SELECT * FROM numbers(2, 2) INTO OUTFILE '${WORKING_FOLDER_02457}/file_2.bz2'"
cat ${WORKING_FOLDER_02457}/file_1.bz2 ${WORKING_FOLDER_02457}/file_2.bz2 > ${WORKING_FOLDER_02457}/concatenated.bz2
${CLICKHOUSE_CLIENT} --query "SELECT * FROM file('${WORKING_FOLDER_02457}/concatenated.bz2', 'TabSeparated', 'col Int64')"
rm -rf "${WORKING_FOLDER_02457}"