merge master

This commit is contained in:
Guillaume Tassery 2019-09-16 09:27:38 +02:00
commit 8920f36d96
167 changed files with 5466 additions and 1155 deletions

View File

@ -372,7 +372,7 @@ It allows to set commit mode: after every batch of messages is handled, or after
* Renamed functions `leastSqr` to `simpleLinearRegression`, `LinearRegression` to `linearRegression`, `LogisticRegression` to `logisticRegression`. [#5391](https://github.com/yandex/ClickHouse/pull/5391) ([Nikolai Kochetov](https://github.com/KochetovNicolai))
### Performance Improvements
* Paralellize processing of parts of non-replicated MergeTree tables in ALTER MODIFY query. [#4639](https://github.com/yandex/ClickHouse/pull/4639) ([Ivan Kush](https://github.com/IvanKush))
* Parallelize processing of parts of non-replicated MergeTree tables in ALTER MODIFY query. [#4639](https://github.com/yandex/ClickHouse/pull/4639) ([Ivan Kush](https://github.com/IvanKush))
* Optimizations in regular expressions extraction. [#5193](https://github.com/yandex/ClickHouse/pull/5193) [#5191](https://github.com/yandex/ClickHouse/pull/5191) ([Danila Kutenin](https://github.com/danlark1))
* Do not add right join key column to join result if it's used only in join on section. [#5260](https://github.com/yandex/ClickHouse/pull/5260) ([Artem Zuikov](https://github.com/4ertus2))
* Freeze the Kafka buffer after first empty response. It avoids multiple invokations of `ReadBuffer::next()` for empty result in some row-parsing streams. [#5283](https://github.com/yandex/ClickHouse/pull/5283) ([Ivan](https://github.com/abyss7))
@ -599,7 +599,7 @@ lee](https://github.com/neverlee))
* Fix error `Unknown log entry type: 0` after `OPTIMIZE TABLE FINAL` query. [#4683](https://github.com/yandex/ClickHouse/pull/4683) ([Amos Bird](https://github.com/amosbird))
* Wrong arguments to `hasAny` or `hasAll` functions may lead to segfault. [#4698](https://github.com/yandex/ClickHouse/pull/4698) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Deadlock may happen while executing `DROP DATABASE dictionary` query. [#4701](https://github.com/yandex/ClickHouse/pull/4701) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Fix undefinied behavior in `median` and `quantile` functions. [#4702](https://github.com/yandex/ClickHouse/pull/4702) ([hcz](https://github.com/hczhcz))
* Fix undefined behavior in `median` and `quantile` functions. [#4702](https://github.com/yandex/ClickHouse/pull/4702) ([hcz](https://github.com/hczhcz))
* Fix compression level detection when `network_compression_method` in lowercase. Broken in v19.1. [#4706](https://github.com/yandex/ClickHouse/pull/4706) ([proller](https://github.com/proller))
* Fixed ignorance of `<timezone>UTC</timezone>` setting (fixes issue [#4658](https://github.com/yandex/ClickHouse/issues/4658)). [#4718](https://github.com/yandex/ClickHouse/pull/4718) ([proller](https://github.com/proller))
* Fix `histogram` function behaviour with `Distributed` tables. [#4741](https://github.com/yandex/ClickHouse/pull/4741) ([olegkv](https://github.com/olegkv))
@ -668,7 +668,7 @@ lee](https://github.com/neverlee))
* Fix error `Unknown log entry type: 0` after `OPTIMIZE TABLE FINAL` query. [#4683](https://github.com/yandex/ClickHouse/pull/4683) ([Amos Bird](https://github.com/amosbird))
* Wrong arguments to `hasAny` or `hasAll` functions may lead to segfault. [#4698](https://github.com/yandex/ClickHouse/pull/4698) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Deadlock may happen while executing `DROP DATABASE dictionary` query. [#4701](https://github.com/yandex/ClickHouse/pull/4701) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Fix undefinied behavior in `median` and `quantile` functions. [#4702](https://github.com/yandex/ClickHouse/pull/4702) ([hcz](https://github.com/hczhcz))
* Fix undefined behavior in `median` and `quantile` functions. [#4702](https://github.com/yandex/ClickHouse/pull/4702) ([hcz](https://github.com/hczhcz))
* Fix compression level detection when `network_compression_method` in lowercase. Broken in v19.1. [#4706](https://github.com/yandex/ClickHouse/pull/4706) ([proller](https://github.com/proller))
* Fixed ignorance of `<timezone>UTC</timezone>` setting (fixes issue [#4658](https://github.com/yandex/ClickHouse/issues/4658)). [#4718](https://github.com/yandex/ClickHouse/pull/4718) ([proller](https://github.com/proller))
* Fix `histogram` function behaviour with `Distributed` tables. [#4741](https://github.com/yandex/ClickHouse/pull/4741) ([olegkv](https://github.com/olegkv))
@ -850,7 +850,7 @@ lee](https://github.com/neverlee))
* Added aggregate function `entropy` which computes Shannon entropy. [#4238](https://github.com/yandex/ClickHouse/pull/4238) ([Quid37](https://github.com/Quid37))
* Added ability to send queries `INSERT INTO tbl VALUES (....` to server without splitting on `query` and `data` parts. [#4301](https://github.com/yandex/ClickHouse/pull/4301) ([alesapin](https://github.com/alesapin))
* Generic implementation of `arrayWithConstant` function was added. [#4322](https://github.com/yandex/ClickHouse/pull/4322) ([alexey-milovidov](https://github.com/alexey-milovidov))
* Implented `NOT BETWEEN` comparison operator. [#4228](https://github.com/yandex/ClickHouse/pull/4228) ([Dmitry Naumov](https://github.com/nezed))
* Implemented `NOT BETWEEN` comparison operator. [#4228](https://github.com/yandex/ClickHouse/pull/4228) ([Dmitry Naumov](https://github.com/nezed))
* Implement `sumMapFiltered` in order to be able to limit the number of keys for which values will be summed by `sumMap`. [#4129](https://github.com/yandex/ClickHouse/pull/4129) ([Léo Ercolanelli](https://github.com/ercolanelli-leo))
* Added support of `Nullable` types in `mysql` table function. [#4198](https://github.com/yandex/ClickHouse/pull/4198) ([Emmanuel Donin de Rosière](https://github.com/edonin))
* Support for arbitrary constant expressions in `LIMIT` clause. [#4246](https://github.com/yandex/ClickHouse/pull/4246) ([k3box](https://github.com/k3box))
@ -925,7 +925,7 @@ lee](https://github.com/neverlee))
* Added keyword `INDEX` in `CREATE TABLE` query. A column with name `index` must be quoted with backticks or double quotes: `` `index` ``. [#4143](https://github.com/yandex/ClickHouse/pull/4143) ([Nikita Vasilev](https://github.com/nikvas0))
* `sumMap` now promote result type instead of overflow. The old `sumMap` behavior can be obtained by using `sumMapWithOverflow` function. [#4151](https://github.com/yandex/ClickHouse/pull/4151) ([Léo Ercolanelli](https://github.com/ercolanelli-leo))
### Performance Impovements
### Performance Improvements
* `std::sort` replaced by `pdqsort` for queries without `LIMIT`. [#4236](https://github.com/yandex/ClickHouse/pull/4236) ([Evgenii Pravda](https://github.com/kvinty))
* Now server reuse threads from global thread pool. This affects performance in some corner cases. [#4150](https://github.com/yandex/ClickHouse/pull/4150) ([alexey-milovidov](https://github.com/alexey-milovidov))

View File

@ -150,6 +150,7 @@ endif ()
if (LINKER_NAME)
message(STATUS "Using linker: ${LINKER_NAME} (selected from: LLD_PATH=${LLD_PATH}; GOLD_PATH=${GOLD_PATH}; COMPILER_POSTFIX=${COMPILER_POSTFIX})")
set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fuse-ld=${LINKER_NAME}")
set (CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -fuse-ld=${LINKER_NAME}")
endif ()
# Make sure the final executable has symbols exported
@ -229,7 +230,8 @@ endif ()
# Make this extra-checks for correct library dependencies.
if (NOT SANITIZE)
set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--no-undefined")
set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--no-undefined")
set (CMAKE_SHARED_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--no-undefined")
endif ()
include(cmake/dbms_glob_sources.cmake)

View File

@ -12,8 +12,7 @@
# https://youtrack.jetbrains.com/issue/CPP-2659
# https://youtrack.jetbrains.com/issue/CPP-870
string(TOLOWER "${CMAKE_COMMAND}" CMAKE_COMMAND_LOWER)
if (NOT ${CMAKE_COMMAND_LOWER} MATCHES "clion")
if (NOT DEFINED ENV{CLION_IDE})
find_program(NINJA_PATH ninja)
if (NINJA_PATH)
set(CMAKE_GENERATOR "Ninja" CACHE INTERNAL "" FORCE)

View File

@ -18,3 +18,4 @@ ClickHouse is an open-source column-oriented database management system that all
* [ClickHouse Meetup in Hong Kong](https://www.meetup.com/Hong-Kong-Machine-Learning-Meetup/events/263580542/) on October 17.
* [ClickHouse Meetup in Shenzhen](https://www.huodongxing.com/event/3483759917300) on October 20.
* [ClickHouse Meetup in Shanghai](https://www.huodongxing.com/event/4483760336000) on October 27.
* [ClickHouse Meetup in Tokyo](https://clickhouse.connpass.com/event/147001/) on November 14.

View File

@ -1,20 +1,26 @@
option (USE_CAPNP "Enable Cap'n Proto" ON)
option (ENABLE_CAPNP "Enable Cap'n Proto" ON)
if (USE_CAPNP)
option (USE_INTERNAL_CAPNP_LIBRARY "Set to FALSE to use system capnproto library instead of bundled" ${NOT_UNBUNDLED})
if (ENABLE_CAPNP)
# FIXME: refactor to use `add_library( IMPORTED)` if possible.
if (NOT USE_INTERNAL_CAPNP_LIBRARY)
find_library (KJ kj)
find_library (CAPNP capnp)
find_library (CAPNPC capnpc)
option (USE_INTERNAL_CAPNP_LIBRARY "Set to FALSE to use system capnproto library instead of bundled" ${NOT_UNBUNDLED})
set (CAPNP_LIBRARIES ${CAPNPC} ${CAPNP} ${KJ})
else ()
add_subdirectory(contrib/capnproto-cmake)
# FIXME: refactor to use `add_library( IMPORTED)` if possible.
if (NOT USE_INTERNAL_CAPNP_LIBRARY)
find_library (KJ kj)
find_library (CAPNP capnp)
find_library (CAPNPC capnpc)
set (CAPNP_LIBRARIES capnpc)
endif ()
set (CAPNP_LIBRARIES ${CAPNPC} ${CAPNP} ${KJ})
else ()
add_subdirectory(contrib/capnproto-cmake)
message (STATUS "Using capnp: ${CAPNP_LIBRARIES}")
set (CAPNP_LIBRARIES capnpc)
endif ()
if (CAPNP_LIBRARIES)
set (USE_CAPNP 1)
endif ()
endif ()
message (STATUS "Using capnp: ${CAPNP_LIBRARIES}")

View File

@ -143,15 +143,15 @@ add_subdirectory(src/Common/Config)
set (all_modules)
macro(add_object_library name common_path)
list (APPEND all_modules ${name})
add_glob(${name}_headers RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} ${common_path}/*.h)
add_glob(${name}_sources ${common_path}/*.cpp ${common_path}/*.c ${common_path}/*.h)
if (MAKE_STATIC_LIBRARIES OR NOT SPLIT_SHARED_LIBRARIES)
add_library(${name} OBJECT ${${name}_sources} ${${name}_headers})
add_glob(dbms_headers RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} ${common_path}/*.h)
add_glob(dbms_sources ${common_path}/*.cpp ${common_path}/*.c ${common_path}/*.h)
else ()
list (APPEND all_modules ${name})
add_glob(${name}_headers RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} ${common_path}/*.h)
add_glob(${name}_sources ${common_path}/*.cpp ${common_path}/*.c ${common_path}/*.h)
add_library(${name} SHARED ${${name}_sources} ${${name}_headers})
# force all split libs to be linked
target_link_options(${name} PUBLIC "-Wl,--no-as-needed")
target_link_libraries (${name} PRIVATE -Wl,--unresolved-symbols=ignore-all)
endif ()
endmacro()
@ -177,15 +177,15 @@ add_object_library(clickhouse_processors_transforms src/Processors/Transforms)
add_object_library(clickhouse_processors_sources src/Processors/Sources)
if (MAKE_STATIC_LIBRARIES OR NOT SPLIT_SHARED_LIBRARIES)
foreach (module ${all_modules})
list (APPEND dbms_sources $<TARGET_OBJECTS:${module}>)
endforeach ()
add_library(dbms STATIC ${dbms_headers} ${dbms_sources})
add_library (dbms STATIC ${dbms_headers} ${dbms_sources})
set (all_modules dbms)
else()
add_library(dbms SHARED ${dbms_headers} ${dbms_sources})
add_library (dbms SHARED ${dbms_headers} ${dbms_sources})
target_link_libraries (dbms PUBLIC ${all_modules})
list (APPEND all_modules dbms)
# force all split libs to be linked
set (CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,--no-as-needed")
endif ()
list (APPEND all_modules dbms)
macro (dbms_target_include_directories)
foreach (module ${all_modules})
@ -389,7 +389,8 @@ if (USE_PARQUET)
endif ()
if (OPENSSL_CRYPTO_LIBRARY)
dbms_target_link_libraries(PRIVATE ${OPENSSL_CRYPTO_LIBRARY})
dbms_target_link_libraries (PRIVATE ${OPENSSL_CRYPTO_LIBRARY})
target_link_libraries (clickhouse_common_io PRIVATE ${OPENSSL_CRYPTO_LIBRARY})
endif ()
dbms_target_include_directories (SYSTEM BEFORE PRIVATE ${DIVIDE_INCLUDE_DIR})

View File

@ -563,9 +563,17 @@ private:
if (is_interactive)
{
std::cout << "Connected to " << server_name
<< " server version " << server_version
<< " revision " << server_revision
<< "." << std::endl << std::endl;
<< " server version " << server_version
<< " revision " << server_revision
<< "." << std::endl << std::endl;
if (std::make_tuple(VERSION_MAJOR, VERSION_MINOR, VERSION_PATCH)
< std::make_tuple(server_version_major, server_version_minor, server_version_patch))
{
std::cout << "ClickHouse client version is older than ClickHouse server. "
<< "It may lack support for new features."
<< std::endl << std::endl;
}
}
}

View File

@ -1,5 +1,5 @@
set(CLICKHOUSE_COPIER_SOURCES ${CMAKE_CURRENT_SOURCE_DIR}/ClusterCopier.cpp)
set(CLICKHOUSE_COPIER_LINK PRIVATE clickhouse_common_zookeeper clickhouse_parsers clickhouse_functions clickhouse_table_functions clickhouse_aggregate_functions clickhouse_dictionaries string_utils PUBLIC daemon)
set(CLICKHOUSE_COPIER_LINK PRIVATE clickhouse_common_zookeeper clickhouse_parsers clickhouse_functions clickhouse_table_functions clickhouse_aggregate_functions clickhouse_dictionaries string_utils ${Poco_XML_LIBRARY} PUBLIC daemon)
set(CLICKHOUSE_COPIER_INCLUDE SYSTEM PRIVATE ${PCG_RANDOM_INCLUDE_DIR})
clickhouse_program_add(copier)

View File

@ -1,3 +1,5 @@
#include <Common/config.h>
#if USE_POCO_NETSSL
#include "MySQLHandler.h"
#include <limits>
@ -301,3 +303,4 @@ void MySQLHandler::comQuery(ReadBuffer & payload)
}
}
#endif

View File

@ -1,4 +1,6 @@
#pragma once
#include <Common/config.h>
#if USE_POCO_NETSSL
#include <Poco/Net/TCPServerConnection.h>
#include <Poco/Net/SecureStreamSocket.h>
@ -56,3 +58,4 @@ private:
};
}
#endif

View File

@ -1,3 +1,5 @@
#include <Common/config.h>
#if USE_POCO_NETSSL
#include <Common/OpenSSLHelpers.h>
#include <Poco/Crypto/X509Certificate.h>
#include <Poco/Net/SSLManager.h>
@ -122,3 +124,4 @@ Poco::Net::TCPServerConnection * MySQLHandlerFactory::createConnection(const Poc
}
}
#endif

View File

@ -1,5 +1,8 @@
#pragma once
#include <Common/config.h>
#if USE_POCO_NETSSL
#include <Poco/Net/TCPServerConnectionFactory.h>
#include <atomic>
#include <openssl/rsa.h>
@ -37,3 +40,4 @@ public:
};
}
#endif

View File

@ -88,6 +88,7 @@ namespace ErrorCodes
extern const int FAILED_TO_GETPWUID;
extern const int MISMATCHING_USERS_FOR_PROCESS_AND_DATA;
extern const int NETWORK_ERROR;
extern const int PATH_ACCESS_DENIED;
}
@ -270,6 +271,15 @@ int Server::main(const std::vector<std::string> & /*args*/)
Poco::File(path + "data/" + default_database).createDirectories();
Poco::File(path + "metadata/" + default_database).createDirectories();
/// Check that we have read and write access to all data paths
auto disk_selector = global_context->getDiskSelector();
for (const auto & [name, disk] : disk_selector.getDisksMap())
{
Poco::File disk_path(disk->getPath());
if (!disk_path.canRead() || !disk_path.canWrite())
throw Exception("There is no RW access to disk " + name + " (" + disk->getPath() + ")", ErrorCodes::PATH_ACCESS_DENIED);
}
StatusFile status{path + "status"};
SCOPE_EXIT({

View File

@ -545,6 +545,9 @@ void TCPHandler::processOrdinaryQueryWithProcessors(size_t num_threads)
{
auto & pipeline = state.io.pipeline;
if (pipeline.getMaxThreads())
num_threads = pipeline.getMaxThreads();
/// Send header-block, to allow client to prepare output format for data to send.
{
auto & header = pipeline.getHeader();
@ -594,7 +597,15 @@ void TCPHandler::processOrdinaryQueryWithProcessors(size_t num_threads)
lazy_format->finish();
lazy_format->clearQueue();
pool.wait();
try
{
pool.wait();
}
catch (...)
{
/// If exception was thrown during pipeline execution, skip it while processing other exception.
}
pipeline = QueryPipeline()
);
@ -1000,8 +1011,7 @@ void TCPHandler::receiveUnexpectedData()
auto skip_block_in = std::make_shared<NativeBlockInputStream>(
*maybe_compressed_in,
last_block_in.header,
client_revision,
!connection_context.getSettingsRef().low_cardinality_allow_in_native_format);
client_revision);
Block skip_block = skip_block_in->read();
throw NetException("Unexpected packet Data received from client", ErrorCodes::UNEXPECTED_PACKET_FROM_CLIENT);
@ -1028,8 +1038,7 @@ void TCPHandler::initBlockInput()
state.block_in = std::make_shared<NativeBlockInputStream>(
*state.maybe_compressed_in,
header,
client_revision,
!connection_context.getSettingsRef().low_cardinality_allow_in_native_format);
client_revision);
}
}

View File

@ -88,10 +88,11 @@ public:
if (is_large)
{
toLarge();
UInt32 cardinality;
readBinary(cardinality, in);
db_roaring_bitmap_add_many(in, rb, cardinality);
std::string s;
readStringBinary(s,in);
rb = roaring_bitmap_portable_deserialize(s.c_str());
for (const auto & x : small) //merge from small
roaring_bitmap_add(rb, x.getValue());
}
else
small.read(in);
@ -103,9 +104,10 @@ public:
if (isLarge())
{
UInt32 cardinality = roaring_bitmap_get_cardinality(rb);
writePODBinary(cardinality, out);
db_ra_to_uint32_array(out, &rb->high_low_container);
uint32_t expectedsize = roaring_bitmap_portable_size_in_bytes(rb);
std::string s(expectedsize,0);
roaring_bitmap_portable_serialize(rb, const_cast<char*>(s.data()));
writeStringBinary(s,out);
}
else
small.write(out);

View File

@ -238,6 +238,10 @@ void Adam::read(ReadBuffer & buf)
void Adam::merge(const IWeightsUpdater & rhs, Float64 frac, Float64 rhs_frac)
{
auto & adam_rhs = static_cast<const Adam &>(rhs);
if (adam_rhs.average_gradient.empty())
return;
if (average_gradient.empty())
{
if (!average_squared_gradient.empty() ||

View File

@ -0,0 +1,538 @@
#include <Common/DiskSpaceMonitor.h>
#include <set>
#include <Common/escapeForFileName.h>
#include <Poco/File.h>
namespace DB
{
namespace DiskSpace
{
std::mutex Disk::mutex;
std::filesystem::path getMountPoint(std::filesystem::path absolute_path)
{
if (absolute_path.is_relative())
throw Exception("Path is relative. It's a bug.", ErrorCodes::LOGICAL_ERROR);
absolute_path = std::filesystem::canonical(absolute_path);
const auto get_device_id = [](const std::filesystem::path & p)
{
struct stat st;
if (stat(p.c_str(), &st))
throwFromErrnoWithPath("Cannot stat " + p.string(), p.string(), ErrorCodes::SYSTEM_ERROR);
return st.st_dev;
};
/// If /some/path/to/dir/ and /some/path/to/ have different device id,
/// then device which contains /some/path/to/dir/filename is mounted to /some/path/to/dir/
auto device_id = get_device_id(absolute_path);
while (absolute_path.has_relative_path())
{
auto parent = absolute_path.parent_path();
auto parent_device_id = get_device_id(parent);
if (device_id != parent_device_id)
return absolute_path;
absolute_path = parent;
device_id = parent_device_id;
}
return absolute_path;
}
/// Returns name of filesystem mounted to mount_point
#if !defined(__linux__)
[[noreturn]]
#endif
std::string getFilesystemName([[maybe_unused]] const std::string & mount_point)
{
#if defined(__linux__)
auto mounted_filesystems = setmntent("/etc/mtab", "r");
if (!mounted_filesystems)
throw DB::Exception("Cannot open /etc/mtab to get name of filesystem", ErrorCodes::SYSTEM_ERROR);
mntent fs_info;
constexpr size_t buf_size = 4096; /// The same as buffer used for getmntent in glibc. It can happen that it's not enough
char buf[buf_size];
while (getmntent_r(mounted_filesystems, &fs_info, buf, buf_size) && fs_info.mnt_dir != mount_point)
;
endmntent(mounted_filesystems);
if (fs_info.mnt_dir != mount_point)
throw DB::Exception("Cannot find name of filesystem by mount point " + mount_point, ErrorCodes::SYSTEM_ERROR);
return fs_info.mnt_fsname;
#else
throw DB::Exception("Supported on linux only", ErrorCodes::NOT_IMPLEMENTED);
#endif
}
ReservationPtr Disk::reserve(UInt64 bytes) const
{
if (!tryReserve(bytes))
return {};
return std::make_unique<Reservation>(bytes, std::static_pointer_cast<const Disk>(shared_from_this()));
}
bool Disk::tryReserve(UInt64 bytes) const
{
std::lock_guard lock(mutex);
if (bytes == 0)
{
LOG_DEBUG(&Logger::get("DiskSpaceMonitor"), "Reserving 0 bytes on disk " << name);
++reservation_count;
return true;
}
auto available_space = getAvailableSpace();
UInt64 unreserved_space = available_space - std::min(available_space, reserved_bytes);
if (unreserved_space >= bytes)
{
LOG_DEBUG(
&Logger::get("DiskSpaceMonitor"),
"Reserving " << bytes << " bytes on disk " << name << " having unreserved " << unreserved_space << " bytes.");
++reservation_count;
reserved_bytes += bytes;
return true;
}
return false;
}
UInt64 Disk::getUnreservedSpace() const
{
std::lock_guard lock(mutex);
auto available_space = getSpaceInformation().getAvailableSpace();
available_space -= std::min(available_space, reserved_bytes);
return available_space;
}
UInt64 Disk::Stat::getTotalSpace() const
{
UInt64 total_size = fs.f_blocks * fs.f_bsize;
if (total_size < keep_free_space_bytes)
return 0;
return total_size - keep_free_space_bytes;
}
UInt64 Disk::Stat::getAvailableSpace() const
{
/// we use f_bavail, because part of b_free space is
/// available for superuser only and for system purposes
UInt64 total_size = fs.f_bavail * fs.f_bsize;
if (total_size < keep_free_space_bytes)
return 0;
return total_size - keep_free_space_bytes;
}
Reservation::~Reservation()
{
try
{
std::lock_guard lock(Disk::mutex);
if (disk_ptr->reserved_bytes < size)
{
disk_ptr->reserved_bytes = 0;
LOG_ERROR(&Logger::get("DiskSpaceMonitor"), "Unbalanced reservations size for disk '" + disk_ptr->getName() + "'.");
}
else
{
disk_ptr->reserved_bytes -= size;
}
if (disk_ptr->reservation_count == 0)
LOG_ERROR(&Logger::get("DiskSpaceMonitor"), "Unbalanced reservation count for disk '" + disk_ptr->getName() + "'.");
else
--disk_ptr->reservation_count;
}
catch (...)
{
tryLogCurrentException("~DiskSpaceMonitor");
}
}
void Reservation::update(UInt64 new_size)
{
std::lock_guard lock(Disk::mutex);
disk_ptr->reserved_bytes -= size;
size = new_size;
disk_ptr->reserved_bytes += size;
}
DiskSelector::DiskSelector(const Poco::Util::AbstractConfiguration & config, const std::string & config_prefix, const String & default_path)
{
Poco::Util::AbstractConfiguration::Keys keys;
config.keys(config_prefix, keys);
constexpr auto default_disk_name = "default";
bool has_default_disk = false;
for (const auto & disk_name : keys)
{
if (!std::all_of(disk_name.begin(), disk_name.end(), isWordCharASCII))
throw Exception("Disk name can contain only alphanumeric and '_' (" + disk_name + ")",
ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
auto disk_config_prefix = config_prefix + "." + disk_name;
bool has_space_ratio = config.has(disk_config_prefix + ".keep_free_space_ratio");
if (config.has(disk_config_prefix + ".keep_free_space_bytes") && has_space_ratio)
throw Exception("Only one of 'keep_free_space_bytes' and 'keep_free_space_ratio' can be specified",
ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
UInt64 keep_free_space_bytes = config.getUInt64(disk_config_prefix + ".keep_free_space_bytes", 0);
String path;
if (config.has(disk_config_prefix + ".path"))
path = config.getString(disk_config_prefix + ".path");
if (has_space_ratio)
{
auto ratio = config.getDouble(config_prefix + ".keep_free_space_ratio");
if (ratio < 0 || ratio > 1)
throw Exception("'keep_free_space_ratio' have to be between 0 and 1",
ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
String tmp_path = path;
if (tmp_path.empty())
tmp_path = default_path;
// Create tmp disk for getting total disk space.
keep_free_space_bytes = static_cast<UInt64>(Disk("tmp", tmp_path, 0).getTotalSpace() * ratio);
}
if (disk_name == default_disk_name)
{
has_default_disk = true;
if (!path.empty())
throw Exception("\"default\" disk path should be provided in <path> not it <storage_configuration>",
ErrorCodes::UNKNOWN_ELEMENT_IN_CONFIG);
disks.emplace(disk_name, std::make_shared<const Disk>(disk_name, default_path, keep_free_space_bytes));
}
else
{
if (path.empty())
throw Exception("Disk path can not be empty. Disk " + disk_name, ErrorCodes::UNKNOWN_ELEMENT_IN_CONFIG);
if (path.back() != '/')
throw Exception("Disk path must end with /. Disk " + disk_name, ErrorCodes::UNKNOWN_ELEMENT_IN_CONFIG);
disks.emplace(disk_name, std::make_shared<const Disk>(disk_name, path, keep_free_space_bytes));
}
}
if (!has_default_disk)
disks.emplace(default_disk_name, std::make_shared<const Disk>(default_disk_name, default_path, 0));
}
const DiskPtr & DiskSelector::operator[](const String & name) const
{
auto it = disks.find(name);
if (it == disks.end())
throw Exception("Unknown disk " + name, ErrorCodes::UNKNOWN_DISK);
return it->second;
}
Volume::Volume(
String name_,
const Poco::Util::AbstractConfiguration & config,
const std::string & config_prefix,
const DiskSelector & disk_selector)
: name(std::move(name_))
{
Poco::Util::AbstractConfiguration::Keys keys;
config.keys(config_prefix, keys);
Logger * logger = &Logger::get("StorageConfiguration");
for (const auto & disk : keys)
{
if (startsWith(disk, "disk"))
{
auto disk_name = config.getString(config_prefix + "." + disk);
disks.push_back(disk_selector[disk_name]);
}
}
if (disks.empty())
throw Exception("Volume must contain at least one disk.", ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
auto has_max_bytes = config.has(config_prefix + ".max_data_part_size_bytes");
auto has_max_ratio = config.has(config_prefix + ".max_data_part_size_ratio");
if (has_max_bytes && has_max_ratio)
throw Exception("Only one of 'max_data_part_size_bytes' and 'max_data_part_size_ratio' should be specified.",
ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
if (has_max_bytes)
{
max_data_part_size = config.getUInt64(config_prefix + ".max_data_part_size_bytes", 0);
}
else if (has_max_ratio)
{
auto ratio = config.getDouble(config_prefix + ".max_data_part_size_ratio");
if (ratio < 0)
throw Exception("'max_data_part_size_ratio' have to be not less then 0.",
ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
UInt64 sum_size = 0;
std::vector<UInt64> sizes;
for (const auto & disk : disks)
{
sizes.push_back(disk->getTotalSpace());
sum_size += sizes.back();
}
max_data_part_size = static_cast<decltype(max_data_part_size)>(sum_size * ratio / disks.size());
for (size_t i = 0; i < disks.size(); ++i)
if (sizes[i] < max_data_part_size)
LOG_WARNING(logger, "Disk " << disks[i]->getName() << " on volume " << config_prefix <<
" have not enough space (" << sizes[i] <<
") for containing part the size of max_data_part_size (" <<
max_data_part_size << ")");
}
constexpr UInt64 MIN_PART_SIZE = 8u * 1024u * 1024u;
if (max_data_part_size < MIN_PART_SIZE)
LOG_WARNING(logger, "Volume '" << name << "' max_data_part_size is too low ("
<< formatReadableSizeWithBinarySuffix(max_data_part_size) << " < "
<< formatReadableSizeWithBinarySuffix(MIN_PART_SIZE) << ")");
}
ReservationPtr Volume::reserve(UInt64 expected_size) const
{
/// This volume can not store files which size greater than max_data_part_size
if (max_data_part_size != 0 && expected_size > max_data_part_size)
return {};
size_t start_from = last_used.fetch_add(1u, std::memory_order_relaxed);
size_t disks_num = disks.size();
for (size_t i = 0; i < disks_num; ++i)
{
size_t index = (start_from + i) % disks_num;
auto reservation = disks[index]->reserve(expected_size);
if (reservation)
return reservation;
}
return {};
}
UInt64 Volume::getMaxUnreservedFreeSpace() const
{
UInt64 res = 0;
for (const auto & disk : disks)
res = std::max(res, disk->getUnreservedSpace());
return res;
}
StoragePolicy::StoragePolicy(
String name_,
const Poco::Util::AbstractConfiguration & config,
const std::string & config_prefix,
const DiskSelector & disks)
: name(std::move(name_))
{
String volumes_prefix = config_prefix + ".volumes";
if (!config.has(volumes_prefix))
throw Exception("StoragePolicy must contain at least one volume (.volumes)", ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
Poco::Util::AbstractConfiguration::Keys keys;
config.keys(volumes_prefix, keys);
for (const auto & attr_name : keys)
{
if (!std::all_of(attr_name.begin(), attr_name.end(), isWordCharASCII))
throw Exception("Volume name can contain only alphanumeric and '_' (" + attr_name + ")", ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
volumes.push_back(std::make_shared<Volume>(attr_name, config, volumes_prefix + "." + attr_name, disks));
if (volumes_names.find(attr_name) != volumes_names.end())
throw Exception("Volumes names must be unique (" + attr_name + " duplicated)", ErrorCodes::UNKNOWN_POLICY);
volumes_names[attr_name] = volumes.size() - 1;
}
if (volumes.empty())
throw Exception("StoragePolicy must contain at least one volume.", ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
/// Check that disks are unique in Policy
std::set<String> disk_names;
for (const auto & volume : volumes)
{
for (const auto & disk : volume->disks)
{
if (disk_names.find(disk->getName()) != disk_names.end())
throw Exception("Duplicate disk '" + disk->getName() + "' in storage policy '" + name + "'", ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
disk_names.insert(disk->getName());
}
}
move_factor = config.getDouble(config_prefix + ".move_factor", 0.1);
if (move_factor > 1)
throw Exception("Disk move factor have to be in [0., 1.] interval, but set to " + toString(move_factor),
ErrorCodes::LOGICAL_ERROR);
}
StoragePolicy::StoragePolicy(String name_, Volumes volumes_, double move_factor_)
: volumes(std::move(volumes_))
, name(std::move(name_))
, move_factor(move_factor_)
{
if (volumes.empty())
throw Exception("StoragePolicy must contain at least one Volume.", ErrorCodes::UNKNOWN_POLICY);
if (move_factor > 1)
throw Exception("Disk move factor have to be in [0., 1.] interval, but set to " + toString(move_factor),
ErrorCodes::LOGICAL_ERROR);
for (size_t i = 0; i < volumes.size(); ++i)
{
if (volumes_names.find(volumes[i]->getName()) != volumes_names.end())
throw Exception("Volumes names must be unique (" + volumes[i]->getName() + " duplicated).", ErrorCodes::UNKNOWN_POLICY);
volumes_names[volumes[i]->getName()] = i;
}
}
Disks StoragePolicy::getDisks() const
{
Disks res;
for (const auto & volume : volumes)
for (const auto & disk : volume->disks)
res.push_back(disk);
return res;
}
DiskPtr StoragePolicy::getAnyDisk() const
{
/// StoragePolicy must contain at least one Volume
/// Volume must contain at least one Disk
if (volumes.empty())
throw Exception("StoragePolicy has no volumes. It's a bug.", ErrorCodes::NOT_ENOUGH_SPACE);
if (volumes[0]->disks.empty())
throw Exception("Volume '" + volumes[0]->getName() + "' has no disks. It's a bug.", ErrorCodes::NOT_ENOUGH_SPACE);
return volumes[0]->disks[0];
}
DiskPtr StoragePolicy::getDiskByName(const String & disk_name) const
{
for (auto && volume : volumes)
for (auto && disk : volume->disks)
if (disk->getName() == disk_name)
return disk;
return {};
}
UInt64 StoragePolicy::getMaxUnreservedFreeSpace() const
{
UInt64 res = 0;
for (const auto & volume : volumes)
res = std::max(res, volume->getMaxUnreservedFreeSpace());
return res;
}
ReservationPtr StoragePolicy::reserve(UInt64 expected_size, size_t min_volume_index) const
{
for (size_t i = min_volume_index; i < volumes.size(); ++i)
{
const auto & volume = volumes[i];
auto reservation = volume->reserve(expected_size);
if (reservation)
return reservation;
}
return {};
}
ReservationPtr StoragePolicy::reserve(UInt64 expected_size) const
{
return reserve(expected_size, 0);
}
ReservationPtr StoragePolicy::makeEmptyReservationOnLargestDisk() const
{
UInt64 max_space = 0;
DiskPtr max_disk;
for (const auto & volume : volumes)
{
for (const auto & disk : volume->disks)
{
auto avail_space = disk->getAvailableSpace();
if (avail_space > max_space)
{
max_space = avail_space;
max_disk = disk;
}
}
}
return max_disk->reserve(0);
}
size_t StoragePolicy::getVolumeIndexByDisk(const DiskPtr & disk_ptr) const
{
for (size_t i = 0; i < volumes.size(); ++i)
{
const auto & volume = volumes[i];
for (const auto & disk : volume->disks)
if (disk->getName() == disk_ptr->getName())
return i;
}
throw Exception("No disk " + disk_ptr->getName() + " in policy " + name, ErrorCodes::UNKNOWN_DISK);
}
StoragePolicySelector::StoragePolicySelector(
const Poco::Util::AbstractConfiguration & config,
const String & config_prefix,
const DiskSelector & disks)
{
Poco::Util::AbstractConfiguration::Keys keys;
config.keys(config_prefix, keys);
for (const auto & name : keys)
{
if (!std::all_of(name.begin(), name.end(), isWordCharASCII))
throw Exception("StoragePolicy name can contain only alphanumeric and '_' (" + name + ")",
ErrorCodes::EXCESSIVE_ELEMENT_IN_CONFIG);
policies.emplace(name, std::make_shared<StoragePolicy>(name, config, config_prefix + "." + name, disks));
LOG_INFO(&Logger::get("StoragePolicySelector"), "Storage policy " << name << " loaded");
}
constexpr auto default_storage_policy_name = "default";
constexpr auto default_volume_name = "default";
constexpr auto default_disk_name = "default";
/// Add default policy if it's not specified explicetly
if (policies.find(default_storage_policy_name) == policies.end())
{
auto default_volume = std::make_shared<Volume>(
default_volume_name,
std::vector<DiskPtr>{disks[default_disk_name]},
0);
auto default_policy = std::make_shared<StoragePolicy>(default_storage_policy_name, Volumes{default_volume}, 0.0);
policies.emplace(default_storage_policy_name, default_policy);
}
}
const StoragePolicyPtr & StoragePolicySelector::operator[](const String & name) const
{
auto it = policies.find(name);
if (it == policies.end())
throw Exception("Unknown StoragePolicy " + name, ErrorCodes::UNKNOWN_POLICY);
return it->second;
}
}
}

View File

@ -0,0 +1,359 @@
#pragma once
#include <mutex>
#include <sys/statvfs.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#if defined(__linux__)
#include <cstdio>
#include <mntent.h>
#endif
#include <memory>
#include <filesystem>
#include <boost/noncopyable.hpp>
#include <Poco/Util/AbstractConfiguration.h>
#include <common/logger_useful.h>
#include <Common/Exception.h>
#include <IO/WriteHelpers.h>
#include <Common/formatReadable.h>
#include <Common/CurrentMetrics.h>
namespace CurrentMetrics
{
extern const Metric DiskSpaceReservedForMerge;
}
namespace DB
{
namespace ErrorCodes
{
extern const int LOGICAL_ERROR;
extern const int CANNOT_STATVFS;
extern const int NOT_ENOUGH_SPACE;
extern const int SYSTEM_ERROR;
extern const int UNKNOWN_ELEMENT_IN_CONFIG;
extern const int EXCESSIVE_ELEMENT_IN_CONFIG;
extern const int UNKNOWN_POLICY;
extern const int UNKNOWN_DISK;
}
namespace DiskSpace
{
class Reservation;
using ReservationPtr = std::unique_ptr<Reservation>;
/// Returns mount point of filesystem where absoulte_path (must exist) is located
std::filesystem::path getMountPoint(std::filesystem::path absolute_path);
/// Returns name of filesystem mounted to mount_point
#if !defined(__linux__)
[[noreturn]]
#endif
std::string getFilesystemName([[maybe_unused]] const std::string & mount_point);
inline struct statvfs getStatVFS(const std::string & path)
{
struct statvfs fs;
if (statvfs(path.c_str(), &fs) != 0)
throwFromErrnoWithPath(
"Could not calculate available disk space (statvfs)", path, ErrorCodes::CANNOT_STATVFS);
return fs;
}
/**
* Provide interface for reservation
*/
class Space : public std::enable_shared_from_this<Space>
{
public:
virtual ReservationPtr reserve(UInt64 bytes) const = 0;
virtual const String & getName() const = 0;
virtual ~Space() = default;
};
using SpacePtr = std::shared_ptr<const Space>;
/** Disk - Smallest space unit.
* path - Path to space. Ends with /
* name - Unique key using for disk space reservation.
*/
class Disk : public Space
{
public:
friend class Reservation;
/// Snapshot of disk space state (free and total space)
class Stat
{
struct statvfs fs{};
UInt64 keep_free_space_bytes;
public:
explicit Stat(const Disk & disk)
{
if (statvfs(disk.path.c_str(), &fs) != 0)
throwFromErrno("Could not calculate available disk space (statvfs)", ErrorCodes::CANNOT_STATVFS);
keep_free_space_bytes = disk.keep_free_space_bytes;
}
/// Total space on disk using information from statvfs
UInt64 getTotalSpace() const;
/// Available space on disk using information from statvfs
UInt64 getAvailableSpace() const;
};
Disk(const String & name_, const String & path_, UInt64 keep_free_space_bytes_)
: name(name_)
, path(path_)
, keep_free_space_bytes(keep_free_space_bytes_)
{
if (path.back() != '/')
throw Exception("Disk path must ends with '/', but '" + path + "' doesn't.", ErrorCodes::LOGICAL_ERROR);
}
/// Reserves bytes on disk, if not possible returns nullptr.
ReservationPtr reserve(UInt64 bytes) const override;
/// Disk name from configuration;
const String & getName() const override { return name; }
/// Path on fs to disk
const String & getPath() const { return path; }
/// Path to clickhouse data directory on this disk
String getClickHouseDataPath() const { return path + "data/"; }
/// Amount of bytes which should be kept free on this disk
UInt64 getKeepingFreeSpace() const { return keep_free_space_bytes; }
/// Snapshot of disk space state (free and total space)
Stat getSpaceInformation() const { return Stat(*this); }
/// Total available space on disk
UInt64 getTotalSpace() const { return getSpaceInformation().getTotalSpace(); }
/// Space currently available on disk, take information from statvfs call
UInt64 getAvailableSpace() const { return getSpaceInformation().getAvailableSpace(); }
/// Currently available (prev method) minus already reserved space
UInt64 getUnreservedSpace() const;
private:
const String name;
const String path;
const UInt64 keep_free_space_bytes;
/// Used for reservation counters modification
static std::mutex mutex;
mutable UInt64 reserved_bytes = 0;
mutable UInt64 reservation_count = 0;
private:
/// Reserves bytes on disk, if not possible returns false
bool tryReserve(UInt64 bytes) const;
};
/// It is not possible to change disk runtime.
using DiskPtr = std::shared_ptr<const Disk>;
using Disks = std::vector<DiskPtr>;
/** Information about reserved size on concrete disk.
* Unreserve on destroy. Doesn't reserve bytes in constructor.
*/
class Reservation final : private boost::noncopyable
{
public:
Reservation(UInt64 size_, DiskPtr disk_ptr_)
: size(size_)
, metric_increment(CurrentMetrics::DiskSpaceReservedForMerge, size)
, disk_ptr(disk_ptr_)
{
}
/// Unreserves reserved space and decrement reservations count on disk
~Reservation();
/// Changes amount of reserved space. When new_size is greater than before,
/// availability of free space is not checked.
void update(UInt64 new_size);
/// Get reservation size
UInt64 getSize() const { return size; }
/// Get disk where reservation take place
const DiskPtr & getDisk() const { return disk_ptr; }
private:
UInt64 size;
CurrentMetrics::Increment metric_increment;
DiskPtr disk_ptr;
};
/// Parse .xml configuration and store information about disks
/// Mostly used for introspection.
class DiskSelector
{
public:
DiskSelector(const Poco::Util::AbstractConfiguration & config,
const std::string & config_prefix, const String & default_path);
/// Get disk by name
const DiskPtr & operator[](const String & name) const;
/// Get all disks name
const auto & getDisksMap() const { return disks; }
private:
std::map<String, DiskPtr> disks;
};
/**
* Disks group by some (user) criteria. For example,
* - Volume("slow_disks", [d1, d2], 100)
* - Volume("fast_disks", [d3, d4], 200)
* Cannot store parts larger than max_data_part_size.
*/
class Volume : public Space
{
friend class StoragePolicy;
public:
Volume(String name_, std::vector<DiskPtr> disks_, UInt64 max_data_part_size_)
: max_data_part_size(max_data_part_size_)
, disks(std::move(disks_))
, name(std::move(name_))
{
}
Volume(String name_, const Poco::Util::AbstractConfiguration & config,
const std::string & config_prefix, const DiskSelector & disk_selector);
/// Uses Round-robin to choose disk for reservation.
/// Returns valid reservation or nullptr if there is no space left on any disk.
ReservationPtr reserve(UInt64 bytes) const override;
/// Return biggest unreserved space across all disks
UInt64 getMaxUnreservedFreeSpace() const;
/// Volume name from config
const String & getName() const override { return name; }
/// Max size of reservation
UInt64 max_data_part_size = 0;
/// Disks in volume
Disks disks;
private:
mutable std::atomic<size_t> last_used = 0;
const String name;
};
using VolumePtr = std::shared_ptr<const Volume>;
using Volumes = std::vector<VolumePtr>;
/**
* Contains all information about volumes configuration for Storage.
* Can determine appropriate Volume and Disk for each reservation.
*/
class StoragePolicy : public Space
{
public:
StoragePolicy(String name_, const Poco::Util::AbstractConfiguration & config,
const std::string & config_prefix, const DiskSelector & disks);
StoragePolicy(String name_, Volumes volumes_, double move_factor_);
/// Returns disks ordered by volumes priority
Disks getDisks() const;
/// Returns any disk
/// Used when it's not important, for example for
/// mutations files
DiskPtr getAnyDisk() const;
DiskPtr getDiskByName(const String & disk_name) const;
/// Get free space from most free disk
UInt64 getMaxUnreservedFreeSpace() const;
const String & getName() const override { return name; }
/// Returns valid reservation or null
ReservationPtr reserve(UInt64 bytes) const override;
/// Reserve space on any volume with index > min_volume_index
ReservationPtr reserve(UInt64 bytes, size_t min_volume_index) const;
/// Find volume index, which contains disk
size_t getVolumeIndexByDisk(const DiskPtr & disk_ptr) const;
/// Reserves 0 bytes on disk with max available space
/// Do not use this function when it is possible to predict size.
ReservationPtr makeEmptyReservationOnLargestDisk() const;
const Volumes & getVolumes() const { return volumes; }
/// Returns number [0., 1.] -- fraction of free space on disk
/// which should be kept with help of background moves
double getMoveFactor() const { return move_factor; }
/// Get volume by index from storage_policy
VolumePtr getVolume(size_t i) const { return (i < volumes_names.size() ? volumes[i] : VolumePtr()); }
VolumePtr getVolumeByName(const String & volume_name) const
{
auto it = volumes_names.find(volume_name);
if (it == volumes_names.end())
return {};
return getVolume(it->second);
}
private:
Volumes volumes;
const String name;
std::map<String, size_t> volumes_names;
/// move_factor from interval [0., 1.]
/// We move something if disk from this policy
/// filled more than total_size * move_factor
double move_factor = 0.1; /// by default move factor is 10%
};
using StoragePolicyPtr = std::shared_ptr<const StoragePolicy>;
/// Parse .xml configuration and store information about policies
/// Mostly used for introspection.
class StoragePolicySelector
{
public:
StoragePolicySelector(const Poco::Util::AbstractConfiguration & config,
const String & config_prefix, const DiskSelector & disks);
/// Policy by name
const StoragePolicyPtr & operator[](const String & name) const;
/// All policies
const std::map<String, StoragePolicyPtr> & getPoliciesMap() const { return policies; }
private:
std::map<String, StoragePolicyPtr> policies;
};
}
}

View File

@ -452,6 +452,11 @@ namespace ErrorCodes
extern const int INVALID_WITH_FILL_EXPRESSION = 475;
extern const int WITH_TIES_WITHOUT_ORDER_BY = 476;
extern const int INVALID_USAGE_OF_INPUT = 477;
extern const int UNKNOWN_POLICY = 478;
extern const int UNKNOWN_DISK = 479;
extern const int UNKNOWN_PROTOCOL = 480;
extern const int PATH_ACCESS_DENIED = 481;
extern const int KEEPER_EXCEPTION = 999;
extern const int POCO_EXCEPTION = 1000;

View File

@ -10,7 +10,7 @@
#include <common/demangle.h>
#include <Common/config_version.h>
#include <Common/formatReadable.h>
#include <Storages/MergeTree/DiskSpaceMonitor.h>
#include <Common/DiskSpaceMonitor.h>
#include <filesystem>
namespace DB
@ -84,16 +84,16 @@ void getNoSpaceLeftInfoMessage(std::filesystem::path path, std::string & msg)
while (!std::filesystem::exists(path) && path.has_relative_path())
path = path.parent_path();
auto fs = DiskSpaceMonitor::getStatVFS(path);
auto fs = DiskSpace::getStatVFS(path);
msg += "\nTotal space: " + formatReadableSizeWithBinarySuffix(fs.f_blocks * fs.f_bsize)
+ "\nAvailable space: " + formatReadableSizeWithBinarySuffix(fs.f_bavail * fs.f_bsize)
+ "\nTotal inodes: " + formatReadableQuantity(fs.f_files)
+ "\nAvailable inodes: " + formatReadableQuantity(fs.f_favail);
auto mount_point = DiskSpaceMonitor::getMountPoint(path).string();
auto mount_point = DiskSpace::getMountPoint(path).string();
msg += "\nMount point: " + mount_point;
#if defined(__linux__)
msg += "\nFilesystem: " + DiskSpaceMonitor::getFilesystemName(mount_point);
msg += "\nFilesystem: " + DiskSpace::getFilesystemName(mount_point);
#endif
}

View File

@ -0,0 +1,9 @@
#pragma once
#define __msan_unpoison(X, Y)
#if defined(__has_feature)
# if __has_feature(memory_sanitizer)
# undef __msan_unpoison
# include <sanitizer/msan_interface.h>
# endif
#endif

View File

@ -1,3 +1,5 @@
#include <Common/config.h>
#if USE_POCO_NETSSL
#include "OpenSSLHelpers.h"
#include <ext/scope_guard.h>
#include <openssl/err.h>
@ -16,3 +18,4 @@ String getOpenSSLErrors()
}
}
#endif

View File

@ -1,4 +1,6 @@
#pragma once
#include <Common/config.h>
#if USE_POCO_NETSSL
#include <Core/Types.h>
@ -10,3 +12,4 @@ namespace DB
String getOpenSSLErrors();
}
#endif

View File

@ -4,8 +4,6 @@
#include <Common/CurrentMetrics.h>
#include <Common/ProfileEvents.h>
#include <cassert>
namespace ProfileEvents
{
@ -35,22 +33,48 @@ namespace ErrorCodes
}
/** A single-use object that represents lock's ownership
* For the purpose of exception safety guarantees LockHolder is to be used in two steps:
* 1. Create an instance (allocating all the memory needed)
* 2. Associate the instance with the lock (attach to the lock and locking request group)
*/
class RWLockImpl::LockHolderImpl
{
bool bound{false};
Type lock_type;
String query_id;
CurrentMetrics::Increment active_client_increment;
RWLock parent;
GroupsContainer::iterator it_group;
ClientsContainer::iterator it_client;
QueryIdToHolder::key_type query_id;
CurrentMetrics::Increment active_client_increment;
LockHolderImpl(RWLock && parent, GroupsContainer::iterator it_group, ClientsContainer::iterator it_client);
public:
LockHolderImpl(const LockHolderImpl & other) = delete;
LockHolderImpl& operator=(const LockHolderImpl & other) = delete;
/// Implicit memory allocation for query_id is done here
LockHolderImpl(const String & query_id_, Type type)
: lock_type{type}, query_id{query_id_},
active_client_increment{
type == Type::Read ? CurrentMetrics::RWLockActiveReaders : CurrentMetrics::RWLockActiveWriters}
{
}
~LockHolderImpl();
private:
/// A separate method which binds the lock holder to the owned lock
/// N.B. It is very important that this method produces no allocations
bool bind_with(RWLock && parent_, GroupsContainer::iterator it_group_) noexcept
{
if (bound)
return false;
it_group = it_group_;
parent = std::move(parent_);
++it_group->refererrs;
bound = true;
return true;
}
friend class RWLockImpl;
};
@ -62,29 +86,33 @@ namespace
class QueryLockInfo
{
private:
std::mutex mutex;
mutable std::mutex mutex;
std::map<std::string, size_t> queries;
public:
void add(const String & query_id)
{
std::lock_guard lock(mutex);
++queries[query_id];
const auto res = queries.emplace(query_id, 1); // may throw
if (!res.second)
++res.first->second;
}
void remove(const String & query_id)
void remove(const String & query_id) noexcept
{
std::lock_guard lock(mutex);
auto it = queries.find(query_id);
assert(it != queries.end());
if (--it->second == 0)
queries.erase(it);
const auto query_it = queries.find(query_id);
if (query_it != queries.cend() && --query_it->second == 0)
queries.erase(query_it);
}
void check(const String & query_id)
void check(const String & query_id) const
{
std::lock_guard lock(mutex);
if (queries.count(query_id))
if (queries.find(query_id) != queries.cend())
throw Exception("Possible deadlock avoided. Client should retry.", ErrorCodes::DEADLOCK_AVOIDED);
}
};
@ -93,8 +121,16 @@ namespace
}
/** To guarantee that we do not get any piece of our data corrupted:
* 1. Perform all actions that include allocations before changing lock's internal state
* 2. Roll back any changes that make the state inconsistent
*
* Note: "SM" in the commentaries below stands for STATE MODIFICATION
*/
RWLockImpl::LockHolder RWLockImpl::getLock(RWLockImpl::Type type, const String & query_id)
{
const bool request_has_query_id = query_id != NO_QUERY;
Stopwatch watch(CLOCK_MONOTONIC_COARSE);
CurrentMetrics::Increment waiting_client_increment((type == Read) ? CurrentMetrics::RWLockWaitingReaders
: CurrentMetrics::RWLockWaitingWriters);
@ -106,29 +142,39 @@ RWLockImpl::LockHolder RWLockImpl::getLock(RWLockImpl::Type type, const String &
: ProfileEvents::RWLockWritersWaitMilliseconds, watch.elapsedMilliseconds());
};
GroupsContainer::iterator it_group;
ClientsContainer::iterator it_client;
/// This object is placed above unique_lock, because it may lock in destructor.
LockHolder res;
auto lock_holder = std::make_shared<LockHolderImpl>(query_id, type);
std::unique_lock lock(mutex);
/// Check if the same query is acquiring previously acquired lock
if (query_id != RWLockImpl::NO_QUERY)
/// The FastPath:
/// Check if the same query_id already holds the required lock in which case we can proceed without waiting
if (request_has_query_id)
{
auto it_query = query_id_to_holder.find(query_id);
if (it_query != query_id_to_holder.end())
res = it_query->second.lock();
}
const auto it_query = owner_queries.find(query_id);
if (it_query != owner_queries.end())
{
const auto current_owner_group = queue.begin();
if (res)
{
/// XXX: it means we can't upgrade lock from read to write - with proper waiting!
if (type != Read || res->it_group->type != Read)
throw Exception("Attempt to acquire exclusive lock recursively", ErrorCodes::LOGICAL_ERROR);
else
return res;
/// XXX: it means we can't upgrade lock from read to write!
if (type == Write)
throw Exception(
"RWLockImpl::getLock(): Cannot acquire exclusive lock while RWLock is already locked",
ErrorCodes::LOGICAL_ERROR);
if (current_owner_group->type == Write)
throw Exception(
"RWLockImpl::getLock(): RWLock is already locked in exclusive mode",
ErrorCodes::LOGICAL_ERROR);
/// N.B. Type is Read here, query_id is not empty and it_query is a valid iterator
all_read_locks.add(query_id); /// SM1: may throw on insertion (nothing to roll back)
++it_query->second; /// SM2: nothrow
lock_holder->bind_with(shared_from_this(), current_owner_group); /// SM3: nothrow
finalize_metrics();
return lock_holder;
}
}
/** If the query already has any active read lock and tries to acquire another read lock
@ -148,86 +194,106 @@ RWLockImpl::LockHolder RWLockImpl::getLock(RWLockImpl::Type type, const String &
if (type == Type::Write || queue.empty() || queue.back().type == Type::Write)
{
if (type == Type::Read && !queue.empty() && queue.back().type == Type::Write && query_id != RWLockImpl::NO_QUERY)
if (type == Type::Read && request_has_query_id && !queue.empty())
all_read_locks.check(query_id);
/// Create new group of clients
it_group = queue.emplace(queue.end(), type);
/// Create a new group of locking requests
queue.emplace_back(type); /// SM1: may throw (nothing to roll back)
}
else
{
/// Will append myself to last group
it_group = std::prev(queue.end());
else if (request_has_query_id && queue.size() > 1)
all_read_locks.check(query_id);
if (it_group != queue.begin() && query_id != RWLockImpl::NO_QUERY)
all_read_locks.check(query_id);
}
GroupsContainer::iterator it_group = std::prev(queue.end());
/// Append myself to the end of chosen group
auto & clients = it_group->clients;
try
{
it_client = clients.emplace(clients.end(), type);
}
catch (...)
{
/// Remove group if it was the first client in the group and an error occurred
if (clients.empty())
queue.erase(it_group);
throw;
}
res.reset(new LockHolderImpl(shared_from_this(), it_group, it_client));
/// We need to reference the associated group before waiting to guarantee
/// that this group does not get deleted prematurely
++it_group->refererrs;
/// Wait a notification until we will be the only in the group.
it_group->cv.wait(lock, [&] () { return it_group == queue.begin(); });
/// Insert myself (weak_ptr to the holder) to queries set to implement recursive lock
if (query_id != RWLockImpl::NO_QUERY)
{
query_id_to_holder.emplace(query_id, res);
--it_group->refererrs;
if (type == Type::Read)
all_read_locks.add(query_id);
if (request_has_query_id)
{
try
{
if (type == Type::Read)
all_read_locks.add(query_id); /// SM2: may throw on insertion
/// and is safe to roll back unconditionally
const auto emplace_res =
owner_queries.emplace(query_id, 1); /// SM3: may throw on insertion
if (!emplace_res.second)
++emplace_res.first->second; /// SM4: nothrow
}
catch (...)
{
/// Methods std::list<>::emplace_back() and std::unordered_map<>::emplace() provide strong exception safety
/// We only need to roll back the changes to these objects: all_read_locks and the locking queue
if (type == Type::Read)
all_read_locks.remove(query_id); /// Rollback(SM2): nothrow
if (it_group->refererrs == 0)
{
const auto next = queue.erase(it_group); /// Rollback(SM1): nothrow
if (next != queue.end())
next->cv.notify_all();
}
throw;
}
}
res->query_id = query_id;
lock_holder->bind_with(shared_from_this(), it_group); /// SM: nothrow
finalize_metrics();
return res;
return lock_holder;
}
/** The sequence points of acquiring lock's ownership by an instance of LockHolderImpl:
* 1. all_read_locks is updated
* 2. owner_queries is updated
* 3. request group is updated by LockHolderImpl which in turn becomes "bound"
*
* If by the time when destructor of LockHolderImpl is called the instance has been "bound",
* it is guaranteed that all three steps have been executed successfully and the resulting state is consistent.
* With the mutex locked the order of steps to restore the lock's state can be arbitrary
*
* We do not employ try-catch: if something bad happens, there is nothing we can do =(
*/
RWLockImpl::LockHolderImpl::~LockHolderImpl()
{
if (!bound || parent == nullptr)
return;
std::lock_guard lock(parent->mutex);
/// Remove weak_ptrs to the holder, since there are no owners of the current lock
parent->query_id_to_holder.erase(query_id);
/// The associated group must exist (and be the beginning of the queue?)
if (parent->queue.empty() || it_group != parent->queue.begin())
return;
if (*it_client == RWLockImpl::Read && query_id != RWLockImpl::NO_QUERY)
all_read_locks.remove(query_id);
/// Removes myself from client list of our group
it_group->clients.erase(it_client);
/// Remove the group if we were the last client and notify the next group
if (it_group->clients.empty())
/// If query_id is not empty it must be listed in parent->owner_queries
if (query_id != RWLockImpl::NO_QUERY)
{
auto & parent_queue = parent->queue;
parent_queue.erase(it_group);
const auto owner_it = parent->owner_queries.find(query_id);
if (owner_it != parent->owner_queries.end())
{
if (--owner_it->second == 0) /// SM: nothrow
parent->owner_queries.erase(owner_it); /// SM: nothrow
if (!parent_queue.empty())
parent_queue.front().cv.notify_all();
if (lock_type == RWLockImpl::Read)
all_read_locks.remove(query_id); /// SM: nothrow
}
}
/// If we are the last remaining referrer, remove the group and notify the next group
if (--it_group->refererrs == 0) /// SM: nothrow
{
const auto next = parent->queue.erase(it_group); /// SM: nothrow
if (next != parent->queue.end())
next->cv.notify_all();
}
}
RWLockImpl::LockHolderImpl::LockHolderImpl(RWLock && parent_, RWLockImpl::GroupsContainer::iterator it_group_,
RWLockImpl::ClientsContainer::iterator it_client_)
: parent{std::move(parent_)}, it_group{it_group_}, it_client{it_client_},
active_client_increment{(*it_client == RWLockImpl::Read) ? CurrentMetrics::RWLockActiveReaders
: CurrentMetrics::RWLockActiveWriters}
{
}
}

View File

@ -8,6 +8,7 @@
#include <condition_variable>
#include <map>
#include <string>
#include <unordered_map>
namespace DB
@ -53,25 +54,24 @@ private:
struct Group;
using GroupsContainer = std::list<Group>;
using ClientsContainer = std::list<Type>;
using QueryIdToHolder = std::map<String, std::weak_ptr<LockHolderImpl>>;
using OwnerQueryIds = std::unordered_map<String, size_t>;
/// Group of clients that should be executed concurrently
/// i.e. a group could contain several readers, but only one writer
/// Group of locking requests that should be granted concurrently
/// i.e. a group can contain several readers, but only one writer
struct Group
{
// FIXME: there is only redundant |type| information inside |clients|.
const Type type;
ClientsContainer clients;
size_t refererrs;
std::condition_variable cv; /// all clients of the group wait group condvar
std::condition_variable cv; /// all locking requests of the group wait on this condvar
explicit Group(Type type_) : type{type_} {}
explicit Group(Type type_) : type{type_}, refererrs{0} {}
};
mutable std::mutex mutex;
GroupsContainer queue;
QueryIdToHolder query_id_to_holder;
OwnerQueryIds owner_queries;
mutable std::mutex mutex;
};

View File

@ -11,6 +11,7 @@
#endif
#include <gtest/gtest.h>
#include <chrono>
namespace DB

View File

@ -1,3 +1,5 @@
#include <Common/config.h>
#if USE_POCO_NETSSL
#include <IO/WriteBuffer.h>
#include <IO/ReadBufferFromString.h>
#include <IO/WriteBufferFromString.h>
@ -100,3 +102,4 @@ size_t getLengthEncodedStringSize(const String & s)
}
}
#endif

View File

@ -1,4 +1,6 @@
#pragma once
#include <Common/config.h>
#if USE_POCO_NETSSL
#include <ext/scope_guard.h>
#include <openssl/pem.h>
@ -1075,3 +1077,4 @@ private:
}
}
#endif

View File

@ -74,7 +74,8 @@ struct Settings : public SettingsCollection<Settings>
M(SettingUInt64, background_pool_size, 16, "Number of threads performing background work for tables (for example, merging in merge tree). Only has meaning at server startup.") \
M(SettingUInt64, background_schedule_pool_size, 16, "Number of threads performing background tasks for replicated tables. Only has meaning at server startup.") \
\
M(SettingMilliseconds, distributed_directory_monitor_sleep_time_ms, 100, "Sleep time for StorageDistributed DirectoryMonitors in case there is no work or exception has been thrown.") \
M(SettingMilliseconds, distributed_directory_monitor_sleep_time_ms, 100, "Sleep time for StorageDistributed DirectoryMonitors, in case of any errors delay grows exponentially.") \
M(SettingMilliseconds, distributed_directory_monitor_max_sleep_time_ms, 30000, "Maximum sleep time for StorageDistributed DirectoryMonitors, it limits exponential growth too.") \
\
M(SettingBool, distributed_directory_monitor_batch_inserts, false, "Should StorageDistributed DirectoryMonitors try to batch individual inserts into bigger ones.") \
\

View File

@ -29,8 +29,8 @@ NativeBlockInputStream::NativeBlockInputStream(ReadBuffer & istr_, UInt64 server
{
}
NativeBlockInputStream::NativeBlockInputStream(ReadBuffer & istr_, const Block & header_, UInt64 server_revision_, bool convert_types_to_low_cardinality_)
: istr(istr_), header(header_), server_revision(server_revision_), convert_types_to_low_cardinality(convert_types_to_low_cardinality_)
NativeBlockInputStream::NativeBlockInputStream(ReadBuffer & istr_, const Block & header_, UInt64 server_revision_)
: istr(istr_), header(header_), server_revision(server_revision_)
{
}
@ -153,12 +153,15 @@ Block NativeBlockInputStream::readImpl()
column.column = std::move(read_column);
/// Support insert from old clients without low cardinality type.
bool revision_without_low_cardinality = server_revision && server_revision < DBMS_MIN_REVISION_WITH_LOW_CARDINALITY_TYPE;
if (header && (convert_types_to_low_cardinality || revision_without_low_cardinality))
if (header)
{
column.column = recursiveLowCardinalityConversion(column.column, column.type, header.getByPosition(i).type);
column.type = header.getByPosition(i).type;
/// Support insert from old clients without low cardinality type.
auto & header_column = header.getByName(column.name);
if (!header_column.type->equals(*column.type))
{
column.column = recursiveLowCardinalityConversion(column.column, column.type, header.getByPosition(i).type);
column.type = header.getByPosition(i).type;
}
}
res.insert(std::move(column));
@ -177,6 +180,22 @@ Block NativeBlockInputStream::readImpl()
index_column_it = index_block_it->columns.begin();
}
if (rows && header)
{
/// Allow to skip columns. Fill them with default values.
Block tmp_res;
for (auto & col : header)
{
if (res.has(col.name))
tmp_res.insert(std::move(res.getByName(col.name)));
else
tmp_res.insert({col.type->createColumn()->cloneResized(rows), col.type, col.name});
}
res.swap(tmp_res);
}
return res;
}

View File

@ -65,7 +65,7 @@ public:
/// For cases when data structure (header) is known in advance.
/// NOTE We may use header for data validation and/or type conversions. It is not implemented.
NativeBlockInputStream(ReadBuffer & istr_, const Block & header_, UInt64 server_revision_, bool convert_types_to_low_cardinality_ = false);
NativeBlockInputStream(ReadBuffer & istr_, const Block & header_, UInt64 server_revision_);
/// For cases when we have an index. It allows to skip columns. Only columns specified in the index will be read.
NativeBlockInputStream(ReadBuffer & istr_, UInt64 server_revision_,
@ -91,8 +91,6 @@ private:
IndexForNativeFormat::Blocks::const_iterator index_block_end;
IndexOfBlockForNativeFormat::Columns::const_iterator index_column_it;
bool convert_types_to_low_cardinality = false;
/// If an index is specified, then `istr` must be CompressedReadBufferFromFile. Unused otherwise.
CompressedReadBufferFromFile * istr_concrete = nullptr;

View File

@ -28,8 +28,6 @@ void copyDataImpl(IBlockInputStream & from, IBlockOutputStream & to, TCancelCall
break;
to.write(block);
if (!block.rows())
to.flush();
progress(block);
}

View File

@ -1,3 +1,4 @@
#include <Common/config.h>
#include <Common/Exception.h>
#include <Interpreters/Context.h>
#include <Core/Settings.h>
@ -312,7 +313,9 @@ FormatFactory::FormatFactory()
registerOutputFormatProcessorODBCDriver(*this);
registerOutputFormatProcessorODBCDriver2(*this);
registerOutputFormatProcessorNull(*this);
#if USE_POCO_NETSSL
registerOutputFormatProcessorMySQLWrite(*this);
#endif
}
FormatFactory & FormatFactory::instance()

View File

@ -254,6 +254,9 @@ void geohashDecode(const char * encoded_string, size_t encoded_len, Float64 * lo
const UInt8 precision = std::min(encoded_len, static_cast<size_t>(MAX_PRECISION));
if (precision == 0)
{
// Empty string is converted to (0, 0)
*longitude = 0;
*latitude = 0;
return;
}

View File

@ -15,7 +15,6 @@ if(USE_HYPERSCAN)
target_include_directories(clickhouse_functions_url SYSTEM PRIVATE ${HYPERSCAN_INCLUDE_DIR})
endif()
include(${ClickHouse_SOURCE_DIR}/cmake/find_gperf.cmake)
if (USE_GPERF)
# Only for regenerate
add_custom_target(generate-tldlookup-gperf ./tldLookup.sh

View File

@ -336,10 +336,6 @@ void FunctionArrayEnumerateRankedExtended<Derived>::executeMethodImpl(
/// Skipping offsets if no data in this array
if (prev_off == off)
{
if (depth_to_look > 2)
want_clear = true;
if (depth_to_look >= 2)
{
/// Advance to the next element of the parent array.
@ -357,6 +353,7 @@ void FunctionArrayEnumerateRankedExtended<Derived>::executeMethodImpl(
{
last_offset_by_depth[depth] = (*offsets_by_depth[depth])[current_offset_n_by_depth[depth]];
++current_offset_n_by_depth[depth];
want_clear = true;
}
else
{

View File

@ -2,6 +2,7 @@
#include <Common/Exception.h>
#include <common/logger_useful.h>
#include <Common/MemorySanitizer.h>
#include <Poco/Logger.h>
#include <boost/range/iterator_range.hpp>
#include <errno.h>
@ -69,6 +70,9 @@ int AIOContextPool::getCompletionEvents(io_event events[], const int max_events)
if (errno != EINTR)
throwFromErrno("io_getevents: Failed to wait for asynchronous IO completion", ErrorCodes::CANNOT_IO_GETEVENTS, errno);
/// Unpoison the memory returned from a non-instrumented system call.
__msan_unpoison(events, sizeof(*events) * num_events);
return num_events;
}

View File

@ -38,20 +38,22 @@ namespace detail
SessionPtr session;
std::istream * istr; /// owned by session
std::unique_ptr<ReadBuffer> impl;
std::vector<Poco::Net::HTTPCookie> cookies;
public:
using OutStreamCallback = std::function<void(std::ostream &)>;
explicit ReadWriteBufferFromHTTPBase(SessionPtr session_,
explicit ReadWriteBufferFromHTTPBase(
SessionPtr session_,
Poco::URI uri_,
const std::string & method_ = {},
OutStreamCallback out_stream_callback = {},
const Poco::Net::HTTPBasicCredentials & credentials = {},
size_t buffer_size_ = DBMS_DEFAULT_BUFFER_SIZE)
: ReadBuffer(nullptr, 0)
, uri {uri_}
, method {!method_.empty() ? method_ : out_stream_callback ? Poco::Net::HTTPRequest::HTTP_POST : Poco::Net::HTTPRequest::HTTP_GET}
, session {session_}
, uri{uri_}
, method{!method_.empty() ? method_ : out_stream_callback ? Poco::Net::HTTPRequest::HTTP_POST : Poco::Net::HTTPRequest::HTTP_GET}
, session{session_}
{
// With empty path poco will send "POST HTTP/1.1" its bug.
if (uri.getPath().empty())
@ -78,6 +80,7 @@ namespace detail
out_stream_callback(stream_out);
istr = receiveResponse(*session, request, response);
response.getCookies(cookies);
impl = std::make_unique<ReadBufferFromIStream>(*istr, buffer_size_);
}
@ -90,7 +93,6 @@ namespace detail
}
}
bool nextImpl() override
{
if (!impl->next())
@ -99,6 +101,14 @@ namespace detail
working_buffer = internal_buffer;
return true;
}
std::string getResponseCookie(const std::string & name, const std::string & def) const
{
for (const auto & cookie : cookies)
if (cookie.getName() == name)
return cookie.getValue();
return def;
}
};
}

View File

@ -1,6 +1,7 @@
#if defined(__linux__) || defined(__FreeBSD__)
#include <IO/WriteBufferAIO.h>
#include <Common/MemorySanitizer.h>
#include <Common/ProfileEvents.h>
#include <limits>
@ -200,6 +201,9 @@ bool WriteBufferAIO::waitForAIOCompletion()
}
}
// Unpoison the memory returned from an uninstrumented system function.
__msan_unpoison(&event, sizeof(event));
is_pending_write = false;
#if defined(__FreeBSD__)
bytes_written = aio_return(reinterpret_cast<struct aiocb *>(event.udata));

View File

@ -1,4 +1,5 @@
#include <IO/ZlibDeflatingWriteBuffer.h>
#include <Common/MemorySanitizer.h>
namespace DB
@ -70,6 +71,12 @@ void ZlibDeflatingWriteBuffer::nextImpl()
int rc = deflate(&zstr, Z_NO_FLUSH);
out.position() = out.buffer().end() - zstr.avail_out;
// Unpoison the result of deflate explicitly. It uses some custom SSE algo
// for computing CRC32, and it looks like msan is unable to comprehend
// it fully, so it complains about the resulting value depending on the
// uninitialized padding of the input buffer.
__msan_unpoison(out.position(), zstr.avail_out);
if (rc != Z_OK)
throw Exception(std::string("deflate failed: ") + zError(rc), ErrorCodes::ZLIB_DEFLATE_FAILED);
}
@ -92,6 +99,12 @@ void ZlibDeflatingWriteBuffer::finish()
int rc = deflate(&zstr, Z_FINISH);
out.position() = out.buffer().end() - zstr.avail_out;
// Unpoison the result of deflate explicitly. It uses some custom SSE algo
// for computing CRC32, and it looks like msan is unable to comprehend
// it fully, so it complains about the resulting value depending on the
// uninitialized padding of the input buffer.
__msan_unpoison(out.position(), zstr.avail_out);
if (rc == Z_STREAM_END)
{
finished = true;

View File

@ -15,6 +15,7 @@ namespace ActionLocks
extern const StorageActionBlockType ReplicationQueue = 4;
extern const StorageActionBlockType DistributedSend = 5;
extern const StorageActionBlockType PartsTTLMerge = 6;
extern const StorageActionBlockType PartsMove = 7;
}

View File

@ -508,6 +508,13 @@ void NO_INLINE Aggregator::executeWithoutKeyImpl(
bool Aggregator::executeOnBlock(const Block & block, AggregatedDataVariants & result,
ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, bool & no_more_keys)
{
UInt64 num_rows = block.rows();
return executeOnBlock(block.getColumns(), num_rows, result, key_columns, aggregate_columns, no_more_keys);
}
bool Aggregator::executeOnBlock(Columns columns, UInt64 num_rows, AggregatedDataVariants & result,
ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, bool & no_more_keys)
{
if (isCancelled())
return true;
@ -538,7 +545,7 @@ bool Aggregator::executeOnBlock(const Block & block, AggregatedDataVariants & re
/// Remember the columns we will work with
for (size_t i = 0; i < params.keys_size; ++i)
{
materialized_columns.push_back(block.safeGetByPosition(params.keys[i]).column->convertToFullColumnIfConst());
materialized_columns.push_back(columns.at(params.keys[i])->convertToFullColumnIfConst());
key_columns[i] = materialized_columns.back().get();
if (!result.isLowCardinality())
@ -559,7 +566,7 @@ bool Aggregator::executeOnBlock(const Block & block, AggregatedDataVariants & re
{
for (size_t j = 0; j < aggregate_columns[i].size(); ++j)
{
materialized_columns.push_back(block.safeGetByPosition(params.aggregates[i].arguments[j]).column->convertToFullColumnIfConst());
materialized_columns.push_back(columns.at(params.aggregates[i].arguments[j])->convertToFullColumnIfConst());
aggregate_columns[i][j] = materialized_columns.back().get();
auto column_no_lc = recursiveRemoveLowCardinality(aggregate_columns[i][j]->getPtr());
@ -579,8 +586,6 @@ bool Aggregator::executeOnBlock(const Block & block, AggregatedDataVariants & re
if (isCancelled())
return true;
size_t rows = block.rows();
if ((params.overflow_row || result.type == AggregatedDataVariants::Type::without_key) && !result.without_key)
{
AggregateDataPtr place = result.aggregates_pool->alignedAlloc(total_size_of_aggregate_states, align_aggregate_states);
@ -593,7 +598,7 @@ bool Aggregator::executeOnBlock(const Block & block, AggregatedDataVariants & re
/// For the case when there are no keys (all aggregate into one row).
if (result.type == AggregatedDataVariants::Type::without_key)
{
executeWithoutKeyImpl(result.without_key, rows, aggregate_functions_instructions.data(), result.aggregates_pool);
executeWithoutKeyImpl(result.without_key, num_rows, aggregate_functions_instructions.data(), result.aggregates_pool);
}
else
{
@ -602,7 +607,7 @@ bool Aggregator::executeOnBlock(const Block & block, AggregatedDataVariants & re
#define M(NAME, IS_TWO_LEVEL) \
else if (result.type == AggregatedDataVariants::Type::NAME) \
executeImpl(*result.NAME, result.aggregates_pool, rows, key_columns, aggregate_functions_instructions.data(), \
executeImpl(*result.NAME, result.aggregates_pool, num_rows, key_columns, aggregate_functions_instructions.data(), \
no_more_keys, overflow_row_ptr);
if (false) {}
@ -740,6 +745,30 @@ Block Aggregator::convertOneBucketToBlock(
return block;
}
Block Aggregator::mergeAndConvertOneBucketToBlock(
ManyAggregatedDataVariants & variants,
Arena * arena,
bool final,
size_t bucket) const
{
auto & merged_data = *variants[0];
auto method = merged_data.type;
Block block;
if (false) {}
#define M(NAME) \
else if (method == AggregatedDataVariants::Type::NAME) \
{ \
mergeBucketImpl<decltype(merged_data.NAME)::element_type>(variants, bucket, arena); \
block = convertOneBucketToBlock(merged_data, *merged_data.NAME, final, bucket); \
}
APPLY_FOR_VARIANTS_TWO_LEVEL(M)
#undef M
return block;
}
template <typename Method>
void Aggregator::writeToTemporaryFileImpl(
@ -1635,9 +1664,7 @@ private:
}
};
std::unique_ptr<IBlockInputStream> Aggregator::mergeAndConvertToBlocks(
ManyAggregatedDataVariants & data_variants, bool final, size_t max_threads) const
ManyAggregatedDataVariants Aggregator::prepareVariantsToMerge(ManyAggregatedDataVariants & data_variants) const
{
if (data_variants.empty())
throw Exception("Empty data passed to Aggregator::mergeAndConvertToBlocks.", ErrorCodes::EMPTY_DATA_PASSED);
@ -1651,7 +1678,7 @@ std::unique_ptr<IBlockInputStream> Aggregator::mergeAndConvertToBlocks(
non_empty_data.push_back(data);
if (non_empty_data.empty())
return std::make_unique<NullBlockInputStream>(getHeader(final));
return {};
if (non_empty_data.size() > 1)
{
@ -1695,6 +1722,17 @@ std::unique_ptr<IBlockInputStream> Aggregator::mergeAndConvertToBlocks(
non_empty_data[i]->aggregates_pools.begin(), non_empty_data[i]->aggregates_pools.end());
}
return non_empty_data;
}
std::unique_ptr<IBlockInputStream> Aggregator::mergeAndConvertToBlocks(
ManyAggregatedDataVariants & data_variants, bool final, size_t max_threads) const
{
ManyAggregatedDataVariants non_empty_data = prepareVariantsToMerge(data_variants);
if (non_empty_data.empty())
return std::make_unique<NullBlockInputStream>(getHeader(final));
return std::make_unique<MergingAndConvertingBlockInputStream>(*this, non_empty_data, final, max_threads);
}

View File

@ -744,6 +744,7 @@ struct AggregatedDataVariants : private boost::noncopyable
using AggregatedDataVariantsPtr = std::shared_ptr<AggregatedDataVariants>;
using ManyAggregatedDataVariants = std::vector<AggregatedDataVariantsPtr>;
using ManyAggregatedDataVariantsPtr = std::shared_ptr<ManyAggregatedDataVariants>;
/** How are "total" values calculated with WITH TOTALS?
* (For more details, see TotalsHavingBlockInputStream.)
@ -845,6 +846,10 @@ public:
ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, /// Passed to not create them anew for each block
bool & no_more_keys);
bool executeOnBlock(Columns columns, UInt64 num_rows, AggregatedDataVariants & result,
ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, /// Passed to not create them anew for each block
bool & no_more_keys);
/** Convert the aggregation data structure into a block.
* If overflow_row = true, then aggregates for rows that are not included in max_rows_to_group_by are put in the first block.
*
@ -857,6 +862,7 @@ public:
/** Merge several aggregation data structures and output the result as a block stream.
*/
std::unique_ptr<IBlockInputStream> mergeAndConvertToBlocks(ManyAggregatedDataVariants & data_variants, bool final, size_t max_threads) const;
ManyAggregatedDataVariants prepareVariantsToMerge(ManyAggregatedDataVariants & data_variants) const;
/** Merge the stream of partially aggregated blocks into one data structure.
* (Pre-aggregate several blocks that represent the result of independent aggregations from remote servers.)
@ -911,6 +917,8 @@ public:
protected:
friend struct AggregatedDataVariants;
friend class MergingAndConvertingBlockInputStream;
friend class ConvertingAggregatedToChunksTransform;
friend class ConvertingAggregatedToChunksSource;
Params params;
@ -1091,6 +1099,12 @@ protected:
bool final,
size_t bucket) const;
Block mergeAndConvertOneBucketToBlock(
ManyAggregatedDataVariants & variants,
Arena * arena,
bool final,
size_t bucket) const;
Block prepareBlockAndFillWithoutKey(AggregatedDataVariants & data_variants, bool final, bool is_overflows) const;
Block prepareBlockAndFillSingleLevel(AggregatedDataVariants & data_variants, bool final) const;
BlocksList prepareBlocksAndFillTwoLevel(AggregatedDataVariants & data_variants, bool final, ThreadPool * thread_pool) const;
@ -1162,5 +1176,4 @@ APPLY_FOR_AGGREGATED_VARIANTS(M)
#undef M
}

View File

@ -143,6 +143,11 @@ struct ContextShared
std::unique_ptr<DDLWorker> ddl_worker; /// Process ddl commands from zk.
/// Rules for selecting the compression settings, depending on the size of the part.
mutable std::unique_ptr<CompressionCodecSelector> compression_codec_selector;
/// Storage disk chooser
mutable std::unique_ptr<DiskSpace::DiskSelector> merge_tree_disk_selector;
/// Storage policy chooser
mutable std::unique_ptr<DiskSpace::StoragePolicySelector> merge_tree_storage_policy_selector;
std::optional<MergeTreeSettings> merge_tree_settings; /// Settings of MergeTree* engines.
size_t max_table_size_to_drop = 50000000000lu; /// Protects MergeTree tables from accidental DROP (50GB by default)
size_t max_partition_size_to_drop = 50000000000lu; /// Protects MergeTree partitions from accidental DROP (50GB by default)
@ -1759,6 +1764,56 @@ CompressionCodecPtr Context::chooseCompressionCodec(size_t part_size, double par
}
const DiskSpace::DiskPtr & Context::getDisk(const String & name) const
{
auto lock = getLock();
const auto & disk_selector = getDiskSelector();
return disk_selector[name];
}
DiskSpace::DiskSelector & Context::getDiskSelector() const
{
auto lock = getLock();
if (!shared->merge_tree_disk_selector)
{
constexpr auto config_name = "storage_configuration.disks";
auto & config = getConfigRef();
shared->merge_tree_disk_selector = std::make_unique<DiskSpace::DiskSelector>(config, config_name, getPath());
}
return *shared->merge_tree_disk_selector;
}
const DiskSpace::StoragePolicyPtr & Context::getStoragePolicy(const String & name) const
{
auto lock = getLock();
auto & policy_selector = getStoragePolicySelector();
return policy_selector[name];
}
DiskSpace::StoragePolicySelector & Context::getStoragePolicySelector() const
{
auto lock = getLock();
if (!shared->merge_tree_storage_policy_selector)
{
constexpr auto config_name = "storage_configuration.policies";
auto & config = getConfigRef();
shared->merge_tree_storage_policy_selector = std::make_unique<DiskSpace::StoragePolicySelector>(config, config_name, getDiskSelector());
}
return *shared->merge_tree_storage_policy_selector;
}
const MergeTreeSettings & Context::getMergeTreeSettings() const
{
auto lock = getLock();

View File

@ -13,6 +13,7 @@
#include <Common/ThreadPool.h>
#include "config_core.h"
#include <Storages/IStorage_fwd.h>
#include <Common/DiskSpaceMonitor.h>
#include <atomic>
#include <chrono>
#include <condition_variable>
@ -485,6 +486,16 @@ public:
/// Lets you select the compression codec according to the conditions described in the configuration file.
std::shared_ptr<ICompressionCodec> chooseCompressionCodec(size_t part_size, double part_size_ratio) const;
DiskSpace::DiskSelector & getDiskSelector() const;
/// Provides storage disks
const DiskSpace::DiskPtr & getDisk(const String & name) const;
DiskSpace::StoragePolicySelector & getStoragePolicySelector() const;
/// Provides storage politics schemes
const DiskSpace::StoragePolicyPtr & getStoragePolicy(const String &name) const;
/// Get the server uptime in seconds.
time_t getUptimeSeconds() const;

View File

@ -78,12 +78,16 @@ public:
return;
}
/// Generate the name for the external table.
String external_table_name = "_data" + toString(external_table_id);
while (external_tables.count(external_table_name))
String external_table_name = subquery_or_table_name->tryGetAlias();
if (external_table_name.empty())
{
++external_table_id;
/// Generate the name for the external table.
external_table_name = "_data" + toString(external_table_id);
while (external_tables.count(external_table_name))
{
++external_table_id;
external_table_name = "_data" + toString(external_table_id);
}
}
auto interpreter = interpretSubquery(subquery_or_table_name, context, subquery_depth, {});

View File

@ -65,10 +65,7 @@ Block InterpreterInsertQuery::getSampleBlock(const ASTInsertQuery & query, const
/// If the query does not include information about columns
if (!query.columns)
{
/// Format Native ignores header and write blocks as is.
if (query.format == "Native")
return {};
else if (query.no_destination)
if (query.no_destination)
return table->getSampleBlockWithVirtuals();
else
return table_sample_non_materialized;

View File

@ -265,7 +265,7 @@ Block InterpreterKillQueryQuery::getSelectResult(const String & columns, const S
if (where_expression)
select_query += " WHERE " + queryToString(where_expression);
BlockIO block_io = executeQuery(select_query, context, true);
BlockIO block_io = executeQuery(select_query, context, true, QueryProcessingStage::Complete, false, false);
Block res = block_io.in->read();
if (res && block_io.in->read())

View File

@ -411,6 +411,7 @@ BlockInputStreams InterpreterSelectQuery::executeWithMultipleStreams()
QueryPipeline InterpreterSelectQuery::executeWithProcessors()
{
QueryPipeline query_pipeline;
query_pipeline.setMaxThreads(context.getSettingsRef().max_threads);
executeImpl(query_pipeline, input);
return query_pipeline;
}
@ -1635,6 +1636,9 @@ void InterpreterSelectQuery::executeFetchColumns(
if constexpr (pipeline_with_processors)
{
if (streams.size() == 1)
pipeline.setMaxThreads(streams.size());
/// Unify streams. They must have same headers.
if (streams.size() > 1)
{
@ -1657,6 +1661,9 @@ void InterpreterSelectQuery::executeFetchColumns(
Processors sources;
sources.reserve(streams.size());
/// Pin sources for merge tree tables.
bool pin_sources = dynamic_cast<const MergeTreeData *>(storage.get()) != nullptr;
for (auto & stream : streams)
{
bool force_add_agg_info = processing_stage == QueryProcessingStage::WithMergeableState;
@ -1665,8 +1672,10 @@ void InterpreterSelectQuery::executeFetchColumns(
if (processing_stage == QueryProcessingStage::Complete)
source->addTotalsPort();
sources.emplace_back(std::move(source));
if (pin_sources)
source->setStream(sources.size());
sources.emplace_back(std::move(source));
}
pipeline.init(std::move(sources));
@ -1822,9 +1831,10 @@ void InterpreterSelectQuery::executeAggregation(QueryPipeline & pipeline, const
/// If there are several sources, then we perform parallel aggregation
if (pipeline.getNumMainStreams() > 1)
{
pipeline.resize(max_streams);
/// Add resize transform to uniformly distribute data between aggregating streams.
pipeline.resize(pipeline.getNumMainStreams(), true);
auto many_data = std::make_shared<ManyAggregatedData>(max_streams);
auto many_data = std::make_shared<ManyAggregatedData>(pipeline.getNumMainStreams());
auto merge_threads = settings.aggregation_memory_efficient_merge_threads
? static_cast<size_t>(settings.aggregation_memory_efficient_merge_threads)
: static_cast<size_t>(settings.max_threads);

View File

@ -49,6 +49,7 @@ namespace ActionLocks
extern StorageActionBlockType ReplicationQueue;
extern StorageActionBlockType DistributedSend;
extern StorageActionBlockType PartsTTLMerge;
extern StorageActionBlockType PartsMove;
}
@ -189,6 +190,12 @@ BlockIO InterpreterSystemQuery::execute()
case Type::START_TTL_MERGES:
startStopAction(context, query, ActionLocks::PartsTTLMerge, true);
break;
case Type::STOP_MOVES:
startStopAction(context, query, ActionLocks::PartsMove, false);
break;
case Type::START_MOVES:
startStopAction(context, query, ActionLocks::PartsMove, true);
break;
case Type::STOP_FETCHES:
startStopAction(context, query, ActionLocks::PartsFetch, false);
break;

View File

@ -27,7 +27,9 @@ Block PartLogElement::createBlock()
{"DownloadPart", static_cast<Int8>(DOWNLOAD_PART)},
{"RemovePart", static_cast<Int8>(REMOVE_PART)},
{"MutatePart", static_cast<Int8>(MUTATE_PART)},
});
{"MovePart", static_cast<Int8>(MOVE_PART)},
}
);
return
{
@ -40,6 +42,7 @@ Block PartLogElement::createBlock()
{ColumnString::create(), std::make_shared<DataTypeString>(), "table"},
{ColumnString::create(), std::make_shared<DataTypeString>(), "part_name"},
{ColumnString::create(), std::make_shared<DataTypeString>(), "partition_id"},
{ColumnString::create(), std::make_shared<DataTypeString>(), "path_on_disk"},
{ColumnUInt64::create(), std::make_shared<DataTypeUInt64>(), "rows"},
{ColumnUInt64::create(), std::make_shared<DataTypeUInt64>(), "size_in_bytes"}, // On disk
@ -71,6 +74,7 @@ void PartLogElement::appendToBlock(Block & block) const
columns[i++]->insert(table_name);
columns[i++]->insert(part_name);
columns[i++]->insert(partition_id);
columns[i++]->insert(path_on_disk);
columns[i++]->insert(rows);
columns[i++]->insert(bytes_compressed_on_disk);
@ -124,6 +128,7 @@ bool PartLog::addNewParts(Context & current_context, const PartLog::MutableDataP
elem.table_name = part->storage.getTableName();
elem.partition_id = part->info.partition_id;
elem.part_name = part->name;
elem.path_on_disk = part->getFullPath();
elem.bytes_compressed_on_disk = part->bytes_on_disk;
elem.rows = part->rows_count;

View File

@ -15,6 +15,7 @@ struct PartLogElement
DOWNLOAD_PART = 3,
REMOVE_PART = 4,
MUTATE_PART = 5,
MOVE_PART = 6,
};
Type event_type = NEW_PART;
@ -26,6 +27,7 @@ struct PartLogElement
String table_name;
String part_name;
String partition_id;
String path_on_disk;
/// Size of the part
UInt64 rows = 0;

View File

@ -38,6 +38,7 @@
#include <Processors/Transforms/LimitsCheckingTransform.h>
#include <Processors/Transforms/MaterializingTransform.h>
#include <Processors/Formats/IOutputFormat.h>
#include <Parsers/ASTWatchQuery.h>
namespace ProfileEvents
{
@ -181,7 +182,8 @@ static std::tuple<ASTPtr, BlockIO> executeQueryImpl(
bool internal,
QueryProcessingStage::Enum stage,
bool has_query_tail,
ReadBuffer * istr)
ReadBuffer * istr,
bool allow_processors)
{
time_t current_time = time(nullptr);
@ -299,7 +301,7 @@ static std::tuple<ASTPtr, BlockIO> executeQueryImpl(
context.resetInputCallbacks();
auto interpreter = InterpreterFactory::get(ast, context, stage);
bool use_processors = settings.experimental_use_processors && interpreter->canExecuteWithProcessors();
bool use_processors = settings.experimental_use_processors && allow_processors && interpreter->canExecuteWithProcessors();
if (use_processors)
pipeline = interpreter->executeWithProcessors();
@ -548,11 +550,12 @@ BlockIO executeQuery(
Context & context,
bool internal,
QueryProcessingStage::Enum stage,
bool may_have_embedded_data)
bool may_have_embedded_data,
bool allow_processors)
{
BlockIO streams;
std::tie(std::ignore, streams) = executeQueryImpl(query.data(), query.data() + query.size(), context,
internal, stage, !may_have_embedded_data, nullptr);
internal, stage, !may_have_embedded_data, nullptr, allow_processors);
return streams;
}
@ -603,7 +606,7 @@ void executeQuery(
ASTPtr ast;
BlockIO streams;
std::tie(ast, streams) = executeQueryImpl(begin, end, context, false, QueryProcessingStage::Complete, may_have_tail, &istr);
std::tie(ast, streams) = executeQueryImpl(begin, end, context, false, QueryProcessingStage::Complete, may_have_tail, &istr, true);
auto & pipeline = streams.pipeline;
@ -658,7 +661,19 @@ void executeQuery(
if (set_query_id)
set_query_id(context.getClientInfo().current_query_id);
copyData(*streams.in, *out);
if (ast->as<ASTWatchQuery>())
{
/// For Watch query, flush data if block is empty (to send data to client).
auto flush_callback = [&out](const Block & block)
{
if (block.rows() == 0)
out->flush();
};
copyData(*streams.in, *out, [](){ return false; }, std::move(flush_callback));
}
else
copyData(*streams.in, *out);
}
if (pipeline.initialized())

View File

@ -43,7 +43,8 @@ BlockIO executeQuery(
Context & context, /// DB, tables, data types, storage engines, functions, aggregate functions...
bool internal = false, /// If true, this query is caused by another query and thus needn't be registered in the ProcessList.
QueryProcessingStage::Enum stage = QueryProcessingStage::Complete, /// To which stage the query must be executed.
bool may_have_embedded_data = false /// If insert query may have embedded data
bool may_have_embedded_data = false, /// If insert query may have embedded data
bool allow_processors = true /// If can use processors pipeline
);

View File

@ -1,5 +1,6 @@
#include <Parsers/ASTAlterQuery.h>
#include <iomanip>
#include <IO/WriteHelpers.h>
namespace DB
@ -167,31 +168,37 @@ void ASTAlterCommand::formatImpl(
<< (part ? "PART " : "PARTITION ") << (settings.hilite ? hilite_none : "");
partition->formatImpl(settings, state, frame);
}
else if (type == ASTAlterCommand::REPLACE_PARTITION)
{
settings.ostr << (settings.hilite ? hilite_keyword : "") << indent_str << (replace ? "REPLACE" : "ATTACH") << " PARTITION "
<< (settings.hilite ? hilite_none : "");
partition->formatImpl(settings, state, frame);
settings.ostr << (settings.hilite ? hilite_keyword : "") << " FROM " << (settings.hilite ? hilite_none : "");
if (!from_database.empty())
{
settings.ostr << (settings.hilite ? hilite_identifier : "") << backQuoteIfNeed(from_database)
<< (settings.hilite ? hilite_none : "") << ".";
}
settings.ostr << (settings.hilite ? hilite_identifier : "") << backQuoteIfNeed(from_table) << (settings.hilite ? hilite_none : "");
}
else if (type == ASTAlterCommand::MOVE_PARTITION)
{
settings.ostr << (settings.hilite ? hilite_keyword : "") << indent_str << "MOVE PARTITION "
<< (settings.hilite ? hilite_none : "");
settings.ostr << (settings.hilite ? hilite_keyword : "") << indent_str << "MOVE "
<< (part ? "PART " : "PARTITION ") << (settings.hilite ? hilite_none : "");
partition->formatImpl(settings, state, frame);
settings.ostr << (settings.hilite ? hilite_keyword : "") << " TO " << (settings.hilite ? hilite_none : "");
if (!to_database.empty())
settings.ostr << " TO ";
switch (move_destination_type)
{
settings.ostr << (settings.hilite ? hilite_identifier : "") << backQuoteIfNeed(to_database)
<< (settings.hilite ? hilite_none : "") << ".";
case MoveDestinationType::DISK:
settings.ostr << "DISK ";
break;
case MoveDestinationType::VOLUME:
settings.ostr << "VOLUME ";
break;
case MoveDestinationType::PARTITION:
if (!to_database.empty())
{
settings.ostr << (settings.hilite ? hilite_identifier : "") << backQuoteIfNeed(to_database)
<< (settings.hilite ? hilite_none : "") << ".";
}
settings.ostr << (settings.hilite ? hilite_identifier : "")
<< backQuoteIfNeed(to_table)
<< (settings.hilite ? hilite_none : "");
return;
}
if (move_destination_type != MoveDestinationType::PARTITION)
{
WriteBufferFromOwnString move_destination_name_buf;
writeQuoted(move_destination_name, move_destination_name_buf);
settings.ostr << move_destination_name_buf.str();
}
settings.ostr << (settings.hilite ? hilite_identifier : "") << backQuoteIfNeed(to_table) << (settings.hilite ? hilite_none : "");
}
else if (type == ASTAlterCommand::FETCH_PARTITION)
{

View File

@ -42,8 +42,8 @@ public:
DROP_PARTITION,
DROP_DETACHED_PARTITION,
ATTACH_PARTITION,
REPLACE_PARTITION,
MOVE_PARTITION,
REPLACE_PARTITION,
FETCH_PARTITION,
FREEZE_PARTITION,
FREEZE_ALL,
@ -118,7 +118,7 @@ public:
bool detach = false; /// true for DETACH PARTITION
bool part = false; /// true for ATTACH PART and DROP DETACHED PART
bool part = false; /// true for ATTACH PART, DROP DETACHED PART and MOVE
bool clear_column = false; /// for CLEAR COLUMN (do not drop column from metadata)
@ -128,6 +128,17 @@ public:
bool if_exists = false; /// option for DROP_COLUMN, MODIFY_COLUMN, COMMENT_COLUMN
enum MoveDestinationType
{
DISK,
VOLUME,
PARTITION,
};
MoveDestinationType move_destination_type;
String move_destination_name;
/** For FETCH PARTITION - the path in ZK to the shard, from which to download the partition.
*/
String from;

View File

@ -59,6 +59,10 @@ const char * ASTSystemQuery::typeToString(Type type)
return "STOP TTL MERGES";
case Type::START_TTL_MERGES:
return "START TTL MERGES";
case Type::STOP_MOVES:
return "STOP MOVES";
case Type::START_MOVES:
return "START MOVES";
case Type::STOP_FETCHES:
return "STOP FETCHES";
case Type::START_FETCHES:
@ -106,6 +110,8 @@ void ASTSystemQuery::formatImpl(const FormatSettings & settings, FormatState &,
|| type == Type::START_MERGES
|| type == Type::STOP_TTL_MERGES
|| type == Type::START_TTL_MERGES
|| type == Type::STOP_MOVES
|| type == Type::START_MOVES
|| type == Type::STOP_FETCHES
|| type == Type::START_FETCHES
|| type == Type::STOP_REPLICATED_SENDS

View File

@ -37,6 +37,8 @@ public:
START_TTL_MERGES,
STOP_FETCHES,
START_FETCHES,
STOP_MOVES,
START_MOVES,
STOP_REPLICATED_SENDS,
START_REPLICATED_SENDS,
STOP_REPLICATION_QUEUES,

View File

@ -49,12 +49,13 @@ bool ParserAlterCommand::parseImpl(Pos & pos, ASTPtr & node, Expected & expected
ParserKeyword s_attach_partition("ATTACH PARTITION");
ParserKeyword s_detach_partition("DETACH PARTITION");
ParserKeyword s_drop_partition("DROP PARTITION");
ParserKeyword s_move_partition("MOVE PARTITION");
ParserKeyword s_drop_detached_partition("DROP DETACHED PARTITION");
ParserKeyword s_drop_detached_part("DROP DETACHED PART");
ParserKeyword s_attach_part("ATTACH PART");
ParserKeyword s_move_part("MOVE PART");
ParserKeyword s_fetch_partition("FETCH PARTITION");
ParserKeyword s_replace_partition("REPLACE PARTITION");
ParserKeyword s_move_partition("MOVE PARTITION");
ParserKeyword s_freeze("FREEZE");
ParserKeyword s_partition("PARTITION");
@ -67,6 +68,9 @@ bool ParserAlterCommand::parseImpl(Pos & pos, ASTPtr & node, Expected & expected
ParserKeyword s_with("WITH");
ParserKeyword s_name("NAME");
ParserKeyword s_to_disk("TO DISK");
ParserKeyword s_to_volume("TO VOLUME");
ParserKeyword s_delete_where("DELETE WHERE");
ParserKeyword s_update("UPDATE");
ParserKeyword s_where("WHERE");
@ -224,17 +228,58 @@ bool ParserAlterCommand::parseImpl(Pos & pos, ASTPtr & node, Expected & expected
return false;
}
}
else if (s_move_part.ignore(pos, expected))
{
if (!parser_string_literal.parse(pos, command->partition, expected))
return false;
command->type = ASTAlterCommand::MOVE_PARTITION;
command->part = true;
if (s_to_disk.ignore(pos))
command->move_destination_type = ASTAlterCommand::MoveDestinationType::DISK;
else if (s_to_volume.ignore(pos))
command->move_destination_type = ASTAlterCommand::MoveDestinationType::VOLUME;
else if (s_to.ignore(pos) && parseDatabaseAndTableName(pos, expected, command->to_database, command->to_table))
command->move_destination_type = ASTAlterCommand::MoveDestinationType::PARTITION;
else
return false;
if (command->move_destination_type != ASTAlterCommand::MoveDestinationType::PARTITION)
{
ASTPtr ast_space_name;
if (!parser_string_literal.parse(pos, ast_space_name, expected))
return false;
command->move_destination_name = ast_space_name->as<ASTLiteral &>().value.get<const String &>();
}
}
else if (s_move_partition.ignore(pos, expected))
{
if (!parser_partition.parse(pos, command->partition, expected))
return false;
if (!s_to.ignore(pos, expected))
command->type = ASTAlterCommand::MOVE_PARTITION;
if (s_to_disk.ignore(pos))
command->move_destination_type = ASTAlterCommand::MoveDestinationType::DISK;
else if (s_to_volume.ignore(pos))
command->move_destination_type = ASTAlterCommand::MoveDestinationType::VOLUME;
else if (s_to.ignore(pos) && parseDatabaseAndTableName(pos, expected, command->to_database, command->to_table))
{
command->move_destination_type = ASTAlterCommand::MoveDestinationType::PARTITION;
}
else
return false;
if (!parseDatabaseAndTableName(pos, expected, command->to_database, command->to_table))
return false;
command->type = ASTAlterCommand::MOVE_PARTITION;
if (command->move_destination_type != ASTAlterCommand::MoveDestinationType::PARTITION)
{
ASTPtr ast_space_name;
if (!parser_string_literal.parse(pos, ast_space_name, expected))
return false;
command->move_destination_name = ast_space_name->as<ASTLiteral &>().value.get<const String &>();
}
}
else if (s_add_constraint.ignore(pos, expected))
{

View File

@ -58,6 +58,8 @@ bool ParserSystemQuery::parseImpl(IParser::Pos & pos, ASTPtr & node, Expected &
case Type::START_MERGES:
case Type::STOP_TTL_MERGES:
case Type::START_TTL_MERGES:
case Type::STOP_MOVES:
case Type::START_MOVES:
case Type::STOP_FETCHES:
case Type::START_FETCHES:
case Type::STOP_REPLICATED_SENDS:

View File

@ -9,6 +9,7 @@
#include <boost/lockfree/queue.hpp>
#include <Common/Stopwatch.h>
#include <Processors/ISource.h>
namespace DB
{
@ -155,9 +156,9 @@ void PipelineExecutor::addJob(ExecutionState * execution_state)
{
try
{
Stopwatch watch;
// Stopwatch watch;
executeJob(execution_state->processor);
execution_state->execution_time_ns += watch.elapsed();
// execution_state->execution_time_ns += watch.elapsed();
++execution_state->num_executed_jobs;
}
@ -388,7 +389,17 @@ void PipelineExecutor::finish()
finished = true;
}
task_queue_condvar.notify_all();
std::lock_guard guard(executor_contexts_mutex);
for (auto & context : executor_contexts)
{
{
std::lock_guard lock(context->mutex);
context->wake_flag = true;
}
context->condvar.notify_one();
}
}
void PipelineExecutor::execute(size_t num_threads)
@ -461,38 +472,122 @@ void PipelineExecutor::executeSingleThread(size_t thread_num, size_t num_threads
}
};
auto wake_up_executor = [&](size_t executor)
{
std::lock_guard guard(executor_contexts[executor]->mutex);
executor_contexts[executor]->wake_flag = true;
executor_contexts[executor]->condvar.notify_one();
};
auto process_pinned_tasks = [&](Queue & queue)
{
Queue tmp_queue;
struct PinnedTask
{
ExecutionState * task;
size_t thread_num;
};
std::stack<PinnedTask> pinned_tasks;
while (!queue.empty())
{
auto task = queue.front();
queue.pop();
auto stream = task->processor->getStream();
if (stream != IProcessor::NO_STREAM)
pinned_tasks.push({.task = task, .thread_num = stream % num_threads});
else
tmp_queue.push(task);
}
if (!pinned_tasks.empty())
{
std::stack<size_t> threads_to_wake;
{
std::lock_guard lock(task_queue_mutex);
while (!pinned_tasks.empty())
{
auto & pinned_task = pinned_tasks.top();
auto thread = pinned_task.thread_num;
executor_contexts[thread]->pinned_tasks.push(pinned_task.task);
pinned_tasks.pop();
if (threads_queue.has(thread))
{
threads_queue.pop(thread);
threads_to_wake.push(thread);
}
}
}
while (!threads_to_wake.empty())
{
wake_up_executor(threads_to_wake.top());
threads_to_wake.pop();
}
}
queue.swap(tmp_queue);
};
while (!finished)
{
/// First, find any processor to execute.
/// Just travers graph and prepare any processor.
while (!finished)
{
std::unique_lock lock(task_queue_mutex);
if (!task_queue.empty())
{
state = task_queue.front();
task_queue.pop();
break;
std::unique_lock lock(task_queue_mutex);
if (!executor_contexts[thread_num]->pinned_tasks.empty())
{
state = executor_contexts[thread_num]->pinned_tasks.front();
executor_contexts[thread_num]->pinned_tasks.pop();
break;
}
if (!task_queue.empty())
{
state = task_queue.front();
task_queue.pop();
if (!task_queue.empty() && !threads_queue.empty())
{
auto thread_to_wake = threads_queue.pop_any();
lock.unlock();
wake_up_executor(thread_to_wake);
}
break;
}
if (threads_queue.size() + 1 == num_threads)
{
lock.unlock();
finish();
break;
}
threads_queue.push(thread_num);
}
++num_waiting_threads;
if (num_waiting_threads == num_threads)
{
finished = true;
lock.unlock();
task_queue_condvar.notify_all();
break;
std::unique_lock lock(executor_contexts[thread_num]->mutex);
executor_contexts[thread_num]->condvar.wait(lock, [&]
{
return finished || executor_contexts[thread_num]->wake_flag;
});
executor_contexts[thread_num]->wake_flag = false;
}
task_queue_condvar.wait(lock, [&]()
{
return finished || !task_queue.empty();
});
--num_waiting_threads;
}
if (finished)
@ -506,9 +601,9 @@ void PipelineExecutor::executeSingleThread(size_t thread_num, size_t num_threads
addJob(state);
{
Stopwatch execution_time_watch;
// Stopwatch execution_time_watch;
state->job();
execution_time_ns += execution_time_watch.elapsed();
// execution_time_ns += execution_time_watch.elapsed();
}
if (state->exception)
@ -517,7 +612,7 @@ void PipelineExecutor::executeSingleThread(size_t thread_num, size_t num_threads
if (finished)
break;
Stopwatch processing_time_watch;
// Stopwatch processing_time_watch;
/// Try to execute neighbour processor.
{
@ -535,7 +630,9 @@ void PipelineExecutor::executeSingleThread(size_t thread_num, size_t num_threads
/// Process all neighbours. Children will be on the top of stack, then parents.
prepare_all_processors(queue, children, children, parents);
process_pinned_tasks(queue);
/// Take local task from queue if has one.
if (!state && !queue.empty())
{
state = queue.front();
@ -543,10 +640,25 @@ void PipelineExecutor::executeSingleThread(size_t thread_num, size_t num_threads
}
prepare_all_processors(queue, parents, parents, parents);
process_pinned_tasks(queue);
/// Take pinned task if has one.
{
std::lock_guard guard(task_queue_mutex);
if (!executor_contexts[thread_num]->pinned_tasks.empty())
{
if (state)
queue.push(state);
state = executor_contexts[thread_num]->pinned_tasks.front();
executor_contexts[thread_num]->pinned_tasks.pop();
}
}
/// Push other tasks to global queue.
if (!queue.empty())
{
std::lock_guard lock(task_queue_mutex);
std::unique_lock lock(task_queue_mutex);
while (!queue.empty() && !finished)
{
@ -554,7 +666,12 @@ void PipelineExecutor::executeSingleThread(size_t thread_num, size_t num_threads
queue.pop();
}
task_queue_condvar.notify_all();
if (!threads_queue.empty())
{
auto thread_to_wake = threads_queue.pop_any();
lock.unlock();
wake_up_executor(thread_to_wake);
}
}
--num_processing_executors;
@ -562,7 +679,7 @@ void PipelineExecutor::executeSingleThread(size_t thread_num, size_t num_threads
doExpandPipeline(task, false);
}
processing_time_ns += processing_time_watch.elapsed();
// processing_time_ns += processing_time_watch.elapsed();
}
}
@ -580,9 +697,15 @@ void PipelineExecutor::executeImpl(size_t num_threads)
{
Stack stack;
executor_contexts.reserve(num_threads);
for (size_t i = 0; i < num_threads; ++i)
executor_contexts.emplace_back(std::make_unique<ExecutorContext>());
threads_queue.init(num_threads);
{
std::lock_guard guard(executor_contexts_mutex);
executor_contexts.reserve(num_threads);
for (size_t i = 0; i < num_threads; ++i)
executor_contexts.emplace_back(std::make_unique<ExecutorContext>());
}
auto thread_group = CurrentThread::getGroup();
@ -598,7 +721,8 @@ void PipelineExecutor::executeImpl(size_t num_threads)
finish();
for (auto & thread : threads)
thread.join();
if (thread.joinable())
thread.join();
}
);
@ -639,7 +763,8 @@ void PipelineExecutor::executeImpl(size_t num_threads)
}
for (auto & thread : threads)
thread.join();
if (thread.joinable())
thread.join();
finished_flag = true;
}

View File

@ -1,14 +1,14 @@
#pragma once
#include <queue>
#include <stack>
#include <Processors/IProcessor.h>
#include <mutex>
#include <Processors/Executors/ThreadsQueue.h>
#include <Common/ThreadPool.h>
#include <Common/EventCounter.h>
#include <common/logger_useful.h>
#include <boost/lockfree/stack.hpp>
#include <queue>
#include <stack>
#include <mutex>
namespace DB
{
@ -122,17 +122,15 @@ private:
/// Queue with pointers to tasks. Each thread will concurrently read from it until finished flag is set.
/// Stores processors need to be prepared. Preparing status is already set for them.
TaskQueue task_queue;
ThreadsQueue threads_queue;
std::mutex task_queue_mutex;
std::condition_variable task_queue_condvar;
std::atomic_bool cancelled;
std::atomic_bool finished;
Poco::Logger * log = &Poco::Logger::get("PipelineExecutor");
/// Num threads waiting condvar. Last thread finish execution if task_queue is empty.
size_t num_waiting_threads = 0;
/// Things to stop execution to expand pipeline.
struct ExpandPipelineTask
{
@ -155,9 +153,16 @@ private:
/// Will store context for all expand pipeline tasks (it's easy and we don't expect many).
/// This can be solved by using atomic shard ptr.
std::list<ExpandPipelineTask> task_list;
std::condition_variable condvar;
std::mutex mutex;
bool wake_flag = false;
std::queue<ExecutionState *> pinned_tasks;
};
std::vector<std::unique_ptr<ExecutorContext>> executor_contexts;
std::mutex executor_contexts_mutex;
/// Processor ptr -> node number
using ProcessorsMap = std::unordered_map<const IProcessor *, UInt64>;

View File

@ -0,0 +1,70 @@
#pragma once
namespace DB
{
/// Simple struct which stores threads with numbers [0 .. num_threads - 1].
/// Allows to push and pop specified thread, or pop any thread if has.
/// Oll operations (except init) are O(1). No memory allocations after init happen.
struct ThreadsQueue
{
void init(size_t num_threads)
{
stack_size = 0;
stack.clear();
thread_pos_in_stack.clear();
stack.reserve(num_threads);
thread_pos_in_stack.reserve(num_threads);
for (size_t thread = 0; thread < num_threads; ++thread)
{
stack.emplace_back(thread);
thread_pos_in_stack.emplace_back(thread);
}
}
bool has(size_t thread) const { return thread_pos_in_stack[thread] < stack_size; }
size_t size() const { return stack_size; }
bool empty() const { return stack_size == 0; }
void push(size_t thread)
{
if (unlikely(has(thread)))
throw Exception("Can't push thread because it is already in threads queue.", ErrorCodes::LOGICAL_ERROR);
swap_threads(thread, stack[stack_size]);
++stack_size;
}
void pop(size_t thread)
{
if (unlikely(!has(thread)))
throw Exception("Can't pop thread because it is not in threads queue.", ErrorCodes::LOGICAL_ERROR);
--stack_size;
swap_threads(thread, stack[stack_size]);
}
size_t pop_any()
{
if (unlikely(stack_size == 0))
throw Exception("Can't pop from empty queue.", ErrorCodes::LOGICAL_ERROR);
--stack_size;
return stack[stack_size];
}
private:
std::vector<size_t> stack;
std::vector<size_t> thread_pos_in_stack;
size_t stack_size = 0;
void swap_threads(size_t first, size_t second)
{
std::swap(thread_pos_in_stack[first], thread_pos_in_stack[second]);
std::swap(stack[thread_pos_in_stack[first]], stack[thread_pos_in_stack[second]]);
}
};
}

View File

@ -1,3 +1,5 @@
#include <Common/config.h>
#if USE_POCO_NETSSL
#include <Processors/Formats/Impl/MySQLOutputFormat.h>
#include <Core/MySQLProtocol.h>
@ -116,3 +118,4 @@ void registerOutputFormatProcessorMySQLWrite(FormatFactory & factory)
}
}
#endif

View File

@ -1,4 +1,6 @@
#pragma once
#include <Common/config.h>
#if USE_POCO_NETSSL
#include <Processors/Formats/IRowOutputFormat.h>
#include <Core/Block.h>
@ -40,3 +42,4 @@ private:
};
}
#endif

View File

@ -183,6 +183,11 @@ public:
throw Exception("Method 'work' is not implemented for " + getName() + " processor", ErrorCodes::NOT_IMPLEMENTED);
}
virtual void work(size_t /*thread_num*/)
{
work();
}
/** You may call this method if 'prepare' returned Async.
* This method cannot access any ports. It should use only data that was prepared by 'prepare' method.
*
@ -228,10 +233,17 @@ public:
void setDescription(const std::string & description_) { processor_description = description_; }
const std::string & getDescription() const { return processor_description; }
/// Helpers for pipeline executor.
void setStream(size_t value) { stream_number = value; }
size_t getStream() const { return stream_number; }
constexpr static size_t NO_STREAM = std::numeric_limits<size_t>::max();
private:
std::atomic<bool> is_cancelled{false};
std::string processor_description;
size_t stream_number = NO_STREAM;
};

View File

@ -75,6 +75,7 @@ void QueryPipeline::init(Processors sources)
totals.emplace_back(&source->getOutputs().back());
}
/// source->setStream(streams.size());
streams.emplace_back(&source->getOutputs().front());
processors.emplace_back(std::move(source));
}
@ -115,7 +116,7 @@ void QueryPipeline::addSimpleTransformImpl(const TProcessorGetter & getter)
Block header;
auto add_transform = [&](OutputPort *& stream, StreamType stream_type)
auto add_transform = [&](OutputPort *& stream, StreamType stream_type, size_t stream_num [[maybe_unused]] = IProcessor::NO_STREAM)
{
if (!stream)
return;
@ -148,14 +149,17 @@ void QueryPipeline::addSimpleTransformImpl(const TProcessorGetter & getter)
if (transform)
{
// if (stream_type == StreamType::Main)
// transform->setStream(stream_num);
connect(*stream, transform->getInputs().front());
stream = &transform->getOutputs().front();
processors.emplace_back(std::move(transform));
}
};
for (auto & stream : streams)
add_transform(stream, StreamType::Main);
for (size_t stream_num = 0; stream_num < streams.size(); ++stream_num)
add_transform(streams[stream_num], StreamType::Main, stream_num);
add_transform(delayed_stream_port, StreamType::Main);
add_transform(totals_having_port, StreamType::Totals);
@ -247,12 +251,12 @@ void QueryPipeline::concatDelayedStream()
delayed_stream_port = nullptr;
}
void QueryPipeline::resize(size_t num_streams)
void QueryPipeline::resize(size_t num_streams, bool force)
{
checkInitialized();
concatDelayedStream();
if (num_streams == getNumStreams())
if (!force && num_streams == getNumStreams())
return;
has_resize = true;

View File

@ -58,7 +58,7 @@ public:
/// Check if resize transform was used. (In that case another distinct transform will be added).
bool hasMixedStreams() const { return has_resize || hasMoreThanOneStream(); }
void resize(size_t num_streams);
void resize(size_t num_streams, bool force = false);
void unitePipelines(std::vector<QueryPipeline> && pipelines, const Block & common_header, const Context & context);
@ -81,6 +81,9 @@ public:
/// Call after execution.
void finalize();
void setMaxThreads(size_t max_threads_) { max_threads = max_threads_; }
size_t getMaxThreads() const { return max_threads; }
private:
/// All added processors.
@ -106,6 +109,8 @@ private:
IOutputFormat * output_format = nullptr;
size_t max_threads = 0;
void checkInitialized();
void checkSource(const ProcessorPtr & source, bool can_have_totals);
void concatDelayedStream();

View File

@ -123,7 +123,9 @@ Chunk SourceFromInputStream::generate()
return {};
}
#ifndef NDEBUG
assertBlocksHaveEqualStructure(getPort().getHeader(), block, "SourceFromInputStream");
#endif
UInt64 num_rows = block.rows();
Chunk chunk(block.getColumns(), num_rows);

View File

@ -14,14 +14,30 @@ namespace ProfileEvents
namespace DB
{
/// Convert block to chunk.
/// Adds additional info about aggregation.
static Chunk convertToChunk(const Block & block)
{
auto info = std::make_shared<AggregatedChunkInfo>();
info->bucket_num = block.info.bucket_num;
info->is_overflows = block.info.is_overflows;
UInt64 num_rows = block.rows();
Chunk chunk(block.getColumns(), num_rows);
chunk.setChunkInfo(std::move(info));
return chunk;
}
namespace
{
/// Reads chunks from file in native format. Provide chunks with aggregation info.
class SourceFromNativeStream : public ISource
{
public:
SourceFromNativeStream(const Block & header, const std::string & path)
: ISource(header), file_in(path), compressed_in(file_in)
, block_in(std::make_shared<NativeBlockInputStream>(compressed_in, ClickHouseRevision::get()))
: ISource(header), file_in(path), compressed_in(file_in),
block_in(std::make_shared<NativeBlockInputStream>(compressed_in, ClickHouseRevision::get()))
{
block_in->readPrefix();
}
@ -41,15 +57,7 @@ namespace
return {};
}
auto info = std::make_shared<AggregatedChunkInfo>();
info->bucket_num = block.info.bucket_num;
info->is_overflows = block.info.is_overflows;
UInt64 num_rows = block.rows();
Chunk chunk(block.getColumns(), num_rows);
chunk.setChunkInfo(std::move(info));
return chunk;
return convertToChunk(block);
}
private:
@ -57,39 +65,331 @@ namespace
CompressedReadBuffer compressed_in;
BlockInputStreamPtr block_in;
};
}
class ConvertingAggregatedToBlocksTransform : public ISource
/// Worker which merges buckets for two-level aggregation.
/// Atomically increments bucket counter and returns merged result.
class ConvertingAggregatedToChunksSource : public ISource
{
public:
ConvertingAggregatedToChunksSource(
AggregatingTransformParamsPtr params_,
ManyAggregatedDataVariantsPtr data_,
Arena * arena_,
std::shared_ptr<std::atomic<UInt32>> next_bucket_to_merge_)
: ISource(params_->getHeader())
, params(std::move(params_))
, data(std::move(data_))
, next_bucket_to_merge(std::move(next_bucket_to_merge_))
, arena(arena_)
{}
String getName() const override { return "ConvertingAggregatedToChunksSource"; }
protected:
Chunk generate() override
{
public:
ConvertingAggregatedToBlocksTransform(Block header, AggregatingTransformParamsPtr params_, BlockInputStreamPtr stream_)
: ISource(std::move(header)), params(std::move(params_)), stream(std::move(stream_)) {}
UInt32 bucket_num = next_bucket_to_merge->fetch_add(1);
String getName() const override { return "ConvertingAggregatedToBlocksTransform"; }
if (bucket_num >= NUM_BUCKETS)
return {};
protected:
Chunk generate() override
Block block = params->aggregator.mergeAndConvertOneBucketToBlock(*data, arena, params->final, bucket_num);
return convertToChunk(block);
}
private:
AggregatingTransformParamsPtr params;
ManyAggregatedDataVariantsPtr data;
std::shared_ptr<std::atomic<UInt32>> next_bucket_to_merge;
Arena * arena;
static constexpr UInt32 NUM_BUCKETS = 256;
};
/// Generates chunks with aggregated data.
/// In single level case, aggregates data itself.
/// In two-level case, creates `ConvertingAggregatedToChunksSource` workers:
///
/// ConvertingAggregatedToChunksSource ->
/// ConvertingAggregatedToChunksSource -> ConvertingAggregatedToChunksTransform -> AggregatingTransform
/// ConvertingAggregatedToChunksSource ->
///
/// Result chunks guaranteed to be sorted by bucket number.
class ConvertingAggregatedToChunksTransform : public IProcessor
{
public:
ConvertingAggregatedToChunksTransform(AggregatingTransformParamsPtr params_, ManyAggregatedDataVariantsPtr data_, size_t num_threads_)
: IProcessor({}, {params_->getHeader()})
, params(std::move(params_)), data(std::move(data_)), num_threads(num_threads_) {}
String getName() const override { return "ConvertingAggregatedToChunksTransform"; }
void work() override
{
if (data->empty())
{
auto block = stream->read();
if (!block)
return {};
auto info = std::make_shared<AggregatedChunkInfo>();
info->bucket_num = block.info.bucket_num;
info->is_overflows = block.info.is_overflows;
UInt64 num_rows = block.rows();
Chunk chunk(block.getColumns(), num_rows);
chunk.setChunkInfo(std::move(info));
return chunk;
finished = true;
return;
}
private:
/// Store params because aggregator must be destroyed after stream. Order is important.
AggregatingTransformParamsPtr params;
BlockInputStreamPtr stream;
};
}
if (!is_initialized)
{
initialize();
return;
}
if (data->at(0)->isTwoLevel())
{
/// In two-level case will only create sources.
if (inputs.empty())
createSources();
}
else
{
mergeSingleLevel();
}
}
Processors expandPipeline() override
{
for (auto & source : processors)
{
auto & out = source->getOutputs().front();
inputs.emplace_back(out.getHeader(), this);
connect(out, inputs.back());
}
return std::move(processors);
}
IProcessor::Status prepare() override
{
auto & output = outputs.front();
if (finished && !has_input)
{
output.finish();
return Status::Finished;
}
/// Check can output.
if (output.isFinished())
return Status::Finished;
if (!output.canPush())
return Status::PortFull;
if (!is_initialized)
return Status::Ready;
if (!processors.empty())
return Status::ExpandPipeline;
if (has_input)
return preparePushToOutput();
/// Single level case.
if (inputs.empty())
return Status::Ready;
/// Two-level case.
return preparePullFromInputs();
}
private:
IProcessor::Status preparePushToOutput()
{
auto & output = outputs.front();
output.push(std::move(current_chunk));
has_input = false;
if (finished)
{
output.finish();
return Status::Finished;
}
return Status::PortFull;
}
/// Read all sources and try to push current bucket.
IProcessor::Status preparePullFromInputs()
{
bool all_inputs_are_finished = true;
for (auto & input : inputs)
{
if (input.isFinished())
continue;
all_inputs_are_finished = false;
input.setNeeded();
if (input.hasData())
ready_chunks.emplace_back(input.pull());
}
moveReadyChunksToMap();
if (trySetCurrentChunkFromCurrentBucket())
return preparePushToOutput();
if (all_inputs_are_finished)
throw Exception("All sources have finished before getting enough data in "
"ConvertingAggregatedToChunksTransform.", ErrorCodes::LOGICAL_ERROR);
return Status::NeedData;
}
private:
AggregatingTransformParamsPtr params;
ManyAggregatedDataVariantsPtr data;
size_t num_threads;
bool is_initialized = false;
bool has_input = false;
bool finished = false;
Chunk current_chunk;
Chunks ready_chunks;
UInt32 current_bucket_num = 0;
static constexpr Int32 NUM_BUCKETS = 256;
std::map<UInt32, Chunk> bucket_to_chunk;
Processors processors;
static Int32 getBucketFromChunk(const Chunk & chunk)
{
auto & info = chunk.getChunkInfo();
if (!info)
throw Exception("Chunk info was not set for chunk in "
"ConvertingAggregatedToChunksTransform.", ErrorCodes::LOGICAL_ERROR);
auto * agg_info = typeid_cast<const AggregatedChunkInfo *>(info.get());
if (!agg_info)
throw Exception("Chunk should have AggregatedChunkInfo in "
"ConvertingAggregatedToChunksTransform.", ErrorCodes::LOGICAL_ERROR);
return agg_info->bucket_num;
}
void moveReadyChunksToMap()
{
for (auto & chunk : ready_chunks)
{
auto bucket = getBucketFromChunk(chunk);
if (bucket < 0 || bucket >= NUM_BUCKETS)
throw Exception("Invalid bucket number " + toString(bucket) + " in "
"ConvertingAggregatedToChunksTransform.", ErrorCodes::LOGICAL_ERROR);
if (bucket_to_chunk.count(bucket))
throw Exception("Found several chunks with the same bucket number in "
"ConvertingAggregatedToChunksTransform.", ErrorCodes::LOGICAL_ERROR);
bucket_to_chunk[bucket] = std::move(chunk);
}
ready_chunks.clear();
}
void setCurrentChunk(Chunk chunk)
{
if (has_input)
throw Exception("Current chunk was already set in "
"ConvertingAggregatedToChunksTransform.", ErrorCodes::LOGICAL_ERROR);
has_input = true;
current_chunk = std::move(chunk);
}
void initialize()
{
is_initialized = true;
AggregatedDataVariantsPtr & first = data->at(0);
/// At least we need one arena in first data item per thread
if (num_threads > first->aggregates_pools.size())
{
Arenas & first_pool = first->aggregates_pools;
for (size_t j = first_pool.size(); j < num_threads; j++)
first_pool.emplace_back(std::make_shared<Arena>());
}
if (first->type == AggregatedDataVariants::Type::without_key || params->params.overflow_row)
{
params->aggregator.mergeWithoutKeyDataImpl(*data);
auto block = params->aggregator.prepareBlockAndFillWithoutKey(
*first, params->final, first->type != AggregatedDataVariants::Type::without_key);
setCurrentChunk(convertToChunk(block));
}
}
void mergeSingleLevel()
{
AggregatedDataVariantsPtr & first = data->at(0);
if (current_bucket_num > 0 || first->type == AggregatedDataVariants::Type::without_key)
{
finished = true;
return;
}
++current_bucket_num;
#define M(NAME) \
else if (first->type == AggregatedDataVariants::Type::NAME) \
params->aggregator.mergeSingleLevelDataImpl<decltype(first->NAME)::element_type>(*data);
if (false) {}
APPLY_FOR_VARIANTS_SINGLE_LEVEL(M)
#undef M
else
throw Exception("Unknown aggregated data variant.", ErrorCodes::UNKNOWN_AGGREGATED_DATA_VARIANT);
auto block = params->aggregator.prepareBlockAndFillSingleLevel(*first, params->final);
setCurrentChunk(convertToChunk(block));
finished = true;
}
void createSources()
{
AggregatedDataVariantsPtr & first = data->at(0);
auto next_bucket_to_merge = std::make_shared<std::atomic<UInt32>>(0);
for (size_t thread = 0; thread < num_threads; ++thread)
{
Arena * arena = first->aggregates_pools.at(thread).get();
auto source = std::make_shared<ConvertingAggregatedToChunksSource>(
params, data, arena, next_bucket_to_merge);
processors.emplace_back(std::move(source));
}
}
bool trySetCurrentChunkFromCurrentBucket()
{
auto it = bucket_to_chunk.find(current_bucket_num);
if (it != bucket_to_chunk.end())
{
setCurrentChunk(std::move(it->second));
++current_bucket_num;
if (current_bucket_num == NUM_BUCKETS)
finished = true;
return true;
}
return false;
}
};
AggregatingTransform::AggregatingTransform(Block header, AggregatingTransformParamsPtr params_)
: AggregatingTransform(std::move(header), std::move(params_)
@ -197,7 +497,9 @@ Processors AggregatingTransform::expandPipeline()
void AggregatingTransform::consume(Chunk chunk)
{
if (chunk.getNumRows() == 0 && params->params.empty_result_for_aggregation_by_empty_set)
UInt64 num_rows = chunk.getNumRows();
if (num_rows == 0 && params->params.empty_result_for_aggregation_by_empty_set)
return;
if (!is_consume_started)
@ -209,9 +511,7 @@ void AggregatingTransform::consume(Chunk chunk)
src_rows += chunk.getNumRows();
src_bytes += chunk.bytes();
auto block = getInputs().front().getHeader().cloneWithColumns(chunk.detachColumns());
if (!params->aggregator.executeOnBlock(block, variants, key_columns, aggregate_columns, no_more_keys))
if (!params->aggregator.executeOnBlock(chunk.detachColumns(), num_rows, variants, key_columns, aggregate_columns, no_more_keys))
is_consume_finished = true;
}
@ -249,8 +549,9 @@ void AggregatingTransform::initGenerate()
if (!params->aggregator.hasTemporaryFiles())
{
auto stream = params->aggregator.mergeAndConvertToBlocks(many_data->variants, params->final, max_threads);
processors.emplace_back(std::make_shared<ConvertingAggregatedToBlocksTransform>(stream->getHeader(), params, std::move(stream)));
auto prepared_data = params->aggregator.prepareVariantsToMerge(many_data->variants);
auto prepared_data_ptr = std::make_shared<ManyAggregatedDataVariants>(std::move(prepared_data));
processors.emplace_back(std::make_shared<ConvertingAggregatedToChunksTransform>(params, std::move(prepared_data_ptr), max_threads));
}
else
{

View File

@ -33,12 +33,16 @@ struct AggregatingTransformParams
struct ManyAggregatedData
{
ManyAggregatedDataVariants variants;
std::vector<std::unique_ptr<std::mutex>> mutexes;
std::atomic<UInt32> num_finished = 0;
explicit ManyAggregatedData(size_t num_threads = 0) : variants(num_threads)
explicit ManyAggregatedData(size_t num_threads = 0) : variants(num_threads), mutexes(num_threads)
{
for (auto & elem : variants)
elem = std::make_shared<AggregatedDataVariants>();
for (auto & mut : mutexes)
mut = std::make_unique<std::mutex>();
}
};

View File

@ -507,6 +507,7 @@ Processors createMergingAggregatedMemoryEfficientPipe(
for (size_t i = 0; i < num_merging_processors; ++i, ++in, ++out)
{
auto transform = std::make_shared<MergingAggregatedBucketTransform>(params);
transform->setStream(i);
connect(*out, transform->getInputPort());
connect(transform->getOutputPort(), *in);
processors.emplace_back(std::move(transform));

View File

@ -41,7 +41,6 @@ namespace ErrorCodes
namespace
{
static constexpr const std::chrono::seconds max_sleep_time{30};
static constexpr const std::chrono::minutes decrease_error_count_period{5};
template <typename PoolFactory>
@ -66,6 +65,7 @@ StorageDistributedDirectoryMonitor::StorageDistributedDirectoryMonitor(
, current_batch_file_path{path + "current_batch.txt"}
, default_sleep_time{storage.global_context.getSettingsRef().distributed_directory_monitor_sleep_time_ms.totalMilliseconds()}
, sleep_time{default_sleep_time}
, max_sleep_time{storage.global_context.getSettingsRef().distributed_directory_monitor_max_sleep_time_ms.totalMilliseconds()}
, log{&Logger::get(getLoggerName())}
, monitor_blocker(monitor_blocker_)
{
@ -138,7 +138,7 @@ void StorageDistributedDirectoryMonitor::run()
++error_count;
sleep_time = std::min(
std::chrono::milliseconds{Int64(default_sleep_time.count() * std::exp2(error_count))},
std::chrono::milliseconds{max_sleep_time});
max_sleep_time);
tryLogCurrentException(getLoggerName().data());
}
}

View File

@ -56,6 +56,7 @@ private:
size_t error_count{};
std::chrono::milliseconds default_sleep_time;
std::chrono::milliseconds sleep_time;
std::chrono::milliseconds max_sleep_time;
std::chrono::time_point<std::chrono::system_clock> last_decrease_time {std::chrono::system_clock::now()};
std::atomic<bool> quit {false};
std::mutex mutex;

View File

@ -365,8 +365,8 @@ public:
/** Notify engine about updated dependencies for this storage. */
virtual void updateDependencies() {}
/// Returns data path if storage supports it, empty string otherwise.
virtual String getDataPath() const { return {}; }
/// Returns data paths if storage supports it, empty vector otherwise.
virtual Strings getDataPaths() const { return {}; }
/// Returns ASTExpressionList of partition key expression for storage or nullptr if there is none.
virtual ASTPtr getPartitionKeyAST() const { return nullptr; }
@ -398,6 +398,8 @@ public:
/// Returns names of primary key + secondary sorting columns
virtual Names getSortingKeyColumns() const { return {}; }
/// Returns storage policy if storage supports it
virtual DiskSpace::StoragePolicyPtr getStoragePolicy() const { return {}; }
private:
/// You always need to take the next three locks in this order.

View File

@ -150,8 +150,6 @@ bool ReadBufferFromKafkaConsumer::nextImpl()
return true;
}
put_delimiter = (delimiter != 0);
if (current == messages.end())
{
if (intermediate_commit)
@ -183,6 +181,7 @@ bool ReadBufferFromKafkaConsumer::nextImpl()
// XXX: very fishy place with const casting.
auto new_position = reinterpret_cast<char *>(const_cast<unsigned char *>(current->get_payload().get_data()));
BufferBase::set(new_position, current->get_payload().get_size(), 0);
put_delimiter = (delimiter != 0);
/// Since we can poll more messages than we already processed - commit only processed messages.
consumer->store_offset(*current);

View File

@ -4,7 +4,6 @@
#include <Common/NetException.h>
#include <Common/typeid_cast.h>
#include <IO/HTTPCommon.h>
#include <IO/ReadWriteBufferFromHTTP.h>
#include <Poco/File.h>
#include <ext/scope_guard.h>
#include <Poco/Net/HTTPServerResponse.h>
@ -27,6 +26,7 @@ namespace ErrorCodes
extern const int CANNOT_WRITE_TO_OSTREAM;
extern const int CHECKSUM_DOESNT_MATCH;
extern const int UNKNOWN_TABLE;
extern const int UNKNOWN_PROTOCOL;
extern const int INSECURE_PATH;
}
@ -36,6 +36,9 @@ namespace DataPartsExchange
namespace
{
static constexpr auto REPLICATION_PROTOCOL_VERSION_WITHOUT_PARTS_SIZE = "0";
static constexpr auto REPLICATION_PROTOCOL_VERSION_WITH_PARTS_SIZE = "1";
std::string getEndpointId(const std::string & node_id)
{
return "DataPartsExchange:" + node_id;
@ -53,7 +56,14 @@ void Service::processQuery(const Poco::Net::HTMLForm & params, ReadBuffer & /*bo
if (blocker.isCancelled())
throw Exception("Transferring part to replica was cancelled", ErrorCodes::ABORTED);
String client_protocol_version = params.get("client_protocol_version", REPLICATION_PROTOCOL_VERSION_WITHOUT_PARTS_SIZE);
String part_name = params.get("part");
if (client_protocol_version != REPLICATION_PROTOCOL_VERSION_WITH_PARTS_SIZE && client_protocol_version != REPLICATION_PROTOCOL_VERSION_WITHOUT_PARTS_SIZE)
throw Exception("Unsupported fetch protocol version", ErrorCodes::UNKNOWN_PROTOCOL);
const auto data_settings = data.getSettings();
/// Validation of the input that may come from malicious replica.
@ -70,6 +80,8 @@ void Service::processQuery(const Poco::Net::HTMLForm & params, ReadBuffer & /*bo
response.setChunkedTransferEncoding(false);
return;
}
response.addCookie({"server_protocol_version", REPLICATION_PROTOCOL_VERSION_WITH_PARTS_SIZE});
++total_sends;
SCOPE_EXIT({--total_sends;});
@ -100,12 +112,16 @@ void Service::processQuery(const Poco::Net::HTMLForm & params, ReadBuffer & /*bo
MergeTreeData::DataPart::Checksums data_checksums;
if (client_protocol_version == REPLICATION_PROTOCOL_VERSION_WITH_PARTS_SIZE)
writeBinary(checksums.getTotalSizeOnDisk(), out);
writeBinary(checksums.files.size(), out);
for (const auto & it : checksums.files)
{
String file_name = it.first;
String path = data.getFullPath() + part_name + "/" + file_name;
String path = part->getFullPath() + file_name;
UInt64 size = Poco::File(path).getSize();
@ -183,9 +199,10 @@ MergeTreeData::MutableDataPartPtr Fetcher::fetchPart(
uri.setPort(port);
uri.setQueryParameters(
{
{"endpoint", getEndpointId(replica_path)},
{"part", part_name},
{"compress", "false"}
{"endpoint", getEndpointId(replica_path)},
{"part", part_name},
{"client_protocol_version", REPLICATION_PROTOCOL_VERSION_WITH_PARTS_SIZE},
{"compress", "false"}
});
Poco::Net::HTTPBasicCredentials creds{};
@ -205,11 +222,42 @@ MergeTreeData::MutableDataPartPtr Fetcher::fetchPart(
data_settings->replicated_max_parallel_fetches_for_host
};
auto server_protocol_version = in.getResponseCookie("server_protocol_version", REPLICATION_PROTOCOL_VERSION_WITHOUT_PARTS_SIZE);
DiskSpace::ReservationPtr reservation;
if (server_protocol_version == REPLICATION_PROTOCOL_VERSION_WITH_PARTS_SIZE)
{
size_t sum_files_size;
readBinary(sum_files_size, in);
reservation = data.reserveSpace(sum_files_size);
}
else
{
/// We don't know real size of part because sender server version is too old
reservation = data.makeEmptyReservationOnLargestDisk();
}
return downloadPart(part_name, replica_path, to_detached, tmp_prefix_, std::move(reservation), in);
}
MergeTreeData::MutableDataPartPtr Fetcher::downloadPart(
const String & part_name,
const String & replica_path,
bool to_detached,
const String & tmp_prefix_,
const DiskSpace::ReservationPtr reservation,
PooledReadWriteBufferFromHTTP & in)
{
size_t files;
readBinary(files, in);
static const String TMP_PREFIX = "tmp_fetch_";
String tmp_prefix = tmp_prefix_.empty() ? TMP_PREFIX : tmp_prefix_;
String relative_part_path = String(to_detached ? "detached/" : "") + tmp_prefix + part_name;
String absolute_part_path = Poco::Path(data.getFullPath() + relative_part_path + "/").absolute().toString();
String absolute_part_path = Poco::Path(data.getFullPathOnDisk(reservation->getDisk()) + relative_part_path + "/").absolute().toString();
Poco::File part_file(absolute_part_path);
if (part_file.exists())
@ -219,12 +267,11 @@ MergeTreeData::MutableDataPartPtr Fetcher::fetchPart(
part_file.createDirectory();
MergeTreeData::MutableDataPartPtr new_data_part = std::make_shared<MergeTreeData::DataPart>(data, part_name);
MergeTreeData::MutableDataPartPtr new_data_part = std::make_shared<MergeTreeData::DataPart>(data, reservation->getDisk(), part_name);
new_data_part->relative_path = relative_part_path;
new_data_part->is_temp = true;
size_t files;
readBinary(files, in);
MergeTreeData::DataPart::Checksums checksums;
for (size_t i = 0; i < files; ++i)
{

View File

@ -6,6 +6,7 @@
#include <IO/HashingWriteBuffer.h>
#include <IO/copyData.h>
#include <IO/ConnectionTimeouts.h>
#include <IO/ReadWriteBufferFromHTTP.h>
namespace DB
@ -64,6 +65,14 @@ public:
ActionBlocker blocker;
private:
MergeTreeData::MutableDataPartPtr downloadPart(
const String & part_name,
const String & replica_path,
bool to_detached,
const String & tmp_prefix_,
const DiskSpace::ReservationPtr reservation,
PooledReadWriteBufferFromHTTP & in);
MergeTreeData & data;
Logger * log;
};

View File

@ -1,10 +0,0 @@
#include <Storages/MergeTree/DiskSpaceMonitor.h>
namespace DB
{
UInt64 DiskSpaceMonitor::reserved_bytes;
UInt64 DiskSpaceMonitor::reservation_count;
std::mutex DiskSpaceMonitor::mutex;
}

View File

@ -1,220 +0,0 @@
#pragma once
#include <mutex>
#include <sys/statvfs.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#if defined(__linux__)
#include <cstdio>
#include <mntent.h>
#endif
#include <memory>
#include <filesystem>
#include <boost/noncopyable.hpp>
#include <common/logger_useful.h>
#include <Common/Exception.h>
#include <IO/WriteHelpers.h>
#include <Common/formatReadable.h>
#include <Common/CurrentMetrics.h>
namespace CurrentMetrics
{
extern const Metric DiskSpaceReservedForMerge;
}
namespace DB
{
namespace ErrorCodes
{
extern const int CANNOT_STATVFS;
extern const int NOT_ENOUGH_SPACE;
extern const int SYSTEM_ERROR;
}
/** Determines amount of free space in filesystem.
* Could "reserve" space, for different operations to plan disk space usage.
* Reservations are not separated for different filesystems,
* instead it is assumed, that all reservations are done within same filesystem.
*/
class DiskSpaceMonitor
{
public:
class Reservation : private boost::noncopyable
{
public:
~Reservation()
{
try
{
std::lock_guard lock(DiskSpaceMonitor::mutex);
if (DiskSpaceMonitor::reserved_bytes < size)
{
DiskSpaceMonitor::reserved_bytes = 0;
LOG_ERROR(&Logger::get("DiskSpaceMonitor"), "Unbalanced reservations size; it's a bug");
}
else
{
DiskSpaceMonitor::reserved_bytes -= size;
}
if (DiskSpaceMonitor::reservation_count == 0)
{
LOG_ERROR(&Logger::get("DiskSpaceMonitor"), "Unbalanced reservation count; it's a bug");
}
else
{
--DiskSpaceMonitor::reservation_count;
}
}
catch (...)
{
tryLogCurrentException("~DiskSpaceMonitor");
}
}
/// Change amount of reserved space. When new_size is greater than before, availability of free space is not checked.
void update(UInt64 new_size)
{
std::lock_guard lock(DiskSpaceMonitor::mutex);
DiskSpaceMonitor::reserved_bytes -= size;
size = new_size;
DiskSpaceMonitor::reserved_bytes += size;
}
UInt64 getSize() const
{
return size;
}
Reservation(UInt64 size_)
: size(size_), metric_increment(CurrentMetrics::DiskSpaceReservedForMerge, size)
{
std::lock_guard lock(DiskSpaceMonitor::mutex);
DiskSpaceMonitor::reserved_bytes += size;
++DiskSpaceMonitor::reservation_count;
}
private:
UInt64 size;
CurrentMetrics::Increment metric_increment;
};
using ReservationPtr = std::unique_ptr<Reservation>;
inline static struct statvfs getStatVFS(const std::string & path)
{
struct statvfs fs;
if (statvfs(path.c_str(), &fs) != 0)
throwFromErrnoWithPath("Could not calculate available disk space (statvfs)", path,
ErrorCodes::CANNOT_STATVFS);
return fs;
}
static UInt64 getUnreservedFreeSpace(const std::string & path)
{
struct statvfs fs = getStatVFS(path);
UInt64 res = fs.f_bfree * fs.f_bsize;
/// Heuristic by Michael Kolupaev: reserve 30 MB more, because statvfs shows few megabytes more space than df.
res -= std::min(res, static_cast<UInt64>(30 * (1ul << 20)));
std::lock_guard lock(mutex);
if (reserved_bytes > res)
res = 0;
else
res -= reserved_bytes;
return res;
}
static UInt64 getReservedSpace()
{
std::lock_guard lock(mutex);
return reserved_bytes;
}
static UInt64 getReservationCount()
{
std::lock_guard lock(mutex);
return reservation_count;
}
/// If not enough (approximately) space, throw an exception.
static ReservationPtr reserve(const std::string & path, UInt64 size)
{
UInt64 free_bytes = getUnreservedFreeSpace(path);
if (free_bytes < size)
throw Exception("Not enough free disk space to reserve: " + formatReadableSizeWithBinarySuffix(free_bytes) + " available, "
+ formatReadableSizeWithBinarySuffix(size) + " requested", ErrorCodes::NOT_ENOUGH_SPACE);
return std::make_unique<Reservation>(size);
}
/// Returns mount point of filesystem where absoulte_path (must exist) is located
static std::filesystem::path getMountPoint(std::filesystem::path absolute_path)
{
if (absolute_path.is_relative())
throw Exception("Path is relative. It's a bug.", ErrorCodes::LOGICAL_ERROR);
absolute_path = std::filesystem::canonical(absolute_path);
const auto get_device_id = [](const std::filesystem::path & p)
{
struct stat st;
if (stat(p.c_str(), &st))
throwFromErrnoWithPath("Cannot stat " + p.string(), p.string(), ErrorCodes::SYSTEM_ERROR);
return st.st_dev;
};
/// If /some/path/to/dir/ and /some/path/to/ have different device id,
/// then device which contains /some/path/to/dir/filename is mounted to /some/path/to/dir/
auto device_id = get_device_id(absolute_path);
while (absolute_path.has_relative_path())
{
auto parent = absolute_path.parent_path();
auto parent_device_id = get_device_id(parent);
if (device_id != parent_device_id)
return absolute_path;
absolute_path = parent;
device_id = parent_device_id;
}
return absolute_path;
}
/// Returns name of filesystem mounted to mount_point
#if !defined(__linux__)
[[noreturn]]
#endif
static std::string getFilesystemName([[maybe_unused]] const std::string & mount_point)
{
#if defined(__linux__)
auto mounted_filesystems = setmntent("/etc/mtab", "r");
if (!mounted_filesystems)
throw DB::Exception("Cannot open /etc/mtab to get name of filesystem", ErrorCodes::SYSTEM_ERROR);
mntent fs_info;
constexpr size_t buf_size = 4096; /// The same as buffer used for getmntent in glibc. It can happen that it's not enough
char buf[buf_size];
while (getmntent_r(mounted_filesystems, &fs_info, buf, buf_size) && fs_info.mnt_dir != mount_point)
;
endmntent(mounted_filesystems);
if (fs_info.mnt_dir != mount_point)
throw DB::Exception("Cannot find name of filesystem by mount point " + mount_point, ErrorCodes::SYSTEM_ERROR);
return fs_info.mnt_fsname;
#else
throw DB::Exception("Supported on linux only", ErrorCodes::NOT_IMPLEMENTED);
#endif
}
private:
static UInt64 reserved_bytes;
static UInt64 reservation_count;
static std::mutex mutex;
};
}

View File

@ -27,8 +27,8 @@ void MergeTreeBlockOutputStream::write(const Block & block)
PartLog::addNewPart(storage.global_context, part, watch.elapsed());
/// Initiate async merge - it will be done if it's good time for merge and if there are space in 'background_pool'.
if (storage.background_task_handle)
storage.background_task_handle->wake();
if (storage.merging_mutating_task_handle)
storage.merging_mutating_task_handle->wake();
}
}

View File

@ -92,12 +92,14 @@ namespace ErrorCodes
extern const int BAD_DATA_PART_NAME;
extern const int UNKNOWN_SETTING;
extern const int READONLY_SETTING;
extern const int ABORTED;
}
MergeTreeData::MergeTreeData(
const String & database_, const String & table_,
const String & full_path_, const ColumnsDescription & columns_,
const String & database_,
const String & table_,
const ColumnsDescription & columns_,
const IndicesDescription & indices_,
const ConstraintsDescription & constraints_,
Context & context_,
@ -112,19 +114,22 @@ MergeTreeData::MergeTreeData(
bool require_part_metadata_,
bool attach,
BrokenPartCallback broken_part_callback_)
: global_context(context_),
merging_params(merging_params_),
partition_by_ast(partition_by_ast_),
sample_by_ast(sample_by_ast_),
ttl_table_ast(ttl_table_ast_),
require_part_metadata(require_part_metadata_),
database_name(database_), table_name(table_),
full_path(full_path_),
broken_part_callback(broken_part_callback_),
log_name(database_name + "." + table_name), log(&Logger::get(log_name)),
storage_settings(std::move(storage_settings_)),
data_parts_by_info(data_parts_indexes.get<TagByInfo>()),
data_parts_by_state_and_info(data_parts_indexes.get<TagByStateAndInfo>())
: global_context(context_)
, merging_params(merging_params_)
, partition_by_ast(partition_by_ast_)
, sample_by_ast(sample_by_ast_)
, ttl_table_ast(ttl_table_ast_)
, require_part_metadata(require_part_metadata_)
, database_name(database_)
, table_name(table_)
, broken_part_callback(broken_part_callback_)
, log_name(database_name + "." + table_name)
, log(&Logger::get(log_name))
, storage_settings(std::move(storage_settings_))
, storage_policy(context_.getStoragePolicy(getSettings()->storage_policy_name))
, data_parts_by_info(data_parts_indexes.get<TagByInfo>())
, data_parts_by_state_and_info(data_parts_indexes.get<TagByStateAndInfo>())
, parts_mover(this)
{
const auto settings = getSettings();
setProperties(order_by_ast_, primary_key_ast_, columns_, indices_, constraints_);
@ -143,6 +148,7 @@ MergeTreeData::MergeTreeData(
auto syntax = SyntaxAnalyzer(global_context).analyze(sample_by_ast, getColumns().getAllPhysical());
columns_required_for_sampling = syntax->requiredSourceColumns();
}
MergeTreeDataFormatVersion min_format_version(0);
if (!date_column_name.empty())
{
@ -170,22 +176,40 @@ MergeTreeData::MergeTreeData(
setTTLExpressions(columns_.getColumnTTLs(), ttl_table_ast_);
auto path_exists = Poco::File(full_path).exists();
// format_file always contained on any data path
String version_file_path;
/// Creating directories, if not exist.
Poco::File(full_path).createDirectories();
auto paths = getDataPaths();
for (const String & path : paths)
{
Poco::File(path).createDirectories();
Poco::File(path + "detached").createDirectory();
if (Poco::File{path + "format_version.txt"}.exists())
{
if (!version_file_path.empty())
{
LOG_ERROR(log, "Duplication of version file " << version_file_path << " and " << path << "format_file.txt");
throw Exception("Multiple format_version.txt file", ErrorCodes::CORRUPTED_DATA);
}
version_file_path = path + "format_version.txt";
}
}
Poco::File(full_path + "detached").createDirectory();
/// If not choose any
if (version_file_path.empty())
version_file_path = getFullPathOnDisk(storage_policy->getAnyDisk()) + "format_version.txt";
bool version_file_exists = Poco::File(version_file_path).exists();
String version_file_path = full_path + "format_version.txt";
auto version_file_exists = Poco::File(version_file_path).exists();
// When data path or file not exists, ignore the format_version check
if (!attach || !path_exists || !version_file_exists)
if (!attach || !version_file_exists)
{
format_version = min_format_version;
WriteBufferFromFile buf(version_file_path);
writeIntText(format_version.toUnderType(), buf);
}
else if (Poco::File(version_file_path).exists())
else
{
ReadBufferFromFile buf(version_file_path);
UInt32 read_format_version;
@ -194,8 +218,6 @@ MergeTreeData::MergeTreeData(
if (!buf.eof())
throw Exception("Bad version file: " + version_file_path, ErrorCodes::CORRUPTED_DATA);
}
else
format_version = 0;
if (format_version < min_format_version)
{
@ -734,28 +756,38 @@ void MergeTreeData::loadDataParts(bool skip_sanity_checks)
LOG_DEBUG(log, "Loading data parts");
const auto settings = getSettings();
std::vector<std::pair<String, DiskSpace::DiskPtr>> part_names_with_disks;
Strings part_file_names;
Poco::DirectoryIterator end;
for (Poco::DirectoryIterator it(full_path); it != end; ++it)
{
/// Skip temporary directories.
if (startsWith(it.name(), "tmp"))
continue;
part_file_names.push_back(it.name());
auto disks = storage_policy->getDisks();
/// Reversed order to load part from low priority disks firstly.
/// Used for keep part on low priority disk if duplication found
for (auto disk_it = disks.rbegin(); disk_it != disks.rend(); ++disk_it)
{
auto disk_ptr = *disk_it;
for (Poco::DirectoryIterator it(getFullPathOnDisk(disk_ptr)); it != end; ++it)
{
/// Skip temporary directories.
if (startsWith(it.name(), "tmp"))
continue;
part_names_with_disks.emplace_back(it.name(), disk_ptr);
}
}
auto part_lock = lockParts();
data_parts_indexes.clear();
if (part_file_names.empty())
if (part_names_with_disks.empty())
{
LOG_DEBUG(log, "There is no data parts");
return;
}
/// Parallel loading of data parts.
size_t num_threads = std::min(size_t(settings->max_part_loading_threads), part_file_names.size());
size_t num_threads = std::min(size_t(settings->max_part_loading_threads), part_names_with_disks.size());
std::mutex mutex;
@ -768,16 +800,18 @@ void MergeTreeData::loadDataParts(bool skip_sanity_checks)
ThreadPool pool(num_threads);
for (const String & file_name : part_file_names)
for (size_t i = 0; i < part_names_with_disks.size(); ++i)
{
pool.schedule([&]
pool.schedule([&, i]
{
const auto & part_name = part_names_with_disks[i].first;
const auto part_disk_ptr = part_names_with_disks[i].second;
MergeTreePartInfo part_info;
if (!MergeTreePartInfo::tryParsePartName(file_name, &part_info, format_version))
if (!MergeTreePartInfo::tryParsePartName(part_name, &part_info, format_version))
return;
MutableDataPartPtr part = std::make_shared<DataPart>(*this, file_name, part_info);
part->relative_path = file_name;
MutableDataPartPtr part = std::make_shared<DataPart>(*this, part_disk_ptr, part_name, part_info);
part->relative_path = part_name;
bool broken = false;
try
@ -810,7 +844,7 @@ void MergeTreeData::loadDataParts(bool skip_sanity_checks)
if (part->info.level == 0)
{
/// It is impossible to restore level 0 parts.
LOG_ERROR(log, "Considering to remove broken part " << full_path << file_name << " because it's impossible to repair.");
LOG_ERROR(log, "Considering to remove broken part " << getFullPathOnDisk(part_disk_ptr) << part_name << " because it's impossible to repair.");
std::lock_guard loading_lock(mutex);
broken_parts_to_remove.push_back(part);
}
@ -821,11 +855,11 @@ void MergeTreeData::loadDataParts(bool skip_sanity_checks)
/// delete it.
size_t contained_parts = 0;
LOG_ERROR(log, "Part " << full_path << file_name << " is broken. Looking for parts to replace it.");
LOG_ERROR(log, "Part " << getFullPathOnDisk(part_disk_ptr) << part_name << " is broken. Looking for parts to replace it.");
for (const String & contained_name : part_file_names)
for (const auto & [contained_name, contained_disk_ptr] : part_names_with_disks)
{
if (contained_name == file_name)
if (contained_name == part_name)
continue;
MergeTreePartInfo contained_part_info;
@ -834,20 +868,20 @@ void MergeTreeData::loadDataParts(bool skip_sanity_checks)
if (part->info.contains(contained_part_info))
{
LOG_ERROR(log, "Found part " << full_path << contained_name);
LOG_ERROR(log, "Found part " << getFullPathOnDisk(contained_disk_ptr) << contained_name);
++contained_parts;
}
}
if (contained_parts >= 2)
{
LOG_ERROR(log, "Considering to remove broken part " << full_path << file_name << " because it covers at least 2 other parts");
LOG_ERROR(log, "Considering to remove broken part " << getFullPathOnDisk(part_disk_ptr) << part_name << " because it covers at least 2 other parts");
std::lock_guard loading_lock(mutex);
broken_parts_to_remove.push_back(part);
}
else
{
LOG_ERROR(log, "Detaching broken part " << full_path << file_name
LOG_ERROR(log, "Detaching broken part " << getFullPathOnDisk(part_disk_ptr) << part_name
<< " because it covers less than 2 parts. You need to resolve this manually");
std::lock_guard loading_lock(mutex);
broken_parts_to_detach.push_back(part);
@ -862,7 +896,7 @@ void MergeTreeData::loadDataParts(bool skip_sanity_checks)
else
has_adaptive_parts.store(true, std::memory_order_relaxed);
part->modification_time = Poco::File(full_path + file_name).getLastModified().epochTime();
part->modification_time = Poco::File(getFullPathOnDisk(part_disk_ptr) + part_name).getLastModified().epochTime();
/// Assume that all parts are Committed, covered parts will be detected and marked as Outdated later
part->state = DataPartState::Committed;
@ -972,25 +1006,30 @@ void MergeTreeData::clearOldTemporaryDirectories(ssize_t custom_directories_life
? current_time - custom_directories_lifetime_seconds
: current_time - settings->temporary_directories_lifetime.totalSeconds();
const auto full_paths = getDataPaths();
/// Delete temporary directories older than a day.
Poco::DirectoryIterator end;
for (Poco::DirectoryIterator it{full_path}; it != end; ++it)
for (auto && full_data_path : full_paths)
{
if (startsWith(it.name(), "tmp_"))
for (Poco::DirectoryIterator it{full_data_path}; it != end; ++it)
{
Poco::File tmp_dir(full_path + it.name());
if (startsWith(it.name(), "tmp_"))
{
Poco::File tmp_dir(full_data_path + it.name());
try
{
if (tmp_dir.isDirectory() && isOldPartDirectory(tmp_dir, deadline))
try
{
LOG_WARNING(log, "Removing temporary directory " << full_path << it.name());
Poco::File(full_path + it.name()).remove(true);
if (tmp_dir.isDirectory() && isOldPartDirectory(tmp_dir, deadline))
{
LOG_WARNING(log, "Removing temporary directory " << full_data_path << it.name());
Poco::File(full_data_path + it.name()).remove(true);
}
}
catch (const Poco::FileNotFoundException &)
{
/// If the file is already deleted, do nothing.
}
}
catch (const Poco::FileNotFoundException &)
{
/// If the file is already deleted, do nothing.
}
}
}
@ -1135,15 +1174,42 @@ void MergeTreeData::clearPartsFromFilesystem(const DataPartsVector & parts_to_re
}
}
void MergeTreeData::setPath(const String & new_full_path)
void MergeTreeData::rename(
const String & /*new_path_to_db*/, const String & new_database_name,
const String & new_table_name, TableStructureWriteLockHolder &)
{
if (Poco::File{new_full_path}.exists())
throw Exception{"Target path already exists: " + new_full_path, ErrorCodes::DIRECTORY_ALREADY_EXISTS};
auto old_file_db_name = escapeForFileName(database_name);
auto new_file_db_name = escapeForFileName(new_database_name);
auto old_file_table_name = escapeForFileName(table_name);
auto new_file_table_name = escapeForFileName(new_table_name);
Poco::File(full_path).renameTo(new_full_path);
auto disks = storage_policy->getDisks();
for (const auto & disk : disks)
{
auto new_full_path = disk->getClickHouseDataPath() + new_file_db_name + '/' + new_file_table_name + '/';
if (Poco::File{new_full_path}.exists())
throw Exception{"Target path already exists: " + new_full_path, ErrorCodes::DIRECTORY_ALREADY_EXISTS};
}
for (const auto & disk : disks)
{
auto full_path = disk->getClickHouseDataPath() + old_file_db_name + '/' + old_file_table_name + '/';
auto new_db_path = disk->getClickHouseDataPath() + new_file_db_name + '/';
Poco::File db_file{new_db_path};
if (!db_file.exists())
db_file.createDirectory();
auto new_full_path = new_db_path + new_file_table_name + '/';
Poco::File{full_path}.renameTo(new_full_path);
}
global_context.dropCaches();
full_path = new_full_path;
database_name = new_database_name;
table_name = new_table_name;
}
void MergeTreeData::dropAllData()
@ -1166,7 +1232,10 @@ void MergeTreeData::dropAllData()
/// Removing of each data part before recursive removal of directory is to speed-up removal, because there will be less number of syscalls.
clearPartsFromFilesystem(all_parts);
Poco::File(full_path).remove(true);
auto full_paths = getDataPaths();
for (auto && full_data_path : full_paths)
Poco::File(full_data_path).remove(true);
LOG_TRACE(log, "dropAllData: done.");
}
@ -1554,7 +1623,7 @@ void MergeTreeData::alterDataPart(
exception_message
<< ") need to be "
<< (forbidden_because_of_modify ? "modified" : "removed")
<< " in part " << part->name << " of table at " << full_path << ". Aborting just in case."
<< " in part " << part->name << " of table at " << part->getFullPath() << ". Aborting just in case."
<< " If it is not an error, you could increase merge_tree/"
<< (forbidden_because_of_modify ? "max_files_to_modify_in_alter_columns" : "max_files_to_remove_in_alter_columns")
<< " parameter in configuration file (current value: "
@ -1592,10 +1661,11 @@ void MergeTreeData::alterDataPart(
* will have old name of shared offsets for arrays.
*/
IMergedBlockOutputStream::WrittenOffsetColumns unused_written_offsets;
MergedColumnOnlyOutputStream out(
*this,
in.getHeader(),
full_path + part->name + '/',
part->getFullPath(),
true /* sync */,
compression_codec,
true /* skip_offsets */,
@ -1629,7 +1699,7 @@ void MergeTreeData::alterDataPart(
if (!part->checksums.empty())
{
transaction->new_checksums = new_checksums;
WriteBufferFromFile checksums_file(full_path + part->name + "/checksums.txt.tmp", 4096);
WriteBufferFromFile checksums_file(part->getFullPath() + "checksums.txt.tmp", 4096);
new_checksums.write(checksums_file);
transaction->rename_map["checksums.txt.tmp"] = "checksums.txt";
}
@ -1637,7 +1707,7 @@ void MergeTreeData::alterDataPart(
/// Write the new column list to the temporary file.
{
transaction->new_columns = new_columns.filter(part->columns.getNames());
WriteBufferFromFile columns_file(full_path + part->name + "/columns.txt.tmp", 4096);
WriteBufferFromFile columns_file(part->getFullPath() + "columns.txt.tmp", 4096);
transaction->new_columns.writeText(columns_file);
transaction->rename_map["columns.txt.tmp"] = "columns.txt";
}
@ -1721,7 +1791,7 @@ void MergeTreeData::AlterDataPartTransaction::commit()
{
std::unique_lock<std::shared_mutex> lock(data_part->columns_lock);
String path = data_part->storage.full_path + data_part->name + "/";
String path = data_part->getFullPath();
/// NOTE: checking that a file exists before renaming or deleting it
/// is justified by the fact that, when converting an ordinary column
@ -1811,6 +1881,19 @@ MergeTreeData::AlterDataPartTransaction::~AlterDataPartTransaction()
void MergeTreeData::PartsTemporaryRename::addPart(const String & old_name, const String & new_name)
{
old_and_new_names.push_back({old_name, new_name});
const auto paths = storage.getDataPaths();
for (const auto & full_path : paths)
{
for (Poco::DirectoryIterator it = Poco::DirectoryIterator(full_path + source_dir); it != Poco::DirectoryIterator(); ++it)
{
String name = it.name();
if (name == old_name)
{
old_part_name_to_full_path[old_name] = full_path;
break;
}
}
}
}
void MergeTreeData::PartsTemporaryRename::tryRenameAll()
@ -1823,7 +1906,8 @@ void MergeTreeData::PartsTemporaryRename::tryRenameAll()
const auto & names = old_and_new_names[i];
if (names.first.empty() || names.second.empty())
throw DB::Exception("Empty part name. Most likely it's a bug.", ErrorCodes::INCORRECT_FILE_NAME);
Poco::File(base_dir + names.first).renameTo(base_dir + names.second);
const auto full_path = old_part_name_to_full_path[names.first] + source_dir; /// old_name
Poco::File(full_path + names.first).renameTo(full_path + names.second);
}
catch (...)
{
@ -1843,9 +1927,11 @@ MergeTreeData::PartsTemporaryRename::~PartsTemporaryRename()
{
if (names.first.empty())
continue;
try
{
Poco::File(base_dir + names.second).renameTo(base_dir + names.first);
const auto full_path = old_part_name_to_full_path[names.first] + source_dir; /// old_name
Poco::File(full_path + names.second).renameTo(full_path + names.first);
}
catch (...)
{
@ -2425,6 +2511,29 @@ MergeTreeData::DataPartPtr MergeTreeData::getActiveContainingPart(
return nullptr;
}
void MergeTreeData::swapActivePart(MergeTreeData::DataPartPtr part_copy)
{
auto lock = lockParts();
for (const auto & original_active_part : getDataPartsStateRange(DataPartState::Committed))
{
if (part_copy->name == original_active_part->name)
{
auto active_part_it = data_parts_by_info.find(original_active_part->info);
if (active_part_it == data_parts_by_info.end())
throw Exception("Cannot swap part '" + part_copy->name + "', no such active part.", ErrorCodes::NO_SUCH_DATA_PART);
modifyPartState(original_active_part, DataPartState::DeleteOnDestroy);
data_parts_indexes.erase(active_part_it);
auto part_it = data_parts_indexes.insert(part_copy).first;
modifyPartState(part_it, DataPartState::Committed);
return;
}
}
throw Exception("Cannot swap part '" + part_copy->name + "', no such active part.", ErrorCodes::NO_SUCH_DATA_PART);
}
MergeTreeData::DataPartPtr MergeTreeData::getActiveContainingPart(const MergeTreePartInfo & part_info)
{
auto lock = lockParts();
@ -2472,9 +2581,9 @@ MergeTreeData::DataPartPtr MergeTreeData::getPartIfExists(const String & part_na
}
MergeTreeData::MutableDataPartPtr MergeTreeData::loadPartAndFixMetadata(const String & relative_path)
MergeTreeData::MutableDataPartPtr MergeTreeData::loadPartAndFixMetadata(const DiskSpace::DiskPtr & disk, const String & relative_path)
{
MutableDataPartPtr part = std::make_shared<DataPart>(*this, Poco::Path(relative_path).getFileName());
MutableDataPartPtr part = std::make_shared<DataPart>(*this, disk, Poco::Path(relative_path).getFileName());
part->relative_path = relative_path;
loadPartAndFixMetadata(part);
return part;
@ -2591,6 +2700,74 @@ void MergeTreeData::freezePartition(const ASTPtr & partition_ast, const String &
}
void MergeTreeData::movePartitionToDisk(const ASTPtr & partition, const String & name, bool moving_part, const Context & context)
{
String partition_id;
if (moving_part)
partition_id = partition->as<ASTLiteral &>().value.safeGet<String>();
else
partition_id = getPartitionIDFromQuery(partition, context);
DataPartsVector parts;
if (moving_part)
{
parts.push_back(getActiveContainingPart(partition_id));
if (!parts.back())
throw Exception("Part " + partition_id + " is not exists or not active", ErrorCodes::NO_SUCH_DATA_PART);
}
else
parts = getDataPartsVectorInPartition(MergeTreeDataPartState::Committed, partition_id);
auto disk = storage_policy->getDiskByName(name);
if (!disk)
throw Exception("Disk " + name + " does not exists on policy " + storage_policy->getName(), ErrorCodes::UNKNOWN_DISK);
for (const auto & part : parts)
{
if (part->disk->getName() == disk->getName())
throw Exception("Part " + part->name + " already on disk " + name, ErrorCodes::UNKNOWN_DISK);
}
if (!movePartsToSpace(parts, std::static_pointer_cast<const DiskSpace::Space>(disk)))
throw Exception("Cannot move parts because moves are manually disabled.", ErrorCodes::ABORTED);
}
void MergeTreeData::movePartitionToVolume(const ASTPtr & partition, const String & name, bool moving_part, const Context & context)
{
String partition_id;
if (moving_part)
partition_id = partition->as<ASTLiteral &>().value.safeGet<String>();
else
partition_id = getPartitionIDFromQuery(partition, context);
DataPartsVector parts;
if (moving_part)
{
parts.push_back(getActiveContainingPart(partition_id));
if (!parts.back())
throw Exception("Part " + partition_id + " is not exists or not active", ErrorCodes::NO_SUCH_DATA_PART);
}
else
parts = getDataPartsVectorInPartition(MergeTreeDataPartState::Committed, partition_id);
auto volume = storage_policy->getVolumeByName(name);
if (!volume)
throw Exception("Volume " + name + " does not exists on policy " + storage_policy->getName(), ErrorCodes::UNKNOWN_DISK);
for (const auto & part : parts)
for (const auto & disk : volume->disks)
if (part->disk->getName() == disk->getName())
throw Exception("Part " + part->name + " already on volume '" + name + "'", ErrorCodes::UNKNOWN_DISK);
if (!movePartsToSpace(parts, std::static_pointer_cast<const DiskSpace::Space>(volume)))
throw Exception("Cannot move parts because moves are manually disabled.", ErrorCodes::ABORTED);
}
String MergeTreeData::getPartitionIDFromQuery(const ASTPtr & ast, const Context & context)
{
const auto & partition_ast = ast->as<ASTPartition &>();
@ -2714,15 +2891,18 @@ MergeTreeData::getDetachedParts() const
{
std::vector<DetachedPartInfo> res;
for (Poco::DirectoryIterator it(full_path + "detached");
it != Poco::DirectoryIterator(); ++it)
for (const String & path : getDataPaths())
{
auto dir_name = it.name();
for (Poco::DirectoryIterator it(path + "detached");
it != Poco::DirectoryIterator(); ++it)
{
auto dir_name = it.name();
res.emplace_back();
auto & part = res.back();
res.emplace_back();
auto & part = res.back();
DetachedPartInfo::tryParseDetachedPartName(dir_name, part, format_version);
DetachedPartInfo::tryParseDetachedPartName(dir_name, part, format_version);
}
}
return res;
}
@ -2730,10 +2910,11 @@ MergeTreeData::getDetachedParts() const
void MergeTreeData::validateDetachedPartName(const String & name) const
{
if (name.find('/') != std::string::npos || name == "." || name == "..")
throw DB::Exception("Invalid part name", ErrorCodes::INCORRECT_FILE_NAME);
throw DB::Exception("Invalid part name '" + name + "'", ErrorCodes::INCORRECT_FILE_NAME);
Poco::File detached_part_dir(full_path + "detached/" + name);
if (!detached_part_dir.exists())
String full_path = getFullPathForPart(name, "detached/");
if (full_path.empty() || !Poco::File(full_path + name).exists())
throw DB::Exception("Detached part \"" + name + "\" not found" , ErrorCodes::BAD_DATA_PART_NAME);
if (startsWith(name, "attaching_") || startsWith(name, "deleting_"))
@ -2744,7 +2925,7 @@ void MergeTreeData::validateDetachedPartName(const String & name) const
void MergeTreeData::dropDetached(const ASTPtr & partition, bool part, const Context & context)
{
PartsTemporaryRename renamed_parts(*this, full_path + "detached/");
PartsTemporaryRename renamed_parts(*this, "detached/");
if (part)
{
@ -2766,11 +2947,11 @@ void MergeTreeData::dropDetached(const ASTPtr & partition, bool part, const Cont
renamed_parts.tryRenameAll();
for (auto & names : renamed_parts.old_and_new_names)
for (auto & [old_name, new_name] : renamed_parts.old_and_new_names)
{
Poco::File(renamed_parts.base_dir + names.second).remove(true);
LOG_DEBUG(log, "Dropped detached part " << names.first);
names.first.clear();
Poco::File(renamed_parts.old_part_name_to_full_path[old_name] + "detached/" + new_name).remove(true);
LOG_DEBUG(log, "Dropped detached part " << old_name);
old_name.clear();
}
}
@ -2779,6 +2960,7 @@ MergeTreeData::MutableDataPartsVector MergeTreeData::tryLoadPartsToAttach(const
{
String source_dir = "detached/";
std::map<String, DiskSpace::DiskPtr> name_to_disk;
/// Let's compose a list of parts that should be added.
if (attach_part)
{
@ -2792,35 +2974,44 @@ MergeTreeData::MutableDataPartsVector MergeTreeData::tryLoadPartsToAttach(const
LOG_DEBUG(log, "Looking for parts for partition " << partition_id << " in " << source_dir);
ActiveDataPartSet active_parts(format_version);
std::set<String> part_names;
for (Poco::DirectoryIterator it = Poco::DirectoryIterator(full_path + source_dir); it != Poco::DirectoryIterator(); ++it)
const auto disks = storage_policy->getDisks();
for (const DiskSpace::DiskPtr & disk : disks)
{
String name = it.name();
MergeTreePartInfo part_info;
// TODO what if name contains "_tryN" suffix?
/// Parts with prefix in name (e.g. attaching_1_3_3_0, deleting_1_3_3_0) will be ignored
if (!MergeTreePartInfo::tryParsePartName(name, &part_info, format_version))
continue;
if (part_info.partition_id != partition_id)
continue;
LOG_DEBUG(log, "Found part " << name);
active_parts.add(name);
part_names.insert(name);
const auto full_path = getFullPathOnDisk(disk);
for (Poco::DirectoryIterator it = Poco::DirectoryIterator(full_path + source_dir); it != Poco::DirectoryIterator(); ++it)
{
const String & name = it.name();
MergeTreePartInfo part_info;
// TODO what if name contains "_tryN" suffix?
/// Parts with prefix in name (e.g. attaching_1_3_3_0, deleting_1_3_3_0) will be ignored
if (!MergeTreePartInfo::tryParsePartName(name, &part_info, format_version)
|| part_info.partition_id != partition_id)
{
continue;
}
LOG_DEBUG(log, "Found part " << name);
active_parts.add(name);
name_to_disk[name] = disk;
}
}
LOG_DEBUG(log, active_parts.size() << " of them are active");
/// Inactive parts rename so they can not be attached in case of repeated ATTACH.
for (const auto & name : part_names)
for (const auto & [name, disk] : name_to_disk)
{
String containing_part = active_parts.getContainingPart(name);
if (!containing_part.empty() && containing_part != name)
{
auto full_path = getFullPathOnDisk(disk);
// TODO maybe use PartsTemporaryRename here?
Poco::File(full_path + source_dir + name).renameTo(full_path + source_dir + "inactive_" + name);
Poco::File(full_path + source_dir + name)
.renameTo(full_path + source_dir + "inactive_" + name);
}
else
renamed_parts.addPart(name, "attaching_" + name);
}
}
/// Try to rename all parts before attaching to prevent race with DROP DETACHED and another ATTACH.
renamed_parts.tryRenameAll();
@ -2831,7 +3022,7 @@ MergeTreeData::MutableDataPartsVector MergeTreeData::tryLoadPartsToAttach(const
for (const auto & part_names : renamed_parts.old_and_new_names)
{
LOG_DEBUG(log, "Checking part " << part_names.second);
MutableDataPartPtr part = std::make_shared<DataPart>(*this, part_names.first);
MutableDataPartPtr part = std::make_shared<DataPart>(*this, name_to_disk[part_names.first], part_names.first);
part->relative_path = source_dir + part_names.second;
loadPartAndFixMetadata(part);
loaded_parts.push_back(part);
@ -2840,6 +3031,20 @@ MergeTreeData::MutableDataPartsVector MergeTreeData::tryLoadPartsToAttach(const
return loaded_parts;
}
DiskSpace::ReservationPtr MergeTreeData::reserveSpace(UInt64 expected_size)
{
constexpr UInt64 RESERVATION_MIN_ESTIMATION_SIZE = 1u * 1024u * 1024u; /// 1MB
expected_size = std::max(RESERVATION_MIN_ESTIMATION_SIZE, expected_size);
auto reservation = storage_policy->reserve(expected_size);
if (reservation)
return reservation;
throw Exception("Cannot reserve " + formatReadableSizeWithBinarySuffix(expected_size) + ", not enought space.",
ErrorCodes::NOT_ENOUGH_SPACE);
}
MergeTreeData::DataParts MergeTreeData::getDataParts(const DataPartStates & affordable_states) const
{
DataParts res;
@ -3024,7 +3229,9 @@ MergeTreeData::MutableDataPartPtr MergeTreeData::cloneAndLoadDataPart(const Merg
String dst_part_name = src_part->getNewName(dst_part_info);
String tmp_dst_part_name = tmp_part_prefix + dst_part_name;
Poco::Path dst_part_absolute_path = Poco::Path(full_path + tmp_dst_part_name).absolute();
auto reservation = reserveSpace(src_part->bytes_on_disk);
String dst_part_path = getFullPathOnDisk(reservation->getDisk());
Poco::Path dst_part_absolute_path = Poco::Path(dst_part_path + tmp_dst_part_name).absolute();
Poco::Path src_part_absolute_path = Poco::Path(src_part->getFullPath()).absolute();
if (Poco::File(dst_part_absolute_path).exists())
@ -3033,7 +3240,9 @@ MergeTreeData::MutableDataPartPtr MergeTreeData::cloneAndLoadDataPart(const Merg
LOG_DEBUG(log, "Cloning part " << src_part_absolute_path.toString() << " to " << dst_part_absolute_path.toString());
localBackup(src_part_absolute_path, dst_part_absolute_path);
MergeTreeData::MutableDataPartPtr dst_data_part = std::make_shared<MergeTreeData::DataPart>(*this, dst_part_name, dst_part_info);
MergeTreeData::MutableDataPartPtr dst_data_part = std::make_shared<MergeTreeData::DataPart>(
*this, reservation->getDisk(), dst_part_name, dst_part_info);
dst_data_part->relative_path = tmp_dst_part_name;
dst_data_part->is_temp = true;
@ -3042,18 +3251,49 @@ MergeTreeData::MutableDataPartPtr MergeTreeData::cloneAndLoadDataPart(const Merg
return dst_data_part;
}
String MergeTreeData::getFullPathOnDisk(const DiskSpace::DiskPtr & disk) const
{
return disk->getClickHouseDataPath() + escapeForFileName(database_name) + '/' + escapeForFileName(table_name) + '/';
}
DiskSpace::DiskPtr MergeTreeData::getDiskForPart(const String & part_name, const String & relative_path) const
{
const auto disks = storage_policy->getDisks();
for (const DiskSpace::DiskPtr & disk : disks)
{
const auto disk_path = getFullPathOnDisk(disk);
for (Poco::DirectoryIterator it = Poco::DirectoryIterator(disk_path + relative_path); it != Poco::DirectoryIterator(); ++it)
if (it.name() == part_name)
return disk;
}
return nullptr;
}
String MergeTreeData::getFullPathForPart(const String & part_name, const String & relative_path) const
{
auto disk = getDiskForPart(part_name, relative_path);
if (disk)
return getFullPathOnDisk(disk) + relative_path;
return "";
}
Strings MergeTreeData::getDataPaths() const
{
Strings res;
auto disks = storage_policy->getDisks();
for (const auto & disk : disks)
res.push_back(getFullPathOnDisk(disk));
return res;
}
void MergeTreeData::freezePartitionsByMatcher(MatcherFn matcher, const String & with_name, const Context & context)
{
String clickhouse_path = Poco::Path(context.getPath()).makeAbsolute().toString();
String shadow_path = clickhouse_path + "shadow/";
Poco::File(shadow_path).createDirectories();
String backup_path = shadow_path
+ (!with_name.empty()
? escapeForFileName(with_name)
: toString(Increment(shadow_path + "increment.txt").get(true)))
+ "/";
LOG_DEBUG(log, "Snapshot will be placed at " + backup_path);
String default_shadow_path = clickhouse_path + "shadow/";
Poco::File(default_shadow_path).createDirectories();
auto increment = Increment(default_shadow_path + "increment.txt").get(true);
/// Acquire a snapshot of active data parts to prevent removing while doing backup.
const auto data_parts = getDataParts();
@ -3064,14 +3304,19 @@ void MergeTreeData::freezePartitionsByMatcher(MatcherFn matcher, const String &
if (!matcher(part))
continue;
LOG_DEBUG(log, "Freezing part " << part->name);
String shadow_path = part->disk->getPath() + "shadow/";
Poco::File(shadow_path).createDirectories();
String backup_path = shadow_path
+ (!with_name.empty()
? escapeForFileName(with_name)
: toString(increment))
+ "/";
LOG_DEBUG(log, "Freezing part " << part->name << " snapshot will be placed at " + backup_path);
String part_absolute_path = Poco::Path(part->getFullPath()).absolute().toString();
if (!startsWith(part_absolute_path, clickhouse_path))
throw Exception("Part path " + part_absolute_path + " is not inside " + clickhouse_path, ErrorCodes::LOGICAL_ERROR);
String backup_part_absolute_path = part_absolute_path;
backup_part_absolute_path.replace(0, clickhouse_path.size(), backup_path);
String backup_part_absolute_path = backup_path + "data/" + getDatabaseName() + "/" + getTableName() + "/" + part->relative_path;
localBackup(part_absolute_path, backup_part_absolute_path);
part->is_frozen.store(true, std::memory_order_relaxed);
++parts_processed;
@ -3094,4 +3339,199 @@ bool MergeTreeData::canReplacePartition(const DataPartPtr & src_part) const
return true;
}
void MergeTreeData::writePartLog(
PartLogElement::Type type,
const ExecutionStatus & execution_status,
UInt64 elapsed_ns,
const String & new_part_name,
const DataPartPtr & result_part,
const DataPartsVector & source_parts,
const MergeListEntry * merge_entry)
try
{
auto part_log = global_context.getPartLog(database_name);
if (!part_log)
return;
PartLogElement part_log_elem;
part_log_elem.event_type = type;
part_log_elem.error = static_cast<UInt16>(execution_status.code);
part_log_elem.exception = execution_status.message;
part_log_elem.event_time = time(nullptr);
/// TODO: Stop stopwatch in outer code to exclude ZK timings and so on
part_log_elem.duration_ms = elapsed_ns / 10000000;
part_log_elem.database_name = database_name;
part_log_elem.table_name = table_name;
part_log_elem.partition_id = MergeTreePartInfo::fromPartName(new_part_name, format_version).partition_id;
part_log_elem.part_name = new_part_name;
if (result_part)
{
part_log_elem.path_on_disk = result_part->getFullPath();
part_log_elem.bytes_compressed_on_disk = result_part->bytes_on_disk;
part_log_elem.rows = result_part->rows_count;
}
part_log_elem.source_part_names.reserve(source_parts.size());
for (const auto & source_part : source_parts)
part_log_elem.source_part_names.push_back(source_part->name);
if (merge_entry)
{
part_log_elem.rows_read = (*merge_entry)->rows_read;
part_log_elem.bytes_read_uncompressed = (*merge_entry)->bytes_read_uncompressed;
part_log_elem.rows = (*merge_entry)->rows_written;
part_log_elem.bytes_uncompressed = (*merge_entry)->bytes_written_uncompressed;
}
part_log->add(part_log_elem);
}
catch (...)
{
tryLogCurrentException(log, __PRETTY_FUNCTION__);
}
MergeTreeData::CurrentlyMovingPartsTagger::CurrentlyMovingPartsTagger(MergeTreeMovingParts && moving_parts_, MergeTreeData & data_)
: parts_to_move(std::move(moving_parts_)), data(data_)
{
for (const auto & moving_part : parts_to_move)
if (!data.currently_moving_parts.emplace(moving_part.part).second)
throw Exception("Cannot move part '" + moving_part.part->name + "'. It's already moving.", ErrorCodes::LOGICAL_ERROR);
}
MergeTreeData::CurrentlyMovingPartsTagger::~CurrentlyMovingPartsTagger()
{
std::lock_guard lock(data.moving_parts_mutex);
for (const auto & moving_part : parts_to_move)
{
/// Something went completely wrong
if (!data.currently_moving_parts.count(moving_part.part))
std::terminate();
data.currently_moving_parts.erase(moving_part.part);
}
}
bool MergeTreeData::selectPartsAndMove()
{
if (parts_mover.moves_blocker.isCancelled())
return false;
auto moving_tagger = selectPartsForMove();
if (moving_tagger.parts_to_move.empty())
return false;
return moveParts(std::move(moving_tagger));
}
bool MergeTreeData::movePartsToSpace(const DataPartsVector & parts, DiskSpace::SpacePtr space)
{
if (parts_mover.moves_blocker.isCancelled())
return false;
auto moving_tagger = checkPartsForMove(parts, space);
if (moving_tagger.parts_to_move.empty())
return false;
return moveParts(std::move(moving_tagger));
}
MergeTreeData::CurrentlyMovingPartsTagger MergeTreeData::selectPartsForMove()
{
MergeTreeMovingParts parts_to_move;
auto can_move = [this](const DataPartPtr & part, String * reason) -> bool
{
if (partIsAssignedToBackgroundOperation(part))
{
*reason = "part already assigned to background operation.";
return false;
}
if (currently_moving_parts.count(part))
{
*reason = "part is already moving.";
return false;
}
return true;
};
std::lock_guard moving_lock(moving_parts_mutex);
parts_mover.selectPartsForMove(parts_to_move, can_move, moving_lock);
return CurrentlyMovingPartsTagger(std::move(parts_to_move), *this);
}
MergeTreeData::CurrentlyMovingPartsTagger MergeTreeData::checkPartsForMove(const DataPartsVector & parts, DiskSpace::SpacePtr space)
{
std::lock_guard moving_lock(moving_parts_mutex);
MergeTreeMovingParts parts_to_move;
for (const auto & part : parts)
{
auto reservation = space->reserve(part->bytes_on_disk);
if (!reservation)
throw Exception("Move is not possible. Not enough space on '" + space->getName() + "'", ErrorCodes::NOT_ENOUGH_SPACE);
auto & reserved_disk = reservation->getDisk();
String path_to_clone = getFullPathOnDisk(reserved_disk);
if (Poco::File(path_to_clone + part->name).exists())
throw Exception(
"Move is not possible: " + path_to_clone + part->name + " already exists",
ErrorCodes::DIRECTORY_ALREADY_EXISTS);
if (currently_moving_parts.count(part) || partIsAssignedToBackgroundOperation(part))
throw Exception(
"Cannot move part '" + part->name + "' because it's participating in background process",
ErrorCodes::PART_IS_TEMPORARILY_LOCKED);
parts_to_move.emplace_back(part, std::move(reservation));
}
return CurrentlyMovingPartsTagger(std::move(parts_to_move), *this);
}
bool MergeTreeData::moveParts(CurrentlyMovingPartsTagger && moving_tagger)
{
LOG_INFO(log, "Got " << moving_tagger.parts_to_move.size() << " parts to move.");
for (const auto & moving_part : moving_tagger.parts_to_move)
{
Stopwatch stopwatch;
DataPartPtr cloned_part;
auto write_part_log = [&](const ExecutionStatus & execution_status)
{
writePartLog(
PartLogElement::Type::MOVE_PART,
execution_status,
stopwatch.elapsed(),
moving_part.part->name,
cloned_part,
{moving_part.part},
nullptr);
};
try
{
cloned_part = parts_mover.clonePart(moving_part);
parts_mover.swapClonedPart(cloned_part);
write_part_log({});
}
catch (...)
{
write_part_log(ExecutionStatus::fromCurrentException());
if (cloned_part)
cloned_part->remove();
throw;
}
}
return true;
}
}

View File

@ -8,6 +8,7 @@
#include <Storages/MergeTree/MergeTreePartInfo.h>
#include <Storages/MergeTree/MergeTreeSettings.h>
#include <Storages/MergeTree/MergeTreeMutationStatus.h>
#include <Storages/MergeTree/MergeList.h>
#include <IO/ReadBufferFromString.h>
#include <IO/WriteBufferFromFile.h>
#include <IO/ReadBufferFromFile.h>
@ -16,6 +17,9 @@
#include <DataStreams/GraphiteRollupSortedBlockInputStream.h>
#include <Storages/MergeTree/MergeTreeDataPart.h>
#include <Storages/IndicesDescription.h>
#include <Storages/MergeTree/MergeTreePartsMover.h>
#include <Interpreters/PartLog.h>
#include <Common/DiskSpaceMonitor.h>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
@ -26,7 +30,9 @@
namespace DB
{
class MergeListEntry;
class AlterCommands;
class MergeTreePartsMover;
namespace ErrorCodes
{
@ -251,7 +257,13 @@ public:
struct PartsTemporaryRename : private boost::noncopyable
{
PartsTemporaryRename(const MergeTreeData & storage_, const String & base_dir_) : storage(storage_), base_dir(base_dir_) {}
PartsTemporaryRename(
const MergeTreeData & storage_,
const String & source_dir_)
: storage(storage_)
, source_dir(source_dir_)
{
}
void addPart(const String & old_name, const String & new_name);
@ -262,8 +274,9 @@ public:
~PartsTemporaryRename();
const MergeTreeData & storage;
String base_dir;
const String source_dir;
std::vector<std::pair<String, String>> old_and_new_names;
std::unordered_map<String, String> old_part_name_to_full_path;
bool renamed = false;
};
@ -302,7 +315,7 @@ public:
String getModeName() const;
};
/// Attach the table corresponding to the directory in full_path (must end with /), with the given columns.
/// Attach the table corresponding to the directory in full_path inside policy (must end with /), with the given columns.
/// Correctness of names and paths is not checked.
///
/// date_column_name - if not empty, the name of the Date column used for partitioning by month.
@ -319,7 +332,6 @@ public:
/// require_part_metadata - should checksums.txt and columns.txt exist in the part directory.
/// attach - whether the existing table is attached or the new table is created.
MergeTreeData(const String & database_, const String & table_,
const String & full_path_,
const ColumnsDescription & columns_,
const IndicesDescription & indices_,
const ConstraintsDescription & constraints_,
@ -348,6 +360,8 @@ public:
Names getColumnsRequiredForFinal() const override { return sorting_key_expr->getRequiredColumns(); }
Names getSortingKeyColumns() const override { return sorting_key_columns; }
DiskSpace::StoragePolicyPtr getStoragePolicy() const override { return storage_policy; }
bool supportsPrewhere() const override { return true; }
bool supportsSampling() const override { return sample_by_ast != nullptr; }
@ -393,7 +407,6 @@ public:
/// Load the set of data parts from disk. Call once - immediately after the object is created.
void loadDataParts(bool skip_sanity_checks);
String getFullPath() const { return full_path; }
String getLogName() const { return log_name; }
Int64 getMaxBlockNumber() const;
@ -424,7 +437,11 @@ public:
/// Returns a committed part with the given name or a part containing it. If there is no such part, returns nullptr.
DataPartPtr getActiveContainingPart(const String & part_name);
DataPartPtr getActiveContainingPart(const MergeTreePartInfo & part_info);
DataPartPtr getActiveContainingPart(const MergeTreePartInfo & part_info, DataPartState state, DataPartsLock &lock);
DataPartPtr getActiveContainingPart(const MergeTreePartInfo & part_info, DataPartState state, DataPartsLock & lock);
/// Swap part with it's identical copy (possible with another path on another disk).
/// If original part is not active or doesn't exist exception will be thrown.
void swapActivePart(MergeTreeData::DataPartPtr part_copy);
/// Returns all parts in specified partition
DataPartsVector getDataPartsVectorInPartition(DataPartState state, const String & partition_id);
@ -518,7 +535,8 @@ public:
/// Moves the entire data directory.
/// Flushes the uncompressed blocks cache and the marks cache.
/// Must be called with locked lockStructureForAlter().
void setPath(const String & full_path);
void rename(const String & new_path_to_db, const String & new_database_name,
const String & new_table_name, TableStructureWriteLockHolder &) override;
/// Check if the ALTER can be performed:
/// - all needed columns are present.
@ -570,7 +588,7 @@ public:
bool hasAnyColumnTTL() const { return !ttl_entries_by_name.empty(); }
/// Check that the part is not broken and calculate the checksums for it if they are not present.
MutableDataPartPtr loadPartAndFixMetadata(const String & relative_path);
MutableDataPartPtr loadPartAndFixMetadata(const DiskSpace::DiskPtr & disk, const String & relative_path);
void loadPartAndFixMetadata(MutableDataPartPtr part);
/** Create local backup (snapshot) for parts with specified prefix.
@ -579,6 +597,14 @@ public:
*/
void freezePartition(const ASTPtr & partition, const String & with_name, const Context & context, TableStructureReadLockHolder & table_lock_holder);
public:
/// Moves partition to specified Disk
void movePartitionToDisk(const ASTPtr & partition, const String & name, bool moving_part, const Context & context);
/// Moves partition to specified Volume
void movePartitionToVolume(const ASTPtr & partition, const String & name, bool moving_part, const Context & context);
size_t getColumnCompressedSize(const std::string & name) const
{
auto lock = lockParts();
@ -608,8 +634,8 @@ public:
MergeTreeData & checkStructureAndGetMergeTreeData(const StoragePtr & source_table) const;
MergeTreeData & checkStructureAndGetMergeTreeData(IStorage * source_table) const;
MergeTreeData::MutableDataPartPtr cloneAndLoadDataPart(const MergeTreeData::DataPartPtr & src_part, const String & tmp_part_prefix,
const MergeTreePartInfo & dst_part_info);
MergeTreeData::MutableDataPartPtr cloneAndLoadDataPart(
const MergeTreeData::DataPartPtr & src_part, const String & tmp_part_prefix, const MergeTreePartInfo & dst_part_info);
virtual std::vector<MergeTreeMutationStatus> getMutationsStatus() const = 0;
@ -630,6 +656,25 @@ public:
return storage_settings.get();
}
/// Get table path on disk
String getFullPathOnDisk(const DiskSpace::DiskPtr & disk) const;
/// Get disk for part. Looping through directories on FS because some parts maybe not in
/// active dataparts set (detached)
DiskSpace::DiskPtr getDiskForPart(const String & part_name, const String & relative_path = "") const;
/// Get full path for part. Uses getDiskForPart and returns the full path
String getFullPathForPart(const String & part_name, const String & relative_path = "") const;
Strings getDataPaths() const override;
/// Reserves space at least 1MB
DiskSpace::ReservationPtr reserveSpace(UInt64 expected_size);
/// Choose disk with max available free space
/// Reserves 0 bytes
DiskSpace::ReservationPtr makeEmptyReservationOnLargestDisk() { return storage_policy->makeEmptyReservationOnLargestDisk(); }
MergeTreeDataFormatVersion format_version;
Context global_context;
@ -687,6 +732,16 @@ public:
bool has_non_adaptive_index_granularity_parts = false;
/// Parts that currently moving from disk/volume to another.
/// This set have to be used with `currently_processing_in_background_mutex`.
/// Moving may conflict with merges and mutations, but this is OK, because
/// if we decide to move some part to another disk, than we
/// assuredly will choose this disk for containing part, which will appear
/// as result of merge or mutation.
DataParts currently_moving_parts;
/// Mutex for currently_moving_parts
mutable std::mutex moving_parts_mutex;
protected:
@ -706,7 +761,7 @@ protected:
String database_name;
String table_name;
String full_path;
/// Current column sizes in compressed and uncompressed form.
ColumnSizeByName column_sizes;
@ -721,6 +776,8 @@ protected:
/// Use get and set to receive readonly versions.
MultiVersion<MergeTreeSettings> storage_settings;
DiskSpace::StoragePolicyPtr storage_policy;
/// Work with data parts
struct TagByInfo{};
@ -758,6 +815,8 @@ protected:
DataPartsIndexes::index<TagByInfo>::type & data_parts_by_info;
DataPartsIndexes::index<TagByStateAndInfo>::type & data_parts_by_state_and_info;
MergeTreePartsMover parts_mover;
using DataPartIteratorByInfo = DataPartsIndexes::index<TagByInfo>::type::iterator;
using DataPartIteratorByStateAndInfo = DataPartsIndexes::index<TagByStateAndInfo>::type::iterator;
@ -802,7 +861,6 @@ protected:
throw Exception("Can't modify " + (*it)->getNameWithState(), ErrorCodes::LOGICAL_ERROR);
}
/// Used to serialize calls to grabOldParts.
std::mutex grab_old_parts_mutex;
/// The same for clearOldTemporaryDirectories.
@ -856,6 +914,48 @@ protected:
bool canReplacePartition(const DataPartPtr & data_part) const;
void writePartLog(
PartLogElement::Type type,
const ExecutionStatus & execution_status,
UInt64 elapsed_ns,
const String & new_part_name,
const DataPartPtr & result_part,
const DataPartsVector & source_parts,
const MergeListEntry * merge_entry);
/// If part is assigned to merge or mutation (possibly replicated)
/// Should be overriden by childs, because they can have different
/// mechanisms for parts locking
virtual bool partIsAssignedToBackgroundOperation(const DataPartPtr & part) const = 0;
/// Moves part to specified space, used in ALTER ... MOVE ... queries
bool movePartsToSpace(const DataPartsVector & parts, DiskSpace::SpacePtr space);
/// Selects parts for move and moves them, used in background process
bool selectPartsAndMove();
private:
/// RAII Wrapper for atomic work with currently moving parts
/// Acuire them in constructor and remove them in destructor
/// Uses data.currently_moving_parts_mutex
struct CurrentlyMovingPartsTagger
{
MergeTreeMovingParts parts_to_move;
MergeTreeData & data;
CurrentlyMovingPartsTagger(MergeTreeMovingParts && moving_parts_, MergeTreeData & data_);
CurrentlyMovingPartsTagger(const CurrentlyMovingPartsTagger & other) = delete;
~CurrentlyMovingPartsTagger();
};
/// Move selected parts to corresponding disks
bool moveParts(CurrentlyMovingPartsTagger && parts_to_move);
/// Select parts for move and disks for them. Used in background moving processes.
CurrentlyMovingPartsTagger selectPartsForMove();
/// Check selected parts for movements. Used by ALTER ... MOVE queries.
CurrentlyMovingPartsTagger checkPartsForMove(const DataPartsVector & parts, DiskSpace::SpacePtr space);
};
}

View File

@ -3,13 +3,12 @@
#include <Storages/MergeTree/MergeTreeSequentialBlockInputStream.h>
#include <Storages/MergeTree/MergedBlockOutputStream.h>
#include <Storages/MergeTree/MergedColumnOnlyOutputStream.h>
#include <Storages/MergeTree/DiskSpaceMonitor.h>
#include <Common/DiskSpaceMonitor.h>
#include <Storages/MergeTree/SimpleMergeSelector.h>
#include <Storages/MergeTree/AllMergeSelector.h>
#include <Storages/MergeTree/TTLMergeSelector.h>
#include <Storages/MergeTree/MergeList.h>
#include <Storages/MergeTree/StorageFromMergeTreeDataPart.h>
#include <Storages/MergeTree/BackgroundProcessingPool.h>
#include <DataStreams/TTLBlockInputStream.h>
#include <DataStreams/DistinctSortedBlockInputStream.h>
#include <DataStreams/ExpressionBlockInputStream.h>
@ -120,18 +119,17 @@ void FutureMergedMutatedPart::assign(MergeTreeData::DataPartsVector parts_)
name = part_info.getPartName();
}
MergeTreeDataMergerMutator::MergeTreeDataMergerMutator(MergeTreeData & data_, const BackgroundProcessingPool & pool_)
: data(data_), pool(pool_), log(&Logger::get(data.getLogName() + " (MergerMutator)"))
MergeTreeDataMergerMutator::MergeTreeDataMergerMutator(MergeTreeData & data_, size_t background_pool_size_)
: data(data_), background_pool_size(background_pool_size_), log(&Logger::get(data.getLogName() + " (MergerMutator)"))
{
}
UInt64 MergeTreeDataMergerMutator::getMaxSourcePartsSizeForMerge()
{
size_t total_threads_in_pool = pool.getNumberOfThreads();
size_t busy_threads_in_pool = CurrentMetrics::values[CurrentMetrics::BackgroundPoolTask].load(std::memory_order_relaxed);
return getMaxSourcePartsSizeForMerge(total_threads_in_pool, busy_threads_in_pool == 0 ? 0 : busy_threads_in_pool - 1); /// 1 is current thread
return getMaxSourcePartsSizeForMerge(background_pool_size, busy_threads_in_pool == 0 ? 0 : busy_threads_in_pool - 1); /// 1 is current thread
}
@ -152,20 +150,18 @@ UInt64 MergeTreeDataMergerMutator::getMaxSourcePartsSizeForMerge(size_t pool_siz
data_settings->max_bytes_to_merge_at_max_space_in_pool,
static_cast<double>(free_entries) / data_settings->number_of_free_entries_in_pool_to_lower_max_size_of_merge);
return std::min(max_size, static_cast<UInt64>(DiskSpaceMonitor::getUnreservedFreeSpace(data.full_path) / DISK_USAGE_COEFFICIENT_TO_SELECT));
return std::min(max_size, static_cast<UInt64>(data.storage_policy->getMaxUnreservedFreeSpace() / DISK_USAGE_COEFFICIENT_TO_SELECT));
}
UInt64 MergeTreeDataMergerMutator::getMaxSourcePartSizeForMutation()
{
const auto data_settings = data.getSettings();
size_t total_threads_in_pool = pool.getNumberOfThreads();
size_t busy_threads_in_pool = CurrentMetrics::values[CurrentMetrics::BackgroundPoolTask].load(std::memory_order_relaxed);
/// Allow mutations only if there are enough threads, leave free threads for merges else
if (total_threads_in_pool - busy_threads_in_pool >= data_settings->number_of_free_entries_in_pool_to_execute_mutation)
return static_cast<UInt64>(DiskSpaceMonitor::getUnreservedFreeSpace(data.full_path) / DISK_USAGE_COEFFICIENT_TO_RESERVE);
if (background_pool_size - busy_threads_in_pool >= data_settings->number_of_free_entries_in_pool_to_execute_mutation)
return static_cast<UInt64>(data.storage_policy->getMaxUnreservedFreeSpace() / DISK_USAGE_COEFFICIENT_TO_RESERVE);
return 0;
}
@ -276,7 +272,6 @@ bool MergeTreeDataMergerMutator::selectPartsToMerge(
return true;
}
bool MergeTreeDataMergerMutator::selectAllPartsToMergeWithinPartition(
FutureMergedMutatedPart & future_part,
UInt64 & available_disk_space,
@ -325,9 +320,7 @@ bool MergeTreeDataMergerMutator::selectAllPartsToMergeWithinPartition(
disk_space_warning_time = now;
LOG_WARNING(log, "Won't merge parts from " << parts.front()->name << " to " << (*prev_it)->name
<< " because not enough free space: "
<< formatReadableSizeWithBinarySuffix(available_disk_space) << " free and unreserved "
<< "(" << formatReadableSizeWithBinarySuffix(DiskSpaceMonitor::getReservedSpace()) << " reserved in "
<< DiskSpaceMonitor::getReservationCount() << " chunks), "
<< formatReadableSizeWithBinarySuffix(available_disk_space) << " free and unreserved, "
<< formatReadableSizeWithBinarySuffix(sum_bytes)
<< " required now (+" << static_cast<int>((DISK_USAGE_COEFFICIENT_TO_SELECT - 1.0) * 100)
<< "% on overhead); suppressing similar warnings for the next hour");
@ -536,7 +529,7 @@ public:
/// parts should be sorted.
MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTemporaryPart(
const FutureMergedMutatedPart & future_part, MergeList::Entry & merge_entry, TableStructureReadLockHolder &,
time_t time_of_merge, DiskSpaceMonitor::Reservation * disk_reservation, bool deduplicate, bool force_ttl)
time_t time_of_merge, DiskSpace::Reservation * space_reservation, bool deduplicate, bool force_ttl)
{
static const String TMP_PREFIX = "tmp_merge_";
@ -549,7 +542,8 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
<< parts.front()->name << " to " << parts.back()->name
<< " into " << TMP_PREFIX + future_part.name);
String new_part_tmp_path = data.getFullPath() + TMP_PREFIX + future_part.name + "/";
String part_path = data.getFullPathOnDisk(space_reservation->getDisk());
String new_part_tmp_path = part_path + TMP_PREFIX + future_part.name + "/";
if (Poco::File(new_part_tmp_path).exists())
throw Exception("Directory " + new_part_tmp_path + " already exists", ErrorCodes::DIRECTORY_ALREADY_EXISTS);
@ -569,7 +563,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
data.merging_params, gathering_columns, gathering_column_names, merging_columns, merging_column_names);
MergeTreeData::MutableDataPartPtr new_data_part = std::make_shared<MergeTreeData::DataPart>(
data, future_part.name, future_part.part_info);
data, space_reservation->getDisk(), future_part.name, future_part.part_info);
new_data_part->partition.assign(future_part.getPartition());
new_data_part->relative_path = TMP_PREFIX + future_part.name;
new_data_part->is_temp = true;
@ -743,7 +737,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
to.writePrefix();
size_t rows_written = 0;
const size_t initial_reservation = disk_reservation ? disk_reservation->getSize() : 0;
const size_t initial_reservation = space_reservation ? space_reservation->getSize() : 0;
auto is_cancelled = [&]() { return merges_blocker.isCancelled()
|| (need_remove_expired_values && ttl_merges_blocker.isCancelled()); };
@ -759,7 +753,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
merge_entry->bytes_written_uncompressed = merged_stream->getProfileInfo().bytes;
/// Reservation updates is not performed yet, during the merge it may lead to higher free space requirements
if (disk_reservation && sum_input_rows_upper_bound)
if (space_reservation && sum_input_rows_upper_bound)
{
/// The same progress from merge_entry could be used for both algorithms (it should be more accurate)
/// But now we are using inaccurate row-based estimation in Horizontal case for backward compatibility
@ -767,9 +761,10 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mergePartsToTempor
? std::min(1., 1. * rows_written / sum_input_rows_upper_bound)
: std::min(1., merge_entry->progress.load(std::memory_order_relaxed));
disk_reservation->update(static_cast<size_t>((1. - progress) * initial_reservation));
space_reservation->update(static_cast<size_t>((1. - progress) * initial_reservation));
}
}
merged_stream->readSuffix();
merged_stream.reset();
@ -904,6 +899,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mutatePartToTempor
const std::vector<MutationCommand> & commands,
MergeListEntry & merge_entry,
const Context & context,
DiskSpace::Reservation * space_reservation,
TableStructureReadLockHolder & table_lock_holder)
{
auto check_not_cancelled = [&]()
@ -950,7 +946,7 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataMergerMutator::mutatePartToTempor
LOG_TRACE(log, "Mutating part " << source_part->name << " to mutation version " << future_part.part_info.mutation);
MergeTreeData::MutableDataPartPtr new_data_part = std::make_shared<MergeTreeData::DataPart>(
data, future_part.name, future_part.part_info);
data, space_reservation->getDisk(), future_part.name, future_part.part_info);
new_data_part->relative_path = "tmp_mut_" + future_part.name;
new_data_part->is_temp = true;
new_data_part->ttl_infos = source_part->ttl_infos;
@ -1239,6 +1235,7 @@ MergeTreeData::DataPartPtr MergeTreeDataMergerMutator::renameMergedTemporaryPart
return new_data_part;
}
size_t MergeTreeDataMergerMutator::estimateNeededDiskSpace(const MergeTreeData::DataPartsVector & source_parts)
{
size_t res = 0;

View File

@ -1,7 +1,6 @@
#pragma once
#include <Storages/MergeTree/MergeTreeData.h>
#include <Storages/MergeTree/DiskSpaceMonitor.h>
#include <Storages/MutationCommands.h>
#include <atomic>
#include <functional>
@ -32,15 +31,17 @@ struct FutureMergedMutatedPart
void assign(MergeTreeData::DataPartsVector parts_);
};
/** Can select the parts to merge and merge them.
*/
/** Can select parts for background processes and do them.
* Currently helps with merges, mutations and moves
*/
class MergeTreeDataMergerMutator
{
public:
using AllowedMergingPredicate = std::function<bool (const MergeTreeData::DataPartPtr &, const MergeTreeData::DataPartPtr &, String * reason)>;
public:
MergeTreeDataMergerMutator(MergeTreeData & data_, const BackgroundProcessingPool & pool_);
MergeTreeDataMergerMutator(MergeTreeData & data_, size_t background_pool_size);
/** Get maximum total size of parts to do merge, at current moment of time.
* It depends on number of free threads in background_pool and amount of free space in disk.
@ -71,6 +72,7 @@ public:
const AllowedMergingPredicate & can_merge,
String * out_disable_reason = nullptr);
/** Select all the parts in the specified partition for merge, if possible.
* final - choose to merge even a single part - that is, allow to merge one part "with itself".
*/
@ -95,19 +97,22 @@ public:
MergeTreeData::MutableDataPartPtr mergePartsToTemporaryPart(
const FutureMergedMutatedPart & future_part,
MergeListEntry & merge_entry, TableStructureReadLockHolder & table_lock_holder, time_t time_of_merge,
DiskSpaceMonitor::Reservation * disk_reservation, bool deduplication, bool force_ttl);
DiskSpace::Reservation * disk_reservation, bool deduplication, bool force_ttl);
/// Mutate a single data part with the specified commands. Will create and return a temporary part.
MergeTreeData::MutableDataPartPtr mutatePartToTemporaryPart(
const FutureMergedMutatedPart & future_part,
const std::vector<MutationCommand> & commands,
MergeListEntry & merge_entry, const Context & context, TableStructureReadLockHolder & table_lock_holder);
MergeListEntry & merge_entry, const Context & context,
DiskSpace::Reservation * disk_reservation,
TableStructureReadLockHolder & table_lock_holder);
MergeTreeData::DataPartPtr renameMergedTemporaryPart(
MergeTreeData::MutableDataPartPtr & new_data_part,
const MergeTreeData::DataPartsVector & parts,
MergeTreeData::Transaction * out_transaction = nullptr);
/// The approximate amount of disk space needed for merge or mutation. With a surplus.
static size_t estimateNeededDiskSpace(const MergeTreeData::DataPartsVector & source_parts);
@ -137,7 +142,7 @@ private:
private:
MergeTreeData & data;
const BackgroundProcessingPool & pool;
const size_t background_pool_size;
Logger * log;

View File

@ -134,16 +134,22 @@ void MergeTreeDataPart::MinMaxIndex::merge(const MinMaxIndex & other)
}
MergeTreeDataPart::MergeTreeDataPart(MergeTreeData & storage_, const String & name_)
MergeTreeDataPart::MergeTreeDataPart(MergeTreeData & storage_, const DiskSpace::DiskPtr & disk_, const String & name_)
: storage(storage_)
, disk(disk_)
, name(name_)
, info(MergeTreePartInfo::fromPartName(name_, storage.format_version))
, index_granularity_info(storage)
{
}
MergeTreeDataPart::MergeTreeDataPart(const MergeTreeData & storage_, const String & name_, const MergeTreePartInfo & info_)
MergeTreeDataPart::MergeTreeDataPart(
const MergeTreeData & storage_,
const DiskSpace::DiskPtr & disk_,
const String & name_,
const MergeTreePartInfo & info_)
: storage(storage_)
, disk(disk_)
, name(name_)
, info(info_)
, index_granularity_info(storage)
@ -240,9 +246,9 @@ String MergeTreeDataPart::getColumnNameWithMinumumCompressedSize() const
String MergeTreeDataPart::getFullPath() const
{
if (relative_path.empty())
throw Exception("Part relative_path cannot be empty. This is bug.", ErrorCodes::LOGICAL_ERROR);
throw Exception("Part relative_path cannot be empty. It's bug.", ErrorCodes::LOGICAL_ERROR);
return storage.full_path + relative_path + "/";
return storage.getFullPathOnDisk(disk) + relative_path + "/";
}
String MergeTreeDataPart::getNameWithPrefix() const
@ -308,7 +314,7 @@ time_t MergeTreeDataPart::getMaxTime() const
MergeTreeDataPart::~MergeTreeDataPart()
{
if (is_temp)
if (state == State::DeleteOnDestroy || is_temp)
{
try
{
@ -318,11 +324,14 @@ MergeTreeDataPart::~MergeTreeDataPart()
if (!dir.exists())
return;
if (!startsWith(getNameWithPrefix(), "tmp"))
if (is_temp)
{
LOG_ERROR(storage.log, "~DataPart() should remove part " << path
<< " but its name doesn't start with tmp. Too suspicious, keeping the part.");
return;
if (!startsWith(getNameWithPrefix(), "tmp"))
{
LOG_ERROR(storage.log, "~DataPart() should remove part " << path
<< " but its name doesn't start with tmp. Too suspicious, keeping the part.");
return;
}
}
dir.remove(true);
@ -364,10 +373,11 @@ void MergeTreeDataPart::remove() const
* And a race condition can happen that will lead to "File not found" error here.
*/
String full_path = storage.getFullPathOnDisk(disk);
String from = full_path + relative_path;
String to = full_path + "delete_tmp_" + name;
// TODO directory delete_tmp_<name> is never removed if server crashes before returning from this function
String from = storage.full_path + relative_path;
String to = storage.full_path + "delete_tmp_" + name;
Poco::File from_dir{from};
Poco::File to_dir{to};
@ -447,7 +457,7 @@ void MergeTreeDataPart::remove() const
void MergeTreeDataPart::renameTo(const String & new_relative_path, bool remove_new_dir_if_exists) const
{
String from = getFullPath();
String to = storage.full_path + new_relative_path + "/";
String to = storage.getFullPathOnDisk(disk) + new_relative_path + "/";
Poco::File from_file(from);
if (!from_file.exists())
@ -468,7 +478,7 @@ void MergeTreeDataPart::renameTo(const String & new_relative_path, bool remove_n
}
else
{
throw Exception("part directory " + to + " already exists", ErrorCodes::DIRECTORY_ALREADY_EXISTS);
throw Exception("Part directory " + to + " already exists", ErrorCodes::DIRECTORY_ALREADY_EXISTS);
}
}
@ -495,7 +505,7 @@ String MergeTreeDataPart::getRelativePathForDetachedPart(const String & prefix)
res = "detached/" + (prefix.empty() ? "" : prefix + "_")
+ name + (try_no ? "_try" + DB::toString(try_no) : "");
if (!Poco::File(storage.full_path + res).exists())
if (!Poco::File(storage.getFullPathOnDisk(disk) + res).exists())
return res;
LOG_WARNING(storage.log, "Directory " << res << " (to detach to) already exists."
@ -519,11 +529,27 @@ UInt64 MergeTreeDataPart::getMarksCount() const
void MergeTreeDataPart::makeCloneInDetached(const String & prefix) const
{
Poco::Path src(getFullPath());
Poco::Path dst(storage.full_path + getRelativePathForDetachedPart(prefix));
Poco::Path dst(storage.getFullPathOnDisk(disk) + getRelativePathForDetachedPart(prefix));
/// Backup is not recursive (max_level is 0), so do not copy inner directories
localBackup(src, dst, 0);
}
void MergeTreeDataPart::makeCloneOnDiskDetached(const DiskSpace::ReservationPtr & reservation) const
{
auto & reserved_disk = reservation->getDisk();
if (reserved_disk->getName() == disk->getName())
throw Exception("Can not clone data part " + name + " to same disk " + disk->getName(), ErrorCodes::LOGICAL_ERROR);
String path_to_clone = storage.getFullPathOnDisk(reserved_disk) + "detached/";
if (Poco::File(path_to_clone + relative_path).exists())
throw Exception("Path " + path_to_clone + relative_path + " already exists. Can not clone ", ErrorCodes::DIRECTORY_ALREADY_EXISTS);
Poco::File(path_to_clone).createDirectory();
Poco::File cloning_directory(getFullPath());
cloning_directory.copyTo(path_to_clone);
}
void MergeTreeDataPart::loadColumnsChecksumsIndexes(bool require_columns_checksums, bool check_consistency)
{
/// Memory should not be limited during ATTACH TABLE query.
@ -636,10 +662,10 @@ void MergeTreeDataPart::loadPartitionAndMinMaxIndex()
}
else
{
String full_path = getFullPath();
partition.load(storage, full_path);
String path = getFullPath();
partition.load(storage, path);
if (!isEmpty())
minmax_idx.load(storage, full_path);
minmax_idx.load(storage, path);
}
String calculated_partition_id = partition.getID(storage.partition_key_sample);
@ -955,6 +981,8 @@ String MergeTreeDataPart::stateToString(MergeTreeDataPart::State state)
return "Outdated";
case State::Deleting:
return "Deleting";
case State::DeleteOnDestroy:
return "DeleteOnDestroy";
}
__builtin_unreachable();

View File

@ -32,9 +32,9 @@ struct MergeTreeDataPart
using Checksums = MergeTreeDataPartChecksums;
using Checksum = MergeTreeDataPartChecksums::Checksum;
MergeTreeDataPart(const MergeTreeData & storage_, const String & name_, const MergeTreePartInfo & info_);
MergeTreeDataPart(const MergeTreeData & storage_, const DiskSpace::DiskPtr & disk_, const String & name_, const MergeTreePartInfo & info_);
MergeTreeDataPart(MergeTreeData & storage_, const String & name_);
MergeTreeDataPart(MergeTreeData & storage_, const DiskSpace::DiskPtr & disk_, const String & name_);
/// Returns the name of a column with minimum compressed size (as returned by getColumnSize()).
/// If no checksums are present returns the name of the first physically existing column.
@ -73,6 +73,7 @@ struct MergeTreeDataPart
const MergeTreeData & storage;
DiskSpace::DiskPtr disk;
String name;
MergeTreePartInfo info;
@ -102,20 +103,22 @@ struct MergeTreeDataPart
* Part state should be modified under data_parts mutex.
*
* Possible state transitions:
* Temporary -> Precommitted: we are trying to commit a fetched, inserted or merged part to active set
* Precommitted -> Outdated: we could not to add a part to active set and doing a rollback (for example it is duplicated part)
* Precommitted -> Commited: we successfully committed a part to active dataset
* Precommitted -> Outdated: a part was replaced by a covering part or DROP PARTITION
* Outdated -> Deleting: a cleaner selected this part for deletion
* Deleting -> Outdated: if an ZooKeeper error occurred during the deletion, we will retry deletion
* Temporary -> Precommitted: we are trying to commit a fetched, inserted or merged part to active set
* Precommitted -> Outdated: we could not to add a part to active set and doing a rollback (for example it is duplicated part)
* Precommitted -> Commited: we successfully committed a part to active dataset
* Precommitted -> Outdated: a part was replaced by a covering part or DROP PARTITION
* Outdated -> Deleting: a cleaner selected this part for deletion
* Deleting -> Outdated: if an ZooKeeper error occurred during the deletion, we will retry deletion
* Committed -> DeleteOnDestroy if part was moved to another disk
*/
enum class State
{
Temporary, /// the part is generating now, it is not in data_parts list
PreCommitted, /// the part is in data_parts, but not used for SELECTs
Committed, /// active data part, used by current and upcoming SELECTs
Outdated, /// not active data part, but could be used by only current SELECTs, could be deleted after SELECTs finishes
Deleting /// not active data part with identity refcounter, it is deleting right now by a cleaner
Temporary, /// the part is generating now, it is not in data_parts list
PreCommitted, /// the part is in data_parts, but not used for SELECTs
Committed, /// active data part, used by current and upcoming SELECTs
Outdated, /// not active data part, but could be used by only current SELECTs, could be deleted after SELECTs finishes
Deleting, /// not active data part with identity refcounter, it is deleting right now by a cleaner
DeleteOnDestroy, /// part was moved to another disk and should be deleted in own destructor
};
using TTLInfo = MergeTreeDataPartTTLInfo;
@ -256,6 +259,9 @@ struct MergeTreeDataPart
/// Makes clone of a part in detached/ directory via hard links
void makeCloneInDetached(const String & prefix) const;
/// Makes full clone of part in detached/ on another disk
void makeCloneOnDiskDetached(const DiskSpace::ReservationPtr & reservation) const;
/// Populates columns_to_size map (compressed size).
void accumulateColumnSizes(ColumnToSize & column_to_size) const;

View File

@ -198,7 +198,14 @@ MergeTreeData::MutableDataPartPtr MergeTreeDataWriter::writeTempPart(BlockWithPa
else
part_name = new_part_info.getPartName();
MergeTreeData::MutableDataPartPtr new_data_part = std::make_shared<MergeTreeData::DataPart>(data, part_name, new_part_info);
/// Size of part would not be grater than block.bytes() + epsilon
size_t expected_size = block.bytes();
auto reservation = data.reserveSpace(expected_size);
MergeTreeData::MutableDataPartPtr new_data_part =
std::make_shared<MergeTreeData::DataPart>(data, reservation->getDisk(), part_name, new_part_info);
new_data_part->partition = std::move(partition);
new_data_part->minmax_idx = std::move(minmax_idx);
new_data_part->relative_path = TMP_PREFIX + part_name;

View File

@ -9,6 +9,7 @@ namespace DB
{
/// A mutation entry for non-replicated MergeTree storage engines.
/// Stores information about mutation in file mutation_*.txt.
struct MergeTreeMutationEntry
{
time_t create_time = 0;

View File

@ -0,0 +1,178 @@
#include <Storages/MergeTree/MergeTreePartsMover.h>
#include <Storages/MergeTree/MergeTreeData.h>
#include <set>
#include <boost/algorithm/string/join.hpp>
namespace DB
{
namespace ErrorCodes
{
extern const int ABORTED;
extern const int NO_SUCH_DATA_PART;
extern const int LOGICAL_ERROR;
}
namespace
{
/// Contains minimal number of heaviest parts, which sum size on disk is greater than required.
/// If there are not enough summary size, than contains all parts.
class LargestPartsWithRequiredSize
{
struct PartsSizeOnDiskComparator
{
bool operator()(const MergeTreeData::DataPartPtr & f, const MergeTreeData::DataPartPtr & s) const
{
/// If parts have equal sizes, than order them by names (names are unique)
return std::tie(f->bytes_on_disk, f->name) < std::tie(s->bytes_on_disk, s->name);
}
};
std::set<MergeTreeData::DataPartPtr, PartsSizeOnDiskComparator> elems;
UInt64 required_size_sum;
UInt64 current_size_sum = 0;
public:
LargestPartsWithRequiredSize(UInt64 required_sum_size_) : required_size_sum(required_sum_size_) {}
void add(MergeTreeData::DataPartPtr part)
{
if (current_size_sum < required_size_sum)
{
elems.emplace(part);
current_size_sum += part->bytes_on_disk;
return;
}
/// Adding smaller element
if (!elems.empty() && (*elems.begin())->bytes_on_disk >= part->bytes_on_disk)
return;
elems.emplace(part);
current_size_sum += part->bytes_on_disk;
while (!elems.empty() && (current_size_sum - (*elems.begin())->bytes_on_disk >= required_size_sum))
{
current_size_sum -= (*elems.begin())->bytes_on_disk;
elems.erase(elems.begin());
}
}
/// Returns parts ordered by size
MergeTreeData::DataPartsVector getAccumulatedParts()
{
MergeTreeData::DataPartsVector res;
for (const auto & elem : elems)
res.push_back(elem);
return res;
}
};
}
bool MergeTreePartsMover::selectPartsForMove(
MergeTreeMovingParts & parts_to_move,
const AllowedMovingPredicate & can_move,
const std::lock_guard<std::mutex> & /* moving_parts_lock */)
{
MergeTreeData::DataPartsVector data_parts = data->getDataPartsVector();
if (data_parts.empty())
return false;
std::unordered_map<DiskSpace::DiskPtr, LargestPartsWithRequiredSize> need_to_move;
const auto & policy = data->getStoragePolicy();
const auto & volumes = policy->getVolumes();
/// Do not check if policy has one volume
if (volumes.size() == 1)
return false;
/// Do not check last volume
for (size_t i = 0; i != volumes.size() - 1; ++i)
{
for (const auto & disk : volumes[i]->disks)
{
UInt64 required_available_space = disk->getTotalSpace() * policy->getMoveFactor();
UInt64 unreserved_space = disk->getUnreservedSpace();
if (required_available_space > unreserved_space)
need_to_move.emplace(disk, required_available_space - unreserved_space);
}
}
for (const auto & part : data_parts)
{
String reason;
/// Don't report message to log, because logging is excessive
if (!can_move(part, &reason))
continue;
auto to_insert = need_to_move.find(part->disk);
if (to_insert != need_to_move.end())
to_insert->second.add(part);
}
for (auto && move : need_to_move)
{
auto min_volume_priority = policy->getVolumeIndexByDisk(move.first) + 1;
for (auto && part : move.second.getAccumulatedParts())
{
auto reservation = policy->reserve(part->bytes_on_disk, min_volume_priority);
if (!reservation)
{
/// Next parts to move from this disk has greater size and same min volume priority
/// There are no space for them
/// But it can be possible to move data from other disks
break;
}
parts_to_move.emplace_back(part, std::move(reservation));
}
}
return !parts_to_move.empty();
}
MergeTreeData::DataPartPtr MergeTreePartsMover::clonePart(const MergeTreeMoveEntry & moving_part) const
{
if (moves_blocker.isCancelled())
throw Exception("Cancelled moving parts.", ErrorCodes::ABORTED);
LOG_TRACE(log, "Cloning part " << moving_part.part->name);
moving_part.part->makeCloneOnDiskDetached(moving_part.reserved_space);
MergeTreeData::MutableDataPartPtr cloned_part =
std::make_shared<MergeTreeData::DataPart>(*data, moving_part.reserved_space->getDisk(), moving_part.part->name);
cloned_part->relative_path = "detached/" + moving_part.part->name;
LOG_TRACE(log, "Part " << moving_part.part->name << " was cloned to " << cloned_part->getFullPath());
cloned_part->loadColumnsChecksumsIndexes(true, true);
return cloned_part;
}
void MergeTreePartsMover::swapClonedPart(const MergeTreeData::DataPartPtr & cloned_part) const
{
if (moves_blocker.isCancelled())
throw Exception("Cancelled moving parts.", ErrorCodes::ABORTED);
auto active_part = data->getActiveContainingPart(cloned_part->name);
/// It's ok, because we don't block moving parts for merges or mutations
if (!active_part || active_part->name != cloned_part->name)
{
LOG_INFO(log, "Failed to swap " << cloned_part->name << ". Active part doesn't exist."
<< " Possible it was merged or mutated. Will remove copy on path '" << cloned_part->getFullPath() << "'.");
return;
}
cloned_part->renameTo(active_part->name);
/// TODO what happen if server goes down here?
data->swapActivePart(cloned_part);
LOG_TRACE(log, "Part " << cloned_part->name << " was moved to " << cloned_part->getFullPath());
}
}

View File

@ -0,0 +1,73 @@
#pragma once
#include <functional>
#include <vector>
#include <optional>
#include <Storages/MergeTree/MergeTreeDataPart.h>
#include <Common/ActionBlocker.h>
#include <Common/DiskSpaceMonitor.h>
namespace DB
{
/// Active part from storage and destination reservation where
/// it have to be moved.
struct MergeTreeMoveEntry
{
std::shared_ptr<const MergeTreeDataPart> part;
DiskSpace::ReservationPtr reserved_space;
MergeTreeMoveEntry(const std::shared_ptr<const MergeTreeDataPart> & part_, DiskSpace::ReservationPtr reservation_)
: part(part_), reserved_space(std::move(reservation_))
{
}
};
using MergeTreeMovingParts = std::vector<MergeTreeMoveEntry>;
/** Can select parts for background moves, clone them to appropriate disks into
* /detached directory and replace them into active parts set
*/
class MergeTreePartsMover
{
private:
/// Callback tells that part is not participating in background process
using AllowedMovingPredicate = std::function<bool(const std::shared_ptr<const MergeTreeDataPart> &, String * reason)>;
public:
MergeTreePartsMover(MergeTreeData * data_)
: data(data_)
, log(&Poco::Logger::get("MergeTreePartsMover"))
{
}
/// Select parts for background moves according to storage_policy configuration.
/// Returns true if at least one part was selected for move.
bool selectPartsForMove(
MergeTreeMovingParts & parts_to_move,
const AllowedMovingPredicate & can_move,
const std::lock_guard<std::mutex> & moving_parts_lock);
/// Copies part to selected reservation in detached folder. Throws exception if part alredy exists.
std::shared_ptr<const MergeTreeDataPart> clonePart(const MergeTreeMoveEntry & moving_part) const;
/// Replaces cloned part from detached directory into active data parts set.
/// Replacing part changes state to DeleteOnDestroy and will be removed from disk after destructor of
/// MergeTreeDataPart called. If replacing part doesn't exists or not active (commited) than
/// cloned part will be removed and loge message will be reported. It may happen in case of concurrent
/// merge or mutation.
void swapClonedPart(const std::shared_ptr<const MergeTreeDataPart> & cloned_parts) const;
public:
/// Can stop background moves and moves from queries
ActionBlocker moves_blocker;
private:
MergeTreeData * data;
Logger * log;
};
}

View File

@ -88,6 +88,7 @@ struct MergeTreeSettings : public SettingsCollection<MergeTreeSettings>
M(SettingMaxThreads, max_part_loading_threads, 0, "The number of theads to load data parts at startup.") \
M(SettingMaxThreads, max_part_removal_threads, 0, "The number of theads for concurrent removal of inactive data parts. One is usually enough, but in 'Google Compute Environment SSD Persistent Disks' file removal (unlink) operation is extraordinarily slow and you probably have to increase this number (recommended is up to 16).") \
M(SettingUInt64, concurrent_part_removal_threshold, 100, "Activate concurrent part removal (see 'max_part_removal_threads') only if the number of inactive data parts is at least this.") \
M(SettingString, storage_policy_name, "default", "Name of storage disk policy")
DECLARE_SETTINGS_COLLECTION(LIST_OF_MERGE_TREE_SETTINGS)
@ -103,7 +104,7 @@ struct MergeTreeSettings : public SettingsCollection<MergeTreeSettings>
/// We check settings after storage creation
static bool isReadonlySetting(const String & name)
{
return name == "index_granularity" || name == "index_granularity_bytes";
return name == "index_granularity" || name == "index_granularity_bytes" || name == "storage_policy_name";
}
};

View File

@ -85,8 +85,9 @@ void ReplicatedMergeTreeAlterThread::run()
auto metadata_in_zk = ReplicatedMergeTreeTableMetadata::parse(metadata_str);
auto metadata_diff = ReplicatedMergeTreeTableMetadata(storage).checkAndFindDiff(metadata_in_zk, /* allow_alter = */ true);
/// If you need to lock table structure, then suspend merges.
/// If you need to lock table structure, then suspend merges and moves.
ActionLock merge_blocker = storage.merger_mutator.merges_blocker.cancel();
ActionLock moves_blocker = storage.parts_mover.moves_blocker.cancel();
MergeTreeData::DataParts parts;

View File

@ -15,6 +15,7 @@ namespace ErrorCodes
{
extern const int UNEXPECTED_NODE_IN_ZOOKEEPER;
extern const int UNFINISHED;
extern const int PART_IS_TEMPORARILY_LOCKED;
}
@ -30,7 +31,7 @@ void ReplicatedMergeTreeQueue::addVirtualParts(const MergeTreeData::DataParts &
{
std::lock_guard lock(state_mutex);
for (const auto & part : parts)
for (auto part : parts)
{
current_parts.add(part->name);
virtual_parts.add(part->name);
@ -38,6 +39,12 @@ void ReplicatedMergeTreeQueue::addVirtualParts(const MergeTreeData::DataParts &
}
bool ReplicatedMergeTreeQueue::isVirtualPart(const MergeTreeData::DataPartPtr & data_part) const
{
std::lock_guard lock(state_mutex);
return virtual_parts.getContainingPart(data_part->info) != data_part->name;
}
bool ReplicatedMergeTreeQueue::load(zkutil::ZooKeeperPtr zookeeper)
{
auto queue_path = replica_path + "/queue";
@ -379,11 +386,10 @@ bool ReplicatedMergeTreeQueue::remove(zkutil::ZooKeeperPtr zookeeper, const Stri
bool ReplicatedMergeTreeQueue::removeFromVirtualParts(const MergeTreePartInfo & part_info)
{
std::unique_lock lock(state_mutex);
std::lock_guard lock(state_mutex);
return virtual_parts.remove(part_info);
}
void ReplicatedMergeTreeQueue::pullLogsToQueue(zkutil::ZooKeeperPtr zookeeper, Coordination::WatchCallback watch_callback)
{
std::lock_guard lock(pull_logs_to_queue_mutex);
@ -763,7 +769,10 @@ bool ReplicatedMergeTreeQueue::checkReplaceRangeCanBeRemoved(const MergeTreePart
return true;
}
void ReplicatedMergeTreeQueue::removePartProducingOpsInRange(zkutil::ZooKeeperPtr zookeeper, const MergeTreePartInfo & part_info, const ReplicatedMergeTreeLogEntryData & current)
void ReplicatedMergeTreeQueue::removePartProducingOpsInRange(
zkutil::ZooKeeperPtr zookeeper,
const MergeTreePartInfo & part_info,
const ReplicatedMergeTreeLogEntryData & current)
{
Queue to_wait;
size_t removed_entries = 0;
@ -1312,7 +1321,7 @@ bool ReplicatedMergeTreeQueue::tryFinalizeMutations(zkutil::ZooKeeperPtr zookeep
}
void ReplicatedMergeTreeQueue::disableMergesInRange(const String & part_name)
void ReplicatedMergeTreeQueue::disableMergesInBlockRange(const String & part_name)
{
std::lock_guard lock(state_mutex);
virtual_parts.add(part_name);
@ -1630,25 +1639,30 @@ bool ReplicatedMergeTreeMergePredicate::operator()(
if (left_max_block + 1 < right_min_block)
{
/// Fake part which will appear as merge result
MergeTreePartInfo gap_part_info(
left->info.partition_id, left_max_block + 1, right_min_block - 1,
MergeTreePartInfo::MAX_LEVEL, MergeTreePartInfo::MAX_BLOCK_NUMBER);
/// We don't select parts if any smaller part covered by our merge must exist after
/// processing replication log up to log_pointer.
Strings covered = queue.virtual_parts.getPartsCoveredBy(gap_part_info);
if (!covered.empty())
{
if (out_reason)
*out_reason = "There are " + toString(covered.size()) + " parts (from " + covered.front()
+ " to " + covered.back() + ") that are still not present on this replica between "
+ left->name + " and " + right->name;
+ " to " + covered.back() + ") that are still not present or beeing processed by "
+ " other background process on this replica between " + left->name + " and " + right->name;
return false;
}
}
Int64 left_mutation_ver = queue.getCurrentMutationVersionImpl(
left->info.partition_id, left->info.getDataVersion(), lock);
Int64 right_mutation_ver = queue.getCurrentMutationVersionImpl(
left->info.partition_id, right->info.getDataVersion(), lock);
if (left_mutation_ver != right_mutation_ver)
{
if (out_reason)

View File

@ -229,6 +229,7 @@ public:
~ReplicatedMergeTreeQueue();
void initialize(const String & zookeeper_path_, const String & replica_path_, const String & logger_name_,
const MergeTreeData::DataParts & parts);
@ -304,6 +305,7 @@ public:
/// Count the total number of active mutations that are finished (is_done = true).
size_t countFinishedMutations() const;
/// Returns functor which used by MergeTreeMergerMutator to select parts for merge
ReplicatedMergeTreeMergePredicate getMergePredicate(zkutil::ZooKeeperPtr & zookeeper);
/// Return the version (block number) of the last mutation that we don't need to apply to the part
@ -318,12 +320,17 @@ public:
/// (because some mutations are probably done but we are not sure yet), returns true.
bool tryFinalizeMutations(zkutil::ZooKeeperPtr zookeeper);
/// Prohibit merges in the specified range.
void disableMergesInRange(const String & part_name);
/// Prohibit merges in the specified blocks range.
/// Add part to virtual_parts, which means that part must exist
/// after processing replication log up to log_pointer.
/// Part maybe fake (look at ReplicatedMergeTreeMergePredicate).
void disableMergesInBlockRange(const String & part_name);
/** Check that part isn't in currently generating parts and isn't covered by them and add it to future_parts.
* Locks queue's mutex.
*/
/// Cheks that part is already in virtual parts
bool isVirtualPart(const MergeTreeData::DataPartPtr & data_part) const;
/// Check that part isn't in currently generating parts and isn't covered by them and add it to future_parts.
/// Locks queue's mutex.
bool addFuturePartIfNotCoveredByThem(const String & part_name, LogEntry & entry, String & reject_reason);
/// A blocker that stops selects from the queue

View File

@ -639,14 +639,14 @@ static StoragePtr create(const StorageFactory::Arguments & args)
if (replicated)
return StorageReplicatedMergeTree::create(
zookeeper_path, replica_name, args.attach, args.data_path, args.database_name, args.table_name,
zookeeper_path, replica_name, args.attach, args.database_name, args.table_name,
args.columns, indices_description, args.constraints,
args.context, date_column_name, partition_by_ast, order_by_ast, primary_key_ast,
sample_by_ast, ttl_table_ast, merging_params, std::move(storage_settings),
args.has_force_restore_data_flag);
else
return StorageMergeTree::create(
args.data_path, args.database_name, args.table_name, args.columns, indices_description,
args.database_name, args.table_name, args.columns, indices_description,
args.constraints, args.attach, args.context, date_column_name, partition_by_ast, order_by_ast,
primary_key_ast, sample_by_ast, ttl_table_ast, merging_params, std::move(storage_settings),
args.has_force_restore_data_flag);

View File

@ -39,6 +39,30 @@ std::optional<PartitionCommand> PartitionCommand::parse(const ASTAlterCommand *
res.part = command_ast->part;
return res;
}
else if (command_ast->type == ASTAlterCommand::MOVE_PARTITION)
{
PartitionCommand res;
res.type = MOVE_PARTITION;
res.partition = command_ast->partition;
res.part = command_ast->part;
switch (command_ast->move_destination_type)
{
case ASTAlterCommand::MoveDestinationType::DISK:
res.move_destination_type = PartitionCommand::MoveDestinationType::DISK;
break;
case ASTAlterCommand::MoveDestinationType::VOLUME:
res.move_destination_type = PartitionCommand::MoveDestinationType::VOLUME;
break;
case ASTAlterCommand::MoveDestinationType::PARTITION:
res.move_destination_type = PartitionCommand::MoveDestinationType::PARTITION;
res.to_database = command_ast->to_database;
res.to_table = command_ast->to_table;
break;
}
if (res.move_destination_type != PartitionCommand::MoveDestinationType::PARTITION)
res.move_destination_name = command_ast->move_destination_name;
return res;
}
else if (command_ast->type == ASTAlterCommand::REPLACE_PARTITION)
{
PartitionCommand res;
@ -49,15 +73,6 @@ std::optional<PartitionCommand> PartitionCommand::parse(const ASTAlterCommand *
res.from_table = command_ast->from_table;
return res;
}
else if (command_ast->type == ASTAlterCommand::MOVE_PARTITION)
{
PartitionCommand res;
res.type = MOVE_PARTITION;
res.partition = command_ast->partition;
res.to_database = command_ast->to_database;
res.to_table = command_ast->to_table;
return res;
}
else if (command_ast->type == ASTAlterCommand::FETCH_PARTITION)
{
PartitionCommand res;

View File

@ -19,6 +19,7 @@ struct PartitionCommand
enum Type
{
ATTACH_PARTITION,
MOVE_PARTITION,
CLEAR_COLUMN,
CLEAR_INDEX,
DROP_PARTITION,
@ -27,7 +28,6 @@ struct PartitionCommand
FREEZE_ALL_PARTITIONS,
FREEZE_PARTITION,
REPLACE_PARTITION,
MOVE_PARTITION,
};
Type type;
@ -57,6 +57,17 @@ struct PartitionCommand
/// For FREEZE PARTITION
String with_name;
enum MoveDestinationType
{
DISK,
VOLUME,
PARTITION,
};
MoveDestinationType move_destination_type;
String move_destination_name;
static std::optional<PartitionCommand> parse(const ASTAlterCommand * command);
};

View File

@ -96,7 +96,7 @@ public:
void startup() override;
void shutdown() override;
String getDataPath() const override { return path; }
Strings getDataPaths() const override { return {path}; }
const ExpressionActionsPtr & getShardingKeyExpr() const { return sharding_key_expr; }
const String & getShardingKeyColumnName() const { return sharding_key_column_name; }

View File

@ -325,11 +325,11 @@ BlockOutputStreamPtr StorageFile::write(
return std::make_shared<StorageFileBlockOutputStream>(*this);
}
String StorageFile::getDataPath() const
Strings StorageFile::getDataPaths() const
{
if (paths.empty())
throw Exception("Table '" + table_name + "' is in readonly mode", ErrorCodes::DATABASE_ACCESS_DENIED);
return paths[0];
return paths;
}
void StorageFile::rename(const String & new_path_to_db, const String & new_database_name, const String & new_table_name, TableStructureWriteLockHolder &)

Some files were not shown because too many files have changed in this diff Show More