ClickHouse/src/Storages/MergeTree/MergeTreeDataPartUUID.h
Aleksei Semiglazov 921518db0a CLICKHOUSE-606: query deduplication based on parts' UUID
* add the query data deduplication excluding duplicated parts in MergeTree family engines.

query deduplication is based on parts' UUID which should be enabled first with merge_tree setting
assign_part_uuids=1

allow_experimental_query_deduplication setting is to enable part deduplication, default ot false.

data part UUID is a mechanism of giving a data part a unique identifier.
Having UUID and deduplication mechanism provides a potential of moving parts
between shards preserving data consistency on a read path:
duplicated UUIDs will cause root executor to retry query against on of the replica explicitly
asking to exclude encountered duplicated fingerprints during a distributed query execution.

NOTE: this implementation don't provide any knobs to lock part and hence its UUID. Any mutations/merge will
update part's UUID.

* add _part_uuid virtual column, allowing to use UUIDs in predicates.

Signed-off-by: Aleksei Semiglazov <asemiglazov@cloudflare.com>

address comments
2021-02-02 16:53:39 +00:00

35 lines
890 B
C++

#pragma once
#include <memory>
#include <mutex>
#include <unordered_set>
#include <Core/UUID.h>
namespace DB
{
/** PartUUIDs is a uuid set to control query deduplication.
* The object is used in query context in both direction:
* Server->Client to send all parts' UUIDs that have been read during the query
* Client->Server to ignored specified parts from being processed.
*
* Current implementation assumes a user setting allow_experimental_query_deduplication=1 is set.
*/
struct PartUUIDs
{
public:
/// Add new UUIDs if not duplicates found otherwise return duplicated UUIDs
std::vector<UUID> add(const std::vector<UUID> & uuids);
/// Get accumulated UUIDs
std::vector<UUID> get() const;
bool has(const UUID & uuid) const;
private:
mutable std::mutex mutex;
std::unordered_set<UUID> uuids;
};
using PartUUIDsPtr = std::shared_ptr<PartUUIDs>;
}