mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-12-16 03:12:43 +00:00
167 lines
8.5 KiB
C++
167 lines
8.5 KiB
C++
#pragma once
|
|
|
|
#include <Storages/MergeTree/MergeSelector.h>
|
|
|
|
|
|
/**
|
|
We have a set of data parts that is dynamically changing - new data parts are added and there is background merging process.
|
|
Background merging process periodically selects continuous range of data parts to merge.
|
|
|
|
It tries to optimize the following metrics:
|
|
1. Write amplification: total amount of data written on disk (source data + merges) relative to the amount of source data.
|
|
It can be also considered as the total amount of work for merges.
|
|
2. The number of data parts in the set at the random moment of time (average + quantiles).
|
|
|
|
Also taking the following considerations:
|
|
1. Older data parts should be merged less frequently than newer data parts.
|
|
2. Larger data parts should be merged less frequently than smaller data parts.
|
|
3. If no new parts arrive, we should continue to merge existing data parts to eventually optimize the table.
|
|
4. Never allow too many parts, because it will slow down SELECT queries significantly.
|
|
5. Multiple background merges can run concurrently but not too many.
|
|
|
|
It is not possible to optimize both metrics, because they contradict to each other.
|
|
To lower the number of parts we can merge eagerly but write amplification will increase.
|
|
Then we need some balance between optimization of these two metrics.
|
|
|
|
But some optimizations may improve both metrics.
|
|
|
|
For example, we can look at the "merge tree" - the tree of data parts that were merged.
|
|
If the tree is perfectly balanced then its depth is proportonal to the log(data size),
|
|
the total amount of work is proportional to data_size * log(data_size)
|
|
and the write amplification is proportional to log(data_size).
|
|
If it's not balanced (e.g. every new data part is always merged with existing data parts),
|
|
its depth is proportional to the data size, total amount of work is proportional to data_size^2.
|
|
|
|
We can also control the "base of the logarithm" - you can look it as the number of data parts
|
|
that are merged at once (the tree "arity"). But as the data parts are of different size, we should generalize it:
|
|
calculate the ratio between total size of merged parts to the size of the largest part participated in merge.
|
|
For example, if we merge 4 parts of size 5, 3, 1, 1 - then "base" will be 2 (10 / 5).
|
|
|
|
Base of the logarithm (simply called `base` in `SimpleMergeSelector`) is the main knob to control the write amplification.
|
|
The more it is, the less is write amplification but we will have more data parts on average.
|
|
|
|
To fit all the considerations, we also adjust `base` depending on total parts count,
|
|
parts size and parts age, with linear interpolation (then `base` is not a constant but a function of multiple variables,
|
|
looking like a section of hyperplane).
|
|
|
|
|
|
Then we apply the algorithm to select the optimal range of data parts to merge.
|
|
There is a proof that this algorithm is optimal if we look in the future only by single step.
|
|
|
|
The best range of data parts is selected.
|
|
|
|
We also apply some tunes:
|
|
- there is a fixed const of merging small parts (that is added to the size of data part before all estimations);
|
|
- there are some heuristics to "stick" ranges to large data parts.
|
|
|
|
It's still unclear if this algorithm is good or optimal at all. It's unclear if this algorithm is using the optimal coefficients.
|
|
|
|
To test and optimize SimpleMergeSelector, we apply the following methods:
|
|
- insert/merge simulator: a model simulating parts insertion and merging;
|
|
merge selecting algorithm is applied and the relevant metrics are calculated to allow to tune the algorithm;
|
|
- insert/merge simulator on real `system.part_log` from production - it gives realistic information about inserted data parts:
|
|
their sizes, at what time intervals they are inserted;
|
|
|
|
There is a research thesis dedicated to optimization of merge algorithm:
|
|
https://presentations.clickhouse.com/hse_2019/merge_algorithm.pptx
|
|
|
|
This work made attempt to variate the coefficients in SimpleMergeSelector and to solve the optimization task:
|
|
maybe some change in coefficients will give a clear win on all metrics. Surprisingly enough, it has found
|
|
that our selection of coefficients is near optimal. It has found slightly more optimal coefficients,
|
|
but I decided not to use them, because the representativeness of the test data is in question.
|
|
|
|
This work did not make any attempt to propose any other algorithm.
|
|
This work did not make any attempt to analyze the task with analytical methods.
|
|
That's why I still believe that there are many opportunities to optimize the merge selection algorithm.
|
|
|
|
Please do not mix the task with a similar task in other LSM-based systems (like RocksDB).
|
|
Their problem statement is subtly different. Our set of data parts is consisted of data parts
|
|
that are completely independent in stored data. Ranges of primary keys in data parts can intersect.
|
|
When doing SELECT we read from all data parts. INSERTed data parts comes with unknown size...
|
|
*/
|
|
|
|
namespace DB
|
|
{
|
|
|
|
class SimpleMergeSelector final : public IMergeSelector
|
|
{
|
|
public:
|
|
struct Settings
|
|
{
|
|
/// Zero means unlimited. Can be overridden by the same merge tree setting.
|
|
size_t max_parts_to_merge_at_once = 100;
|
|
|
|
/** Minimum ratio of size of one part to all parts in set of parts to merge (for usual cases).
|
|
* For example, if all parts have equal size, it means, that at least 'base' number of parts should be merged.
|
|
* If parts has non-uniform sizes, then minimum number of parts to merge is effectively increased.
|
|
* This behaviour balances merge-tree workload.
|
|
* It called 'base', because merge-tree depth could be estimated as logarithm with that base.
|
|
*
|
|
* If base is higher - then tree gets more wide and narrow, lowering write amplification.
|
|
* If base is lower - then merges occurs more frequently, lowering number of parts in average.
|
|
*
|
|
* We need some balance between write amplification and number of parts.
|
|
*/
|
|
double base = 5;
|
|
|
|
/** Base is lowered until 1 (effectively means "merge any two parts") depending on several variables:
|
|
*
|
|
* 1. Total number of parts in partition. If too many - then base is lowered.
|
|
* It means: when too many parts - do merges more urgently.
|
|
*
|
|
* 2. Minimum age of parts participating in merge. If higher age - then base is lowered.
|
|
* It means: do less wide merges only rarely.
|
|
*
|
|
* 3. Sum size of parts participating in merge. If higher - then more age is required to lower base. So, base is lowered slower.
|
|
* It means: for small parts, it's worth to merge faster, even not so wide or balanced.
|
|
*
|
|
* We have multivariative dependency. Let it be logarithmic of size and somewhat multi-linear by other variables,
|
|
* between some boundary points, and constant outside.
|
|
*/
|
|
|
|
size_t min_size_to_lower_base = 1024 * 1024;
|
|
size_t max_size_to_lower_base = 100ULL * 1024 * 1024 * 1024;
|
|
|
|
time_t min_age_to_lower_base_at_min_size = 10;
|
|
time_t min_age_to_lower_base_at_max_size = 10;
|
|
time_t max_age_to_lower_base_at_min_size = 3600;
|
|
time_t max_age_to_lower_base_at_max_size = 30 * 86400;
|
|
|
|
size_t min_parts_to_lower_base = 10;
|
|
size_t max_parts_to_lower_base = 50;
|
|
|
|
/// Add this to size before all calculations. It means: merging even very small parts has it's fixed cost.
|
|
size_t size_fixed_cost_to_add = 5 * 1024 * 1024;
|
|
|
|
/** Heuristic:
|
|
* Make some preference for ranges, that sum_size is like (in terms of ratio) to part previous at left.
|
|
*/
|
|
bool enable_heuristic_to_align_parts = true;
|
|
double heuristic_to_align_parts_min_ratio_of_sum_size_to_prev_part = 0.9;
|
|
double heuristic_to_align_parts_max_absolute_difference_in_powers_of_two = 0.5;
|
|
double heuristic_to_align_parts_max_score_adjustment = 0.75;
|
|
|
|
/** If it's not 0, all part ranges that have min_age larger than min_age_to_force_merge
|
|
* will be considered for merging
|
|
*/
|
|
size_t min_age_to_force_merge = 0;
|
|
|
|
/** Heuristic:
|
|
* From right side of range, remove all parts, that size is less than specified ratio of sum_size.
|
|
*/
|
|
bool enable_heuristic_to_remove_small_parts_at_right = true;
|
|
double heuristic_to_remove_small_parts_at_right_max_ratio = 0.01;
|
|
};
|
|
|
|
explicit SimpleMergeSelector(const Settings & settings_) : settings(settings_) {}
|
|
|
|
PartsRange select(
|
|
const PartsRanges & parts_ranges,
|
|
size_t max_total_size_to_merge) override;
|
|
|
|
private:
|
|
const Settings settings;
|
|
};
|
|
|
|
}
|