mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-11-21 23:21:59 +00:00
35231662b3
While reading from AggregatingMergeTree with SimpleAggregateFunction(String) in primary key and optimize_aggregation_in_order perf top shows: Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 287759760270 lost: 0/0 drop: 0/0 Children Self Shared Object Symbol + 12.64% 11.39% clickhouse [.] memcpy + 9.08% 0.23% [unknown] [.] 0000000000000000 + 8.45% 8.40% clickhouse [.] ProfileEvents::increment # <-- this, and in debug it has not 0.08x overhead, but 5.8x overhead + 7.68% 7.67% clickhouse [.] LZ4_compress_fast_extState + 5.29% 5.22% clickhouse [.] DB::IAggregateFunctionHelper<DB::AggregateFunctionNullUnary<true, true> >::addFree The reason is obvious, ProfileEvents is atomic counters (and also they are nested): <details> ``` Samples: 7M of event 'cycles', 4000 Hz, Event count (approx.): 450726149337 ProfileEvents::increment /usr/bin/clickhouse [Percent: local period] Percent│ │ │ │ Disassembly of section .text: │ │ 00000000078d8900 <ProfileEvents::increment(unsigned long, unsigned long)@@Base>: │ ProfileEvents::increment(unsigned long, unsigned long): 0.17 │ push %rbp 0.00 │ mov %rsi,%rbp 0.04 │ push %rbx 0.20 │ mov %rdi,%rbx 0.17 │ sub $0x8,%rsp 0.26 │ → callq DB::CurrentThread::getProfileEvents │ ProfileEvents::Counters::increment(unsigned long, unsigned long): 0.00 │ lea 0x0(,%rbx,8),%rdi 0.05 │ nop │ unsigned long std::__1::__cxx_atomic_fetch_add<unsigned long, unsigned long>(std::__1::__cxx_atomic_base_impl<unsigned long>*, unsigned long, std::__1::memory_order): 1.02 │ mov (%rax),%rdx 97.04 │ lock add %rbp,(%rdx,%rdi,1) │ ProfileEvents::Counters::increment(unsigned long, unsigned long): 0.21 │ mov 0x10(%rax),%rax 0.04 │ test %rax,%rax 0.00 │ → jne 78d8920 <ProfileEvents::increment(unsigned long, unsigned long)@@Base+0x20> │ ProfileEvents::increment(unsigned long, unsigned long): 0.38 │ add $0x8,%rsp 0.00 │ pop %rbx 0.04 │ pop %rbp 0.38 │ ← retq ``` </details> These ProfileEvents was ArenaAllocChunks (it shows ~1.5M events per second), and the reason is that the table has SimpleAggregateFunction(String) in PK, which requires Arena. But most of the time there Arena wasn't even used, so avoid this cost by re-creating Arena only if it was "used" (i.e. has new chunks). Another possibility is to avoid populating Arena::head in ctor, but this will make the Arena code more complex, so for now this was preferred. Also as a long-term solution it worth looking at implementing them via RCU (to move the extra overhead out from the write code path into read side). |
||
---|---|---|
.. | ||
agg_functions_min_max_any.xml | ||
aggregate_functions_of_group_by_keys.xml | ||
aggregating_merge_tree_simple_aggregate_function_string.xml | ||
aggregating_merge_tree.xml | ||
aggregation_in_order.xml | ||
analyze_array_tuples.xml | ||
and_function.xml | ||
any_anyLast.xml | ||
arithmetic_operations_in_aggr_func.xml | ||
arithmetic.xml | ||
array_auc.xml | ||
array_element.xml | ||
array_fill.xml | ||
array_index_low_cardinality_numbers.xml | ||
array_index_low_cardinality_strings.xml | ||
array_join.xml | ||
array_reduce.xml | ||
base64_hits.xml | ||
base64.xml | ||
basename.xml | ||
bigint_arithm.xml | ||
bit_operations_fixed_string_numbers.xml | ||
bit_operations_fixed_string.xml | ||
bitCount.xml | ||
bloom_filter_insert.xml | ||
bloom_filter_select.xml | ||
bounding_ratio.xml | ||
cidr.xml | ||
codecs_float_insert.xml | ||
codecs_float_select.xml | ||
codecs_int_insert.xml | ||
codecs_int_select.xml | ||
collations.xml | ||
column_column_comparison.xml | ||
columns_hashing.xml | ||
complex_array_creation.xml | ||
concat_hits.xml | ||
conditional.xml | ||
consistent_hashes.xml | ||
constant_column_comparison.xml | ||
constant_column_search.xml | ||
count.xml | ||
cpu_synthetic.xml | ||
cryptographic_hashes.xml | ||
date_parsing.xml | ||
date_time_64.xml | ||
date_time_long.xml | ||
date_time_short.xml | ||
datetime_comparison.xml | ||
decimal_aggregates.xml | ||
decimal_casts.xml | ||
decimal_parse.xml | ||
distinct_combinator.xml | ||
distributed_aggregation_memory_efficient.xml | ||
distributed_aggregation.xml | ||
duplicate_order_by_and_distinct.xml | ||
early_constant_folding.xml | ||
empty_string_deserialization.xml | ||
empty_string_serialization.xml | ||
encrypt_decrypt_empty_string_slow.xml | ||
encrypt_decrypt_empty_string.xml | ||
encrypt_decrypt_slow.xml | ||
encrypt_decrypt.xml | ||
entropy.xml | ||
extract.xml | ||
first_significant_subdomain.xml | ||
fixed_string16.xml | ||
float_formatting.xml | ||
float_mod.xml | ||
float_parsing.xml | ||
format_date_time.xml | ||
format_readable.xml | ||
functions_coding.xml | ||
functions_geo.xml | ||
functions_with_hash_tables.xml | ||
fuzz_bits.xml | ||
general_purpose_hashes_on_UUID.xml | ||
general_purpose_hashes.xml | ||
generate_table_function.xml | ||
great_circle_dist.xml | ||
group_array_moving_sum.xml | ||
group_by_sundy_li.xml | ||
h3.xml | ||
if_array_num.xml | ||
if_array_string.xml | ||
if_string_const.xml | ||
if_string_hits.xml | ||
if_to_multiif.xml | ||
if_transform_strings_to_enum.xml | ||
information_value.xml | ||
injective_functions_inside_uniq.xml | ||
insert_parallel.xml | ||
insert_select_default_small_block.xml | ||
insert_sequential_and_background_merges.xml | ||
insert_values_with_expressions.xml | ||
inserts_arrays_lowcardinality.xml | ||
int_parsing.xml | ||
IPv4.xml | ||
IPv6.xml | ||
jit_large_requests.xml | ||
jit_small_requests.xml | ||
joins_in_memory_pmj.xml | ||
joins_in_memory.xml | ||
json_extract_rapidjson.xml | ||
json_extract_simdjson.xml | ||
least_greatest_hits.xml | ||
leftpad.xml | ||
linear_regression.xml | ||
local_replica.xml | ||
logical_functions_large.xml | ||
logical_functions_medium.xml | ||
logical_functions_small.xml | ||
materialized_view_parallel_insert.xml | ||
math.xml | ||
merge_table_streams.xml | ||
merge_tree_huge_pk.xml | ||
merge_tree_many_partitions_2.xml | ||
merge_tree_many_partitions.xml | ||
merge_tree_simple_select.xml | ||
mingroupby-orderbylimit1.xml | ||
modulo.xml | ||
monotonous_order_by.xml | ||
ngram_distance.xml | ||
number_formatting_formats.xml | ||
nyc_taxi.xml | ||
optimized_select_final.xml | ||
or_null_default.xml | ||
order_by_decimals.xml | ||
order_by_read_in_order.xml | ||
order_by_single_column.xml | ||
order_with_limit.xml | ||
parallel_final.xml | ||
parallel_index.xml | ||
parallel_insert.xml | ||
parallel_mv.xml | ||
parse_engine_file.xml | ||
point_in_polygon_const.xml | ||
point_in_polygon.xml | ||
polymorphic_parts_l.xml | ||
polymorphic_parts_m.xml | ||
polymorphic_parts_s.xml | ||
pre_limit_no_sorting.xml | ||
prewhere.xml | ||
push_down_limit.xml | ||
quantile_merge.xml | ||
questdb_sum_float32.xml | ||
questdb_sum_float64.xml | ||
questdb_sum_int32.xml | ||
rand.xml | ||
random_fixed_string.xml | ||
random_printable_ascii.xml | ||
random_string_utf8.xml | ||
random_string.xml | ||
range.xml | ||
read_from_comp_parts.xml | ||
read_hits_with_aio.xml | ||
read_in_order_many_parts.xml | ||
README.md | ||
redundant_functions_in_order_by.xml | ||
removing_group_by_keys.xml | ||
right.xml | ||
round_down.xml | ||
round_methods.xml | ||
scalar.xml | ||
select_format.xml | ||
set_hits.xml | ||
set_index.xml | ||
set.xml | ||
simple_join_query.xml | ||
single_fixed_string_groupby.xml | ||
slices_hits.xml | ||
sort_radix_trivial.xml | ||
sort.xml | ||
string_join.xml | ||
string_set.xml | ||
string_sort.xml | ||
sum_map.xml | ||
sum.xml | ||
synthetic_hardware_benchmark.xml | ||
trim_numbers.xml | ||
trim_urls.xml | ||
trim_whitespace.xml | ||
uniq.xml | ||
url_hits.xml | ||
vectorize_aggregation_combinators.xml | ||
visit_param_extract_raw.xml | ||
website.xml |
ClickHouse performance tests
This directory contains .xml
-files with performance tests for @akuzm tool.
How to write performance test
First of all you should check existing tests don't cover your case. If there are no such tests than you should write your own.
You have to specify preconditions
. It contains table names. Only hits_100m_single
, hits_10m_single
, test.hits
are available in CI.
You can use substitions
, create
, fill
and drop
queries to prepare test. You can find examples in this folder.
Take into account, that these tests will run in CI which consists of 56-cores and 512 RAM machines. Queries will be executed much faster than on local laptop.
If your test continued more than 10 minutes, please, add tag long
to have an opportunity to run all tests and skip long ones.
How to run performance test
TODO @akuzm
How to validate single test
pip3 install clickhouse_driver
../../docker/test/performance-comparison/perf.py --runs 1 insert_parallel.xml