Merge remote-tracking branch 'origin/master' into HEAD

2024-09-20 08:40:50 +00:00 · 2020-09-18 18:06:41 +03:00 · 2020-09-18 18:06:41 +03:00 · 8995140ffb
commit 8995140ffb
parent e7b9319e12 eae9950abe
15 changed files with 639 additions and 376 deletions
--- a/docker/test/performance-comparison/README.md
+++ b/docker/test/performance-comparison/README.md
@ -16,7 +16,7 @@ We also consider the test to be unstable, if the observed difference is less tha
 performance differences above 5% more often than in 5% runs, so the test is likely
 to have false positives.

-### How to read the report
+### How to Read the Report

 The check status summarizes the report in a short text message like `1 faster, 10 unstable`:
 * `1 faster` -- how many queries became faster,
@ -27,28 +27,50 @@ The check status summarizes the report in a short text message like `1 faster, 1

 The report page itself constists of a several tables. Some of them always signify errors, e.g. "Run errors" -- the very presence of this table indicates that there were errors during the test, that are not normal and must be fixed. Some tables are mostly informational, e.g. "Test times" -- they reflect normal test results. But if a cell in such table is marked in red, this also means an error, e.g., a test is taking too long to run.

-#### Tested commits
+#### Tested Commits
 Informational, no action required. Log messages for the commits that are tested. Note that for the right commit, we show nominal tested commit `pull/*/head` and real tested commit `pull/*/merge`, which is generated by GitHub by merging latest master to the `pull/*/head` and which we actually build and test in CI.

-#### Run errors
-Action required for every item -- these are errors that must be fixed. The errors that ocurred when running some test queries. For more information about the error, download test output archive and see `test-name-err.log`. To reproduce, see 'How to run' below.
+#### Error Summary
+Action required for every item.

-#### Slow on client
-Action required for every item -- these are errors that must be fixed. This table shows queries that take significantly longer to process on the client than on the server. A possible reason might be sending too much data to the client, e.g., a forgotten `format Null`.
+This table summarizes all errors that ocurred during the test. Click the links to go to the description of a particular error.

-#### Short queries not marked as short
-Action required for every item -- these are errors that must be fixed. This table shows queries that are "short" but not explicitly marked as such. "Short" queries are too fast to meaningfully compare performance, because the changes are drowned by the noise. We consider all queries that run faster than 0.02 s to be "short", and only check the performance if they became slower than this threshold. Probably this mode is not what you want, so you have to increase the query run time to be between 1 and 0.1 s, so that the performance can be compared. You do want this "short" mode for queries that complete "immediately", such as some varieties of `select count(*)`. You have to mark them as "short" explicitly by writing `<query short="1">...`. The value of "short" attribute is evaluated as a python expression, and substitutions are performed, so you can write something like `<query short="{column1} = {column2}">select count(*) from table where {column1} > {column2}</query>`, to mark only a particular combination of variables as short.
+#### Run Errors
+Action required for every item -- these are errors that must be fixed.

-#### Partial queries
-Action required for the cells marked in red. Shows the queries we are unable to run on an old server -- probably because they contain a new function. You should see this table when you add a new function and a performance test for it. Check that the run time and variance are acceptable (run time between 0.1 and 1 seconds, variance below 10%). If not, they will be highlighted in red.
+The errors that ocurred when running some test queries. For more information about the error, download test output archive and see `test-name-err.log`. To reproduce, see 'How to run' below.

-#### Changes in performance
-Action required for the cells marked in red, and some cheering is appropriate for the cells marked in green. These are the queries for which we observe a statistically significant change in performance. Note that there will always be some false positives -- we try to filter by p < 0.001, and have 2000 queries, so two false positives per run are expected. In practice we have more -- e.g. code layout changed because of some unknowable jitter in compiler internals, so the change we observe is real, but it is a 'false positive' in the sense that it is not directly caused by your changes. If, based on your knowledge of ClickHouse internals, you can decide that the observed test changes are not relevant to the changes made in the tested PR, you can ignore them.
+#### Slow on Client
+Action required for every item -- these are errors that must be fixed.
+
+This table shows queries that take significantly longer to process on the client than on the server. A possible reason might be sending too much data to the client, e.g., a forgotten `format Null`.
+
+#### Inconsistent Short Marking
+Action required for every item -- these are errors that must be fixed.
+
+Queries that have "short" duration (on the order of 0.1 s) can't be reliably tested in a normal way, where we perform a small (about ten) measurements for each server, because the signal-to-noise ratio is much smaller. There is a special mode for such queries that instead runs them for a fixed amount of time, normally with much higher number of measurements (up to thousands). This mode must be explicitly enabled by the test author to avoid accidental errors. It must be used only for queries that are meant to complete "immediately", such as `select count(*)`. If your query is not supposed to be "immediate", try to make it run longer, by e.g. processing more data.
+
+This table shows queries for which the "short" marking is not consistent with the actual query run time -- i.e., a query runs for a long time but is marked as short, or it runs very fast but is not marked as short.
+
+If your query is really supposed to complete "immediately" and can't be made to run longer, you have to mark it as "short". To do so, write `<query short="1">...` in the test file. The value of "short" attribute is evaluated as a python expression, and substitutions are performed, so you can write something like `<query short="{column1} = {column2}">select count(*) from table where {column1} > {column2}</query>`, to mark only a particular combination of variables as short.
+
+
+#### Partial Queries
+Action required for the cells marked in red.
+
+Shows the queries we are unable to run on an old server -- probably because they contain a new function. You should see this table when you add a new function and a performance test for it. Check that the run time and variance are acceptable (run time between 0.1 and 1 seconds, variance below 10%). If not, they will be highlighted in red.
+
+#### Changes in Performance
+Action required for the cells marked in red, and some cheering is appropriate for the cells marked in green.
+
+These are the queries for which we observe a statistically significant change in performance. Note that there will always be some false positives -- we try to filter by p < 0.001, and have 2000 queries, so two false positives per run are expected. In practice we have more -- e.g. code layout changed because of some unknowable jitter in compiler internals, so the change we observe is real, but it is a 'false positive' in the sense that it is not directly caused by your changes. If, based on your knowledge of ClickHouse internals, you can decide that the observed test changes are not relevant to the changes made in the tested PR, you can ignore them.

 You can find flame graphs for queries with performance changes in the test output archive, in files named as 'my_test_0_Cpu_SELECT 1 FROM....FORMAT Null.left.svg'. First goes the test name, then the query number in the test, then the trace type (same as in `system.trace_log`), and then the server version (left is old and right is new).

-#### Unstable queries
-Action required for the cells marked in red. These are queries for which we did not observe a statistically significant change in performance, but for which the variance in query performance is very high. This means that we are likely to observe big changes in performance even in the absence of real changes, e.g. when comparing the server to itself. Such queries are going to have bad sensitivity as performance tests -- if a query has, say, 50% expected variability, this means we are going to see changes in performance up to 50%, even when there were no real changes in the code. And because of this, we won't be able to detect changes less than 50% with such a query, which is pretty bad. The reasons for the high variability must be investigated and fixed; ideally, the variability should be brought under 5-10%. 
+#### Unstable Queries
+Action required for the cells marked in red.
+
+These are the queries for which we did not observe a statistically significant change in performance, but for which the variance in query performance is very high. This means that we are likely to observe big changes in performance even in the absence of real changes, e.g. when comparing the server to itself. Such queries are going to have bad sensitivity as performance tests -- if a query has, say, 50% expected variability, this means we are going to see changes in performance up to 50%, even when there were no real changes in the code. And because of this, we won't be able to detect changes less than 50% with such a query, which is pretty bad. The reasons for the high variability must be investigated and fixed; ideally, the variability should be brought under 5-10%. 

 The most frequent reason for instability is that the query is just too short -- e.g. below 0.1 seconds. Bringing query time to 0.2 seconds or above usually helps.
 Other reasons may include:
@ -57,24 +79,33 @@ Other reasons may include:

 Investigating the instablility is the hardest problem in performance testing, and we still have not been able to understand the reasons behind the instability of some queries. There are some data that can help you in the performance test output archive. Look for files named 'my_unstable_test_0_SELECT 1...FORMAT Null.{left,right}.metrics.rep'. They contain metrics from `system.query_log.ProfileEvents` and functions from stack traces from `system.trace_log`, that vary significantly between query runs. The second column is array of \[min, med, max] values for the metric. Say, if you see `PerfCacheMisses` there, it may mean that the code being tested has not-so-cache-local memory access pattern that is sensitive to memory layout.

-#### Skipped tests
-Informational, no action required. Shows the tests that were skipped, and the reason for it. Normally it is because the data set required for the test was not loaded, or the test is marked as 'long' -- both cases mean that the test is too big to be ran per-commit.
+#### Skipped Tests
+Informational, no action required.

-#### Test performance changes
-Informational, no action required. This table summarizes the changes in performance of queries in each test -- how many queries have changed, how many are unstable, and what is the magnitude of the changes.
+Shows the tests that were skipped, and the reason for it. Normally it is because the data set required for the test was not loaded, or the test is marked as 'long' -- both cases mean that the test is too big to be ran per-commit.

-#### Test times
-Action required for the cells marked in red. This table shows the run times for all the tests. You may have to fix two kinds of errors in this table:
+#### Test Performance Changes
+Informational, no action required.
+
+This table summarizes the changes in performance of queries in each test -- how many queries have changed, how many are unstable, and what is the magnitude of the changes.
+
+#### Test Times
+Action required for the cells marked in red.
+
+This table shows the run times for all the tests. You may have to fix two kinds of errors in this table:
 1) Average query run time is too long -- probalby means that the preparatory steps such as creating the table and filling them with data are taking too long. Try to make them faster.
 2) Longest query run time is too long -- some particular queries are taking too long, try to make them faster. The ideal query run time is between 0.1 and 1 s.

-#### Concurrent benchmarks
-No action required. This table shows the results of a concurrent behcmark where queries from `website` are ran in parallel using `clickhouse-benchmark`, and requests per second values are compared for old and new servers. It shows variability up to 20% for no apparent reason, so it's probably safe to disregard it. We have it for special cases like investigating concurrency effects in memory allocators, where it may be important.
+#### Metric Changes
+No action required.

-#### Metric changes
-No action required. These are changes in median values of metrics from `system.asynchronous_metrics_log`. Again, they are prone to unexplained variation and you can safely ignore this table unless it's interesting to you for some particular reason (e.g. you want to compare memory usage). There are also graphs of these metrics in the performance test output archive, in the `metrics` folder.
+These are changes in median values of metrics from `system.asynchronous_metrics_log`. These metrics are prone to unexplained variation and you can safely ignore this table unless it's interesting to you for some particular reason (e.g. you want to compare memory usage). There are also graphs of these metrics in the performance test output archive, in the `metrics` folder.

-### How to run
+#### Errors while Building the Report
+Ask a maintainer for help. These errors normally indicate a problem with testing infrastructure.
+
+
+### How to Run
 Run the entire docker container, specifying PR number (0 for master)
 and SHA of the commit to test. The reference revision is determined as a nearest
 ancestor testing release tag. It is possible to specify the reference revision and
--- a/docker/test/performance-comparison/compare.sh
+++ b/docker/test/performance-comparison/compare.sh
@ -210,33 +210,9 @@ function run_tests
    wait
 }

-# Run some queries concurrently and report the resulting TPS. This additional
-# (relatively) short test helps detect concurrency-related effects, because the
-# main performance comparison testing is done query-by-query.
-function run_benchmark
-{
-    rm -rf benchmark ||:
-    mkdir benchmark ||:
-
-    # The list is built by run_tests.
-    while IFS= read -r file
-    do
-        name=$(basename "$file" ".xml")
-
-        "$script_dir/perf.py" --print-queries "$file" > "benchmark/$name-queries.txt"
-        "$script_dir/perf.py" --print-settings "$file" > "benchmark/$name-settings.txt"
-
-        readarray -t settings < "benchmark/$name-settings.txt"
-        command=(clickhouse-benchmark --concurrency 6 --cumulative --iterations 1000 --randomize 1 --delay 0 --continue_on_errors "${settings[@]}")
-
-        "${command[@]}" --port 9001 --json "benchmark/$name-left.json" < "benchmark/$name-queries.txt"
-        "${command[@]}" --port 9002 --json "benchmark/$name-right.json" < "benchmark/$name-queries.txt"
-    done < benchmarks-to-run.txt
-}
-
 function get_profiles_watchdog
 {
-    sleep 6000
+    sleep 600

    echo "The trace collection did not finish in time." >> profile-errors.log

@ -503,8 +479,6 @@ build_log_column_definitions
 cat analyze/errors.log >> report/errors.log ||:
 cat profile-errors.log >> report/errors.log ||:

-short_query_threshold="0.02"
-
 clickhouse-local --query "
 create view query_display_names as select * from
    file('analyze/query-display-names.tsv', TSV,
@ -537,18 +511,11 @@ create view query_metric_stats as
 -- Main statistics for queries -- query time as reported in query log.
 create table queries engine File(TSVWithNamesAndTypes, 'report/queries.tsv')
    as select
-        -- Comparison mode doesn't make sense for queries that complete
-        -- immediately (on the same order of time as noise). If query duration is
-        -- less that some threshold, we just skip it. If there is a significant
-        -- regression in such query, the time will exceed the threshold, and we
-        -- well process it normally and detect the regression.
-        right < $short_query_threshold as short,
-
-        not short and abs(diff) > report_threshold        and abs(diff) > stat_threshold as changed_fail,
-        not short and abs(diff) > report_threshold - 0.05 and abs(diff) > stat_threshold as changed_show,
+        abs(diff) > report_threshold        and abs(diff) > stat_threshold as changed_fail,
+        abs(diff) > report_threshold - 0.05 and abs(diff) > stat_threshold as changed_show,
        
-        not short and not changed_fail and stat_threshold > report_threshold + 0.10 as unstable_fail,
-        not short and not changed_show and stat_threshold > report_threshold - 0.05 as unstable_show,
+        not changed_fail and stat_threshold > report_threshold + 0.10 as unstable_fail,
+        not changed_show and stat_threshold > report_threshold - 0.05 as unstable_show,
        
        left, right, diff, stat_threshold,
        if(report_threshold > 0, report_threshold, 0.10) as report_threshold,
@ -653,9 +620,9 @@ create table wall_clock_time_per_test engine Memory as select *

 create table test_time engine Memory as
    select test, sum(client) total_client_time,
-        maxIf(client, not short) query_max,
-        minIf(client, not short) query_min,
-        count(*) queries, sum(short) short_queries
+        max(client) query_max,
+        min(client) query_min,
+        count(*) queries
    from total_client_time_per_query full join queries using (test, query_index)
    group by test;

@ -695,7 +662,6 @@ create table test_times_report engine File(TSV, 'report/test-times.tsv') as
    select wall_clock_time_per_test.test, real,
        toDecimal64(total_client_time, 3),
        queries,
-        short_queries,
        toDecimal64(query_max, 3),
        toDecimal64(real / queries, 3) avg_real_per_query,
        toDecimal64(query_min, 3),
@ -734,32 +700,47 @@ create table queries_for_flamegraph engine File(TSVWithNamesAndTypes,
    select test, query_index from queries where unstable_show or changed_show
    ;

-- List of queries that have 'short' duration, but are not marked as 'short' by
-- the test author (we report them).
-create table unmarked_short_queries_report
-    engine File(TSV, 'report/unmarked-short-queries.tsv')
-    as select time, test, query_index, query_display_name
+
+create view shortness
+    as select 
+        (test, query_index) in
+            (select * from file('analyze/marked-short-queries.tsv', TSV,
+            'test text, query_index int'))
+            as marked_short,
+        time, test, query_index, query_display_name
    from (
-            select right time, test, query_index from queries where short
+            select right time, test, query_index from queries
            union all
            select time_median, test, query_index from partial_query_times
-                where time_median < $short_query_threshold
        ) times
        left join query_display_names
            on times.test = query_display_names.test
                and times.query_index = query_display_names.query_index
-    where (test, query_index) not in
-        (select * from file('analyze/marked-short-queries.tsv', TSV,
-            'test text, query_index int'))
-    order by test, query_index
    ;

+-- Report of queries that have inconsistent 'short' markings:
+-- 1) have short duration, but are not marked as 'short'
+-- 2) the reverse -- marked 'short' but take too long.
+-- The threshold for 2) is twice the threshold for 1), to avoid jitter.
+create table inconsistent_short_marking_report
+    engine File(TSV, 'report/inconsistent-short-marking.tsv')
+    as select
+        multiIf(marked_short and time > 0.1, 'marked as short but is too long',
+                not marked_short and time < 0.02, 'is short but not marked as such',
+                '') problem,
+        marked_short, time,
+        test, query_index, query_display_name
+    from shortness
+    where problem != ''
+    ;
+
+
 --------------------------------------------------------------------------------
 -- various compatibility data formats follow, not related to the main report

 -- keep the table in old format so that we can analyze new and old data together
 create table queries_old_format engine File(TSVWithNamesAndTypes, 'queries.rep')
-    as select short, changed_fail, unstable_fail, left, right, diff,
+    as select 0 short, changed_fail, unstable_fail, left, right, diff,
        stat_threshold, test, query_display_name query
    from queries
    ;
@ -1057,9 +1038,6 @@ case "$stage" in
    # Ignore the errors to collect the log and build at least some report, anyway
    time run_tests ||:
    ;&
-"run_benchmark")
-    time run_benchmark 2> >(tee -a run-errors.tsv 1>&2) ||:
-    ;&
 "get_profiles")
    # Check for huge pages.
    cat /sys/kernel/mm/transparent_hugepage/enabled > thp-enabled.txt ||:
--- a/docker/test/performance-comparison/perf.py
+++ b/docker/test/performance-comparison/perf.py
@ -1,17 +1,20 @@
 #!/usr/bin/python3

-import os
-import sys
-import itertools
-import clickhouse_driver
-import xml.etree.ElementTree as et
 import argparse
+import clickhouse_driver
+import itertools
+import functools
+import math
+import os
 import pprint
 import random
 import re
+import statistics
 import string
+import sys
 import time
 import traceback
+import xml.etree.ElementTree as et

 def tsv_escape(s):
    return s.replace('\\', '\\\\').replace('\t', '\\t').replace('\n', '\\n').replace('\r','')
@ -64,18 +67,13 @@ def substitute_parameters(query_templates, other_templates = []):
 # Build a list of test queries, substituting parameters to query templates,
 # and reporting the queries marked as short.
 test_queries = []
+is_short = []
 for e in root.findall('query'):
-    new_queries = []
-    if 'short' in e.attrib:
-        new_queries, [is_short] = substitute_parameters([e.text], [[e.attrib['short']]])
-        for i, s in enumerate(is_short):
-            # Don't print this if we only need to print the queries.
-            if eval(s) and not args.print_queries:
-                print(f'short\t{i + len(test_queries)}')
-    else:
-        new_queries = substitute_parameters([e.text])
-
+    new_queries, [new_is_short] = substitute_parameters([e.text], [[e.attrib.get('short', '0')]])
    test_queries += new_queries
+    is_short += [eval(s) for s in new_is_short]
+
+assert(len(test_queries) == len(is_short))


 # If we're only asked to print the queries, do that and exit
@ -84,6 +82,11 @@ if args.print_queries:
        print(q)
    exit(0)

+# Print short queries
+for i, s in enumerate(is_short):
+    if s:
+        print(f'short\t{i}')
+
 # If we're only asked to print the settings, do that and exit. These are settings
 # for clickhouse-benchmark, so we print them as command line arguments, e.g.
 # '--max_memory_usage=10000000'.
@ -118,7 +121,7 @@ if 'max_ignored_relative_change' in root.attrib:

 # Open connections
 servers = [{'host': host, 'port': port} for (host, port) in zip(args.host, args.port)]
-connections = [clickhouse_driver.Client(**server) for server in servers]
+all_connections = [clickhouse_driver.Client(**server) for server in servers]

 for s in servers:
    print('server\t{}\t{}'.format(s['host'], s['port']))
@ -128,7 +131,7 @@ for s in servers:
 # connection loses the changes in settings.
 drop_query_templates = [q.text for q in root.findall('drop_query')]
 drop_queries = substitute_parameters(drop_query_templates)
-for conn_index, c in enumerate(connections):
+for conn_index, c in enumerate(all_connections):
    for q in drop_queries:
        try:
            c.execute(q)
@ -144,7 +147,7 @@ for conn_index, c in enumerate(connections):
 # configurable). So the end result is uncertain, but hopefully we'll be able to
 # run at least some queries.
 settings = root.findall('settings/*')
-for conn_index, c in enumerate(connections):
+for conn_index, c in enumerate(all_connections):
    for s in settings:
        try:
            q = f"set {s.tag} = '{s.text}'"
@ -156,7 +159,7 @@ for conn_index, c in enumerate(connections):
 # Check tables that should exist. If they don't exist, just skip this test.
 tables = [e.text for e in root.findall('preconditions/table_exists')]
 for t in tables:
-    for c in connections:
+    for c in all_connections:
        try:
            res = c.execute("select 1 from {} limit 1".format(t))
        except:
@ -178,7 +181,7 @@ for q in create_queries:
            file = sys.stderr)
        sys.exit(1)

-for conn_index, c in enumerate(connections):
+for conn_index, c in enumerate(all_connections):
    for q in create_queries:
        c.execute(q)
        print(f'create\t{conn_index}\t{c.last_query.elapsed}\t{tsv_escape(q)}')
@ -186,7 +189,7 @@ for conn_index, c in enumerate(connections):
 # Run fill queries
 fill_query_templates = [q.text for q in root.findall('fill_query')]
 fill_queries = substitute_parameters(fill_query_templates)
-for conn_index, c in enumerate(connections):
+for conn_index, c in enumerate(all_connections):
    for q in fill_queries:
        c.execute(q)
        print(f'fill\t{conn_index}\t{c.last_query.elapsed}\t{tsv_escape(q)}')
@ -216,8 +219,8 @@ for query_index in queries_to_run:
    # new one. We want to run them on the new server only, so that the PR author
    # can ensure that the test works properly. Remember the errors we had on
    # each server.
-    query_error_on_connection = [None] * len(connections);
-    for conn_index, c in enumerate(connections):
+    query_error_on_connection = [None] * len(all_connections);
+    for conn_index, c in enumerate(all_connections):
        try:
            prewarm_id = f'{query_prefix}.prewarm0'
            res = c.execute(q, query_id = prewarm_id)
@ -244,21 +247,22 @@ for query_index in queries_to_run:

    if len(no_errors) == 0:
        continue
-    elif len(no_errors) < len(connections):
+    elif len(no_errors) < len(all_connections):
        print(f'partial\t{query_index}\t{no_errors}')

+    this_query_connections = [all_connections[index] for index in no_errors]
+
    # Now, perform measured runs.
    # Track the time spent by the client to process this query, so that we can
    # notice the queries that take long to process on the client side, e.g. by
    # sending excessive data.
    start_seconds = time.perf_counter()
    server_seconds = 0
-    for run in range(0, args.runs):
+    run = 0
+    while True:
        run_id = f'{query_prefix}.run{run}'
-        for conn_index, c in enumerate(connections):
-            if query_error_on_connection[conn_index]:
-                continue

+        for conn_index, c in enumerate(this_query_connections):
            try:
                res = c.execute(q, query_id = run_id)
            except Exception as e:
@ -267,8 +271,8 @@ for query_index in queries_to_run:
                e.message = run_id + ': ' + e.message
                raise

-            print(f'query\t{query_index}\t{run_id}\t{conn_index}\t{c.last_query.elapsed}')
            server_seconds += c.last_query.elapsed
+            print(f'query\t{query_index}\t{run_id}\t{conn_index}\t{c.last_query.elapsed}')

            if c.last_query.elapsed > 10:
                # Stop processing pathologically slow queries, to avoid timing out
@ -277,12 +281,37 @@ for query_index in queries_to_run:
                print(f'The query no. {query_index} is taking too long to run ({c.last_query.elapsed} s)', file=sys.stderr)
                exit(2)

+        # Be careful with the counter, after this line it's the next iteration
+        # already.
+        run += 1
+
+        # Try to run any query for at least the specified number of times,
+        # before considering other stop conditions.
+        if run < args.runs:
+            continue
+
+        # For very short queries we have a special mode where we run them for at
+        # least some time. The recommended lower bound of run time for "normal"
+        # queries is about 0.1 s, and we run them about 10 times, giving the
+        # time per query per server of about one second. Use this value as a
+        # reference for "short" queries.
+        if is_short[query_index]:
+            if server_seconds >= 2 * len(this_query_connections):
+                break
+            # Also limit the number of runs, so that we don't go crazy processing
+            # the results -- 'eqmed.sql' is really suboptimal.
+            if run >= 500:
+                break
+        else:
+            if run >= args.runs:
+                break
+
    client_seconds = time.perf_counter() - start_seconds
    print(f'client-time\t{query_index}\t{client_seconds}\t{server_seconds}')

 # Run drop queries
 drop_queries = substitute_parameters(drop_query_templates)
-for conn_index, c in enumerate(connections):
+for conn_index, c in enumerate(all_connections):
    for q in drop_queries:
        c.execute(q)
        print(f'drop\t{conn_index}\t{c.last_query.elapsed}\t{tsv_escape(q)}')
--- a/docker/test/performance-comparison/report.py
+++ b/docker/test/performance-comparison/report.py
@ -98,6 +98,9 @@ th {{

 tr:nth-child(odd) td {{filter: brightness(90%);}}

+.inconsistent-short-marking tr :nth-child(2),
+.inconsistent-short-marking tr :nth-child(3),
+.inconsistent-short-marking tr :nth-child(5),
 .all-query-times tr :nth-child(1),
 .all-query-times tr :nth-child(2),
 .all-query-times tr :nth-child(3),
@ -126,7 +129,6 @@ tr:nth-child(odd) td {{filter: brightness(90%);}}
 .test-times tr :nth-child(5),
 .test-times tr :nth-child(6),
 .test-times tr :nth-child(7),
-.test-times tr :nth-child(8),
 .concurrent-benchmarks tr :nth-child(2),
 .concurrent-benchmarks tr :nth-child(3),
 .concurrent-benchmarks tr :nth-child(4),
@ -205,9 +207,11 @@ def tableStart(title):
    global table_anchor
    table_anchor = cls
    anchor = currentTableAnchor()
+    help_anchor = '-'.join(title.lower().split(' '));
    return f"""
        <h2 id="{anchor}">
            <a class="cancela" href="#{anchor}">{title}</a>
+            <a class="cancela" href="https://github.com/ClickHouse/ClickHouse/tree/master/docker/test/performance-comparison#{help_anchor}"><sup style="color: #888">?</sup></a>
        </h2>
        <table class="{cls}">
    """
@ -250,7 +254,7 @@ def addSimpleTable(caption, columns, rows, pos=None):
 def add_tested_commits():
    global report_errors
    try:
-        addSimpleTable('Tested commits', ['Old', 'New'],
+        addSimpleTable('Tested Commits', ['Old', 'New'],
            [['<pre>{}</pre>'.format(x) for x in
                [open('left-commit.txt').read(),
                 open('right-commit.txt').read()]]])
@ -276,7 +280,7 @@ def add_report_errors():
    if not report_errors:
        return

-    text = tableStart('Errors while building the report')
+    text = tableStart('Errors while Building the Report')
    text += tableHeader(['Error'])
    for x in report_errors:
        text += tableRow([x])
@ -290,7 +294,7 @@ def add_errors_explained():
        return

    text = '<a name="fail1"/>'
-    text += tableStart('Error summary')
+    text += tableStart('Error Summary')
    text += tableHeader(['Description'])
    for row in errors_explained:
        text += tableRow(row)
@ -308,26 +312,26 @@ if args.report == 'main':

    run_error_rows = tsvRows('run-errors.tsv')
    error_tests += len(run_error_rows)
-    addSimpleTable('Run errors', ['Test', 'Error'], run_error_rows)
+    addSimpleTable('Run Errors', ['Test', 'Error'], run_error_rows)
    if run_error_rows:
        errors_explained.append([f'<a href="#{currentTableAnchor()}">There were some errors while running the tests</a>']);


    slow_on_client_rows = tsvRows('report/slow-on-client.tsv')
    error_tests += len(slow_on_client_rows)
-    addSimpleTable('Slow on client',
+    addSimpleTable('Slow on Client',
                     ['Client time,&nbsp;s', 'Server time,&nbsp;s', 'Ratio', 'Test', 'Query'],
                     slow_on_client_rows)
    if slow_on_client_rows:
        errors_explained.append([f'<a href="#{currentTableAnchor()}">Some queries are taking noticeable time client-side (missing `FORMAT Null`?)</a>']);

-    unmarked_short_rows = tsvRows('report/unmarked-short-queries.tsv')
+    unmarked_short_rows = tsvRows('report/inconsistent-short-marking.tsv')
    error_tests += len(unmarked_short_rows)
-    addSimpleTable('Short queries not marked as short',
-        ['New client time, s', 'Test', '#', 'Query'],
+    addSimpleTable('Inconsistent Short Marking',
+        ['Problem', 'Is marked as short', 'New client time, s', 'Test', '#', 'Query'],
        unmarked_short_rows)
    if unmarked_short_rows:
-        errors_explained.append([f'<a href="#{currentTableAnchor()}">Some queries have short duration but are not explicitly marked as "short"</a>']);
+        errors_explained.append([f'<a href="#{currentTableAnchor()}">Some queries have inconsistent short marking</a>']);

    def add_partial():
        rows = tsvRows('report/partial-queries-report.tsv')
@ -335,7 +339,7 @@ if args.report == 'main':
            return

        global unstable_partial_queries, slow_average_tests, tables
-        text = tableStart('Partial queries')
+        text = tableStart('Partial Queries')
        columns = ['Median time, s', 'Relative time variance', 'Test', '#', 'Query']
        text += tableHeader(columns)
        attrs = ['' for c in columns]
@ -366,7 +370,7 @@ if args.report == 'main':

        global faster_queries, slower_queries, tables

-        text = tableStart('Changes in performance')
+        text = tableStart('Changes in Performance')
        columns = [
            'Old,&nbsp;s',                                          # 0
            'New,&nbsp;s',                                          # 1
@ -423,7 +427,7 @@ if args.report == 'main':
            'Query' #7
        ]

-        text = tableStart('Unstable queries')
+        text = tableStart('Unstable Queries')
        text += tableHeader(columns)

        attrs = ['' for c in columns]
@ -444,9 +448,9 @@ if args.report == 'main':
    add_unstable_queries()

    skipped_tests_rows = tsvRows('analyze/skipped-tests.tsv')
-    addSimpleTable('Skipped tests', ['Test', 'Reason'], skipped_tests_rows)
+    addSimpleTable('Skipped Tests', ['Test', 'Reason'], skipped_tests_rows)

-    addSimpleTable('Test performance changes',
+    addSimpleTable('Test Performance Changes',
        ['Test', 'Ratio of speedup&nbsp;(-) or slowdown&nbsp;(+)', 'Queries', 'Total not OK', 'Changed perf', 'Unstable'],
        tsvRows('report/test-perf-changes.tsv'))

@ -460,36 +464,35 @@ if args.report == 'main':
            'Test',                                               #0
            'Wall clock time,&nbsp;s',                            #1
            'Total client time,&nbsp;s',                          #2
-            'Total queries',                                      #3
-            'Ignored short queries',                              #4
-            'Longest query<br>(sum for all runs),&nbsp;s',        #5
-            'Avg wall clock time<br>(sum for all runs),&nbsp;s',  #6
-            'Shortest query<br>(sum for all runs),&nbsp;s',       #7
-            # 'Runs'                                              #8
+            'Total queries',                                 #3
+            'Longest query<br>(sum for all runs),&nbsp;s',        #4
+            'Avg wall clock time<br>(sum for all runs),&nbsp;s',  #5
+            'Shortest query<br>(sum for all runs),&nbsp;s',       #6
+            # 'Runs'                                              #7
            ]

-        text = tableStart('Test times')
+        text = tableStart('Test Times')
        text += tableHeader(columns)

        allowed_average_run_time = 3.75 # 60 seconds per test at 7 runs
        attrs = ['' for c in columns]
        for r in rows:
            anchor = f'{currentTableAnchor()}.{r[0]}'
-            total_runs = (int(r[8]) + 1) * 2  # one prewarm run, two servers
-            if float(r[6]) > allowed_average_run_time * total_runs:
+            total_runs = (int(r[7]) + 1) * 2  # one prewarm run, two servers
+            if float(r[5]) > allowed_average_run_time * total_runs:
                # FIXME should be 15s max -- investigate parallel_insert
                slow_average_tests += 1
-                attrs[6] = f'style="background: {color_bad}"'
+                attrs[5] = f'style="background: {color_bad}"'
                errors_explained.append([f'<a href="#{anchor}">The test \'{r[0]}\' is too slow to run as a whole. Investigate whether the create and fill queries can be sped up'])
            else:
-                attrs[6] = ''
+                attrs[5] = ''

-            if float(r[5]) > allowed_single_run_time * total_runs:
+            if float(r[4]) > allowed_single_run_time * total_runs:
                slow_average_tests += 1
-                attrs[5] = f'style="background: {color_bad}"'
+                attrs[4] = f'style="background: {color_bad}"'
                errors_explained.append([f'<a href="./all-queries.html#all-query-times.{r[0]}.0">Some query of the test \'{r[0]}\' is too slow to run. See the all queries report'])
            else:
-                attrs[5] = ''
+                attrs[4] = ''

            text += tableRow(r, attrs, anchor)

@ -498,74 +501,7 @@ if args.report == 'main':

    add_test_times()

-    def add_benchmark_results():
-        if not os.path.isfile('benchmark/website-left.json'):
-            return
-
-        json_reports = [json.load(open(f'benchmark/website-{x}.json')) for x in ['left', 'right']]
-        stats = [next(iter(x.values()))["statistics"] for x in json_reports]
-        qps = [x["QPS"] for x in stats]
-        queries = [x["num_queries"] for x in stats]
-        errors = [x["num_errors"] for x in stats]
-        relative_diff = (qps[1] - qps[0]) / max(0.01, qps[0]);
-        times_diff = max(qps) / max(0.01, min(qps))
-
-        all_rows = []
-        header = ['Benchmark', 'Metric', 'Old', 'New', 'Relative difference', 'Times difference'];
-
-        attrs = ['' for x in header]
-        row = ['website', 'queries', f'{queries[0]:d}', f'{queries[1]:d}', '--', '--']
-        attrs[0] = 'rowspan=2'
-        all_rows.append([row, attrs])
-
-        attrs = ['' for x in header]
-        row = [None, 'queries/s', f'{qps[0]:.3f}', f'{qps[1]:.3f}', f'{relative_diff:.3f}', f'x{times_diff:.3f}']
-        if abs(relative_diff) > 0.1:
-            # More queries per second is better.
-            if relative_diff > 0.:
-                attrs[4] = f'style="background: {color_good}"'
-            else:
-                attrs[4] = f'style="background: {color_bad}"'
-        else:
-            attrs[4] = ''
-        all_rows.append([row, attrs]);
-
-        if max(errors):
-            all_rows[0][1][0] = "rowspan=3"
-            row = [''] * (len(header))
-            attrs = ['' for x in header]
-
-            attrs[0] = None
-            row[1] = 'errors'
-            row[2] = f'{errors[0]:d}'
-            row[3] = f'{errors[1]:d}'
-            row[4] = '--'
-            row[5] = '--'
-            if errors[0]:
-                attrs[2] += f' style="background: {color_bad}" '
-            if errors[1]:
-                attrs[3] += f' style="background: {color_bad}" '
-
-            all_rows.append([row, attrs])
-
-        text = tableStart('Concurrent benchmarks')
-        text += tableHeader(header)
-        for row, attrs in all_rows:
-            text += tableRow(row, attrs)
-        text += tableEnd()
-
-        global tables
-        tables.append(text)
-
-    try:
-        add_benchmark_results()
-    except:
-        report_errors.append(
-            traceback.format_exception_only(
-                *sys.exc_info()[:2])[-1])
-        pass
-
-    addSimpleTable('Metric changes',
+    addSimpleTable('Metric Changes',
        ['Metric', 'Old median value', 'New median value',
            'Relative difference', 'Times difference'],
        tsvRows('metrics/changes.tsv'))
@ -656,7 +592,7 @@ elif args.report == 'all-queries':
            'Query',                                  #9
            ]

-        text = tableStart('All query times')
+        text = tableStart('All Query Times')
        text += tableHeader(columns)

        attrs = ['' for c in columns]
--- a/src/Storages/MergeTree/MergeTreeData.cpp
+++ b/src/Storages/MergeTree/MergeTreeData.cpp
@ -1823,16 +1823,27 @@ MergeTreeData::DataPartsVector MergeTreeData::getActivePartsToReplace(
 }


-void MergeTreeData::renameTempPartAndAdd(MutableDataPartPtr & part, SimpleIncrement * increment, Transaction * out_transaction)
+bool MergeTreeData::renameTempPartAndAdd(MutableDataPartPtr & part, SimpleIncrement * increment, Transaction * out_transaction)
 {
-    auto removed = renameTempPartAndReplace(part, increment, out_transaction);
-    if (!removed.empty())
-        throw Exception("Added part " + part->name + " covers " + toString(removed.size())
-            + " existing part(s) (including " + removed[0]->name + ")", ErrorCodes::LOGICAL_ERROR);
+    if (out_transaction && &out_transaction->data != this)
+        throw Exception("MergeTreeData::Transaction for one table cannot be used with another. It is a bug.",
+            ErrorCodes::LOGICAL_ERROR);
+
+    DataPartsVector covered_parts;
+    {
+        auto lock = lockParts();
+        if (!renameTempPartAndReplace(part, increment, out_transaction, lock, &covered_parts))
+            return false;
+    }
+    if (!covered_parts.empty())
+        throw Exception("Added part " + part->name + " covers " + toString(covered_parts.size())
+            + " existing part(s) (including " + covered_parts[0]->name + ")", ErrorCodes::LOGICAL_ERROR);
+
+    return true;
 }


-void MergeTreeData::renameTempPartAndReplace(
+bool MergeTreeData::renameTempPartAndReplace(
    MutableDataPartPtr & part, SimpleIncrement * increment, Transaction * out_transaction,
    std::unique_lock<std::mutex> & lock, DataPartsVector * out_covered_parts)
 {
@ -1863,7 +1874,7 @@ void MergeTreeData::renameTempPartAndReplace(
        part_info.mutation = 0; /// it's equal to min_block by default
        part_name = part->getNewName(part_info);
    }
-    else
+    else /// Parts from ReplicatedMergeTree already have names
        part_name = part->name;

    LOG_TRACE(log, "Renaming temporary part {} to {}.", part->relative_path, part_name);
@ -1886,7 +1897,7 @@ void MergeTreeData::renameTempPartAndReplace(
    if (covering_part)
    {
        LOG_WARNING(log, "Tried to add obsolete part {} covered by {}", part_name, covering_part->getNameWithState());
-        return;
+        return false;
    }

    /// All checks are passed. Now we can rename the part on disk.
@ -1931,6 +1942,8 @@ void MergeTreeData::renameTempPartAndReplace(
        for (DataPartPtr & covered_part : covered_parts)
            out_covered_parts->emplace_back(std::move(covered_part));
    }
+
+    return true;
 }

 MergeTreeData::DataPartsVector MergeTreeData::renameTempPartAndReplace(
@ -1968,6 +1981,22 @@ void MergeTreeData::removePartsFromWorkingSet(const MergeTreeData::DataPartsVect
    }
 }

+void MergeTreeData::removePartsFromWorkingSetImmediatelyAndSetTemporaryState(const DataPartsVector & remove)
+{
+    auto lock = lockParts();
+
+    for (const auto & part : remove)
+    {
+        auto it_part = data_parts_by_info.find(part->info);
+        if (it_part == data_parts_by_info.end())
+            throw Exception("Part " + part->getNameWithState() + " not found in data_parts", ErrorCodes::LOGICAL_ERROR);
+
+        modifyPartState(part, IMergeTreeDataPart::State::Temporary);
+        /// Erase immediately
+        data_parts_indexes.erase(it_part);
+    }
+}
+
 void MergeTreeData::removePartsFromWorkingSet(const DataPartsVector & remove, bool clear_without_timeout, DataPartsLock * acquired_lock)
 {
    auto lock = (acquired_lock) ? DataPartsLock() : lockParts();
@ -3087,6 +3116,25 @@ MergeTreeData::DataPartPtr MergeTreeData::getAnyPartInPartition(
    return nullptr;
 }

+
+void MergeTreeData::Transaction::rollbackPartsToTemporaryState()
+{
+    if (!isEmpty())
+    {
+        std::stringstream ss;
+        ss << " Rollbacking parts state to temporary and removing from working set:";
+        for (const auto & part : precommitted_parts)
+            ss << " " << part->relative_path;
+        ss << ".";
+        LOG_DEBUG(data.log, "Undoing transaction.{}", ss.str());
+
+        data.removePartsFromWorkingSetImmediatelyAndSetTemporaryState(
+            DataPartsVector(precommitted_parts.begin(), precommitted_parts.end()));
+    }
+
+    clear();
+}
+
 void MergeTreeData::Transaction::rollback()
 {
    if (!isEmpty())
@ -3251,7 +3299,8 @@ MergeTreeData::MutableDataPartPtr MergeTreeData::cloneAndLoadDataPartOnSameDisk(
    }
    if (!does_storage_policy_allow_same_disk)
        throw Exception(
-            "Could not clone and load part " + quoteString(src_part->getFullPath()) + " because disk does not belong to storage policy", ErrorCodes::BAD_ARGUMENTS);
+            "Could not clone and load part " + quoteString(src_part->getFullPath()) + " because disk does not belong to storage policy",
+            ErrorCodes::BAD_ARGUMENTS);

    String dst_part_name = src_part->getNewName(dst_part_info);
    String tmp_dst_part_name = tmp_part_prefix + dst_part_name;
--- a/src/Storages/MergeTree/MergeTreeData.h
+++ b/src/Storages/MergeTree/MergeTreeData.h
@ -225,6 +225,10 @@ public:

        void rollback();

+        /// Immediately remove parts from table's data_parts set and change part
+        /// state to temporary. Useful for new parts which not present in table.
+        void rollbackPartsToTemporaryState();
+
        size_t size() const { return precommitted_parts.size(); }
        bool isEmpty() const { return precommitted_parts.empty(); }

@ -426,7 +430,8 @@ public:
    /// If out_transaction != nullptr, adds the part in the PreCommitted state (the part will be added to the
    /// active set later with out_transaction->commit()).
    /// Else, commits the part immediately.
-    void renameTempPartAndAdd(MutableDataPartPtr & part, SimpleIncrement * increment = nullptr, Transaction * out_transaction = nullptr);
+    /// Returns true if part was added. Returns false if part is covered by bigger part.
+    bool renameTempPartAndAdd(MutableDataPartPtr & part, SimpleIncrement * increment = nullptr, Transaction * out_transaction = nullptr);

    /// The same as renameTempPartAndAdd but the block range of the part can contain existing parts.
    /// Returns all parts covered by the added part (in ascending order).
@ -435,10 +440,16 @@ public:
        MutableDataPartPtr & part, SimpleIncrement * increment = nullptr, Transaction * out_transaction = nullptr);

    /// Low-level version of previous one, doesn't lock mutex
-    void renameTempPartAndReplace(
+    bool renameTempPartAndReplace(
            MutableDataPartPtr & part, SimpleIncrement * increment, Transaction * out_transaction, DataPartsLock & lock,
            DataPartsVector * out_covered_parts = nullptr);

+
+    /// Remove parts from working set immediately (without wait for background
+    /// process). Transfer part state to temporary. Have very limited usage only
+    /// for new parts which don't already present in table.
+    void removePartsFromWorkingSetImmediatelyAndSetTemporaryState(const DataPartsVector & remove);
+
    /// Removes parts from the working set parts.
    /// Parts in add must already be in data_parts with PreCommitted, Committed, or Outdated states.
    /// If clear_without_timeout is true, the parts will be deleted at once, or during the next call to
--- a/src/Storages/MergeTree/ReplicatedMergeTreeBlockOutputStream.cpp
+++ b/src/Storages/MergeTree/ReplicatedMergeTreeBlockOutputStream.cpp
@ -27,6 +27,9 @@ namespace ErrorCodes
    extern const int INSERT_WAS_DEDUPLICATED;
    extern const int TIMEOUT_EXCEEDED;
    extern const int NO_ACTIVE_REPLICAS;
+    extern const int DUPLICATE_DATA_PART;
+    extern const int PART_IS_TEMPORARILY_LOCKED;
+    extern const int LOGICAL_ERROR;
 }


@ -96,7 +99,8 @@ void ReplicatedMergeTreeBlockOutputStream::checkQuorumPrecondition(zkutil::ZooKe

    auto quorum_status = quorum_status_future.get();
    if (quorum_status.error != Coordination::Error::ZNONODE)
-        throw Exception("Quorum for previous write has not been satisfied yet. Status: " + quorum_status.data, ErrorCodes::UNSATISFIED_QUORUM_FOR_PREVIOUS_WRITE);
+        throw Exception("Quorum for previous write has not been satisfied yet. Status: " + quorum_status.data,
+                        ErrorCodes::UNSATISFIED_QUORUM_FOR_PREVIOUS_WRITE);

    /// Both checks are implicitly made also later (otherwise there would be a race condition).

@ -116,7 +120,6 @@ void ReplicatedMergeTreeBlockOutputStream::write(const Block & block)
 {
    last_block_is_duplicate = false;

-    /// TODO Is it possible to not lock the table structure here?
    storage.delayInsertOrThrowIfNeeded(&storage.partial_shutdown_event);

    auto zookeeper = storage.getZooKeeper();
@ -214,170 +217,258 @@ void ReplicatedMergeTreeBlockOutputStream::commitPart(
    metadata_snapshot->check(part->getColumns());
    assertSessionIsNotExpired(zookeeper);

-    /// Obtain incremental block number and lock it. The lock holds our intention to add the block to the filesystem.
-    /// We remove the lock just after renaming the part. In case of exception, block number will be marked as abandoned.
-    /// Also, make deduplication check. If a duplicate is detected, no nodes are created.
+    String temporary_part_relative_path = part->relative_path;

-    /// Allocate new block number and check for duplicates
-    bool deduplicate_block = !block_id.empty();
-    String block_id_path = deduplicate_block ? storage.zookeeper_path + "/blocks/" + block_id : "";
-    auto block_number_lock = storage.allocateBlockNumber(part->info.partition_id, zookeeper, block_id_path);
+    /// There is one case when we need to retry transaction in a loop.
+    /// But don't do it too many times - just as defensive measure.
+    size_t loop_counter = 0;
+    constexpr size_t max_iterations = 10;

-    if (!block_number_lock)
+    bool is_already_existing_part = false;
+
+    while (true)
    {
-        LOG_INFO(log, "Block with ID {} already exists; ignoring it.", block_id);
-        part->is_duplicate = true;
-        last_block_is_duplicate = true;
-        ProfileEvents::increment(ProfileEvents::DuplicatedInsertedBlocks);
-        return;
-    }
+        /// Obtain incremental block number and lock it. The lock holds our intention to add the block to the filesystem.
+        /// We remove the lock just after renaming the part. In case of exception, block number will be marked as abandoned.
+        /// Also, make deduplication check. If a duplicate is detected, no nodes are created.

-    Int64 block_number = block_number_lock->getNumber();
+        /// Allocate new block number and check for duplicates
+        bool deduplicate_block = !block_id.empty();
+        String block_id_path = deduplicate_block ? storage.zookeeper_path + "/blocks/" + block_id : "";
+        auto block_number_lock = storage.allocateBlockNumber(part->info.partition_id, zookeeper, block_id_path);

-    /// Set part attributes according to part_number. Prepare an entry for log.
+        /// Prepare transaction to ZooKeeper
+        /// It will simultaneously add information about the part to all the necessary places in ZooKeeper and remove block_number_lock.
+        Coordination::Requests ops;

-    part->info.min_block = block_number;
-    part->info.max_block = block_number;
-    part->info.level = 0;
-
-    String part_name = part->getNewName(part->info);
-    part->name = part_name;
-
-    StorageReplicatedMergeTree::LogEntry log_entry;
-    log_entry.type = StorageReplicatedMergeTree::LogEntry::GET_PART;
-    log_entry.create_time = time(nullptr);
-    log_entry.source_replica = storage.replica_name;
-    log_entry.new_part_name = part_name;
-    log_entry.quorum = quorum;
-    log_entry.block_id = block_id;
-    log_entry.new_part_type = part->getType();
-
-    /// Simultaneously add information about the part to all the necessary places in ZooKeeper and remove block_number_lock.
-
-    /// Information about the part.
-    Coordination::Requests ops;
-
-    storage.getCommitPartOps(ops, part, block_id_path);
-
-    /// Replication log.
-    ops.emplace_back(zkutil::makeCreateRequest(
-        storage.zookeeper_path + "/log/log-",
-        log_entry.toString(),
-        zkutil::CreateMode::PersistentSequential));
-
-    /// Deletes the information that the block number is used for writing.
-    block_number_lock->getUnlockOps(ops);
-
-    /** If you need a quorum - create a node in which the quorum is monitored.
-        * (If such a node already exists, then someone has managed to make another quorum record at the same time, but for it the quorum has not yet been reached.
-        *  You can not do the next quorum record at this time.)
-        */
-    if (quorum)
-    {
-        ReplicatedMergeTreeQuorumEntry quorum_entry;
-        quorum_entry.part_name = part_name;
-        quorum_entry.required_number_of_replicas = quorum;
-        quorum_entry.replicas.insert(storage.replica_name);
-
-        /** At this point, this node will contain information that the current replica received a part.
-            * When other replicas will receive this part (in the usual way, processing the replication log),
-            *  they will add themselves to the contents of this node.
-            * When it contains information about `quorum` number of replicas, this node is deleted,
-            *  which indicates that the quorum has been reached.
-            */
-
-        ops.emplace_back(
-            zkutil::makeCreateRequest(
-                quorum_info.status_path,
-                quorum_entry.toString(),
-                zkutil::CreateMode::Persistent));
-
-        /// Make sure that during the insertion time, the replica was not reinitialized or disabled (when the server is finished).
-        ops.emplace_back(
-            zkutil::makeCheckRequest(
-                storage.replica_path + "/is_active",
-                quorum_info.is_active_node_version));
-
-        /// Unfortunately, just checking the above is not enough, because `is_active` node can be deleted and reappear with the same version.
-        /// But then the `host` value will change. We will check this.
-        /// It's great that these two nodes change in the same transaction (see MergeTreeRestartingThread).
-        ops.emplace_back(
-            zkutil::makeCheckRequest(
-                storage.replica_path + "/host",
-                quorum_info.host_node_version));
-    }
-
-    MergeTreeData::Transaction transaction(storage); /// If you can not add a part to ZK, we'll remove it back from the working set.
-    storage.renameTempPartAndAdd(part, nullptr, &transaction);
-
-    Coordination::Responses responses;
-    Coordination::Error multi_code = zookeeper->tryMultiNoThrow(ops, responses); /// 1 RTT
-
-    if (multi_code == Coordination::Error::ZOK)
-    {
-        transaction.commit();
-        storage.merge_selecting_task->schedule();
-
-        /// Lock nodes have been already deleted, do not delete them in destructor
-        block_number_lock->assumeUnlocked();
-    }
-    else if (multi_code == Coordination::Error::ZCONNECTIONLOSS
-        || multi_code == Coordination::Error::ZOPERATIONTIMEOUT)
-    {
-        /** If the connection is lost, and we do not know if the changes were applied, we can not delete the local part
-          *  if the changes were applied, the inserted block appeared in `/blocks/`, and it can not be inserted again.
-          */
-        transaction.commit();
-        storage.enqueuePartForCheck(part->name, MAX_AGE_OF_LOCAL_PART_THAT_WASNT_ADDED_TO_ZOOKEEPER);
-
-        /// We do not know whether or not data has been inserted.
-        throw Exception("Unknown status, client must retry. Reason: " + String(Coordination::errorMessage(multi_code)),
-            ErrorCodes::UNKNOWN_STATUS_OF_INSERT);
-    }
-    else if (Coordination::isUserError(multi_code))
-    {
-        String failed_op_path = zkutil::KeeperMultiException(multi_code, ops, responses).getPathForFirstFailedOp();
-
-        if (multi_code == Coordination::Error::ZNODEEXISTS && deduplicate_block && failed_op_path == block_id_path)
+        Int64 block_number = 0;
+        String existing_part_name;
+        if (block_number_lock)
        {
-            /// Block with the same id have just appeared in table (or other replica), rollback the insertion.
-            LOG_INFO(log, "Block with ID {} already exists; ignoring it (removing part {})", block_id, part->name);
+            is_already_existing_part = false;
+            block_number = block_number_lock->getNumber();

-            part->is_duplicate = true;
-            transaction.rollback();
-            last_block_is_duplicate = true;
-            ProfileEvents::increment(ProfileEvents::DuplicatedInsertedBlocks);
-        }
-        else if (multi_code == Coordination::Error::ZNODEEXISTS && failed_op_path == quorum_info.status_path)
-        {
-            transaction.rollback();
+            /// Set part attributes according to part_number. Prepare an entry for log.

-            throw Exception("Another quorum insert has been already started", ErrorCodes::UNSATISFIED_QUORUM_FOR_PREVIOUS_WRITE);
+            part->info.min_block = block_number;
+            part->info.max_block = block_number;
+            part->info.level = 0;
+
+            part->name = part->getNewName(part->info);
+
+            /// Will add log entry about new part.
+
+            StorageReplicatedMergeTree::LogEntry log_entry;
+            log_entry.type = StorageReplicatedMergeTree::LogEntry::GET_PART;
+            log_entry.create_time = time(nullptr);
+            log_entry.source_replica = storage.replica_name;
+            log_entry.new_part_name = part->name;
+            log_entry.quorum = quorum;
+            log_entry.block_id = block_id;
+            log_entry.new_part_type = part->getType();
+
+            ops.emplace_back(zkutil::makeCreateRequest(
+                storage.zookeeper_path + "/log/log-",
+                log_entry.toString(),
+                zkutil::CreateMode::PersistentSequential));
+
+            /// Deletes the information that the block number is used for writing.
+            block_number_lock->getUnlockOps(ops);
+
+            /** If we need a quorum - create a node in which the quorum is monitored.
+              * (If such a node already exists, then someone has managed to make another quorum record at the same time,
+              *  but for it the quorum has not yet been reached.
+              *  You can not do the next quorum record at this time.)
+              */
+            if (quorum)
+            {
+                ReplicatedMergeTreeQuorumEntry quorum_entry;
+                quorum_entry.part_name = part->name;
+                quorum_entry.required_number_of_replicas = quorum;
+                quorum_entry.replicas.insert(storage.replica_name);
+
+                /** At this point, this node will contain information that the current replica received a part.
+                    * When other replicas will receive this part (in the usual way, processing the replication log),
+                    *  they will add themselves to the contents of this node.
+                    * When it contains information about `quorum` number of replicas, this node is deleted,
+                    *  which indicates that the quorum has been reached.
+                    */
+
+                ops.emplace_back(
+                    zkutil::makeCreateRequest(
+                        quorum_info.status_path,
+                        quorum_entry.toString(),
+                        zkutil::CreateMode::Persistent));
+
+                /// Make sure that during the insertion time, the replica was not reinitialized or disabled (when the server is finished).
+                ops.emplace_back(
+                    zkutil::makeCheckRequest(
+                        storage.replica_path + "/is_active",
+                        quorum_info.is_active_node_version));
+
+                /// Unfortunately, just checking the above is not enough, because `is_active`
+                /// node can be deleted and reappear with the same version.
+                /// But then the `host` value will change. We will check this.
+                /// It's great that these two nodes change in the same transaction (see MergeTreeRestartingThread).
+                ops.emplace_back(
+                    zkutil::makeCheckRequest(
+                        storage.replica_path + "/host",
+                        quorum_info.host_node_version));
+            }
        }
        else
        {
-            /// NOTE: We could be here if the node with the quorum existed, but was quickly removed.
-            transaction.rollback();
-            throw Exception("Unexpected logical error while adding block " + toString(block_number) + " with ID '" + block_id + "': "
-                            + Coordination::errorMessage(multi_code) + ", path " + failed_op_path,
-                            ErrorCodes::UNEXPECTED_ZOOKEEPER_ERROR);
+            is_already_existing_part = true;
+
+            /// This block was already written to some replica. Get the part name for it.
+            /// Note: race condition with DROP PARTITION operation is possible. User will get "No node" exception and it is Ok.
+            existing_part_name = zookeeper->get(storage.zookeeper_path + "/blocks/" + block_id);
+
+            /// If it exists on our replica, ignore it.
+            if (storage.getActiveContainingPart(existing_part_name))
+            {
+                LOG_INFO(log, "Block with ID {} already exists locally as part {}; ignoring it.", block_id, existing_part_name);
+                part->is_duplicate = true;
+                last_block_is_duplicate = true;
+                ProfileEvents::increment(ProfileEvents::DuplicatedInsertedBlocks);
+                return;
+            }
+
+            LOG_INFO(log, "Block with ID {} already exists on other replicas as part {}; will write it locally with that name.",
+                block_id, existing_part_name);
+
+            /// If it does not exist, we will write a new part with existing name.
+            /// Note that it may also appear on filesystem right now in PreCommitted state due to concurrent inserts of the same data.
+            /// It will be checked when we will try to rename directory.
+
+            part->name = existing_part_name;
+            part->info = MergeTreePartInfo::fromPartName(existing_part_name, storage.format_version);
+            /// Used only for exception messages.
+            block_number = part->info.min_block;
+
+
+            /// Do not check for duplicate on commit to ZK.
+            block_id_path.clear();
        }
-    }
-    else if (Coordination::isHardwareError(multi_code))
-    {
-        transaction.rollback();
-        throw Exception("Unrecoverable network error while adding block " + toString(block_number) + " with ID '" + block_id + "': "
-                        + Coordination::errorMessage(multi_code), ErrorCodes::UNEXPECTED_ZOOKEEPER_ERROR);
-    }
-    else
-    {
-        transaction.rollback();
-        throw Exception("Unexpected ZooKeeper error while adding block " + toString(block_number) + " with ID '" + block_id + "': "
-                        + Coordination::errorMessage(multi_code), ErrorCodes::UNEXPECTED_ZOOKEEPER_ERROR);
+
+        /// Information about the part.
+        storage.getCommitPartOps(ops, part, block_id_path);
+
+        MergeTreeData::Transaction transaction(storage); /// If you can not add a part to ZK, we'll remove it back from the working set.
+        bool renamed = false;
+        try
+        {
+            renamed = storage.renameTempPartAndAdd(part, nullptr, &transaction);
+        }
+        catch (const Exception & e)
+        {
+            if (e.code() != ErrorCodes::DUPLICATE_DATA_PART
+                && e.code() != ErrorCodes::PART_IS_TEMPORARILY_LOCKED)
+                throw;
+        }
+        if (!renamed)
+        {
+            if (is_already_existing_part)
+            {
+                LOG_INFO(log, "Part {} is duplicate and it is already written by concurrent request or fetched; ignoring it.", part->name);
+                return;
+            }
+            else
+                throw Exception(ErrorCodes::LOGICAL_ERROR, "Part with name {} is already written by concurrent request."
+                    " It should not happen for non-duplicate data parts because unique names are assigned for them. It's a bug",
+                    part->name);
+        }
+
+        Coordination::Responses responses;
+        Coordination::Error multi_code = zookeeper->tryMultiNoThrow(ops, responses); /// 1 RTT
+
+        if (multi_code == Coordination::Error::ZOK)
+        {
+            transaction.commit();
+            storage.merge_selecting_task->schedule();
+
+            /// Lock nodes have been already deleted, do not delete them in destructor
+            if (block_number_lock)
+                block_number_lock->assumeUnlocked();
+        }
+        else if (multi_code == Coordination::Error::ZCONNECTIONLOSS
+            || multi_code == Coordination::Error::ZOPERATIONTIMEOUT)
+        {
+            /** If the connection is lost, and we do not know if the changes were applied, we can not delete the local part
+              *  if the changes were applied, the inserted block appeared in `/blocks/`, and it can not be inserted again.
+              */
+            transaction.commit();
+            storage.enqueuePartForCheck(part->name, MAX_AGE_OF_LOCAL_PART_THAT_WASNT_ADDED_TO_ZOOKEEPER);
+
+            /// We do not know whether or not data has been inserted.
+            throw Exception("Unknown status, client must retry. Reason: " + String(Coordination::errorMessage(multi_code)),
+                ErrorCodes::UNKNOWN_STATUS_OF_INSERT);
+        }
+        else if (Coordination::isUserError(multi_code))
+        {
+            String failed_op_path = zkutil::KeeperMultiException(multi_code, ops, responses).getPathForFirstFailedOp();
+
+            if (multi_code == Coordination::Error::ZNODEEXISTS && deduplicate_block && failed_op_path == block_id_path)
+            {
+                /// Block with the same id have just appeared in table (or other replica), rollback thee insertion.
+                LOG_INFO(log, "Block with ID {} already exists (it was just appeared). Renaming part {} back to {}. Will retry write.",
+                    block_id, part->name, temporary_part_relative_path);
+
+                /// We will try to add this part again on the new iteration as it's just a new part.
+                /// So remove it from storage parts set immediately and transfer state to temporary.
+                transaction.rollbackPartsToTemporaryState();
+
+                part->is_temp = true;
+                part->renameTo(temporary_part_relative_path, false);
+
+                /// If this part appeared on other replica than it's better to try to write it locally one more time. If it's our part
+                /// than it will be ignored on the next itration.
+                ++loop_counter;
+                if (loop_counter == max_iterations)
+                {
+                    part->is_duplicate = true; /// Part is duplicate, just remove it from local FS
+                    throw Exception("Too many transaction retries - it may indicate an error", ErrorCodes::DUPLICATE_DATA_PART);
+                }
+                continue;
+            }
+            else if (multi_code == Coordination::Error::ZNODEEXISTS && failed_op_path == quorum_info.status_path)
+            {
+                transaction.rollback();
+                throw Exception("Another quorum insert has been already started", ErrorCodes::UNSATISFIED_QUORUM_FOR_PREVIOUS_WRITE);
+            }
+            else
+            {
+                /// NOTE: We could be here if the node with the quorum existed, but was quickly removed.
+                transaction.rollback();
+                throw Exception("Unexpected logical error while adding block " + toString(block_number) + " with ID '" + block_id + "': "
+                                + Coordination::errorMessage(multi_code) + ", path " + failed_op_path,
+                                ErrorCodes::UNEXPECTED_ZOOKEEPER_ERROR);
+            }
+        }
+        else if (Coordination::isHardwareError(multi_code))
+        {
+            transaction.rollback();
+            throw Exception("Unrecoverable network error while adding block " + toString(block_number) + " with ID '" + block_id + "': "
+                            + Coordination::errorMessage(multi_code), ErrorCodes::UNEXPECTED_ZOOKEEPER_ERROR);
+        }
+        else
+        {
+            transaction.rollback();
+            throw Exception("Unexpected ZooKeeper error while adding block " + toString(block_number) + " with ID '" + block_id + "': "
+                            + Coordination::errorMessage(multi_code), ErrorCodes::UNEXPECTED_ZOOKEEPER_ERROR);
+        }
+
+        break;
    }

    if (quorum)
    {
+        if (is_already_existing_part)
+        {
+            /// We get duplicate part without fetch
+            storage.updateQuorum(part->name);
+        }
+
        /// We are waiting for quorum to be satisfied.
        LOG_TRACE(log, "Waiting for quorum");

@ -397,14 +488,15 @@ void ReplicatedMergeTreeBlockOutputStream::commitPart(
                ReplicatedMergeTreeQuorumEntry quorum_entry(value);

                /// If the node has time to disappear, and then appear again for the next insert.
-                if (quorum_entry.part_name != part_name)
+                if (quorum_entry.part_name != part->name)
                    break;

                if (!event->tryWait(quorum_timeout_ms))
                    throw Exception("Timeout while waiting for quorum", ErrorCodes::TIMEOUT_EXCEEDED);
            }

-            /// And what if it is possible that the current replica at this time has ceased to be active and the quorum is marked as failed and deleted?
+            /// And what if it is possible that the current replica at this time has ceased to be active
+            /// and the quorum is marked as failed and deleted?
            String value;
            if (!zookeeper->tryGet(storage.replica_path + "/is_active", value, nullptr)
                || value != quorum_info.is_active_node_value)
--- a/src/Storages/StorageReplicatedMergeTree.cpp
+++ b/src/Storages/StorageReplicatedMergeTree.cpp
@ -113,6 +113,7 @@ namespace ErrorCodes
    extern const int ALL_REPLICAS_LOST;
    extern const int REPLICA_STATUS_CHANGED;
    extern const int CANNOT_ASSIGN_ALTER;
+    extern const int DIRECTORY_ALREADY_EXISTS;
    extern const int ILLEGAL_TYPE_OF_ARGUMENT;
 }

@ -3316,6 +3317,15 @@ bool StorageReplicatedMergeTree::fetchPart(const String & part_name, const Stora
            part->renameTo("detached/" + part_name, true);
        }
    }
+    catch (const Exception & e)
+    {
+        /// The same part is being written right now (but probably it's not committed yet).
+        /// We will check the need for fetch later.
+        if (e.code() == ErrorCodes::DIRECTORY_ALREADY_EXISTS)
+            return false;
+
+        throw;
+    }
    catch (...)
    {
        if (!to_detached)
@ -4775,9 +4785,11 @@ void StorageReplicatedMergeTree::fetchPartition(
        missing_parts.clear();
        for (const String & part : parts_to_fetch)
        {
+            bool fetched = false;
+
            try
            {
-                fetchPart(part, metadata_snapshot, best_replica_path, true, 0, zookeeper);
+                fetched = fetchPart(part, metadata_snapshot, best_replica_path, true, 0, zookeeper);
            }
            catch (const DB::Exception & e)
            {
@ -4786,8 +4798,10 @@ void StorageReplicatedMergeTree::fetchPartition(
                    throw;

                LOG_INFO(log, e.displayText());
-                missing_parts.push_back(part);
            }
+
+            if (!fetched)
+                missing_parts.push_back(part);
        }

        ++try_no;
--- a/tests/queries/0_stateless/01319_manual_write_to_replicas.reference
+++ b/tests/queries/0_stateless/01319_manual_write_to_replicas.reference
@ -0,0 +1,6 @@
+Hello, world
+---
+Hello, world
+Hello, world
+Hello, world
+Hello, world
--- a/tests/queries/0_stateless/01319_manual_write_to_replicas.sql
+++ b/tests/queries/0_stateless/01319_manual_write_to_replicas.sql
@ -0,0 +1,25 @@
+DROP TABLE IF EXISTS r1;
+DROP TABLE IF EXISTS r2;
+
+CREATE TABLE r1 (x String) ENGINE = ReplicatedMergeTree('/clickhouse/tables/r', 'r1') ORDER BY x;
+CREATE TABLE r2 (x String) ENGINE = ReplicatedMergeTree('/clickhouse/tables/r', 'r2') ORDER BY x;
+
+SYSTEM STOP REPLICATED SENDS;
+
+INSERT INTO r1 VALUES ('Hello, world');
+SELECT * FROM r1;
+SELECT * FROM r2;
+INSERT INTO r2 VALUES ('Hello, world');
+SELECT '---';
+SELECT * FROM r1;
+SELECT * FROM r2;
+
+SYSTEM START REPLICATED SENDS;
+SYSTEM SYNC REPLICA r1;
+SYSTEM SYNC REPLICA r2;
+
+SELECT * FROM r1;
+SELECT * FROM r2;
+
+DROP TABLE r1;
+DROP TABLE r2;
--- a/tests/queries/0_stateless/01459_manual_write_to_replicas.reference
+++ b/tests/queries/0_stateless/01459_manual_write_to_replicas.reference
@ -0,0 +1,10 @@
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
--- a/tests/queries/0_stateless/01459_manual_write_to_replicas.sh
+++ b/tests/queries/0_stateless/01459_manual_write_to_replicas.sh
@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+set -e
+
+CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+. "$CURDIR"/../shell_config.sh
+
+NUM_REPLICAS=10
+
+for i in $(seq 1 $NUM_REPLICAS); do
+    $CLICKHOUSE_CLIENT -n -q "
+        DROP TABLE IF EXISTS r$i;
+        CREATE TABLE r$i (x UInt64) ENGINE = ReplicatedMergeTree('/clickhouse/tables/01459_manual_write_ro_replicas/r', 'r$i') ORDER BY x;
+    "
+done
+
+function thread {
+    for x in {0..99}; do
+        $CLICKHOUSE_CLIENT --query "INSERT INTO r$1 SELECT $x % $NUM_REPLICAS = $1 ? $x - 1 : $x"  # Replace some records as duplicates so they will be written by other replicas
+    done
+}
+
+for i in $(seq 1 $NUM_REPLICAS); do
+    thread $i &
+done
+
+wait
+
+for i in $(seq 1 $NUM_REPLICAS); do
+    $CLICKHOUSE_CLIENT -n -q "
+        SYSTEM SYNC REPLICA r$i;
+        SELECT count(), min(x), max(x), sum(x) FROM r$i;
+        DROP TABLE IF EXISTS r$i;
+"
+done
--- a/tests/queries/0_stateless/01459_manual_write_to_replicas_quorum.reference
+++ b/tests/queries/0_stateless/01459_manual_write_to_replicas_quorum.reference
@ -0,0 +1,10 @@
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
+100	0	99	4950
--- a/tests/queries/0_stateless/01459_manual_write_to_replicas_quorum.sh
+++ b/tests/queries/0_stateless/01459_manual_write_to_replicas_quorum.sh
@ -0,0 +1,37 @@
+#!/usr/bin/env bash
+
+set -e
+
+CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+. "$CURDIR"/../shell_config.sh
+
+NUM_REPLICAS=10
+
+for i in $(seq 1 $NUM_REPLICAS); do
+    $CLICKHOUSE_CLIENT -n -q "
+        DROP TABLE IF EXISTS r$i;
+        CREATE TABLE r$i (x UInt64) ENGINE = ReplicatedMergeTree('/clickhouse/tables/01459_manual_write_ro_replicas_quorum/r', 'r$i') ORDER BY x;
+    "
+done
+
+function thread {
+    for x in {0..99}; do
+        while true; do
+            $CLICKHOUSE_CLIENT --insert_quorum 5 --query "INSERT INTO r$1 SELECT $x" 2>&1 | grep -qF 'Quorum for previous write has not been satisfied yet' || break
+        done
+    done
+}
+
+for i in $(seq 1 $NUM_REPLICAS); do
+    thread $i &
+done
+
+wait
+
+for i in $(seq 1 $NUM_REPLICAS); do
+    $CLICKHOUSE_CLIENT -n -q "
+        SYSTEM SYNC REPLICA r$i;
+        SELECT count(), min(x), max(x), sum(x) FROM r$i;
+        DROP TABLE IF EXISTS r$i;
+"
+done
--- a/utils/s3tools/s3uploader
+++ b/utils/s3tools/s3uploader
@ -68,8 +68,8 @@ def make_tar_file_for_table(clickhouse_data_path, db_name, table_name,

 USAGE_EXAMPLES = '''
 examples:
-\ts3uploader --dataset-name some_ds --access-key-id XXX --secret-access-key YYY --clickhouse-data-path /opt/clickhouse/ --table-name default.some_tbl --bucket-name some-bucket
-\ts3uploader --dataset-name some_ds --access-key-id XXX --secret-access-key YYY --file-name some_ds.tsv.xz --bucket-name some-bucket
+\t./s3uploader --dataset-name some_ds --access-key-id XXX --secret-access-key YYY --clickhouse-data-path /opt/clickhouse/ --table-name default.some_tbl --bucket-name some-bucket
+\t./s3uploader --dataset-name some_ds --access-key-id XXX --secret-access-key YYY --file-path some_ds.tsv.xz --bucket-name some-bucket --s3-path /path/to/
 '''

 if __name__ == "__main__":