ClickHouse/docker/test/performance-comparison
Alexander Kuzmenkov 1617442242 fixup
2020-09-30 14:55:24 +03:00
..
config Calculate profiler data in separate query runs 2020-09-19 20:38:21 +03:00
compare-releases.sh performance comparison 2020-05-23 03:27:56 +03:00
compare.sh fix 2020-09-29 14:07:31 +03:00
Dockerfile Calculate profiler data in separate query runs 2020-09-19 20:38:21 +03:00
download.sh performance comparison 2020-08-11 17:14:24 +03:00
entrypoint.sh If perf test definition changed, run everything + longer (as in master) 2020-09-18 16:28:46 +03:00
eqmed.sql faster 2020-09-10 17:55:54 +03:00
manual-run.sh performance comparison 2020-08-11 17:14:24 +03:00
perf.py fixup 2020-09-30 14:55:24 +03:00
README.md reword the diagnostic about short queries in perf test 2020-09-21 15:14:47 +03:00
report.py Merge remote-tracking branch 'origin/master' into HEAD 2020-09-23 23:56:15 +03:00

[draft] Performance comparison test

This is an experimental mode that compares performance of old and new server side by side. Both servers are run, and the query is executed on one then another, measuring the times. This setup should remove much of the variability present in the current performance tests, which only run the new version and compare with the old results recorded some time in the past.

To interpret the observed results, we build randomization distribution for the observed difference of median times between old and new server, under the null hypothesis that the performance distribution is the same (for the details of the method, see [1]). We consider the observed difference in performance significant, if it is above 5% and above the 95th percentile of the randomization distribution. We also consider the test to be unstable, if the observed difference is less than 5%, but the 95th percentile is above 5% -- this means that we are likely to observe performance differences above 5% more often than in 5% runs, so the test is likely to have false positives.

How to Read the Report

The check status summarizes the report in a short text message like 1 faster, 10 unstable:

  • 1 faster -- how many queries became faster,
  • 1 slower -- how many queries are slower,
  • 1 too long -- how many queries are taking too long to run,
  • 1 unstable -- how many queries have unstable results,
  • 1 errors -- how many errors there are in total. Action is required for every error, this number must be zero. The number of errors includes slower tests, tests that are too long, errors while running the tests and building reports, etc. Please look at the main report page to investigate these errors.

The report page itself constists of a several tables. Some of them always signify errors, e.g. "Run errors" -- the very presence of this table indicates that there were errors during the test, that are not normal and must be fixed. Some tables are mostly informational, e.g. "Test times" -- they reflect normal test results. But if a cell in such table is marked in red, this also means an error, e.g., a test is taking too long to run.

Tested Commits

Informational, no action required. Log messages for the commits that are tested. Note that for the right commit, we show nominal tested commit pull/*/head and real tested commit pull/*/merge, which is generated by GitHub by merging latest master to the pull/*/head and which we actually build and test in CI.

Error Summary

Action required for every item.

This table summarizes all errors that ocurred during the test. Click the links to go to the description of a particular error.

Run Errors

Action required for every item -- these are errors that must be fixed.

The errors that ocurred when running some test queries. For more information about the error, download test output archive and see test-name-err.log. To reproduce, see 'How to run' below.

Slow on Client

Action required for every item -- these are errors that must be fixed.

This table shows queries that take significantly longer to process on the client than on the server. A possible reason might be sending too much data to the client, e.g., a forgotten format Null.

Unexpected Query Duration

Action required for every item -- these are errors that must be fixed.

Queries that have "short" duration (on the order of 0.1 s) can't be reliably tested in a normal way, where we perform a small (about ten) measurements for each server, because the signal-to-noise ratio is much smaller. There is a special mode for such queries that instead runs them for a fixed amount of time, normally with much higher number of measurements (up to thousands). This mode must be explicitly enabled by the test author to avoid accidental errors. It must be used only for queries that are meant to complete "immediately", such as select count(*). If your query is not supposed to be "immediate", try to make it run longer, by e.g. processing more data.

This table shows queries for which the "short" marking is not consistent with the actual query run time -- i.e., a query runs for a long time but is marked as short, or it runs very fast but is not marked as short.

If your query is really supposed to complete "immediately" and can't be made to run longer, you have to mark it as "short". To do so, write <query short="1">... in the test file. The value of "short" attribute is evaluated as a python expression, and substitutions are performed, so you can write something like <query short="{column1} = {column2}">select count(*) from table where {column1} > {column2}</query>, to mark only a particular combination of variables as short.

Partial Queries

Action required for the cells marked in red.

Shows the queries we are unable to run on an old server -- probably because they contain a new function. You should see this table when you add a new function and a performance test for it. Check that the run time and variance are acceptable (run time between 0.1 and 1 seconds, variance below 10%). If not, they will be highlighted in red.

Changes in Performance

Action required for the cells marked in red, and some cheering is appropriate for the cells marked in green.

These are the queries for which we observe a statistically significant change in performance. Note that there will always be some false positives -- we try to filter by p < 0.001, and have 2000 queries, so two false positives per run are expected. In practice we have more -- e.g. code layout changed because of some unknowable jitter in compiler internals, so the change we observe is real, but it is a 'false positive' in the sense that it is not directly caused by your changes. If, based on your knowledge of ClickHouse internals, you can decide that the observed test changes are not relevant to the changes made in the tested PR, you can ignore them.

You can find flame graphs for queries with performance changes in the test output archive, in files named as 'my_test_0_Cpu_SELECT 1 FROM....FORMAT Null.left.svg'. First goes the test name, then the query number in the test, then the trace type (same as in system.trace_log), and then the server version (left is old and right is new).

Unstable Queries

Action required for the cells marked in red.

These are the queries for which we did not observe a statistically significant change in performance, but for which the variance in query performance is very high. This means that we are likely to observe big changes in performance even in the absence of real changes, e.g. when comparing the server to itself. Such queries are going to have bad sensitivity as performance tests -- if a query has, say, 50% expected variability, this means we are going to see changes in performance up to 50%, even when there were no real changes in the code. And because of this, we won't be able to detect changes less than 50% with such a query, which is pretty bad. The reasons for the high variability must be investigated and fixed; ideally, the variability should be brought under 5-10%.

The most frequent reason for instability is that the query is just too short -- e.g. below 0.1 seconds. Bringing query time to 0.2 seconds or above usually helps. Other reasons may include:

  • using a lot of memory which is allocated differently between servers, so the access time may vary. This may apply to your queries if you have a Memory engine table that is bigger than 1 GB. For example, this problem has plagued arithmetic and logical_functions tests for a long time.
  • having some threshold behavior in the query, e.g. you insert to a Buffer table and it is flushed only on some query runs, so you get a much higher time for them.

Investigating the instablility is the hardest problem in performance testing, and we still have not been able to understand the reasons behind the instability of some queries. There are some data that can help you in the performance test output archive. Look for files named 'my_unstable_test_0_SELECT 1...FORMAT Null.{left,right}.metrics.rep'. They contain metrics from system.query_log.ProfileEvents and functions from stack traces from system.trace_log, that vary significantly between query runs. The second column is array of [min, med, max] values for the metric. Say, if you see PerfCacheMisses there, it may mean that the code being tested has not-so-cache-local memory access pattern that is sensitive to memory layout.

Skipped Tests

Informational, no action required.

Shows the tests that were skipped, and the reason for it. Normally it is because the data set required for the test was not loaded, or the test is marked as 'long' -- both cases mean that the test is too big to be ran per-commit.

Test Performance Changes

Informational, no action required.

This table summarizes the changes in performance of queries in each test -- how many queries have changed, how many are unstable, and what is the magnitude of the changes.

Test Times

Action required for the cells marked in red.

This table shows the run times for all the tests. You may have to fix two kinds of errors in this table:

  1. Average query run time is too long -- probalby means that the preparatory steps such as creating the table and filling them with data are taking too long. Try to make them faster.
  2. Longest query run time is too long -- some particular queries are taking too long, try to make them faster. The ideal query run time is between 0.1 and 1 s.

Metric Changes

No action required.

These are changes in median values of metrics from system.asynchronous_metrics_log. These metrics are prone to unexplained variation and you can safely ignore this table unless it's interesting to you for some particular reason (e.g. you want to compare memory usage). There are also graphs of these metrics in the performance test output archive, in the metrics folder.

Errors while Building the Report

Ask a maintainer for help. These errors normally indicate a problem with testing infrastructure.

How to Run

Run the entire docker container, specifying PR number (0 for master) and SHA of the commit to test. The reference revision is determined as a nearest ancestor testing release tag. It is possible to specify the reference revision and pull requests (0 for master) manually.

docker run --network=host --volume=$(pwd)/workspace:/workspace --volume=$(pwd)/output:/output
    [-e REF_PR={} -e REF_SHA={}]
    -e PR_TO_TEST={} -e SHA_TO_TEST={}
    yandex/clickhouse-performance-comparison

Then see the report.html in the output directory.

There are some environment variables that influence what the test does:

  • -e CHCP_RUNS -- the number of runs;
  • -e CHPC_TEST_GREP -- the names of the tests (xml files) to run, interpreted as a grep pattern.
  • -e CHPC_LOCAL_SCRIPT -- use the comparison scripts from the docker container and not from the tested commit.

Re-genarate report with your tweaks

From the workspace directory (extracted test output archive):

stage=report compare.sh

More stages are available, e.g. restart servers or run the tests. See the code.

Run a single test on the already configured servers

docker/test/performance-comparison/perf.py --host=localhost --port=9000 --runs=1 tests/performance/logical_functions_small.xml

Run all tests on some custom configuration

Start two servers manually on ports 9001 (old) and 9002 (new). Change to a new directory to be used as workspace for tests, and try something like this:

$ PATH=$PATH:~/ch4/build-gcc9-rel/programs \
    CHPC_TEST_PATH=~/ch3/ch/tests/performance \
    CHPC_TEST_GREP=visit_param \
    stage=run_tests \
    ~/ch3/ch/docker/test/performance-comparison/compare.sh
  • PATH must contain clickhouse-local and clickhouse-client.
  • CHPC_TEST_PATH -- path to performance test cases, e.g. tests/performance.
  • CHPC_TEST_GREP -- a filter for which tests to run, as a grep pattern.
  • stage -- from which execution stage to start. To run the tests, use run_tests stage.

The tests will run, and the report.html will be generated in the current directory.

More complex setup is possible, but inconvenient and requires some scripting. See manual-run.sh for inspiration.

Compare two published releases

Use compare-releases.sh. It will download and extract static + dbg + test packages for both releases, and then call the main comparison script compare.sh, starting from configure stage.

compare-releaseses.sh 19.16.19.85 20.4.2.9

Statistical considerations

Generating randomization distribution for medians is tricky. Suppose we have N runs for each version, and then use the combined 2N run results to make a virtual experiment. In this experiment, we only have N possible values for median of each version. This becomes very clear if you sort those 2N runs and imagine where a window of N runs can be -- the N/2 smallest and N/2 largest values can never be medians. From these N possible values of medians, you can obtain (N/2)^2 possible values of absolute median difference. These numbers are +-1, I'm making an off-by-one error somewhere. So, if your number of runs is small, e.g. 7, you'll only get 16 possible differences, so even if you make 100k virtual experiments, the randomization distribution will have only 16 steps, so you'll get weird effects. So you also have to have enough runs. You can observe it on real data if you add more output to the query that calculates randomization distribution, e.g., add a number of unique median values. Looking even more closely, you can see that the exact values of medians don't matter, and the randomization distribution for difference of medians devolves into some kind of ranked test. We could probably skip all these virtual experiments and calculate the resulting distribution analytically, but I don't know enough math to do it. It would be something close to Wilcoxon test distribution.

References

1. Box, Hunter, Hunter "Statictics for exprerimenters", p. 78: "A Randomized Design Used in the Comparison of Standard and Modified Fertilizer Mixtures for Tomato Plants."