Merge branch 'master' of github.com:yandex/ClickHouse into fix-typo

2024-11-24 08:32:02 +00:00 · 2020-07-09 00:35:47 +03:00 · 2020-07-09 00:35:47 +03:00 · ba6dbb4bb2
commit ba6dbb4bb2
parent 56b835effa 1062270580
20 changed files with 877 additions and 33 deletions
--- a/docker/test/performance-comparison/README.md
+++ b/docker/test/performance-comparison/README.md
@ -23,21 +23,53 @@ The check status summarizes the report in a short text message like `1 faster, 1
 * `1 slower` -- how many queries are slower,
 * `1 too long` -- how many queries are taking too long to run,
 * `1 unstable` -- how many queries have unstable results,
-* `1 errors` -- how many errors there are in total. The number of errors includes slower tests, tests that are too long, errors while running the tests and building reports, etc. Please look at the main report page to investigate these errors.
+* `1 errors` -- how many errors there are in total. Action is required for every error, this number must be zero. The number of errors includes slower tests, tests that are too long, errors while running the tests and building reports, etc. Please look at the main report page to investigate these errors.

 The report page itself constists of a several tables. Some of them always signify errors, e.g. "Run errors" -- the very presence of this table indicates that there were errors during the test, that are not normal and must be fixed. Some tables are mostly informational, e.g. "Test times" -- they reflect normal test results. But if a cell in such table is marked in red, this also means an error, e.g., a test is taking too long to run.

 #### Tested commits
-Informational. Log messages for the commits that are tested. Note that for the right commit, we show nominal tested commit `pull/*/head` and real tested commit `pull/*/merge`, which is generated by GitHub by merging latest master to the `pull/*/head` and which we actually build and test in CI.
+Informational, no action required. Log messages for the commits that are tested. Note that for the right commit, we show nominal tested commit `pull/*/head` and real tested commit `pull/*/merge`, which is generated by GitHub by merging latest master to the `pull/*/head` and which we actually build and test in CI.

 #### Run errors
-These are errors that must be fixed. The errors that ocurred when running some test queries. For more information about the error, download test output archive and see `test-name-err.log`. To reproduce, see 'How to run' below.
+Action required for every item -- these are errors that must be fixed. The errors that ocurred when running some test queries. For more information about the error, download test output archive and see `test-name-err.log`. To reproduce, see 'How to run' below.

 #### Slow on client
-These are errors that must be fixed. This table shows queries that take significantly longer to process on the client than on the server. A possible reason might be sending too much data to the client, e.g., a forgotten `format Null`.
+Action required for every item -- these are errors that must be fixed. This table shows queries that take significantly longer to process on the client than on the server. A possible reason might be sending too much data to the client, e.g., a forgotten `format Null`.

 #### Partial queries
-Informational, no action required if no red cells. Shows the queries we are unable to run on an old server -- probably because they contain a new function. You should see this table when you add a new function and a performance test for it. Check that the run time and variance are acceptable (run time between 0.1 and 1 seconds, variance below 10%). If not, they will be highlighted in red.
+Action required for the cells marked in red. Shows the queries we are unable to run on an old server -- probably because they contain a new function. You should see this table when you add a new function and a performance test for it. Check that the run time and variance are acceptable (run time between 0.1 and 1 seconds, variance below 10%). If not, they will be highlighted in red.
+
+#### Changes in performance
+Action required for the cells marked in red, and some cheering is appropriate for the cells marked in green. These are the queries for which we observe a statistically significant change in performance. Note that there will always be some false positives -- we try to filter by p < 0.001, and have 2000 queries, so two false positives per run are expected. In practice we have more -- e.g. code layout changed because of some unknowable jitter in compiler internals, so the change we observe is real, but it is a 'false positive' in the sense that it is not directly caused by your changes. If, based on your knowledge of ClickHouse internals, you can decide that the observed test changes are not relevant to the changes made in the tested PR, you can ignore them.
+
+You can find flame graphs for queries with performance changes in the test output archive, in files named as 'my_test_0_Cpu_SELECT 1 FROM....FORMAT Null.left.svg'. First goes the test name, then the query number in the test, then the trace type (same as in `system.trace_log`), and then the server version (left is old and right is new).
+
+#### Unstable queries
+Action required for the cells marked in red. These are queries for which we did not observe a statistically significant change in performance, but for which the variance in query performance is very high. This means that we are likely to observe big changes in performance even in the absence of real changes, e.g. when comparing the server to itself. Such queries are going to have bad sensitivity as performance tests -- if a query has, say, 50% expected variability, this means we are going to see changes in performance up to 50%, even when there were no real changes in the code. And because of this, we won't be able to detect changes less than 50% with such a query, which is pretty bad. The reasons for the high variability must be investigated and fixed; ideally, the variability should be brought under 5-10%. 
+
+The most frequent reason for instability is that the query is just too short -- e.g. below 0.1 seconds. Bringing query time to 0.2 seconds or above usually helps.
+Other reasons may include:
+* using a lot of memory which is allocated differently between servers, so the access time may vary. This may apply to your queries if you have a `Memory` engine table that is bigger than 1 GB. For example, this problem has plagued `arithmetic` and `logical_functions` tests for a long time.
+* having some threshold behavior in the query, e.g. you insert to a Buffer table and it is flushed only on some query runs, so you get a much higher time for them.
+
+Investigating the instablility is the hardest problem in performance testing, and we still have not been able to understand the reasons behind the instability of some queries. There are some data that can help you in the performance test output archive. Look for files named 'my_unstable_test_0_SELECT 1...FORMAT Null.{left,right}.metrics.rep'. They contain metrics from `system.query_log.ProfileEvents` and functions from stack traces from `system.trace_log`, that vary significantly between query runs. The second column is array of \[min, med, max] values for the metric. Say, if you see `PerfCacheMisses` there, it may mean that the code being tested has not-so-cache-local memory access pattern that is sensitive to memory layout.
+
+#### Skipped tests
+Informational, no action required. Shows the tests that were skipped, and the reason for it. Normally it is because the data set required for the test was not loaded, or the test is marked as 'long' -- both cases mean that the test is too big to be ran per-commit.
+
+#### Test performance changes
+Informational, no action required. This table summarizes the changes in performance of queries in each test -- how many queries have changed, how many are unstable, and what is the magnitude of the changes.
+
+#### Test times
+Action required for the cells marked in red. This table shows the run times for all the tests. You may have to fix two kinds of errors in this table:
+1) Average query run time is too long -- probalby means that the preparatory steps such as creating the table and filling them with data are taking too long. Try to make them faster.
+2) Longest query run time is too long -- some particular queries are taking too long, try to make them faster. The ideal query run time is between 0.1 and 1 s.
+
+#### Concurrent benchmarks
+No action required. This table shows the results of a concurrent behcmark where queries from `website` are ran in parallel using `clickhouse-benchmark`, and requests per second values are compared for old and new servers. It shows variability up to 20% for no apparent reason, so it's probably safe to disregard it. We have it for special cases like investigating concurrency effects in memory allocators, where it may be important.
+
+#### Metric changes
+No action required. These are changes in median values of metrics from `system.asynchronous_metrics_log`. Again, they are prone to unexplained variation and you can safely ignore this table unless it's interesting to you for some particular reason (e.g. you want to compare memory usage). There are also graphs of these metrics in the performance test output archive, in the `metrics` folder.

 ### How to run
 Run the entire docker container, specifying PR number (0 for master)
--- a/docker/test/testflows/runner/Dockerfile
+++ b/docker/test/testflows/runner/Dockerfile
@ -72,5 +72,5 @@ RUN set -x \
 VOLUME /var/lib/docker
 EXPOSE 2375
 ENTRYPOINT ["dockerd-entrypoint.sh"]
-CMD ["sh", "-c", "python3 regression.py --no-color --local --clickhouse-binary-path ${CLICKHOUSE_TESTS_SERVER_BIN_PATH} --log test.log ${TESTFLOWS_OPTS} && cat test.log | tfs report results --format json > results.json"]
+CMD ["sh", "-c", "python3 regression.py --no-color --local --clickhouse-binary-path ${CLICKHOUSE_TESTS_SERVER_BIN_PATH} --log test.log ${TESTFLOWS_OPTS}; cat test.log | tfs report results --format json > results.json"]

--- a/programs/client/CMakeLists.txt
+++ b/programs/client/CMakeLists.txt
@ -1,6 +1,7 @@
 set (CLICKHOUSE_CLIENT_SOURCES
    Client.cpp
    ConnectionParameters.cpp
+    QueryFuzzer.cpp
    Suggest.cpp
 )

--- a/programs/client/Client.cpp
+++ b/programs/client/Client.cpp
@ -1,5 +1,6 @@
 #include "TestHint.h"
 #include "ConnectionParameters.h"
+#include "QueryFuzzer.h"
 #include "Suggest.h"

 #if USE_REPLXX
@ -40,6 +41,7 @@
 #include <Common/typeid_cast.h>
 #include <Common/clearPasswordFromCommandLine.h>
 #include <Common/Config/ConfigProcessor.h>
+#include <Common/PODArray.h>
 #include <Core/Types.h>
 #include <Core/QueryProcessingStage.h>
 #include <Core/ExternalTable.h>
@ -213,6 +215,9 @@ private:

    ConnectionParameters connection_parameters;

+    QueryFuzzer fuzzer;
+    int query_fuzzer_runs;
+
    void initialize(Poco::Util::Application & self) override
    {
        Poco::Util::Application::initialize(self);
@ -660,7 +665,14 @@ private:
        else
        {
            query_id = config().getString("query_id", "");
-            nonInteractive();
+            if (query_fuzzer_runs)
+            {
+                nonInteractiveWithFuzzing();
+            }
+            else
+            {
+                nonInteractive();
+            }

            /// If exception code isn't zero, we should return non-zero return code anyway.
            if (last_exception_received_from_server)
@ -762,6 +774,119 @@ private:
        processQueryText(text);
    }

+    void nonInteractiveWithFuzzing()
+    {
+        if (config().has("query"))
+        {
+            // Poco configuration should not process substitutions in form of
+            // ${...} inside query
+            processWithFuzzing(config().getRawString("query"));
+            return;
+        }
+
+        // Try to stream the queries from stdin, without reading all of them
+        // into memory. The interface of the parser does not support streaming,
+        // in particular, it can't distinguish the end of partial input buffer
+        // and the final end of input file. This means we have to try to split
+        // the input into separate queries here. Two patterns of input are
+        // especially interesing:
+        // 1) multiline query:
+        //      select 1
+        //      from system.numbers;
+        //
+        // 2) csv insert with in-place data:
+        //      insert into t format CSV 1;2
+        //
+        // (1) means we can't split on new line, and (2) means we can't split on
+        // semicolon. Solution: split on ';\n'. This sequence is frequent enough
+        // in the SQL tests which are our principal input for fuzzing. Now we
+        // have another interesting case:
+        // 3) escaped semicolon followed by newline, e.g.
+        //      select ';
+        //          '
+        //
+        // To handle (3), parse until we can, and read more data if the parser
+        // complains. Hopefully this should be enough...
+        ReadBufferFromFileDescriptor in(STDIN_FILENO);
+        std::string text;
+        while (!in.eof())
+        {
+            // Read until separator.
+            while (!in.eof())
+            {
+                char * next_separator = find_first_symbols<';'>(in.position(),
+                    in.buffer().end());
+
+                if (next_separator < in.buffer().end())
+                {
+                    next_separator++;
+                    if (next_separator < in.buffer().end()
+                        && *next_separator == '\n')
+                    {
+                        // Found ';\n', append it to the query text and try to
+                        // parse.
+                        next_separator++;
+                        text.append(in.position(), next_separator - in.position());
+                        in.position() = next_separator;
+                        break;
+                    }
+                }
+
+                // Didn't find the semicolon and reached the end of buffer.
+                text.append(in.position(), next_separator - in.position());
+                in.position() = next_separator;
+
+                if (text.size() > 1024 * 1024)
+                {
+                    // We've read a lot of text and still haven't seen a separator.
+                    // Likely some pathological input, just fall through to prevent
+                    // too long loops.
+                    break;
+                }
+            }
+
+            // Parse and execute what we've read.
+            fprintf(stderr, "will now parse '%s'\n", text.c_str());
+
+            const auto * new_end = processWithFuzzing(text);
+
+            if (new_end > &text[0])
+            {
+                const auto rest_size = text.size() - (new_end - &text[0]);
+
+                fprintf(stderr, "total %zd, rest %zd\n", text.size(), rest_size);
+
+                memcpy(&text[0], new_end, rest_size);
+                text.resize(rest_size);
+            }
+            else
+            {
+                fprintf(stderr, "total %zd, can't parse\n", text.size());
+            }
+
+            if (!connection->isConnected())
+            {
+                // Uh-oh...
+                std::cerr << "Lost connection to the server." << std::endl;
+                last_exception_received_from_server
+                    = std::make_unique<Exception>(210, "~");
+                return;
+            }
+
+            if (text.size() > 4 * 1024)
+            {
+                // Some pathological situation where the text is larger than 4kB
+                // and we still cannot parse a single query in it. Abort.
+                std::cerr << "Read too much text and still can't parse a query."
+                     " Aborting." << std::endl;
+                last_exception_received_from_server
+                    = std::make_unique<Exception>(1, "~");
+                // return;
+                exit(1);
+            }
+        }
+    }
+
    bool processQueryText(const String & text)
    {
        if (exit_strings.end() != exit_strings.find(trim(text, [](char c){ return isWhitespaceASCII(c) || c == ';'; })))
@ -769,10 +894,17 @@ private:

        if (!config().has("multiquery"))
        {
+            assert(!query_fuzzer_runs);
            processTextAsSingleQuery(text);
            return true;
        }

+        if (query_fuzzer_runs)
+        {
+            processWithFuzzing(text);
+            return true;
+        }
+
        return processMultiQuery(text);
    }

@ -871,6 +1003,121 @@ private:
    }


+    // Returns the last position we could parse.
+    const char * processWithFuzzing(const String & text)
+    {
+        /// Several queries separated by ';'.
+        /// INSERT data is ended by the end of line, not ';'.
+
+        const char * begin = text.data();
+        const char * end = begin + text.size();
+
+        while (begin < end)
+        {
+            // Skip whitespace before the query
+            while (isWhitespaceASCII(*begin) || *begin == ';')
+            {
+                ++begin;
+            }
+
+            const auto * this_query_begin = begin;
+            ASTPtr orig_ast = parseQuery(begin, end, true);
+
+            if (!orig_ast)
+            {
+                // Can't continue after a parsing error
+                return begin;
+            }
+
+            auto * as_insert = orig_ast->as<ASTInsertQuery>();
+            if (as_insert && as_insert->data)
+            {
+                // INSERT data is ended by newline
+                as_insert->end = find_first_symbols<'\n'>(as_insert->data, end);
+                begin = as_insert->end;
+            }
+
+            full_query = text.substr(this_query_begin - text.data(),
+                begin - text.data());
+
+            ASTPtr fuzz_base = orig_ast;
+            for (int fuzz_step = 0; fuzz_step < query_fuzzer_runs; fuzz_step++)
+            {
+                fprintf(stderr, "fuzzing step %d for query at pos %zd\n",
+                    fuzz_step, this_query_begin - text.data());
+
+                ASTPtr ast_to_process;
+                try
+                {
+                    std::stringstream dump_before_fuzz;
+                    fuzz_base->dumpTree(dump_before_fuzz);
+                    auto base_before_fuzz = fuzz_base->formatForErrorMessage();
+
+                    ast_to_process = fuzz_base->clone();
+                    fuzzer.fuzzMain(ast_to_process);
+
+                    auto base_after_fuzz = fuzz_base->formatForErrorMessage();
+
+                    // Debug AST cloning errors.
+                    if (base_before_fuzz != base_after_fuzz)
+                    {
+                        fprintf(stderr, "base before fuzz: %s\n"
+                            "base after fuzz: %s\n", base_before_fuzz.c_str(),
+                            base_after_fuzz.c_str());
+                        fprintf(stderr, "dump before fuzz:\n%s\n",
+                            dump_before_fuzz.str().c_str());
+                        fprintf(stderr, "dump after fuzz:\n");
+                        fuzz_base->dumpTree(std::cerr);
+                        assert(false);
+                    }
+
+                    auto fuzzed_text = ast_to_process->formatForErrorMessage();
+                    if (fuzz_step > 0 && fuzzed_text == base_before_fuzz)
+                    {
+                        fprintf(stderr, "got boring ast\n");
+                        continue;
+                    }
+
+                    parsed_query = ast_to_process;
+                    query_to_send = parsed_query->formatForErrorMessage();
+
+                    processParsedSingleQuery();
+                }
+                catch (...)
+                {
+                    last_exception_received_from_server = std::make_unique<Exception>(getCurrentExceptionMessage(true), getCurrentExceptionCode());
+                    received_exception_from_server = true;
+                    std::cerr << "Error on processing query: " << ast_to_process->formatForErrorMessage() << std::endl << last_exception_received_from_server->message();
+                }
+
+                if (received_exception_from_server)
+                {
+                    // Query completed with error, ignore it and fuzz again.
+                    fprintf(stderr, "Got error, will fuzz again\n");
+
+                    received_exception_from_server = false;
+                    last_exception_received_from_server.reset();
+
+                    continue;
+                }
+                else if (ast_to_process->formatForErrorMessage().size() > 500)
+                {
+                    // ast too long, start from original ast
+                    fprintf(stderr, "current ast too long, won't elaborate\n");
+                    fuzz_base = orig_ast;
+                }
+                else
+                {
+                    // fuzz starting from this successful query
+                    fprintf(stderr, "using this ast as etalon\n");
+                    fuzz_base = ast_to_process;
+                }
+            }
+        }
+
+        return begin;
+    }
+
    void processTextAsSingleQuery(const String & text_)
    {
        full_query = text_;
@ -906,6 +1153,7 @@ private:
    void processParsedSingleQuery()
    {
        resetOutput();
+        last_exception_received_from_server.reset();
        received_exception_from_server = false;

        if (echo_queries)
@ -1537,8 +1785,11 @@ private:
        processed_rows += block.rows();
        initBlockOutputStream(block);

-        /// The header block containing zero rows was used to initialize block_out_stream, do not output it.
-        if (block.rows() != 0)
+        /// The header block containing zero rows was used to initialize
+        /// block_out_stream, do not output it.
+        /// Also do not output too much data if we're fuzzing.
+        if (block.rows() != 0
+            && (query_fuzzer_runs == 0 || processed_rows < 100))
        {
            block_out_stream->write(block);
            written_first_block = true;
@ -1895,6 +2146,7 @@ public:
            ("highlight", po::value<bool>()->default_value(true), "enable or disable basic syntax highlight in interactive command line")
            ("log-level", po::value<std::string>(), "client log level")
            ("server_logs_file", po::value<std::string>(), "put server logs into specified file")
+            ("query-fuzzer-runs", po::value<int>()->default_value(0), "query fuzzer runs")
        ;

        Settings cmd_settings;
@ -2052,6 +2304,17 @@ public:
        if (options.count("highlight"))
            config().setBool("highlight", options["highlight"].as<bool>());

+        if ((query_fuzzer_runs = options["query-fuzzer-runs"].as<int>()))
+        {
+            // Fuzzer implies multiquery.
+            config().setBool("multiquery", true);
+
+            // Ignore errors in parsing queries.
+            // TODO stop using parseQuery.
+            config().setBool("ignore-error", true);
+            ignore_error = true;
+        }
+
        argsToConfig(common_arguments, config(), 100);

        clearPasswordFromCommandLine(argc, argv);
--- a/programs/client/QueryFuzzer.cpp
+++ b/programs/client/QueryFuzzer.cpp
@ -0,0 +1,421 @@
+#include "QueryFuzzer.h"
+
+#include <unordered_set>
+
+#include <pcg_random.hpp>
+#include <Common/assert_cast.h>
+#include <Common/typeid_cast.h>
+#include <Core/Types.h>
+#include <IO/Operators.h>
+#include <IO/UseSSL.h>
+#include <Parsers/ASTExpressionList.h>
+#include <Parsers/ASTFunction.h>
+#include <Parsers/ASTIdentifier.h>
+#include <Parsers/ASTInsertQuery.h>
+#include <Parsers/ASTLiteral.h>
+#include <Parsers/ASTQueryWithOutput.h>
+#include <Parsers/ASTSelectQuery.h>
+#include <Parsers/ASTSelectWithUnionQuery.h>
+#include <Parsers/ASTSetQuery.h>
+#include <Parsers/ASTSubquery.h>
+#include <Parsers/ASTTablesInSelectQuery.h>
+#include <Parsers/ASTUseQuery.h>
+#include <Parsers/ParserQuery.h>
+#include <Parsers/formatAST.h>
+#include <Parsers/parseQuery.h>
+
+namespace DB
+{
+
+Field QueryFuzzer::getRandomField(int type)
+{
+    switch (type)
+    {
+    case 0:
+    {
+        static constexpr Int64 values[]
+                = {-2, -1, 0, 1, 2, 3, 7, 10, 100, 255, 256, 257, 1023, 1024,
+                   1025, 65535, 65536, 65537, 1024 * 1024 - 1, 1024 * 1024,
+                   1024 * 1024 + 1, INT64_MIN, INT64_MAX};
+        return values[fuzz_rand() % (sizeof(values) / sizeof(*values))];
+    }
+    case 1:
+    {
+        static constexpr float values[]
+                = {NAN, INFINITY, -INFINITY, 0., 0.0001, 0.5, 0.9999,
+                   1., 1.0001, 2., 10.0001, 100.0001, 1000.0001};
+        return values[fuzz_rand() % (sizeof(values) / sizeof(*values))];
+    }
+    case 2:
+    {
+        static constexpr Int64 values[]
+                = {-2, -1, 0, 1, 2, 3, 7, 10, 100, 255, 256, 257, 1023, 1024,
+                   1025, 65535, 65536, 65537, 1024 * 1024 - 1, 1024 * 1024,
+                   1024 * 1024 + 1, INT64_MIN, INT64_MAX};
+        static constexpr UInt64 scales[] = {0, 1, 2, 10};
+        return DecimalField<Decimal64>(
+                    values[fuzz_rand() % (sizeof(values) / sizeof(*values))],
+                scales[fuzz_rand() % (sizeof(scales) / sizeof(*scales))]
+                );
+    }
+    default:
+        assert(false);
+        return Null{};
+    }
+}
+
+Field QueryFuzzer::fuzzField(Field field)
+{
+    const auto type = field.getType();
+
+    int type_index = -1;
+
+    if (type == Field::Types::Int64
+        || type == Field::Types::UInt64)
+    {
+        type_index = 0;
+    }
+    else if (type == Field::Types::Float64)
+    {
+        type_index = 1;
+    }
+    else if (type == Field::Types::Decimal32
+             || type == Field::Types::Decimal64
+             || type == Field::Types::Decimal128)
+    {
+        type_index = 2;
+    }
+
+    if (fuzz_rand() % 20 == 0)
+    {
+        return Null{};
+    }
+
+    if (type_index >= 0)
+    {
+        if (fuzz_rand() % 20 == 0)
+        {
+            // Change type sometimes, but not often, because it mostly leads to
+            // boring errors.
+            type_index = fuzz_rand() % 3;
+        }
+        return getRandomField(type_index);
+    }
+
+    if (type == Field::Types::String)
+    {
+        auto & str = field.get<std::string>();
+        UInt64 action = fuzz_rand() % 10;
+        switch (action)
+        {
+        case 0:
+            str = "";
+            break;
+        case 1:
+            str = str + str;
+            break;
+        case 2:
+            str = str + str + str + str;
+            break;
+        case 4:
+            if (!str.empty())
+            {
+                str[fuzz_rand() % str.size()] = '\0';
+            }
+            break;
+        default:
+            // Do nothing
+            break;
+        }
+    }
+    else if (type == Field::Types::Array || type == Field::Types::Tuple)
+    {
+        auto & arr = field.reinterpret<FieldVector>();
+
+        if (fuzz_rand() % 5 == 0 && !arr.empty())
+        {
+            size_t pos = fuzz_rand() % arr.size();
+            arr.erase(arr.begin() + pos);
+            fprintf(stderr, "erased\n");
+        }
+
+        if (fuzz_rand() % 5 == 0)
+        {
+            if (!arr.empty())
+            {
+                size_t pos = fuzz_rand() % arr.size();
+                arr.insert(arr.begin() + pos, fuzzField(arr[pos]));
+                fprintf(stderr, "inserted (pos %zd)\n", pos);
+            }
+            else
+            {
+                arr.insert(arr.begin(), getRandomField(0));
+                fprintf(stderr, "inserted (0)\n");
+            }
+
+        }
+
+        for (auto & element : arr)
+        {
+            element = fuzzField(element);
+        }
+    }
+
+    return field;
+}
+
+ASTPtr QueryFuzzer::getRandomColumnLike()
+{
+    if (column_like.empty())
+    {
+        return nullptr;
+    }
+
+    ASTPtr new_ast = column_like[fuzz_rand() % column_like.size()]->clone();
+    new_ast->setAlias("");
+
+    return new_ast;
+}
+
+void QueryFuzzer::replaceWithColumnLike(ASTPtr & ast)
+{
+    if (column_like.empty())
+    {
+        return;
+    }
+
+    std::string old_alias = ast->tryGetAlias();
+    ast = getRandomColumnLike();
+    ast->setAlias(old_alias);
+}
+
+void QueryFuzzer::replaceWithTableLike(ASTPtr & ast)
+{
+    if (table_like.empty())
+    {
+        return;
+    }
+
+    ASTPtr new_ast = table_like[fuzz_rand() % table_like.size()]->clone();
+
+    std::string old_alias = ast->tryGetAlias();
+    new_ast->setAlias(old_alias);
+
+    ast = new_ast;
+}
+
+void QueryFuzzer::fuzzColumnLikeExpressionList(ASTPtr ast)
+{
+    if (!ast)
+    {
+        return;
+    }
+
+    auto * impl = assert_cast<ASTExpressionList *>(ast.get());
+    if (fuzz_rand() % 50 == 0 && impl->children.size() > 1)
+    {
+        // Don't remove last element -- this leads to questionable
+        // constructs such as empty select.
+        impl->children.erase(impl->children.begin()
+                             + fuzz_rand() % impl->children.size());
+    }
+    if (fuzz_rand() % 50 == 0)
+    {
+        auto pos = impl->children.empty()
+                ? impl->children.begin()
+                : impl->children.begin() + fuzz_rand() % impl->children.size();
+        auto col = getRandomColumnLike();
+        if (col)
+        {
+            impl->children.insert(pos, col);
+        }
+        else
+        {
+            fprintf(stderr, "no random col!\n");
+        }
+    }
+}
+
+void QueryFuzzer::fuzz(ASTs & asts)
+{
+    for (auto & ast : asts)
+    {
+        fuzz(ast);
+    }
+}
+
+void QueryFuzzer::fuzz(ASTPtr & ast)
+{
+    if (!ast)
+        return;
+
+    if (auto * with_union = typeid_cast<ASTSelectWithUnionQuery *>(ast.get()))
+    {
+        fuzz(with_union->list_of_selects);
+    }
+    else if (auto * tables = typeid_cast<ASTTablesInSelectQuery *>(ast.get()))
+    {
+        fuzz(tables->children);
+    }
+    else if (auto * tables_element = typeid_cast<ASTTablesInSelectQueryElement *>(ast.get()))
+    {
+        fuzz(tables_element->table_join);
+        fuzz(tables_element->table_expression);
+        fuzz(tables_element->array_join);
+    }
+    else if (auto * table_expr = typeid_cast<ASTTableExpression *>(ast.get()))
+    {
+        fuzz(table_expr->database_and_table_name);
+        fuzz(table_expr->subquery);
+        fuzz(table_expr->table_function);
+    }
+    else if (auto * expr_list = typeid_cast<ASTExpressionList *>(ast.get()))
+    {
+        fuzz(expr_list->children);
+    }
+    else if (auto * fn = typeid_cast<ASTFunction *>(ast.get()))
+    {
+        fuzzColumnLikeExpressionList(fn->arguments);
+        fuzzColumnLikeExpressionList(fn->parameters);
+
+        fuzz(fn->children);
+    }
+    else if (auto * select = typeid_cast<ASTSelectQuery *>(ast.get()))
+    {
+        fuzzColumnLikeExpressionList(select->select());
+        fuzzColumnLikeExpressionList(select->groupBy());
+
+        fuzz(select->children);
+    }
+    else if (auto * literal = typeid_cast<ASTLiteral *>(ast.get()))
+    {
+        // Only change the queries sometimes.
+        int r = fuzz_rand() % 10;
+        if (r == 0)
+        {
+            literal->value = fuzzField(literal->value);
+        }
+        else if (r == 1)
+        {
+            /* replace with a random function? */
+        }
+        else if (r == 2)
+        {
+            /* replace with something column-like */
+            replaceWithColumnLike(ast);
+        }
+    }
+    else
+    {
+        fuzz(ast->children);
+    }
+}
+
+/*
+ * This functions collects various parts of query that we can then substitute
+ * to a query being fuzzed.
+ *
+ * TODO: we just stop remembering new parts after our corpus reaches certain size.
+ * This is boring, should implement a random replacement of existing parst with
+ * small probability. Do this after we add this fuzzer to CI and fix all the
+ * problems it can routinely find even in this boring version.
+ */
+void QueryFuzzer::collectFuzzInfoMain(const ASTPtr ast)
+{
+    collectFuzzInfoRecurse(ast);
+
+    aliases.clear();
+    for (const auto & alias : aliases_set)
+    {
+        aliases.push_back(alias);
+    }
+
+    column_like.clear();
+    for (const auto & [name, value] : column_like_map)
+    {
+        column_like.push_back(value);
+    }
+
+    table_like.clear();
+    for (const auto & [name, value] : table_like_map)
+    {
+        table_like.push_back(value);
+    }
+}
+
+void QueryFuzzer::addTableLike(const ASTPtr ast)
+{
+    if (table_like_map.size() > 1000)
+    {
+        return;
+    }
+
+    const auto name = ast->formatForErrorMessage();
+    if (name.size() < 200)
+    {
+        table_like_map.insert({name, ast});
+    }
+}
+
+void QueryFuzzer::addColumnLike(const ASTPtr ast)
+{
+    if (column_like_map.size() > 1000)
+    {
+        return;
+    }
+
+    const auto name = ast->formatForErrorMessage();
+    if (name.size() < 200)
+    {
+        column_like_map.insert({name, ast});
+    }
+}
+
+void QueryFuzzer::collectFuzzInfoRecurse(const ASTPtr ast)
+{
+    if (auto * impl = dynamic_cast<ASTWithAlias *>(ast.get()))
+    {
+        if (aliases_set.size() < 1000)
+        {
+            aliases_set.insert(impl->alias);
+        }
+    }
+
+    if (typeid_cast<ASTLiteral *>(ast.get()))
+    {
+        addColumnLike(ast);
+    }
+    else if (typeid_cast<ASTIdentifier *>(ast.get()))
+    {
+        addColumnLike(ast);
+    }
+    else if (typeid_cast<ASTFunction *>(ast.get()))
+    {
+        addColumnLike(ast);
+    }
+    else if (typeid_cast<ASTTableExpression *>(ast.get()))
+    {
+        addTableLike(ast);
+    }
+    else if (typeid_cast<ASTSubquery *>(ast.get()))
+    {
+        addTableLike(ast);
+    }
+
+    for (const auto & child : ast->children)
+    {
+        collectFuzzInfoRecurse(child);
+    }
+}
+
+void QueryFuzzer::fuzzMain(ASTPtr & ast)
+{
+    collectFuzzInfoMain(ast);
+    fuzz(ast);
+
+    std::cout << std::endl;
+    formatAST(*ast, std::cout);
+    std::cout << std::endl << std::endl;
+}
+
+}
--- a/programs/client/QueryFuzzer.h
+++ b/programs/client/QueryFuzzer.h
@ -0,0 +1,58 @@
+#pragma once
+
+#include <unordered_set>
+#include <unordered_map>
+#include <vector>
+
+#include <Common/randomSeed.h>
+#include <Common/Stopwatch.h>
+#include <Core/Field.h>
+#include <Parsers/IAST.h>
+
+namespace DB
+{
+
+/*
+ * This is an AST-based query fuzzer that makes random modifications to query
+ * AST, changing numbers, list of columns, functions, etc. It remembers part of
+ * queries it fuzzed previously, and can substitute these parts to new fuzzed
+ * queries, so you want to feed it a lot of queries to get some interesting mix
+ * of them. Normally we feed SQL regression tests to it.
+ */
+struct QueryFuzzer
+{
+    pcg64 fuzz_rand{randomSeed()};
+
+    // These arrays hold parts of queries that we can substitute into the query
+    // we are currently fuzzing. We add some part from each new query we are asked
+    // to fuzz, and keep this state between queries, so the fuzzing output becomes
+    // more interesting over time, as the queries mix.
+    std::unordered_set<std::string> aliases_set;
+    std::vector<std::string> aliases;
+
+    std::unordered_map<std::string, ASTPtr> column_like_map;
+    std::vector<ASTPtr> column_like;
+
+    std::unordered_map<std::string, ASTPtr> table_like_map;
+    std::vector<ASTPtr> table_like;
+
+    // This is the only function you have to call -- it will modify the passed
+    // ASTPtr to point to new AST with some random changes.
+    void fuzzMain(ASTPtr & ast);
+
+    // Variuos helper functions follow, normally you shouldn't have to call them.
+    Field getRandomField(int type);
+    Field fuzzField(Field field);
+    ASTPtr getRandomColumnLike();
+    void replaceWithColumnLike(ASTPtr & ast);
+    void replaceWithTableLike(ASTPtr & ast);
+    void fuzzColumnLikeExpressionList(ASTPtr ast);
+    void fuzz(ASTs & asts);
+    void fuzz(ASTPtr & ast);
+    void collectFuzzInfoMain(const ASTPtr ast);
+    void addTableLike(const ASTPtr ast);
+    void addColumnLike(const ASTPtr ast);
+    void collectFuzzInfoRecurse(const ASTPtr ast);
+};
+
+}
--- a/programs/ya.make
+++ b/programs/ya.make
@ -17,6 +17,7 @@ SRCS(
    main.cpp

    client/Client.cpp
+    client/QueryFuzzer.cpp
    client/ConnectionParameters.cpp
    client/Suggest.cpp
    extract-from-config/ExtractFromConfig.cpp
--- a/src/Client/Connection.h
+++ b/src/Client/Connection.h
@ -170,6 +170,8 @@ public:
    /// If not connected yet, or if connection is broken - then connect. If cannot connect - throw an exception.
    void forceConnected(const ConnectionTimeouts & timeouts);

+    bool isConnected() const { return connected; }
+
    TablesStatusResponse getTablesStatus(const ConnectionTimeouts & timeouts,
                                         const TablesStatusRequest & request);

--- a/src/Common/DNSResolver.cpp
+++ b/src/Common/DNSResolver.cpp
@ -269,6 +269,8 @@ bool DNSResolver::updateCache()
    LOG_DEBUG(log, "Updating DNS cache");

    {
+        String updated_host_name = Poco::Net::DNS::hostName();
+
        std::lock_guard lock(impl->drop_mutex);

        for (const auto & host : impl->new_hosts)
@ -279,7 +281,7 @@ bool DNSResolver::updateCache()
            impl->known_addresses.insert(address);
        impl->new_addresses.clear();

-        impl->host_name.emplace(Poco::Net::DNS::hostName());
+        impl->host_name.emplace(updated_host_name);
    }

    /// FIXME Updating may take a long time becouse we cannot manage timeouts of getaddrinfo(...) and getnameinfo(...).
--- a/src/Common/ErrorCodes.cpp
+++ b/src/Common/ErrorCodes.cpp
@ -497,6 +497,7 @@ namespace ErrorCodes
    extern const int CASSANDRA_INTERNAL_ERROR = 528;
    extern const int NOT_A_LEADER = 529;
    extern const int CANNOT_CONNECT_RABBITMQ = 530;
+    extern const int CANNOT_FSTAT = 531;

    extern const int KEEPER_EXCEPTION = 999;
    extern const int POCO_EXCEPTION = 1000;
--- a/src/Dictionaries/TrieDictionary.cpp
+++ b/src/Dictionaries/TrieDictionary.cpp
@ -64,18 +64,8 @@ TrieDictionary::TrieDictionary(
 {
    createAttributes();
    trie = btrie_create();
-
-    try
-    {
-        loadData();
-        calculateBytesAllocated();
-    }
-    catch (...)
-    {
-        creation_exception = std::current_exception();
-    }
-
-    creation_time = std::chrono::system_clock::now();
+    loadData();
+    calculateBytesAllocated();
 }

 TrieDictionary::~TrieDictionary()
--- a/src/Dictionaries/TrieDictionary.h
+++ b/src/Dictionaries/TrieDictionary.h
@ -249,10 +249,6 @@ private:
    size_t bucket_count = 0;
    mutable std::atomic<size_t> query_count{0};

-    std::chrono::time_point<std::chrono::system_clock> creation_time;
-
-    std::exception_ptr creation_exception;
-
    Poco::Logger * logger;
 };

--- a/src/IO/WriteBufferFromFileDescriptor.cpp
+++ b/src/IO/WriteBufferFromFileDescriptor.cpp
@ -1,6 +1,8 @@
 #include <unistd.h>
 #include <errno.h>
 #include <cassert>
+#include <sys/types.h>
+#include <sys/stat.h>

 #include <Common/Exception.h>
 #include <Common/ProfileEvents.h>
@ -33,6 +35,7 @@ namespace ErrorCodes
    extern const int CANNOT_FSYNC;
    extern const int CANNOT_SEEK_THROUGH_FILE;
    extern const int CANNOT_TRUNCATE_FILE;
+    extern const int CANNOT_FSTAT;
 }


@ -130,4 +133,14 @@ void WriteBufferFromFileDescriptor::truncate(off_t length)
        throwFromErrnoWithPath("Cannot truncate file " + getFileName(), getFileName(), ErrorCodes::CANNOT_TRUNCATE_FILE);
 }

+
+off_t WriteBufferFromFileDescriptor::size()
+{
+    struct stat buf;
+    int res = fstat(fd, &buf);
+    if (-1 == res)
+        throwFromErrnoWithPath("Cannot execute fstat " + getFileName(), getFileName(), ErrorCodes::CANNOT_FSTAT);
+    return buf.st_size;
+}
+
 }
--- a/src/IO/WriteBufferFromFileDescriptor.h
+++ b/src/IO/WriteBufferFromFileDescriptor.h
@ -44,6 +44,8 @@ public:

    off_t seek(off_t offset, int whence);
    void truncate(off_t length);
+
+    off_t size();
 };

 }
--- a/src/Parsers/parseQuery.cpp
+++ b/src/Parsers/parseQuery.cpp
@ -225,6 +225,15 @@ ASTPtr tryParseQuery(
        || token_iterator->type == TokenType::Semicolon)
    {
        out_error_message = "Empty query";
+        // Token iterator skips over comments, so we'll get this error for queries
+        // like this:
+        // "
+        // -- just a comment
+        // ;
+        //"
+        // Advance the position, so that we can use this parser for stream parsing
+        // even in presence of such queries.
+        pos = token_iterator->begin;
        return nullptr;
    }

--- a/src/Processors/Formats/Impl/CSVRowOutputFormat.cpp
+++ b/src/Processors/Formats/Impl/CSVRowOutputFormat.cpp
@ -19,7 +19,7 @@ CSVRowOutputFormat::CSVRowOutputFormat(WriteBuffer & out_, const Block & header_
 }


-void CSVRowOutputFormat::writePrefix()
+void CSVRowOutputFormat::doWritePrefix()
 {
    const auto & sample = getPort(PortKind::Main).getHeader();
    size_t columns = sample.columns();
--- a/src/Processors/Formats/Impl/CSVRowOutputFormat.h
+++ b/src/Processors/Formats/Impl/CSVRowOutputFormat.h
@ -27,10 +27,11 @@ public:
    void writeField(const IColumn & column, const IDataType & type, size_t row_num) override;
    void writeFieldDelimiter() override;
    void writeRowEndDelimiter() override;
-    void writePrefix() override;
    void writeBeforeTotals() override;
    void writeBeforeExtremes() override;

+    void doWritePrefix() override;
+
    /// https://www.iana.org/assignments/media-types/text/csv
    String getContentType() const override
    {
--- a/src/Storages/StorageFile.cpp
+++ b/src/Storages/StorageFile.cpp
@ -435,6 +435,7 @@ public:
        , metadata_snapshot(metadata_snapshot_)
        , lock(storage.rwlock)
    {
+        std::unique_ptr<WriteBufferFromFileDescriptor> naked_buffer = nullptr;
        if (storage.use_table_fd)
        {
            /** NOTE: Using real file binded to FD may be misleading:
@ -442,17 +443,21 @@ public:
              * INSERT data; SELECT *; last SELECT returns only insert_data
              */
            storage.table_fd_was_used = true;
-            write_buf = wrapWriteBufferWithCompressionMethod(std::make_unique<WriteBufferFromFileDescriptor>(storage.table_fd), compression_method, 3);
+            naked_buffer = std::make_unique<WriteBufferFromFileDescriptor>(storage.table_fd);
        }
        else
        {
            if (storage.paths.size() != 1)
                throw Exception("Table '" + storage.getStorageID().getNameForLogs() + "' is in readonly mode because of globs in filepath", ErrorCodes::DATABASE_ACCESS_DENIED);
-            write_buf = wrapWriteBufferWithCompressionMethod(
-                std::make_unique<WriteBufferFromFile>(storage.paths[0], DBMS_DEFAULT_BUFFER_SIZE, O_WRONLY | O_APPEND | O_CREAT),
-                compression_method, 3);
+            naked_buffer = std::make_unique<WriteBufferFromFile>(storage.paths[0], DBMS_DEFAULT_BUFFER_SIZE, O_WRONLY | O_APPEND | O_CREAT);
        }

+        /// In case of CSVWithNames we have already written prefix.
+        if (naked_buffer->size())
+            prefix_written = true;
+
+        write_buf = wrapWriteBufferWithCompressionMethod(std::move(naked_buffer), compression_method, 3);
+
        writer = FormatFactory::instance().getOutput(storage.format_name, *write_buf, metadata_snapshot->getSampleBlock(), context);
    }

@ -465,7 +470,9 @@ public:

    void writePrefix() override
    {
-        writer->writePrefix();
+        if (!prefix_written)
+            writer->writePrefix();
+        prefix_written = true;
    }

    void writeSuffix() override
@ -484,6 +491,7 @@ private:
    std::unique_lock<std::shared_mutex> lock;
    std::unique_ptr<WriteBuffer> write_buf;
    BlockOutputStreamPtr writer;
+    bool prefix_written{false};
 };

 BlockOutputStreamPtr StorageFile::write(
--- a/tests/queries/0_stateless/01375_storage_file_write_prefix.reference
+++ b/tests/queries/0_stateless/01375_storage_file_write_prefix.reference
@ -0,0 +1,30 @@
+0	0
+1	1
+2	2
+3	3
+4	4
+5	5
+6	6
+7	7
+8	8
+9	9
+0	0
+1	1
+2	2
+3	3
+4	4
+5	5
+6	6
+7	7
+8	8
+9	9
+0	0
+1	1
+2	2
+3	3
+4	4
+5	5
+6	6
+7	7
+8	8
+9	9
--- a/tests/queries/0_stateless/01375_storage_file_write_prefix.sql
+++ b/tests/queries/0_stateless/01375_storage_file_write_prefix.sql
@ -0,0 +1,14 @@
+DROP TABLE IF EXISTS tmp_01375;
+DROP TABLE IF EXISTS table_csv_01375;
+
+CREATE TABLE tmp_01375 (n UInt32, s String) ENGINE = Memory;
+CREATE TABLE table_csv_01375 AS tmp_01375 ENGINE = File(CSVWithNames);
+
+INSERT INTO table_csv_01375 SELECT number as n, toString(n) as s FROM numbers(10);
+INSERT INTO table_csv_01375 SELECT number as n, toString(n) as s FROM numbers(10);
+INSERT INTO table_csv_01375 SELECT number as n, toString(n) as s FROM numbers(10);
+
+SELECT * FROM table_csv_01375;
+
+DROP TABLE IF EXISTS tmp_01375;
+DROP TABLE IF EXISTS table_csv_01375;