ClickHouse/website/blog/en/2021/fuzzing-clickhouse.md
2021-03-10 23:04:30 +03:00

152 lines
9.7 KiB
Markdown

---
title: 'Fuzzing ClickHouse'
date: '2021-03-08'
author: '[Alexander Kuzmenkov](https://github.com/akuzm)'
tags: ['fuzzing', 'testing']
---
Testing is a major problem in software development: there is never enough of
it. It becomes especially true in a database management system, whose task is
to interpret a query language that works on the persistent state managed by the
system in a distributed fashion. Each of these three functions is hard enough
to test even in isolation, and it gets much worse when you combine them. As
ClickHouse developers, we know this from experience. Despite a large amount of
automated testing of all kinds we routinely perform as part of our continuous
integration system, new bugs and regressions are creeping in. We are always
looking for the ways to improve our test coverage, and this article will
describe our recent development in this area -- the AST-based query fuzzer.
A natural form of testing for a SQL DBMS is to create an SQL script describing
the test case, and record its reference result. To test, we run the script and
check that the result matches the reference. This is used in many SQL DBMSes,
and it is the default kind of a test you are expected to write for any
ClickHouse feature or fix. Currently we have [73k lines of SQL tests
alone](https://github.com/ClickHouse/ClickHouse/tree/master/tests/queries/0_stateless),
that reach the [code coverage of
76%](https://clickhouse-test-reports.s3.yandex.net/0/47d684a5c35410201d4dd4f63f3287bf25cdabb7/coverage_report/test_output/index.html).
This form of testing, where a developer writes a few simplified examples of how
the feature can and cannot be used, is sometimes called "example-based
testing". Sadly, the bugs often appear in various corner cases and intersecion
of features, and it is not practical to enumerate them all by hand. There is a
technique for automating this process, called "property-based testing". It lets
you write more general tests of the form "for all values matching these specs,
the result of some operation on them should match this other spec". For
example, such a test can check that if you add two positive numbers, the result
is greater than both of them. But you don't specify which numbers exactly, only
these properties. Then, the property testing system randomly generates some
examples with particular numbers that match the specification, and checks that
the result also matches its specification.
Property-based testing is said to be very efficient, but requires some
developer effort and expertise to write the tests in a special way. There is
another well-known testing technique that is in some sense a corner case of
property-based testing, and that doesn't require much developer time. It is
called fuzzing. When you are fuzzing your program, you feed it random inputs
generated according to some grammar, and the property you are checking is that
your program terminates correctly (no segfaults or assertions or other kinds of
program errors). Most often, the grammar of input for fuzzing is simple -- say,
bit flips and additions, or maybe some dictionary. The space of possible inputs
is huge, so to find interesting paths in it, fuzzing software records the code
paths taken by the program under test for a particular input, and focuses on
the inputs that lead to new code paths that were not seen before. It also
employs some techniques for finding interesting constant values, and so on. In
general, fuzzing allows you to find many interesting corner cases in your
program automatically, without much developer involvement.
Finding valid SQL queries with bit flips would take a long time, so there are
systems that generate valid SQL queries based on the grammar, such as
[SQLSmith](https://github.com/anse1/sqlsmith). They are succesfully used for
finding bugs in databases. It would be interesting to use such a system for
ClickHouse, but it requires some up-front effort to support the ClickHouse SQL
grammar and functions, which may be different from the standard. Also, such
systems don't use any feedback, so while they are much better than systems with
primitive grammar, they still might have a hard time finding interesting
examples. But we already have a big corpus of human-written interesting SQL
queries -- it's in our regression tests. Maybe we can use them as a base for
fuzzing? We tried to do this, and it turned out to be surprisingly simple and
efficient.
Consider some SQL query from a regression test. After parsing, it is easy to
mutate the resulting AST (abstract syntax tree, an internal representation of
the parsed query) before execution to introduce random changes into the query.
For strings and arrays, we make random modifications such as inserting a random
character or doubling the string. For numbers, there are well-known Bad Numbers
such as 0, 1, powers of two and nearby, integer limits, `NaN`. `NaN`s proved to
be especially efficient in finding bugs, because you can often have some
alternative branches in your numeric code, but for a `NaN`, both branches hold
(or not) simultaneously, so this leads to nasty effects.
Another interesting thing we can do is change the arguments to functions and
expressions in the select list. Naturally, all the interesting arguments can be
taken from other test queries. Same goes for changing the tables used in the
queries. When the fuzzer runs in CI, it runs queries from all the SQL tests in
random order, mixing in the parts of query from different tests, so that we can
eventually test all the possible permutations of our features.
The core implementation of the fuzzer is relatively small, consisting of about
700 lines of C++ code. A prototype was made in a couple of days, but naturally
it took significantly longer to polish it and to start routinely using it in
CI. It is very productive and let us find more than 200 bugs already (see the
label [fuzz](https://github.com/ClickHouse/ClickHouse/labels/fuzz) on GitHub).
Some errors it finds are not very interesting, e.g. wrong error messages when a
type of argument doesn't match. But we also found some serious logic errors or
even memory errors. We fix all the errors we find, even not significant ones,
because this lets us ensure that under normal operation, the fuzzer doesn't
find any errors. This is similar to the approach usually taken with compiler
warnings and other optional diagnostics -- it's better to fix or disable every
single case, so that you can be sure you have no diagnostics if everything is
OK, and it's easy to notice new problems.
After fixing the majority of pre-existing error, this fuzzer became efficient
for finding errors in new features. Pull requests introducing new features
normally adds an SQL test, and we pay extra attention to the new tests when
fuzzing, generating more permutations for them. Even if the coverage of the
test is not sufficient, there is a good chance that the fuzzer will find the
missing corner cases. So when we see that all the fuzzer runs in different
configurations have failed for a particular pull request, this almost always
means that it introduces a new bug.
A major factor that makes fuzzing really efficient is that we have a lot of
assertions and other checks of program logic in our code. For debug-only
checks, we use the plain `assert` macro from `<cassert>`. For checks that are
needed even in release mode, we use an exception with a special code
`LOGICAL_ERROR` that signifies an internal program error. We did some work to
ensure that these errors are distinct from errors caused by the wrong user
actions. A user error reported for a randomly generated query is normal (e.g.
it references some non-existent columns), but when we see an internal program
error, we know that it's definitely a bug, same as an assertion. Of course,
even without assertions, you get some checks for memory errors provided by the
OS (segfaults). Various kinds of sanitizers are also very useful in conjunction
with fuzzing. We run this fuzzer under clang's Address, Memory,
UndefinedBehavior and Thread sanitizers, as we do for most of our tests.
To see for yourself how it works, you only need the normal ClickHouse client.
Start `clickhouse-client --query-fuzzer-runs=100`, enter any query, and enjoy
the client going crazy and running a hundred of random queries instead. All
queries from the current session become a source for expressions for fuzzing,
so try entering several different queries to get more interesting results. Be
careful not to do this in production! When you do this experiment, you'll soon
notice that the fuzzer tends to generate queries that are too long to run. This
is why for the CI fuzzer runs we have to configure the server to limit query
execution time, memory usage and so on using the corresponding [server
settings](https://clickhouse.tech/docs/en/operations/settings/query-complexity/#:~:text=In%20the%20default%20configuration%20file,query%20within%20a%20single%20server.).
We had a hilarious situation after that: the fuzzer figured out how to remove
the limits by generating a `SET max_execution_time = 0` query, and then
generated a never-ending query and failed. Thankfully we were able to defeat
its cleverness by using [settings
constraints](https://clickhouse.tech/docs/en/operations/settings/constraints-on-settings/).
The AST-based fuzzer we discussed is only one of the many kinds of fuzzers we
have in ClickHouse. There is a talk (in Russian) [3] by Alexey Milovidov that
explores all the fuzzer in greater detail (in Russian). Another interesting
recent development is application of pivoted query synthesis technique,
implemented in [SQLancer](https://github.com/sqlancer/sqlancer), to ClickHouse.
The authors are going to give [a talk about
this](https://heisenbug-piter.ru/2021/spb/talks/nr1cwknssdodjkqgzsbvh/) soon,
so stay tuned.
12-08-21 [Alexander Kuzmenkov](https://github.com/akuzm)