- TSA is a static analyzer build by Google which finds race conditions
and deadlocks at compile time.
- It works by associating a shared member variable with a
synchronization primitive that protects it. The compiler can then
check at each access if proper locking happened before. A good
introduction are [0] and [1].
- TSA requires some help by the programmer via annotations. Luckily,
LLVM's libcxx already has annotations for std::mutex, std::lock_guard,
std::shared_mutex and std::scoped_lock. This commit enables them
(--> contrib/libcxx-cmake/CMakeLists.txt).
- Further, this commit adds convenience macros for the low-level
annotations for use in ClickHouse (--> base/defines.h). For
demonstration, they are leveraged in a few places.
- As we compile with "-Wall -Wextra -Weverything", the required compiler
flag "-Wthread-safety-analysis" was already enabled. Negative checks
are an experimental feature of TSA and disabled
(--> cmake/warnings.cmake). Compile times did not increase noticeably.
- TSA is used in a few places with simple locking. I tried TSA also
where locking is more complex. The problem was usually that it is
unclear which data is protected by which lock :-(. But there was
definitely some weird code where locking looked broken. So there is
some potential to find bugs.
*** Limitations of TSA besides the ones listed in [1]:
- The programmer needs to know which lock protects which piece of shared
data. This is not always easy for large classes.
- Two synchronization primitives used in ClickHouse are not annotated in
libcxx:
(1) std::unique_lock: A releaseable lock handle often together with
std::condition_variable, e.g. in solve producer-consumer problems.
(2) std::recursive_mutex: A re-entrant mutex variant. Its usage can be
considered a design flaw + typically it is slower than a standard
mutex. In this commit, one std::recursive_mutex was converted to
std::mutex and annotated with TSA.
- For free-standing functions (e.g. helper functions) which are passed
shared data members, it can be tricky to specify the associated lock.
This is because the annotations use the normal C++ rules for symbol
resolution.
[0] https://clang.llvm.org/docs/ThreadSafetyAnalysis.html
[1] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42958.pdf
+ test -f package_folder/clickhouse-client_22.7.1.1+tsan_all.deb package_folder/clickhouse-common-static-dbg_22.7.1.1+tsan_amd64.deb package_folder/clickhouse-common-static_22.7.1.1+tsan_amd64.deb package_folder/clickhouse-keeper_22.7.1.1+tsan_amd64.deb package_folder/clickhouse-server_22.7.1.1+tsan_all.deb
/run.sh: line 13: test: too many arguments
Follow-up for: #38207 (cc @tavplubix)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This test executes SYSTEM RESTART REPLICAS, that may leave some tables
in some tests, and the problem test is
01414_mutations_and_errors_zookeeper, that has invalid values in the
table that produces the following error:
2022.06.19 19:02:07.165320 [ 1242562 ] {} <Error> MutateFromLogEntryTask: virtual bool DB::ReplicatedMergeMutateTaskBase::executeStep(): Code: 6. DB::Exception: Cannot parse string 'Hello' as UInt64: syntax error at begin of string. Note: there are toUInt64OrZero and toUInt64OrNull functions, which returns zero/NULL instead of throwing exception.: while executing 'FUNCTION _CAST(value :: 2, 'UInt64' :: 3) -> _CAST(value, 'UInt64') UInt64 : 4': (while reading from part /var/lib/clickhouse/store/700/70043200-eae1-44da-8554-0d43b7e936d7/20191002_1_1_0/): While executing MergeTreeInOrder. (CANNOT_PARSE_TEXT), Stack trace (when copying this message, always include the lines below):
Here is part of the server log that relevant for the test:
...
2022.06.19 18:33:22.495867 [ 391343 ] {e0332447-5473-4653-ba8b-b976acb304a1} <Trace> InterpreterSystemQuery: Restarting replica on test_9.replicated_mutation_table
2022.06.19 18:33:22.503462 [ 390869 ] {} <Information> test_9.replicated_mutation_table (70043200-eae1-44da-8554-0d43b7e936d7): Stopped being leader
...
2022.06.19 18:33:23.396760 [ 395825 ] {09ee374d-a8d9-47db-bdca-611d605b40c6} <Error> executeQuery: Code: 341. DB::Exception: Mutation is not finished because table shutdown was called. It will be done after table restart. (UNFINISHED) (version 22.6.1.1985 (official build)) (from [::1]:40558) (comment: '01414_mutations_and_errors_zookeeper.sh') (in query: ALTER TABLE replicated_mutation_table MODIFY COLUMN value UInt64 SETTINGS replication_alter_partitions_sync = 2), Stack trace (when copying this message, always include the lines below):
...
2022.06.19 18:33:23.467115 [ 390869 ] {} <Debug> test_9.replicated_mutation_table (70043200-eae1-44da-8554-0d43b7e936d7): Loading data parts
2022.06.19 18:33:23.471062 [ 390869 ] {} <Debug> test_9.replicated_mutation_table (70043200-eae1-44da-8554-0d43b7e936d7): Loaded data parts (3 items)
...
2022.06.19 18:33:23.515997 [ 390869 ] {} <Trace> test_9.replicated_mutation_table (ReplicatedMergeTreeRestartingThread): Restarting thread finished
...
2022.06.19 18:33:23.522475 [ 390869 ] {} <Trace> test_9.replicated_mutation_table (PartMovesBetweenShardsOrchestrator): PartMovesBetweenShardsOrchestrator thread finished
...
2022.06.19 18:33:24.960630 [ 391343 ] {e0332447-5473-4653-ba8b-b976acb304a1} <Error> executeQuery: Code: 57. DB::Exception: Cannot attach table with UUID 9b62c1d4-cf4a-4e41-bd11-bafb1446495c, because it was detached but still used by some query. Retry later. (TABLE_ALREADY_EXISTS) (version 22.6.1.1985 (official build)) (from [::1]:47448) (comment: 01646_system_restart_replicas_smoke.sql) (in query: SYSTEM RESTART REPLICAS;), Stack trace (when copying this message, always include the lines below):
...
2022.06.19 18:33:24.490940 [ 400623 ] {00c29852-e786-4e53-a44a-5f1c5f23c698} <Debug> executeQuery: (from [::1]:48804) (comment: '01414_mutations_and_errors_zookeeper.sh') SELECT distinct(value) FROM replicated_mutation_table ORDER BY value (stage: Complete)
2022.06.19 18:33:24.502168 [ 400623 ] {00c29852-e786-4e53-a44a-5f1c5f23c698} <Error> executeQuery: Code: 60. DB::Exception: Table test_9.replicated_mutation_table doesn't exist. (UNKNOWN_TABLE) (version 22.6.1.1985 (official build)) (from [::1]:48804) (comment: '01414_mutations_and_errors_zookeeper.sh') (in query: SELECT distinct(value) FROM replicated_mutation_table ORDER BY value), Stack trace (when copying this message, always include the lines below):
...
2022.06.19 18:33:25.048152 [ 395940 ] {bb31a17f-aca1-411a-ab30-c6b7598c59e5} <Debug> executeQuery: (from [::1]:49236) (comment: '01414_mutations_and_errors_zookeeper.sh') DROP TABLE IF EXISTS replicated_mutation_table (stage: Complete)
And if this table will be left, then checking error messages in
/var/log/clickhouse-server/clickhouse-server.backward.clean.log will
fail, like in [1].
[1]: https://s3.amazonaws.com/clickhouse-test-reports/38205/90b57e7445d5167ea2170bfe03af29faffc195cf/stress_test__undefined__actions_.html
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>