By some reason cmake rules for googletest sets GTEST_HAS_POSIX_RE=0
(Compatibilty? But which platform that does support ClickHouse does not
have it?)
But everything will be okay, if these macros was set PUBLIC (i.e. for
compiling googletest library itself and it's users), however it was
added as INTERFACE only (so library itself does not know about
GTEST_HAS_POSIX_RE=0), and this leads to UB, here ASan report (while I
was trying to use ASSERT_EXIT()).
<details>
<summary>ASan report</summary>
[ RUN ] Common.LSan
=================================================================
==7566==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6030005b2388 at pc 0x00000d00924c bp 0x7ffcd3b7cfb0 sp 0x7ffcd3b7c770
WRITE of size 64 at 0x6030005b2388 thread T0
0 0xd00924b in regcomp (/bld/src/unit_tests_dbms+0xd00924b) (BuildId: 40d3fa83125f9047)
1 0x29ca243b in testing::internal::RE::Init(char const*) /bld/./contrib/googletest/googletest/src/gtest-port.cc:750:15
2 0xd4d92b3 in testing::internal::RE::RE(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) /bld/./contrib/googletest/googletest/include/gtest/internal/gtest-port.h:896:36
3 0xd4d92b3 in testing::PolymorphicMatcher<testing::internal::MatchesRegexMatcher> testing::ContainsRegex<char const*>(char const* const&) /bld/./contrib/googletest/googletest/include/gtest/gtest-matchers.h:868:28
4 0xd4d813a in testing::internal::MakeDeathTestMatcher(char const*) /bld/./contrib/googletest/googletest/include/gtest/internal/gtest-death-test-internal.h:173:10
5 0xd4d813a in Common_LSan_Test::TestBody() /bld/./src/Common/tests/gtest_lsan.cpp:11:5
0x6030005b2388 is located 0 bytes to the right of 24-byte region [0x6030005b2370,0x6030005b2388)
allocated by thread T0 here:
0 0xd066fbd in operator new(unsigned long) (/bld/src/unit_tests_dbms+0xd066fbd) (BuildId: 40d3fa83125f9047)
1 0xd4d913d in testing::PolymorphicMatcher<testing::internal::MatchesRegexMatcher> testing::ContainsRegex<char const*>(char const* const&) /bld/./contrib/googletest/googletest/include/gtest/gtest-matchers.h:868:24
2 0xd4d813a in testing::internal::MakeDeathTestMatcher(char const*) /bld/./contrib/googletest/googletest/include/gtest/internal/gtest-death-test-internal.h:173:10
3 0xd4d813a in Common_LSan_Test::TestBody() /bld/./src/Common/tests/gtest_lsan.cpp:11:5
</details>
This patch also updates the jemalloc version.
Note, that I've enabled page_id for jemalloc
PR_SET_VMA/PR_SET_VMA_ANON_NAME, that requires linux 5.17+ (but ignores
EINVAL anyway).
v2: add -isystem to fix reserved name for JEMALLOC_OVERRIDE___LIBC_PVALLOC
Refs: https://github.com/jemalloc/jemalloc/pull/2304
Refs: https://github.com/ClickHouse/ClickHouse/issues/31531
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
GCS server does not handle requests with port, and simply report an
error:
```xml
<?xml version="1.0"?>
<?xml version='1.0' encoding='UTF-8'?>
<Error>
<Code>InvalidURI</Code>
<Message>Couldn't parse the specified URI.</Message>
<Details>Invalid URL: storage.googleapis.com:443/...</Details>
</Error>
```
Removing the port fixes the issue. Note that there is port in the Host
header anyway.
Note, this is a problem only for proxy in a tunnel mode, since only it
sends such requests, other sends requests directly via HTTP methods.
Refs: https://github.com/ClickHouse/poco/pull/22#22 (cc @Jokser)
Refs: https://github.com/ClickHouse/poco/pull/63
Refs: #38069 (cc @CurtizJ)
Cc: @alesapin @kssenii
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Sometimes it is useful to build contrib with debug symbols for further
debugging.
With everything turned ON (i.e. debug build) I got 3.3GB vs 3.0GB w/o
this patch, 9% bloat, thoughts about this is this OK or not for you, if
not STRIP_DEBUG_SYMBOLS_HEAVY_CONTRIB can be OFF by default (regardless
of build type).
P.S. aws debug symbols adds just 1.7%.
v2: rename STRIP_HEAVY_DEBUG_SYMBOLS
v3: OMIT_HEAVY_DEBUG_SYMBOLS
v4: documentation had been removed
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.
(*) Addresses (but does not resolve) #38046
(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
preserved for compatibility. Likewise, error codes (e.g.
"ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
"HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
replacement.
[0] https://www.youtube.com/watch?v=KlZWmmflW6M
- TSA is a static analyzer build by Google which finds race conditions
and deadlocks at compile time.
- It works by associating a shared member variable with a
synchronization primitive that protects it. The compiler can then
check at each access if proper locking happened before. A good
introduction are [0] and [1].
- TSA requires some help by the programmer via annotations. Luckily,
LLVM's libcxx already has annotations for std::mutex, std::lock_guard,
std::shared_mutex and std::scoped_lock. This commit enables them
(--> contrib/libcxx-cmake/CMakeLists.txt).
- Further, this commit adds convenience macros for the low-level
annotations for use in ClickHouse (--> base/defines.h). For
demonstration, they are leveraged in a few places.
- As we compile with "-Wall -Wextra -Weverything", the required compiler
flag "-Wthread-safety-analysis" was already enabled. Negative checks
are an experimental feature of TSA and disabled
(--> cmake/warnings.cmake). Compile times did not increase noticeably.
- TSA is used in a few places with simple locking. I tried TSA also
where locking is more complex. The problem was usually that it is
unclear which data is protected by which lock :-(. But there was
definitely some weird code where locking looked broken. So there is
some potential to find bugs.
*** Limitations of TSA besides the ones listed in [1]:
- The programmer needs to know which lock protects which piece of shared
data. This is not always easy for large classes.
- Two synchronization primitives used in ClickHouse are not annotated in
libcxx:
(1) std::unique_lock: A releaseable lock handle often together with
std::condition_variable, e.g. in solve producer-consumer problems.
(2) std::recursive_mutex: A re-entrant mutex variant. Its usage can be
considered a design flaw + typically it is slower than a standard
mutex. In this commit, one std::recursive_mutex was converted to
std::mutex and annotated with TSA.
- For free-standing functions (e.g. helper functions) which are passed
shared data members, it can be tricky to specify the associated lock.
This is because the annotations use the normal C++ rules for symbol
resolution.
[0] https://clang.llvm.org/docs/ThreadSafetyAnalysis.html
[1] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42958.pdf
First part, updated most UTF8, hashing, memory and codecs. Except
utf8lower and upper, maybe a little later.
That includes huge amount of research with movemask dealing. Exact
details and blog post TBD.