This patch also updates the jemalloc version.
Note, that I've enabled page_id for jemalloc
PR_SET_VMA/PR_SET_VMA_ANON_NAME, that requires linux 5.17+ (but ignores
EINVAL anyway).
v2: add -isystem to fix reserved name for JEMALLOC_OVERRIDE___LIBC_PVALLOC
Refs: https://github.com/jemalloc/jemalloc/pull/2304
Refs: https://github.com/ClickHouse/ClickHouse/issues/31531
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
GCS server does not handle requests with port, and simply report an
error:
```xml
<?xml version="1.0"?>
<?xml version='1.0' encoding='UTF-8'?>
<Error>
<Code>InvalidURI</Code>
<Message>Couldn't parse the specified URI.</Message>
<Details>Invalid URL: storage.googleapis.com:443/...</Details>
</Error>
```
Removing the port fixes the issue. Note that there is port in the Host
header anyway.
Note, this is a problem only for proxy in a tunnel mode, since only it
sends such requests, other sends requests directly via HTTP methods.
Refs: https://github.com/ClickHouse/poco/pull/22#22 (cc @Jokser)
Refs: https://github.com/ClickHouse/poco/pull/63
Refs: #38069 (cc @CurtizJ)
Cc: @alesapin @kssenii
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Sometimes it is useful to build contrib with debug symbols for further
debugging.
With everything turned ON (i.e. debug build) I got 3.3GB vs 3.0GB w/o
this patch, 9% bloat, thoughts about this is this OK or not for you, if
not STRIP_DEBUG_SYMBOLS_HEAVY_CONTRIB can be OFF by default (regardless
of build type).
P.S. aws debug symbols adds just 1.7%.
v2: rename STRIP_HEAVY_DEBUG_SYMBOLS
v3: OMIT_HEAVY_DEBUG_SYMBOLS
v4: documentation had been removed
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.
(*) Addresses (but does not resolve) #38046
(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
preserved for compatibility. Likewise, error codes (e.g.
"ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
"HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
replacement.
[0] https://www.youtube.com/watch?v=KlZWmmflW6M
- TSA is a static analyzer build by Google which finds race conditions
and deadlocks at compile time.
- It works by associating a shared member variable with a
synchronization primitive that protects it. The compiler can then
check at each access if proper locking happened before. A good
introduction are [0] and [1].
- TSA requires some help by the programmer via annotations. Luckily,
LLVM's libcxx already has annotations for std::mutex, std::lock_guard,
std::shared_mutex and std::scoped_lock. This commit enables them
(--> contrib/libcxx-cmake/CMakeLists.txt).
- Further, this commit adds convenience macros for the low-level
annotations for use in ClickHouse (--> base/defines.h). For
demonstration, they are leveraged in a few places.
- As we compile with "-Wall -Wextra -Weverything", the required compiler
flag "-Wthread-safety-analysis" was already enabled. Negative checks
are an experimental feature of TSA and disabled
(--> cmake/warnings.cmake). Compile times did not increase noticeably.
- TSA is used in a few places with simple locking. I tried TSA also
where locking is more complex. The problem was usually that it is
unclear which data is protected by which lock :-(. But there was
definitely some weird code where locking looked broken. So there is
some potential to find bugs.
*** Limitations of TSA besides the ones listed in [1]:
- The programmer needs to know which lock protects which piece of shared
data. This is not always easy for large classes.
- Two synchronization primitives used in ClickHouse are not annotated in
libcxx:
(1) std::unique_lock: A releaseable lock handle often together with
std::condition_variable, e.g. in solve producer-consumer problems.
(2) std::recursive_mutex: A re-entrant mutex variant. Its usage can be
considered a design flaw + typically it is slower than a standard
mutex. In this commit, one std::recursive_mutex was converted to
std::mutex and annotated with TSA.
- For free-standing functions (e.g. helper functions) which are passed
shared data members, it can be tricky to specify the associated lock.
This is because the annotations use the normal C++ rules for symbol
resolution.
[0] https://clang.llvm.org/docs/ThreadSafetyAnalysis.html
[1] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42958.pdf
First part, updated most UTF8, hashing, memory and codecs. Except
utf8lower and upper, maybe a little later.
That includes huge amount of research with movemask dealing. Exact
details and blog post TBD.