There are lots of thread pools and simple local-vs-global is not enough
already, it is good to know which one in particular uses threads.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Sorry for the clickbaity title. This is about static method
ConnectionTimeouts::getHTTPTimeouts(). It was be declared in header
IO/ConnectionTimeouts.h, and defined in header
IO/ConnectionTimeoutsContext.h (!). This is weird and caused issues with
linking on s390x (##45520). There was an attempt to fix some
inconsistencies (#45848) but neither did @Algunenano nor me at first
really understand why the definition is in the header.
Turns out that ConnectionTimeoutsContext.h is only #include'd from
source files which are part of the normal server build BUT NOT part of
the keeper standalone build (which must be enabled via CMake
-DBUILD_STANDALONE_KEEPER=1). This dependency was not documented and as
a result, some misguided workarounds were introduced earlier, e.g.
0341c6c54b
The deeper cause was that getHTTPTimeouts() is passed a "Context". This
class is part of the "dbms" libary which is deliberately not linked by
the standalone build of clickhouse-keeper. The context is only used to
read the settings and the "Settings" class is part of the
clickhouse_common library which is linked by clickhouse-keeper already.
To resolve this mess, this PR
- creates source file IO/ConnectionTimeouts.cpp and moves all
ConnectionTimeouts definitions into it, including getHTTPTimeouts().
- breaks the wrong dependency by passing "Settings" instead of "Context"
into getHTTPTimeouts().
- resolves the previous hacks
* save format string for NetException
* format exceptions
* format exceptions 2
* format exceptions 3
* format exceptions 4
* format exceptions 5
* format exceptions 6
* fix
* format exceptions 7
* format exceptions 8
* Update MergeTreeIndexGin.cpp
* Update AggregateFunctionMap.cpp
* Update AggregateFunctionMap.cpp
* fix
It does not give significant benefit, but now, you hashed/sparse_hashed
dictionaries can be filled in parallel (#40003), using sharded
dictionaries, and this should be used instead of PREALLOCATE.
Note, that dictionaries, that had been created with PREALLOCATE will
work, but simply ignore this attribute.
Fixes: #41985 (cc @alexey-milovidov)
Reverts: #23979 (cc @kitaisreal)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Previously there was one (even though very unlikely) case when the dtor
can throw - logging code or ThreadPool::wait.
Just guard the dtor with try/catch and done with it.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Support of sharded dictionary for updatable sources is questionable
since:
- sharded dictionary developed for hashed dictionary with a huge number
of keys
- updatable source requires storing the whole table in memory (due to
how reload works)
- also it is an open question will it have some benefits from the
updatable source or not, since using updatable source with a huge
number of changes in the source does not looks optimal and on the
other side if there are small amount of changes the you don't need
sharded dictionary at all
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Before it was possible for the desturctor to throw, in case of thread
allocation fails, rewrite it to trySchedule() and do sequential destroy
in this case.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
In case of skewed distribution simple division by module will not give
you good distribution between shards and eventually this can lead to
performance the same as non-sharded dictionary (except for it will
occupy +1 thread for Block::scatter).
But if HashedDictionary::blockToAttributes() will not have calls to
HashedDictionary::getShard() this can be fixed by using a more complex
key-to-shard (getShard()) mapping. And actually you do not need to call
getShard() in blockToAttributes() you can simply use passed shard, and
that's it.
And by wrapping key with intHash64() in getShard() skewed distribution
can be fixed.
Note, that previously I tried similar approach but did not removed
getShard() from blockToAttributes(), that's why it failed.
And now it works almost as fast as with simple createBlockSelector(),
just 13.6% slower (18.75min vs 16.5min, with 16 threads).
Note, that I've also tried to add libdivide for this, but it does not
improves the performance.
I've also tried the approach without scatter, and it works 20% slower
then this one (22.5min VS 18.75min, with 16 threads).
v2: Use intHashCRC32() over intHash64() for HashedDictionary::getShard()
(with intHash64() it works very slower, almost 2x slower, there was
18min with 32 threads)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Previous patches in this series has a bottleneck in rehash(). This is
the most slowest operation when insert lots of rows into the hashtable
and eventually all that thread pool sometimes work as the most slowest
thread since we did not have any queue of blocks.
This patch adds such queue and now it scales linearly, so initialy with
1 thread I had ~4 hours for 10e9 elements (UInt64 key, UInt16 value),
after this patch it works in 16 minutes with 16 threads (well actually I
have to use 32 threads because of distribution of data in the source
table).
And now with 16 threads it works 16 times faster.
Also this patch adds more optimal block splitting for the non-complex
dictionaries, and usual block splitting for complex dictionaries.
But anyway this moves the overhead from the loading into the hashtable
threads out to the reader thread, and this is better, since reader does
not uses that much CPU.
v2: fix use-after-free on failed load (add missing wait in dtor)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Right now dictionaries (here I will talk about only
HASHED/SPARSE_HASHED/COMPLEX_KEY_HASHED/COMPLEX_KEY_SPARSE_HASHED)
can load data only in one thread, since it uses one hash table that
cannot be filled from multiple threads.
And in case you have very big dictionary (i.e. 10e9 elements), it can
take a awhile to load them, especially for SPARSE_HASHED variants (and
if you have such amount of elements there, you are likely use
SPARSE_HASHED, since it requires less memory), in my env it takes ~4
hours, which is enormous amount of time.
So this patch add support of shards for dictionaries, number of shards
determine how much hash tables will use this dictionary, also, and which
is more important, how much threads it can use to load the data.
And with 16 threads this works 2x faster, not perfect though, see the
follow up patches in this series.
v0: PARTITION BY
v1: SHARDS 1
v2: SHARDS(1)
v3: tried optimized mod - logical and, but it does not gain even 10%
v4: tried squashing more (max_block_size * shards), but it does not gain even 10% either
v5: move SHARDS into layout parameters (unknown simply ignored)
v6: tune params for perf tests (to avoid too long queries)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
- lots of static_cast
- add safe_cast
- types adjustments
- config
- IStorage::read/watch
- ...
- some TODO's (to convert types in future)
P.S. That was quite a journey...
v2: fixes after rebase
v3: fix conflicts after #42308 merged
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This makes the target location consistent with other auto-generated
files like config_formats.h, config_core.h, and config_functions.h and
simplifies the build of clickhouse_common.
- ExternalDictionaryLibraryBridgeHelper provides the server-side
interface to access the dictionary part of the library bridge.
- In a later commit, CatBoostLibraryBridgeHelper will be added to access
the catboost part of the library bridge from the server.
- Rename generic file and identifier names in library-bridge to
something more dictionary-specific. This is needed because later on,
catboost will be integrated into library-bridge.
- Also: Some smaller fixes like typos and un-inlining non-performance
critical code.
- The logic remains unchanged in this commit.
The following new provile events had been added:
- FileSync - Number of times the F_FULLFSYNC/fsync/fdatasync function was called for files.
- DirectorySync - Number of times the F_FULLFSYNC/fsync/fdatasync function was called for directories.
- FileSyncElapsedMicroseconds - Total time spent waiting for F_FULLFSYNC/fsync/fdatasync syscall for files.
- DirectorySyncElapsedMicroseconds - Total time spent waiting for F_FULLFSYNC/fsync/fdatasync syscall for directories.
v2: rewrite test to sh with retries
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>