Before it was possible for the desturctor to throw, in case of thread
allocation fails, rewrite it to trySchedule() and do sequential destroy
in this case.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
In case of skewed distribution simple division by module will not give
you good distribution between shards and eventually this can lead to
performance the same as non-sharded dictionary (except for it will
occupy +1 thread for Block::scatter).
But if HashedDictionary::blockToAttributes() will not have calls to
HashedDictionary::getShard() this can be fixed by using a more complex
key-to-shard (getShard()) mapping. And actually you do not need to call
getShard() in blockToAttributes() you can simply use passed shard, and
that's it.
And by wrapping key with intHash64() in getShard() skewed distribution
can be fixed.
Note, that previously I tried similar approach but did not removed
getShard() from blockToAttributes(), that's why it failed.
And now it works almost as fast as with simple createBlockSelector(),
just 13.6% slower (18.75min vs 16.5min, with 16 threads).
Note, that I've also tried to add libdivide for this, but it does not
improves the performance.
I've also tried the approach without scatter, and it works 20% slower
then this one (22.5min VS 18.75min, with 16 threads).
v2: Use intHashCRC32() over intHash64() for HashedDictionary::getShard()
(with intHash64() it works very slower, almost 2x slower, there was
18min with 32 threads)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Previous patches in this series has a bottleneck in rehash(). This is
the most slowest operation when insert lots of rows into the hashtable
and eventually all that thread pool sometimes work as the most slowest
thread since we did not have any queue of blocks.
This patch adds such queue and now it scales linearly, so initialy with
1 thread I had ~4 hours for 10e9 elements (UInt64 key, UInt16 value),
after this patch it works in 16 minutes with 16 threads (well actually I
have to use 32 threads because of distribution of data in the source
table).
And now with 16 threads it works 16 times faster.
Also this patch adds more optimal block splitting for the non-complex
dictionaries, and usual block splitting for complex dictionaries.
But anyway this moves the overhead from the loading into the hashtable
threads out to the reader thread, and this is better, since reader does
not uses that much CPU.
v2: fix use-after-free on failed load (add missing wait in dtor)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Right now dictionaries (here I will talk about only
HASHED/SPARSE_HASHED/COMPLEX_KEY_HASHED/COMPLEX_KEY_SPARSE_HASHED)
can load data only in one thread, since it uses one hash table that
cannot be filled from multiple threads.
And in case you have very big dictionary (i.e. 10e9 elements), it can
take a awhile to load them, especially for SPARSE_HASHED variants (and
if you have such amount of elements there, you are likely use
SPARSE_HASHED, since it requires less memory), in my env it takes ~4
hours, which is enormous amount of time.
So this patch add support of shards for dictionaries, number of shards
determine how much hash tables will use this dictionary, also, and which
is more important, how much threads it can use to load the data.
And with 16 threads this works 2x faster, not perfect though, see the
follow up patches in this series.
v0: PARTITION BY
v1: SHARDS 1
v2: SHARDS(1)
v3: tried optimized mod - logical and, but it does not gain even 10%
v4: tried squashing more (max_block_size * shards), but it does not gain even 10% either
v5: move SHARDS into layout parameters (unknown simply ignored)
v6: tune params for perf tests (to avoid too long queries)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This will avoid extra messages on destroy:
- Destroying 1 non empty hash tables (using 1 threads)
- Hash tables destroyed
And actually we cannot wait them in tests, since the query does not
waits until the dictionary will be fully unloaded:
$ pigz -cd clickhouse-server.log.gz.1 | grep 1511e339-a077-4ee7-808e-0211ece99409 -a
2022.12.11 18:21:41.069825 [ 102234 ] {1511e339-a077-4ee7-808e-0211ece99409} <Debug> executeQuery: (from [::1]:58964) (comment: 01509_dictionary_preallocate.sh) SYSTEM RELOAD DICTIONARY dict_01509_preallocate (stage: Complete)
...
2022.12.11 18:21:41.072887 [ 7291 ] {1511e339-a077-4ee7-808e-0211ece99409} <Trace> HashedDictionary: Preallocated 10000 elements
...
2022.12.11 18:21:41.076531 [ 7291 ] {1511e339-a077-4ee7-808e-0211ece99409} <Trace> HashedDictionary: Destroying 1 non empty hash tables (using 1 threads)
2022.12.11 18:21:41.076600 [ 102234 ] {1511e339-a077-4ee7-808e-0211ece99409} <Debug> MemoryTracker: Peak memory usage (for query): 3.05 MiB.
2022.12.11 18:21:41.076618 [ 102234 ] {1511e339-a077-4ee7-808e-0211ece99409} <Debug> TCPHandler: Processed in 0.007111647 sec.
2022.12.11 18:21:41.076697 [ 7291 ] {1511e339-a077-4ee7-808e-0211ece99409} <Trace> HashedDictionary: Hash tables destroyed
See, first the TCPHandler finished and only after HashedDictionary dtor.
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>