It does not give significant benefit, but now, you hashed/sparse_hashed
dictionaries can be filled in parallel (#40003), using sharded
dictionaries, and this should be used instead of PREALLOCATE.
Note, that dictionaries, that had been created with PREALLOCATE will
work, but simply ignore this attribute.
Fixes: #41985 (cc @alexey-milovidov)
Reverts: #23979 (cc @kitaisreal)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Right now dictionaries (here I will talk about only
HASHED/SPARSE_HASHED/COMPLEX_KEY_HASHED/COMPLEX_KEY_SPARSE_HASHED)
can load data only in one thread, since it uses one hash table that
cannot be filled from multiple threads.
And in case you have very big dictionary (i.e. 10e9 elements), it can
take a awhile to load them, especially for SPARSE_HASHED variants (and
if you have such amount of elements there, you are likely use
SPARSE_HASHED, since it requires less memory), in my env it takes ~4
hours, which is enormous amount of time.
So this patch add support of shards for dictionaries, number of shards
determine how much hash tables will use this dictionary, also, and which
is more important, how much threads it can use to load the data.
And with 16 threads this works 2x faster, not perfect though, see the
follow up patches in this series.
v0: PARTITION BY
v1: SHARDS 1
v2: SHARDS(1)
v3: tried optimized mod - logical and, but it does not gain even 10%
v4: tried squashing more (max_block_size * shards), but it does not gain even 10% either
v5: move SHARDS into layout parameters (unknown simply ignored)
v6: tune params for perf tests (to avoid too long queries)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
On the query we have missing two options:
- `LIFETIME` Is not on the example if you copy paste you will have an Exception `DB::Exception: Cannot create dictionary with empty lifetime.`
- `SOURCE` was not mentioned and it's important to link to the main/source table.
- There was an error on the `dictGetT` function there was an additional T this function do not exist (I have tested and we need to use `dictGet`).
- Also in the Dictionary example we have no extra attribute other than the id and the two dates, and for running the queries and the `dicGet` function you need an additional attribute this is why I have added `advertiser_id` (BTW I use advertiser_id as this was use in the example just before) and also add one example, without the example it was not easy to understand what was the 'attr_name' mentioned before.
- I add an example as an user did not knew how to cast the date to a Uint64 (Because most of the time the original/raw dates are defined on the range as Date64, so this example will explain them how to cast when doing the query)