In #27152 the hash function for sparse_hash_map had been changed to
std::hash<> switch it back to DefaultHash<> (ClickHouse builtin), since
std::hash<> for numeric keys returns itself and this does not works
great with sparse_hash_map.
I've tried the example from #32480 and using some hash fixes the
performance of sparse_hashed layout.
Fixes: #32480
v2: Add comments for SparseHashMap
It was initially implemented in #15454, but was reverted in #21948 (due
to higher memory usage).
This implementation differs from the initial, since now there is
separate attribute to enable preallocation, before it was done
automatically, but this has problems with duplicates in the source.
Plus this implementation does not uses dynamic_cast, instead it extends
IDictionarySource interface.
preallocation can be used only when we know number of rows, and for this
we need:
- source clickhouse
- no filtering (i.e. lack of <where>), since filtering can filter
too much rows and eventually it may allocate memory that will
never be used.
For sparse_hash the difference is quite significant, preallocated
sparse_hash hashtable allocates ~33% faster (7.5 seconds vs 5 seconds
for insert, and the difference is more significant for higher number of
elements):
$ ninja bench-sparse_hash-run
[1/1] cd /src/ch/hashtable-bench/.cmake && ...ch/hashtable-bench/.cmake/bench-sparse_hash
sparse_hash/insert: 7.574 <!--
sparse_hash/find : 2.14426
sparse_hash/maxrss: 174MiB
sparse_hash/time: 9710.51 msec (user+sys)
$ time ninja bench-sparse_hash-preallocate-run
[1/1] cd /src/ch/hashtable-bench/.cmake && ...-bench/.cmake/bench-sparse_hash-preallocate
sparse_hash/insert: 5.0522 <!--
sparse_hash/find : 2.14024
sparse_hash/maxrss: 174MiB
sparse_hash/time: 7192.06 msec (user+sys)
P.S. the difference for sparse_hashed dictionary with 4e9 elements
(uint64, uint16) is ~18% (4975.905 vs 4103.569 sec)
v2: do not reallocate the dictionary from the progress callback
Since this will access hashtable in parallel.
v3: drop PREALLOCATE() and do this only for source=clickhouse and empty
<where>