Commit Graph

3 Commits

Author SHA1 Message Date
Azat Khuzhin
808d1a0215 Reimplement preallocate for hashed/sparse_hashed dictionaries
It was initially implemented in #15454, but was reverted in #21948 (due
to higher memory usage).

This implementation differs from the initial, since now there is
separate attribute to enable preallocation, before it was done
automatically, but this has problems with duplicates in the source.

Plus this implementation does not uses dynamic_cast, instead it extends
IDictionarySource interface.
2021-05-10 07:41:48 +03:00
Ivan
315ff4f0d9
ANTLR4 Grammar for ClickHouse and new parser (#11298) 2020-12-04 05:15:44 +03:00
Azat Khuzhin
064f901ea8 Add ability to preallocate hashtables for hashed/sparsehashed dictionaries
preallocation can be used only when we know number of rows, and for this
we need:
- source clickhouse
- no filtering (i.e. lack of <where>), since filtering can filter
  too much rows and eventually it may allocate memory that will
  never be used.

For sparse_hash the difference is quite significant, preallocated
sparse_hash hashtable allocates ~33% faster (7.5 seconds vs 5 seconds
for insert, and the difference is more significant for higher number of
elements):

    $ ninja bench-sparse_hash-run
    [1/1] cd /src/ch/hashtable-bench/.cmake && ...ch/hashtable-bench/.cmake/bench-sparse_hash
    sparse_hash/insert: 7.574 <!--
    sparse_hash/find  : 2.14426
    sparse_hash/maxrss: 174MiB
    sparse_hash/time:   9710.51 msec (user+sys)

    $ time ninja bench-sparse_hash-preallocate-run
    [1/1] cd /src/ch/hashtable-bench/.cmake && ...-bench/.cmake/bench-sparse_hash-preallocate
    sparse_hash/insert: 5.0522 <!--
    sparse_hash/find  : 2.14024
    sparse_hash/maxrss: 174MiB
    sparse_hash/time:   7192.06 msec (user+sys)

P.S. the difference for sparse_hashed dictionary with 4e9 elements
(uint64, uint16) is ~18% (4975.905 vs 4103.569 sec)

v2: do not reallocate the dictionary from the progress callback
    Since this will access hashtable in parallel.
v3: drop PREALLOCATE() and do this only for source=clickhouse and empty
    <where>
2020-10-09 22:28:14 +03:00