ClickHouse® is a real-time analytics DBMS
Go to file
Azat Khuzhin a3f189e191 Optimize sharded dictionaries with skewed distribution
In case of skewed distribution simple division by module will not give
you good distribution between shards and eventually this can lead to
performance the same as non-sharded dictionary (except for it will
occupy +1 thread for Block::scatter).

But if HashedDictionary::blockToAttributes() will not have calls to
HashedDictionary::getShard() this can be fixed by using a more complex
key-to-shard (getShard()) mapping. And actually you do not need to call
getShard() in blockToAttributes() you can simply use passed shard, and
that's it.

And by wrapping key with intHash64() in getShard() skewed distribution
can be fixed.

Note, that previously I tried similar approach but did not removed
getShard() from blockToAttributes(), that's why it failed.

And now it works almost as fast as with simple createBlockSelector(),
just 13.6% slower (18.75min vs 16.5min, with 16 threads).

Note, that I've also tried to add libdivide for this, but it does not
improves the performance.

I've also tried the approach without scatter, and it works 20% slower
then this one (22.5min VS 18.75min, with 16 threads).

v2: Use intHashCRC32() over intHash64() for HashedDictionary::getShard()
    (with intHash64() it works very slower, almost 2x slower, there was
    18min with 32 threads)

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
.github Add a checkbox for documentation 2023-01-12 00:26:03 +03:00
base fix TSA support 2023-01-10 01:19:42 +00:00
benchmark Remove old file 2022-07-12 20:28:02 +02:00
cmake Merge pull request #44828 from ClickHouse/remove-two-lines-of-code 2023-01-04 04:50:52 +03:00
contrib Merge pull request #45144 from ClibMouse/crc-power-fix 2023-01-13 11:24:18 +01:00
docker Merge pull request #44594 from arenadata/ADQM-634 2023-01-12 15:07:45 -05:00
docs Allow to configure queue backlog of the parallel hashed dictionary loader 2023-01-13 13:39:26 +01:00
packages Remove adduser dependency 2023-01-07 01:45:54 +01:00
programs Merge pull request #44327 from kssenii/use-new-named-collections-code-2 2023-01-06 13:06:26 +01:00
rust What happens if I remove these 139 lines of code? 2023-01-03 18:35:31 +00:00
src Optimize sharded dictionaries with skewed distribution 2023-01-13 13:39:26 +01:00
tests Add ability to load hashed dictionaries using multiple threads 2023-01-13 13:39:25 +01:00
utils Update version_date.tsv and changelogs after v22.3.17.13-lts 2023-01-12 19:22:22 +00:00
.clang-format add BeforeLambdaBody to .clang-format 2022-02-11 16:51:45 +01:00
.clang-tidy Temporarily disable misc-* due to being too slow 2022-12-07 11:43:47 +00:00
.editorconfig Changed tabs to spaces in editor configs and in style guide [#CLICKHOUSE-3]. 2017-04-01 11:35:09 +03:00
.exrc Fix vim settings (wrong group for autocmd) 2022-12-03 21:23:24 +01:00
.git-blame-ignore-revs Add files with revision to ignore for git blame 2022-09-13 23:05:56 +02:00
.gitattributes Ignore core.autocrlf for tests references 2022-10-05 09:13:27 +02:00
.gitignore Integrate skim into the client/local 2022-12-14 20:57:41 +01:00
.gitmodules Changes to support the CRC32 in PowerPC to address the WeakHash collision issue. Update the reference to support the hash values based on the specific platform 2023-01-10 21:20:13 -08:00
.pylintrc Cover deprecated bad-* pylint options with black 2022-06-08 14:18:28 +02:00
.snyk Add exclusions from the Snyk scan 2022-10-31 17:47:02 +01:00
.yamllint Drop truthy.check-keys from yamllint (does not supported on CI) 2021-02-21 06:15:36 +03:00
AUTHORS Update AUTHORS 2021-09-22 11:38:03 +03:00
CHANGELOG.md Update CHANGELOG.md 2022-12-20 20:56:40 +03:00
CMakeLists.txt What happens if I remove 156 lines of code? 2023-01-03 18:51:16 +00:00
CODE_OF_CONDUCT.md Add minimal code of conduct #9676 2020-03-16 12:44:28 +03:00
CONTRIBUTING.md Mention ClickHouse CLA in CONTRIBUTING.md (#32697) 2021-12-14 03:47:19 +03:00
format_sources allow several <graphite> targets (#603) 2017-03-21 23:08:09 +04:00
LICENSE Update LICENSE 2023-01-02 00:35:32 +01:00
PreLoad.cmake Update PreLoad.cmake 2022-08-26 18:30:05 +08:00
README.md Update README.md 2022-12-21 02:16:35 +03:00
SECURITY.md Update version_date.tsv and changelogs after v22.12.1.1752-stable 2022-12-15 17:07:16 +00:00

ClickHouse — open source distributed column-oriented DBMS

ClickHouse® is an open-source column-oriented database management system that allows generating analytical data reports in real-time.

  • Official website has a quick high-level overview of ClickHouse on the main page.
  • ClickHouse Cloud ClickHouse as a service, built by the creators and maintainers.
  • Tutorial shows how to set up and query a small ClickHouse cluster.
  • Documentation provides more in-depth information.
  • YouTube channel has a lot of content about ClickHouse in video format.
  • Slack and Telegram allow chatting with ClickHouse users in real-time.
  • Blog contains various ClickHouse-related articles, as well as announcements and reports about events.
  • Code Browser (Woboq) with syntax highlight and navigation.
  • Code Browser (github.dev) with syntax highlight, powered by github.dev.
  • Contacts can help to get your questions answered if there are any.

Upcoming events

  • Recording available: v22.12 Release Webinar 22.12 is the ClickHouse Christmas release. There are plenty of gifts (a new JOIN algorithm among them) and we adopted something from MongoDB. Original creator, co-founder, and CTO of ClickHouse Alexey Milovidov will walk us through the highlights of the release.
  • ClickHouse Meetup at the CHEQ office in Tel Aviv - Jan 16 - We are very excited to be holding our next in-person ClickHouse meetup at the CHEQ office in Tel Aviv! Hear from CHEQ, ServiceNow and Contentsquare, as well as a deep dive presentation from ClickHouse CTO Alexey Milovidov. Join us for a fun evening of talks, food and discussion!
  • ClickHouse Meetup at Microsoft Office in Seattle - Jan 18 - Keep an eye on this space as we will be announcing speakers soon!