Commit Graph

1588 Commits

Author SHA1 Message Date
Nikolay Degterinsky
575a1a4907 Add header checks to HTTP dictionary source 2023-06-20 13:29:25 +00:00
Dmitry Kardymon
806176d88e Add input_format_csv_missing_as_default setting and tests 2023-06-15 11:23:08 +00:00
kssenii
25ae93bbf8 Merge remote-tracking branch 'upstream/master' into add-separate-access-for-use-named-collections 2023-06-14 13:33:56 +02:00
JackyWoo
a1641aa25d
Merge branch 'master' into support_redis 2023-06-12 09:53:06 +08:00
Nikolay Degterinsky
9ad8e022a8
Merge branch 'master' into update-mongo 2023-06-10 10:58:02 +02:00
pufit
55d228e78e
Merge branch 'master' into support_redis 2023-06-09 11:45:12 -04:00
kssenii
63f8a3275b Merge remote-tracking branch 'upstream/master' into add-separate-access-for-use-named-collections 2023-06-09 14:32:41 +02:00
johanngan
be8e048799 Revert invalid RegExpTreeDictionary optimization
This reverts the following commits:
- e77dd81036
- e8527e720b

Additionally, functional tests are added.

When scanning complex regexp nodes sequentially with RE2, the old code
has an optimization to break out of the loop early upon finding a leaf
node that matches. This is an invalid optimization because there's no
guarantee that it's actually a VALID match, because its parents might
NOT have matched. Semantically, a user would expect this match to be
discarded and for the search to continue. Instead, since we skipped
matching after the first false positive, subsequent nodes that would
have matched are missing from the output value. This affects both
dictGet and dictGetAll.

It's difficult to distinguish a true positive from a false positive
while looping through complex_regexp_nodes because we would have to scan
all the parents of a matching node to confirm a true positive. Trying to
do this might actually end up being slower than just scanning every
complex regexp node, because complex_regexp_nodes is only a subset of
all the tree nodes; we may end up duplicating work with scanning
that Vectorscan has already done, depending on whether the parent nodes
are "simple" or "complex". So instead of trying to fix this
optimization, just remove it entirely.
2023-06-06 16:28:44 -05:00
kssenii
adfedb4df0 Add USE NAMED COLLECTION access 2023-06-06 14:46:34 +02:00
johanngan
c0f162c5b6 Add dictGetAll function for RegExpTreeDictionary
This function outputs an array of attribute values from all regexp nodes
that matched in a regexp tree dictionary. An optional final argument can
be passed to limit the array size.
2023-06-04 23:46:04 -05:00
JackyWoo
e6d1b3c351 little fix 2023-06-02 10:05:54 +08:00
JackyWoo
f4f939162d new redis engine schema design 2023-06-02 10:05:54 +08:00
JackyWoo
357df40c8f fix tests 2023-06-02 10:05:54 +08:00
JackyWoo
b35867d907 unify storage type 2023-06-02 10:05:54 +08:00
JackyWoo
40cc8d2107 fix code style 2023-06-02 10:05:54 +08:00
JackyWoo
ce203b5ce6 Check redis table structure 2023-06-02 10:05:54 +08:00
JackyWoo
9a495cbf99 Push down filter into Redis 2023-06-02 10:05:54 +08:00
JackyWoo
e91867373c Add table function Redis 2023-06-02 10:05:54 +08:00
xiebin
28d2269661
Merge branch 'master' into master 2023-05-30 16:13:52 +08:00
xiebin
beb3690c7e if dictionary id is number, do not convert layout to complex 2023-05-30 16:09:01 +08:00
Alexander Tokmakov
876490ff40
Merge pull request #50065 from azat/dict/load-factor-range-fix
Fix hashed/sparse_hashed dictionaries max_load_factor upper range
2023-05-22 15:04:56 +03:00
Nikolay Degterinsky
183f90e45a Update MongoDB protocol 2023-05-22 09:05:23 +00:00
Azat Khuzhin
c30658a9ed Fix hashed/sparse_hashed dictionaries max_load_factor upper range
Previously due to comparison of floats with doubles, it was incorrectly
works for the upper range:

    (lldb) p (float)0.99 > (float)0.99
    (bool) $0 = false
    (lldb) p (float)0.99 > (double)0.99
    (bool) $1 = true

This should also fix performance tests errors on CI:

    clickhouse_driver.errors.ServerException: Code: 36.
    DB::Exception: default.simple_key_HASHED_dictionary_l0_99: max_load_factor parameter should be within [0.5, 0.99], got 0.99. Stack trace:

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-22 08:59:48 +02:00
Azat Khuzhin
0586a27432 Charge only server memory for dictionaries
Right now the memory will be counted for query/user for dictionary, but
only if it load by user (via SYSTEM RELOAD QUERY or via dictGet()), but
it could be also loaded in backgrounad (due to lifetime, or
update_field, so it is like Buffer, only server memory should be
charged.

v2: mark test as long
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Co-authored-by: Sergei Trifonov <svtrifonov@gmail.com>
2023-05-21 22:53:52 +02:00
Azat Khuzhin
e1e2a83a9e Print type of the structure that will be used for HASHED/SPARSE_HASHED
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
f8e7d2cb1f Remove part of the HashTableGrowerWithPrecalculationAndMaxLoadFactor comment
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
c9cde110cd Add initial degree as parameter for HashTableGrowerWithPrecalculationAndMaxLoadFactor
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
42eac6bfbc Wrap implementation helpers into HashedDictionaryImpl namespace
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
6f351851ad Rename grower to HashTableGrowerWithPrecalculationAndMaxLoadFactor
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
1ab130132c Add more comments into HashedDictionaryCollectionType.h
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
422cbe08fe Do not use PackedHashMap for non-POD for the purposes of layout
In clang-16 the behaviour for POD types had been changed in [1], this
does not allows us to use PackedHashMap for some types.

  [1]: 277123376c

Note, that I tried to come up with a more generic solution then
enumeratic types, but failed. Though now I think that this is good,
since this shows which types are not allowed for PackedHashMap

Another option is to use -fclang-abi-compat=13.0 but I doubt it is a
good idea.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
2996b38606 Add ability to configure maximum load factor for the HASHED/SPARSE_HASHED layout
As it turns out, HashMap/PackedHashMap works great even with max load
factor of 0.99. By "great" I mean it least it works faster then
google sparsehash, and not to mention it's friendliness to the memory
allocator (it has zero fragmentation since it works with a continuious
memory region, in comparison to the sparsehash that doing lots of
realloc, which jemalloc does not like, due to it's slabs).

Here is a table of different setups:

settings                         | load (sec) | read (sec) | read (million rows/s) | bytes_allocated | RSS
-                                | -          | -          | -                     | -               | -
HASHED upstream                  | -          | -          | -                     | -               | 35GiB
SPARSE_HASHED upstream           | -          | -          | -                     | -               | 26GiB
-                                | -          | -          | -                     | -               | -
sparse_hash_map glibc hashbench  | -          | -          | -                     | -               | 17.5GiB
sparse_hash_map packed allocator | 101.878    | 231.48     | 4.32                  | -               | 17.7GiB
PackedHashMap 0.5                | 15.514     | 42.35      | 23.61                 | 20GiB           | 22GiB
hashed 0.95                      | 34.903     | 115.615    | 8.65                  | 16GiB           | 18.7GiB
**PackedHashMap 0.95**           | **93.6**   | **19.883** | **10.68**             | **10GiB**       | **12.8GiB**
PackedHashMap 0.99               | 26.113     | 83.6       | 11.96                 | 10GiB           | 12.3GiB

As it shows, PackedHashMap with 0.95 max_load_factor, eats 2.6x less
memory then SPARSE_HASHED in upstream, and it also 2x faster for read!

v2: fix grower
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
3698302ddb Accept float values for dictionary layouts configurations
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
7b5d156cc5 Optimize SPARSE_HASHED layout (by using PackedHashMap)
In case you want dictionary optimized for memory, SPARSE_HASHED is not
always gives you what you need.

Consider the following example <UInt64, UInt16> as <Key, Value>, but
this pair will also have a 6 byte padding (on amd64), so this is almost
40% of space wastage.

And because of this padding, even google::sparse_hash_map, does not make
picture better, in fact, sparse_hash_map is not very friendly to memory
allocators (especially jemalloc).

Here are some numbers for dictionary with 1e9 elements and UInt64 as
key, and UInt16 as value:

settings                         | load (sec) | read (sec) | read (million rows/s) | bytes_allocated | RSS
HASHED upstream                  | -          | -          | -                     | -               | 35GiB
SPARSE_HASHED upstream           | -          | -          | -                     | -               | 26GiB
-                                | -          | -          | -                     | -               | -
sparse_hash_map glibc hashbench  | -          | -          | -                     | -               | 17.5GiB
sparse_hash_map packed allocator | 101.878    | 231.48     | 4.32                  | -               | 17.7GiB
PackedHashMap                    | 15.514     | 42.35      | 23.61                 | 20GiB           | 22GiB

As you can see PackedHashMap looks way more better then HASHED, and
even better then SPARSE_HASHED, but slightly worse then sparse_hash_map
with packed allocator (it is done with a custom patch to google
sparse_hash_map).

v2: rebase on top of bucket_count fix
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
xiebin
b358b53d31
Merge branch 'ClickHouse:master' into master 2023-05-15 11:25:54 +08:00
Alexey Milovidov
5a44dc26e7 Fixes for clang-17 2023-05-13 02:57:31 +02:00
Han Fei
ef74e64336 address comments 2023-05-11 22:18:08 +02:00
xiebin
f9eb6ca6fd if the data type of numeric key is not native uint, convert to complex. 2023-05-10 23:47:15 +08:00
Alexey Milovidov
dc3aca6e98
Merge branch 'master' into master 2023-05-10 07:44:10 +03:00
Han Fei
ddce47f79e refine table source for regexp tree dictionary 2023-05-09 20:17:54 +02:00
Han Fei
72fc567d4a
Merge branch 'master' into hanfei/regexp-dict-read 2023-05-08 16:20:12 +02:00
Han Fei
92e57817a2 Support dictionary table function for RegExpTreeDictionary 2023-05-08 16:14:08 +02:00
Alexander Tokmakov
1224ac9eda fix build 2023-05-08 00:57:13 +02:00
Alexander Tokmakov
abf6c60ad2 Merge branch 'master' into fix_dictionaries_loading_order 2023-05-08 00:31:03 +02:00
Alexey Milovidov
e633ebee85 Merge branch 'master' into concurrency-control-controllable 2023-05-07 20:07:07 +02:00
xiebin
7f9b21849c Fixed a lowercase initial letter and removed needless data 2023-05-07 19:06:06 +08:00
xbthink
72dd039d1c add comments and functional test 2023-05-07 16:22:05 +08:00
Alexey Milovidov
6fddb5bad3 Simplification 2023-05-07 06:31:00 +02:00
Alexey Milovidov
a695d6227d Make concurrency control controllable 2023-05-07 06:16:30 +02:00
Michael Kolupaev
49394a097e Fix 'noisy Warning messages' failing when there are no Warning messages 2023-05-06 10:39:59 -07:00
Bin Xie
bc16cc59ff If a dictionary is created with a complex key, automatically choose the "complex key" layout variant. 2023-05-06 11:09:45 +08:00
Alexander Tokmakov
846abe95e9 fix 2023-05-05 20:50:13 +02:00
Alexander Tokmakov
dd1bbf7c78 fix another issue with dependencies 2023-05-05 16:27:12 +02:00
Azat Khuzhin
2fd1a73812 Fix element_count for HASHED/SPARSE_HASHED with multiple attributes
Previosly element_count was multplied by the number of attributes.

Fixes: #5440
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-01 12:44:41 +02:00
Azat Khuzhin
93201f21d9 Fix load_factor for HASHED/SPARSE_HASHED dictionaries with SHARDS
Previously, bucket_count was set only for the one shard, and hence
load_factor was > 1.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-01 12:44:41 +02:00
MikhailBurdukov
5c9959af49 Resolve conservation 2023-04-28 12:40:47 +00:00
MikhailBurdukov
40ad8499a0 fix 2023-04-26 21:03:27 +00:00
MikhailBurdukov
b229a28e94
Merge branch 'master' into mongo_dict_tls 2023-04-26 23:39:27 +03:00
MikhailBurdukov
389c0af922 Fix style 2023-04-26 19:36:34 +00:00
MikhailBurdukov
baaee66e85 Missing files 2023-04-26 19:29:29 +00:00
MikhailBurdukov
d76430fe90 Added options handling for mongoo dict 2023-04-26 19:19:10 +00:00
Raúl Marín
f0e045bb3d Merge remote-tracking branch 'blessed/master' into arenita 2023-04-24 10:42:56 +02:00
Rory Crispin
99175aefae
trailing whitespace 2023-04-22 17:02:05 +01:00
Rory Crispin
66300402ee
Bracket on newline 2023-04-22 15:42:57 +01:00
Rory Crispin
5e80e9263e
Remove spaces around -> 2023-04-22 15:29:20 +01:00
Rory Crispin
32dcc2e37b
Merge branch 'master' into dict-lifetime-validation 2023-04-22 14:55:11 +01:00
Rory Crispin
006af1dfa1 validate direct dictionary lifetime is unset during creation 2023-04-22 15:49:04 +02:00
Kruglov Pavel
2ad161d2b7
Merge branch 'master' into non-blocking-connect 2023-04-19 13:39:40 +02:00
robot-clickhouse-ci-2
45f4a5f74c
Merge pull request #47964 from ClickHouse/fast-parquet
Read Parquet files faster
2023-04-17 19:27:38 +02:00
Raúl Marín
39f8c43a60 Merge remote-tracking branch 'blessed/master' into arenita 2023-04-17 10:33:38 +02:00
Michael Kolupaev
2d4fe85513 Something 2023-04-17 04:58:32 +00:00
Kseniia Sumarokova
6a0d9a37ce
Merge branch 'master' into fix-mysql-named-collection 2023-04-14 12:03:51 +02:00
kssenii
ad48e1d010 Fox 2023-04-13 19:36:25 +02:00
Raúl Marín
2b70e08f23 Don't count unreserved bytes in Arenas as read_bytes 2023-04-13 12:43:24 +02:00
Robert Schulze
7a21d5888c
Remove -Wshadow suppression which leaked into global namespace 2023-04-13 08:46:40 +00:00
Raúl Marín
da9a539cf7 Reduce the usage of Arena.h 2023-04-13 10:31:32 +02:00
Robert Schulze
f41354ccd6
Merge pull request #48671 from ClickHouse/rs/gcc-removal
Remove GCC remainders
2023-04-13 10:15:35 +02:00
Alexander Tokmakov
75f18b1198
Revert "Check simple dictionary key is native unsigned integer" 2023-04-13 01:32:19 +03:00
Robert Schulze
3f7ce60e03
Merge branch 'master' into rs/gcc-removal 2023-04-12 22:17:04 +02:00
Anton Popov
1520f3e924
Merge pull request #48335 from lzydmxy/check_sample_dict_key_is_correct
Check simple dictionary key is native unsigned integer
2023-04-12 14:27:39 +02:00
Robert Schulze
05606a8835
Clean up GCC warning pragmas 2023-04-11 18:21:08 +00:00
Han Fei
bf28be8837 fix 02504_regexp_dictionary_table_source 2023-04-11 17:07:44 +02:00
Han Fei
6c33180ac8
Merge branch 'master' into hanfei/refine-expmsg 2023-04-11 13:19:38 +02:00
Han Fei
363b97fab8 refine some messages of exception in regexp tree 2023-04-11 11:45:29 +02:00
Alexey Milovidov
d259217cf3
Merge pull request #48570 from azat/build/logger_useful
Remove superfluous includes of logger_userful.h from headers
2023-04-11 03:56:39 +03:00
Alexey Milovidov
93b8fc74ef
Merge pull request #48571 from azat/dict/hashed/uncaught-exception-fix
Fix uncaught exception in case of parallel loader for hashed dictionaries
2023-04-10 23:11:01 +03:00
Azat Khuzhin
79b83c4fd2 Remove superfluous includes of logger_userful.h from headers
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-04-10 17:59:30 +02:00
Azat Khuzhin
211cea5e7c Fix uncaught exception in case of parallel loader for hashed dictionaries
Since ThreadPool::wait() rethrows the first exception (if any):

<details>

<summary>stacktrace</summary>

    2023.04.09 12:53:33.629333 [ 22361 ] {} <Fatal> BaseDaemon: (version 22.13.1.1, build id: 5FB01DCAAFFF19F0A9A61E253567F90685989D2F) (from thread 23032) Terminate called for uncaught exception:
    2023.04.09 12:53:33.630179 [ 23645 ] {} <Fatal> BaseDaemon:
    2023.04.09 12:53:33.630213 [ 23645 ] {} <Fatal> BaseDaemon: Stack trace: 0x7f68b00baccc 0x7f68b006bef2 0x7f68b0056472 0x112a42fe 0x1c17f2a3 0x1c17f238 0xbf4bc3b 0x13961c6d 0x138ee529 0x138ed6bc 0x138dd2f0 0x138dd9c6 0x1571d0dd 0x16197c1f 0x161a231e 0x1619fc93 0x161a51b9 0x11151759 0x1115454e 0x7f68b00b8fd4 0x7f68b013966c
    2023.04.09 12:53:33.630247 [ 23645 ] {} <Fatal> BaseDaemon: 3. ? @ 0x7f68b00baccc in ?
    2023.04.09 12:53:33.630263 [ 23645 ] {} <Fatal> BaseDaemon: 4. gsignal @ 0x7f68b006bef2 in ?
    2023.04.09 12:53:33.630273 [ 23645 ] {} <Fatal> BaseDaemon: 5. abort @ 0x7f68b0056472 in ?
    2023.04.09 12:53:33.648815 [ 23645 ] {} <Fatal> BaseDaemon: 6. ./.build/./src/Daemon/BaseDaemon.cpp:456: terminate_handler() @ 0x112a42fe in /usr/lib/debug/usr/bin/clickhouse.debug
    2023.04.09 12:53:33.651484 [ 23645 ] {} <Fatal> BaseDaemon: 7. ./.build/./contrib/llvm-project/libcxxabi/src/cxa_handlers.cpp:61: std::__terminate(void (*)()) @ 0x1c17f2a3 in /usr/lib/debug/usr/bin/clickhouse.debug
    2023.04.09 12:53:33.654080 [ 23645 ] {} <Fatal> BaseDaemon: 8. ./.build/./contrib/llvm-project/libcxxabi/src/cxa_handlers.cpp:79: std::terminate() @ 0x1c17f238 in /usr/lib/debug/usr/bin/clickhouse.debug
    2023.04.09 12:53:35.025565 [ 23645 ] {} <Fatal> BaseDaemon: 9. ? @ 0xbf4bc3b in /usr/lib/debug/usr/bin/clickhouse.debug
    2023.04.09 12:53:36.495557 [ 23645 ] {} <Fatal> BaseDaemon: 10. DB::ParallelDictionaryLoader<(DB::DictionaryKeyType)0, true, true>::~ParallelDictionaryLoader() @ 0x13961c6d in /usr/lib/debug/usr/bin/clickhouse.debug
    2023.04.09 12:53:37.833142 [ 23645 ] {} <Fatal> BaseDaemon: 11. DB::HashedDictionary<(DB::DictionaryKeyType)0, true, true>::loadData() @ 0x138ee529 in /usr/lib/debug/usr/bin/clickhouse.debug
    2023.04.09 12:53:39.124989 [ 23645 ] {} <Fatal> BaseDaemon: 12. DB::HashedDictionary<(DB::DictionaryKeyType)0, true, true>::HashedDictionary(DB::StorageID const&, DB::DictionaryStructure const&, std::__1::shared_ptr<DB::IDictionarySource>, DB::HashedDictionaryStorageConfiguration const&, std::__1::shared_ptr<DB::Block>) @ 0x138ed6bc in /usr/lib/debug/usr/bin/clickhouse.debug

</details>

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-04-09 22:52:51 +02:00
Alexey Milovidov
09ea79aaf7 Add support for {server_uuid} macro 2023-04-09 03:04:26 +02:00
ltrk2
4544abc7d6 Remove dead code and unused dependencies 2023-04-06 11:37:12 -07:00
lzydmxy
529e1466df use check_dictionary_primary_key instead of check_sample_dict_key_is_correct 2023-04-04 12:04:17 +08:00
lzydmxy
368c120f42 check sample dictionary key is native unsigned integer 2023-04-03 15:48:40 +08:00
kssenii
f96e7b59a2 Better 2023-04-01 13:19:07 +02:00
kssenii
1721b70070 Merge remote-tracking branch 'upstream/master' into ilejn-dict-named-collection 2023-04-01 13:18:26 +02:00
Alexey Milovidov
e982fb9f1c
Merge pull request #47880 from azat/threadpool-introspection
ThreadPool metrics introspection
2023-03-30 01:27:31 +03:00
Azat Khuzhin
f38a7aeabe ThreadPool metrics introspection
There are lots of thread pools and simple local-vs-global is not enough
already, it is good to know which one in particular uses threads.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-03-29 10:46:59 +02:00
Vladimir C
d32c285d17
Merge branch 'master' into vdimir/direct-dict-async-read 2023-03-28 12:41:20 +02:00
Han Fei
e3afa5090f
Merge pull request #47218 from hanfei1991/hanfei/optimize-regexp-tree-1
Refine OptimizeRegularExpression Function and RegexpTreeDict
2023-03-27 15:23:01 +02:00
vdimir
f6de216041
PullingAsyncPipelineExecutor for Direct dictionary with ClickHouse source 2023-03-27 09:52:26 +00:00
Kruglov Pavel
3ee12e21fb
Merge branch 'master' into non-blocking-connect 2023-03-23 20:53:44 +01:00
Han Fei
575c4263a3 address comments 2023-03-22 17:47:25 +01:00
avogar
38e44861ae Fix possible race conditions 2023-03-21 16:01:54 +00:00
Kseniia Sumarokova
3c550b4314
Merge pull request #46647 from kssenii/named-collections-finish
Named collections: finish replacing old code for storages
2023-03-21 12:36:46 +01:00
Robert Schulze
5b036a1a3b
More preparation for libcxx(abi), llvm, clang-tidy 16 (follow-up to #47722) 2023-03-20 12:55:03 +00:00
kssenii
cae3b335d6 Merge remote-tracking branch 'upstream/master' into named-collections-finish 2023-03-20 11:23:22 +01:00
kssenii
bb0beb7449 Merge remote-tracking branch 'upstream/master' into named-collections-finish 2023-03-17 13:02:36 +01:00
Sema Checherinda
3c6deddd1d work with comments on PR 2023-03-16 19:55:58 +01:00
Han Fei
e0954ce7be fix compile 2023-03-16 00:22:05 +01:00
Han Fei
a532503466
Merge branch 'master' into hanfei/optimize-regexp-tree-1 2023-03-15 17:56:01 +01:00
Han Fei
424e8df9ad fix style 2023-03-15 16:01:12 +01:00
Han Fei
d78a9e03ad refine 2023-03-15 15:38:11 +01:00
Kseniia Sumarokova
a9a0d2f5c4
Merge pull request #46524 from artem-yadr/master
Support for MongoDB Replica Set URI with readPreference and host:port enum in MongoDB dictionaries
2023-03-07 11:40:33 +01:00
Kseniia Sumarokova
386663953c
Merge branch 'master' into named-collections-finish 2023-03-03 12:23:38 +01:00
artem-yadr
e1352adced
Update MongoDBDictionarySource.cpp 2023-03-01 12:50:03 +03:00
Han Fei
e77dd81036 fix 2023-02-24 19:48:46 +01:00
Han Fei
e8527e720b refine regexp tree dictionary 2023-02-24 13:08:27 +01:00
kssenii
a54b011670 Finish for mysql 2023-02-20 21:37:38 +01:00
Alexey Milovidov
d8cda3dbb8 Remove PVS-Studio 2023-02-19 23:30:05 +01:00
artem-yadr
08734d4dc0 poco changes are now used in MongoDBDictionarySource 2023-02-17 14:56:21 +03:00
Maksim Kita
c469e10092 Dictionaries DictionaryStorageFetchRequest fix 2023-02-16 12:17:02 +01:00
Nikolay Degterinsky
6e4b660033 Move MongoDB and PostgreSQL sources to Sources folder 2023-02-14 22:35:10 +00:00
Ilya Golshtein
38ea27489c secure in named collections - small cleanup 2023-02-13 01:04:38 +03:00
Alexey Milovidov
44bd95a410
Merge pull request #46167 from ClickHouse/rs/reject-dos-patterns
Reject hyperscan regexes which are prone to ReDoS
2023-02-11 06:04:03 +03:00
Ilya Golshtein
3b72b3f13b secure in named collection - switched to specific_args, tests added 2023-02-10 13:42:11 +03:00
Robert Schulze
74937cf27b
Reject DoS-prone hyperscan regexes 2023-02-09 17:17:35 +00:00
Robert Schulze
e490ec91d9
Merge branch 'master' into rs/fix-fragile-linking 2023-02-09 11:33:59 +01:00
Alexander Tokmakov
8101b044fa
Merge pull request #46091 from azat/sanity-assertions
Sanity assertions for closing file descriptors
2023-02-09 01:02:03 +03:00
Ilya Golshtein
f9e81ca7de secure in named collections - initial 2023-02-08 23:30:16 +03:00
Maksim Kita
77ec255d7c
Merge pull request #45396 from kitaisreal/hashed-dictionary-sharded-nullable-fix
HashedDictionary sharded fix nullable values
2023-02-08 17:15:10 +03:00
Robert Schulze
6ff232d782
Merge branch 'master' into rs/fix-fragile-linking 2023-02-08 12:51:12 +01:00
Alexey Milovidov
55c3bbb739 Fix assertion in statistical functions 2023-02-08 00:09:41 +01:00
Robert Schulze
10af0b3e49
Reduce redundancies 2023-02-07 12:27:23 +00:00
Azat Khuzhin
8cc41b7f41 Check return value of ::close()
Note, that according close(2), EINTR should not be retriable for close()

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-02-07 11:28:22 +01:00
Han Fei
d1d893275a fix 2023-02-06 18:46:23 +01:00
Han Fei
eb76041312 address comments and add one more test 2023-02-06 17:26:20 +01:00
Maksim Kita
e8d66fb1a2 HashedDictionary sharded fix nullable values 2023-02-06 10:50:58 +01:00
Robert Schulze
84b9ff450f
Fix terribly broken, fragile and potentially cyclic linking
Sorry for the clickbaity title. This is about static method
ConnectionTimeouts::getHTTPTimeouts(). It was be declared in header
IO/ConnectionTimeouts.h, and defined in header
IO/ConnectionTimeoutsContext.h (!). This is weird and caused issues with
linking on s390x (##45520). There was an attempt to fix some
inconsistencies (#45848) but neither did @Algunenano nor me at first
really understand why the definition is in the header.

Turns out that ConnectionTimeoutsContext.h is only #include'd from
source files which are part of the normal server build BUT NOT part of
the keeper standalone build (which must be enabled via CMake
-DBUILD_STANDALONE_KEEPER=1). This dependency was not documented and as
a result, some misguided workarounds were introduced earlier, e.g.
0341c6c54b

The deeper cause was that getHTTPTimeouts() is passed a "Context". This
class is part of the "dbms" libary which is deliberately not linked by
the standalone build of clickhouse-keeper. The context is only used to
read the settings and the "Settings" class is part of the
clickhouse_common library which is linked by clickhouse-keeper already.

To resolve this mess, this PR

- creates source file IO/ConnectionTimeouts.cpp and moves all
  ConnectionTimeouts definitions into it, including getHTTPTimeouts().

- breaks the wrong dependency by passing "Settings" instead of "Context"
  into getHTTPTimeouts().

- resolves the previous hacks
2023-02-05 20:49:34 +00:00
Han Fei
baa345fa64 remove logs 2023-02-05 18:06:06 +01:00
Han Fei
9ea3de14ce use re2 by default 2023-02-04 10:53:54 +01:00
Han Fei
2656027c9f make it work if we dont define use_vectorscan macro 2023-02-03 14:25:53 +01:00
Han Fei
a2e43bc333 log for ci 2023-02-02 14:40:09 +01:00
Han Fei
90153c11fc fix matching priority 2023-02-01 15:09:04 +01:00
Han Fei
24b8322bc9 Merge branch 'master' into hanfei/regexp-refine 2023-01-31 17:03:51 +01:00
Bharat Nallan
f1d6e3b908
Merge branch 'master' into ncb/odbc-connection-pool-fixes 2023-01-30 15:49:04 -08:00
Alexander Tokmakov
d7c697ee38 fix 2023-01-26 15:24:39 +01:00
Alexander Tokmakov
14db798191 fix 2023-01-26 13:56:16 +01:00
Alexander Tokmakov
9b670946db Merge branch 'master' into exception_message_patterns5 2023-01-26 00:41:32 +01:00
Han Fei
7df30b1d82 remove trivial logs 2023-01-25 23:57:20 +01:00
Han Fei
f4d38b82e6 RegExpTreeDict use re2 engines when processing heavy regexps 2023-01-25 23:49:00 +01:00
Nikolay Degterinsky
6b2f3de293
Merge pull request #45512 from evillique/fix-msan-build
Fix MSan build once again (too heavy translation units)
2023-01-25 16:11:21 +01:00
Alexander Tokmakov
6eb557b2ba Merge branch 'master' into exception_message_patterns4 2023-01-25 13:49:17 +01:00
Sergei Trifonov
0d1ea05ff6
Merge pull request #45007 from ClickHouse/cancellable-mutex-integration
Fast shared mutex integration
2023-01-25 11:15:46 +01:00
Bharat Nallan
2ef8fcb318
Merge branch 'master' into ncb/odbc-connection-pool-fixes 2023-01-24 21:27:20 -08:00
Nikolay Degterinsky
97aef55a7f Fix typo 2023-01-24 23:00:02 +00:00
Nikolay Degterinsky
d8d85d9bbd Merge remote-tracking branch 'upstream/master' into fix-msan-build 2023-01-24 22:57:47 +00:00
Nikolay Degterinsky
fb6838b043 Review suggestions 2023-01-24 22:54:01 +00:00
Alexander Tokmakov
3f6594f4c6 forbid old ctor of Exception 2023-01-23 22:18:05 +01:00
Alexander Tokmakov
70d1adfe4b
Better formatting for exception messages (#45449)
* save format string for NetException

* format exceptions

* format exceptions 2

* format exceptions 3

* format exceptions 4

* format exceptions 5

* format exceptions 6

* fix

* format exceptions 7

* format exceptions 8

* Update MergeTreeIndexGin.cpp

* Update AggregateFunctionMap.cpp

* Update AggregateFunctionMap.cpp

* fix
2023-01-24 00:13:58 +03:00
Nikolay Degterinsky
f9960361db Fix MSan build 2023-01-23 14:38:07 +00:00
Sergei Trifonov
0fbfa17863
Merge branch 'master' into cancellable-mutex-integration 2023-01-23 12:44:09 +01:00
Maksim Kita
758c8f2776
Merge branch 'master' into dict/remove-preallocate 2023-01-20 13:15:37 +03:00
Han Fei
94336a9b66 fix typo 2023-01-19 13:55:29 +01:00
Han Fei
2884b8837b fix regexp logical error in stress tests 2023-01-19 12:03:54 +01:00
Bharat Nallan Chakravarthy
16a2585d55 fixes to odbc connection pooling 2023-01-18 16:38:56 -08:00
Azat Khuzhin
4366f7fb3b Remove PREALLOCATE for HASHED/SPARSE_HASHED dictionaries
It does not give significant benefit, but now, you hashed/sparse_hashed
dictionaries can be filled in parallel (#40003), using sharded
dictionaries, and this should be used instead of PREALLOCATE.

Note, that dictionaries, that had been created with PREALLOCATE will
work, but simply ignore this attribute.

Fixes: #41985 (cc @alexey-milovidov)
Reverts: #23979 (cc @kitaisreal)

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-18 20:18:37 +01:00
Azat Khuzhin
64e3677961 Avoid double hash calculation in HashedDictionary::getShard(StringRef)
Previously it was written this way because getShard() was a simple
module operation.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
2783850f08 Minor review fixes in HashedDictionary
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
6e0a7add93 Completelly exception safe HashedDictionary dtor
Previously there was one (even though very unlikely) case when the dtor
can throw - logging code or ThreadPool::wait.

Just guard the dtor with try/catch and done with it.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
74def83c5d Destroy hashtables for hashed dictionary in parallel only for sharded dict
Since there can be multiple hashtables, since each attribute uses it's
own hashtable.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
1c0e0ea1e4 Disable sharded dictionaries with updatable sources
Support of sharded dictionary for updatable sources is questionable
since:
- sharded dictionary developed for hashed dictionary with a huge number
  of keys
- updatable source requires storing the whole table in memory (due to
  how reload works)
- also it is an open question will it have some benefits from the
  updatable source or not, since using updatable source with a huge
  number of changes in the source does not looks optimal and on the
  other side if there are small amount of changes the you don't need
  sharded dictionary at all

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
c97991fce1 Use shared arena for HashedDictionary::blockToAttributes()
This should decrease number of allocations.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
01b100da61 Use shared arena in ParallelDictionaryLoader::createShardSelector() (and add missing rollback)
This should decrease number of allocations.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
64874824b4 Minor review fixes in HashedDictionary
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
77c1f07636 Make HashedDictionary::~HashedDictionary exception safe
Before it was possible for the desturctor to throw, in case of thread
allocation fails, rewrite it to trySchedule() and do sequential destroy
in this case.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
a3f189e191 Optimize sharded dictionaries with skewed distribution
In case of skewed distribution simple division by module will not give
you good distribution between shards and eventually this can lead to
performance the same as non-sharded dictionary (except for it will
occupy +1 thread for Block::scatter).

But if HashedDictionary::blockToAttributes() will not have calls to
HashedDictionary::getShard() this can be fixed by using a more complex
key-to-shard (getShard()) mapping. And actually you do not need to call
getShard() in blockToAttributes() you can simply use passed shard, and
that's it.

And by wrapping key with intHash64() in getShard() skewed distribution
can be fixed.

Note, that previously I tried similar approach but did not removed
getShard() from blockToAttributes(), that's why it failed.

And now it works almost as fast as with simple createBlockSelector(),
just 13.6% slower (18.75min vs 16.5min, with 16 threads).

Note, that I've also tried to add libdivide for this, but it does not
improves the performance.

I've also tried the approach without scatter, and it works 20% slower
then this one (22.5min VS 18.75min, with 16 threads).

v2: Use intHashCRC32() over intHash64() for HashedDictionary::getShard()
    (with intHash64() it works very slower, almost 2x slower, there was
    18min with 32 threads)

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
655a564280 Parallel hash tables destroy for hashed dictionaries
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
99063b152f Allow to configure queue backlog of the parallel hashed dictionary loader
v2: Decrease default parallel_queue_backlog to 10000 (same speed)
v3: Rename parallel_queue_backlog to per_shard_load_backlog
v3: Rename per_shard_load_backlog to shard_load_queue_backlog
v4: Fix documentation
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
79ad81dfdf Implement separate queue for parallel loader of hashed dictionaries
Previous patches in this series has a bottleneck in rehash(). This is
the most slowest operation when insert lots of rows into the hashtable
and eventually all that thread pool sometimes work as the most slowest
thread since we did not have any queue of blocks.

This patch adds such queue and now it scales linearly, so initialy with
1 thread I had ~4 hours for 10e9 elements (UInt64 key, UInt16 value),
after this patch it works in 16 minutes with 16 threads (well actually I
have to use 32 threads because of distribution of data in the source
table).

And now with 16 threads it works 16 times faster.

Also this patch adds more optimal block splitting for the non-complex
dictionaries, and usual block splitting for complex dictionaries.
But anyway this moves the overhead from the loading into the hashtable
threads out to the reader thread, and this is better, since reader does
not uses that much CPU.

v2: fix use-after-free on failed load (add missing wait in dtor)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
5d0fd3cdc4 Remove sharded overhead for non-sharded hashed dictionaries
By adding one more template parameter - HashedDictionary<sharded> (yes,
it is already too much of them, for the template class that has explicit
instantion).

Since perf tests [1] shows 20% slowdown.

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/40003/8f0cf2d6b8a7df511afe901331d5e2c7b06c0b4d/performance_comparison_[1/4]/report.html

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:26 +01:00
Azat Khuzhin
345c422e28 Add ability to load hashed dictionaries using multiple threads
Right now dictionaries (here I will talk about only
HASHED/SPARSE_HASHED/COMPLEX_KEY_HASHED/COMPLEX_KEY_SPARSE_HASHED)
can load data only in one thread, since it uses one hash table that
cannot be filled from multiple threads.

And in case you have very big dictionary (i.e. 10e9 elements), it can
take a awhile to load them, especially for SPARSE_HASHED variants (and
if you have such amount of elements there, you are likely use
SPARSE_HASHED, since it requires less memory), in my env it takes ~4
hours, which is enormous amount of time.

So this patch add support of shards for dictionaries, number of shards
determine how much hash tables will use this dictionary, also, and which
is more important, how much threads it can use to load the data.

And with 16 threads this works 2x faster, not perfect though, see the
follow up patches in this series.

v0: PARTITION BY
v1: SHARDS 1
v2: SHARDS(1)
v3: tried optimized mod - logical and, but it does not gain even 10%
v4: tried squashing more (max_block_size * shards), but it does not gain even 10% either
v5: move SHARDS into layout parameters (unknown simply ignored)
v6: tune params for perf tests (to avoid too long queries)
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-01-13 13:39:25 +01:00
serxa
693489a8ad review fixes 2023-01-12 15:51:04 +00:00
Sergei Trifonov
12d8543578
Merge branch 'master' into cancellable-mutex-integration 2023-01-12 16:03:49 +01:00
Han Fei
6ed4570f73
Merge branch 'master' into regexp-tree-dictionary 2023-01-10 15:36:30 +01:00
Maksim Kita
83a8d3ed25 RangeHashedDictionary update field primary key fix 2023-01-09 13:52:15 +01:00
Sergei Trifonov
81d2ea30ba
Merge branch 'master' into cancellable-mutex-integration 2023-01-07 19:37:46 +01:00
Anton Popov
1f32ffedf8
Merge pull request #43221 from ClickHouse/refactoring-ip-types
Replace domain IP types (IPv4, IPv6) with native
2023-01-07 12:01:21 +01:00
serxa
15bb127b01 replace every std::shared_mutex with DB::FastSharedMutex 2023-01-06 23:35:38 +00:00
Han Fei
a4427a05c2 fix build 2023-01-06 14:30:00 +01:00
Kseniia Sumarokova
573d3283b0
Merge pull request #44327 from kssenii/use-new-named-collections-code-2
Replace old named collections code with new (from #43147) part 2
2023-01-06 13:06:26 +01:00
Han Fei
cac7f65b40 fix build 2023-01-06 11:49:34 +01:00
Han Fei
744084375c fix build 2023-01-05 22:27:45 +01:00
Han Fei
ae5ee8194b fix check style 2023-01-05 17:52:05 +01:00
Han Fei
f2a9eea995 write docs and optimize regex compile 2023-01-05 17:38:01 +01:00
Yakov Olkhovskiy
7a5a36cbed
Merge branch 'master' into refactoring-ip-types 2023-01-04 11:11:06 -05:00
Han Fei
65ef7b4adc fix build 2023-01-04 12:45:12 +01:00
Nikolay Degterinsky
aa41e9b775
Merge pull request #44857 from evillique/fix-msan-build
Try to fix MSan build
2023-01-04 04:31:28 +01:00
Han Fei
00e717d7ce some improvement 2023-01-03 21:41:51 +01:00
kssenii
67509aa2d5 Merge remote-tracking branch 'upstream/master' into use-new-named-collections-code-2 2023-01-03 16:41:30 +01:00
Nikolay Degterinsky
c4431e9931 Fix MSan build 2023-01-03 02:21:26 +00:00
Alexey Milovidov
e855d3519a
Merge branch 'master' into refactoring-ip-types 2023-01-02 21:58:53 +03:00