Commit Graph

1527 Commits

Author SHA1 Message Date
johanngan
bcb058f999 Add case insensitive and dot-all modes to RegExpTree dictionary
The new per-dictionary settings control regex match semantics around
case sensitivity and the '.' wildcard with newlines. They must be set at
the dictionary level since they're applied to regex engines at
pattern-compile-time.

- regexp_dict_flag_case_insensitive: case insensitive matching
- regexp_dict_flag_dotall: '.' matches all characters including newlines

They correspond to HS_FLAG_CASELESS and HS_FLAG_DOTALL in Vectorscan
and case_sensitive and dot_nl in RE2. These are the most useful options
compatible with the internal behavior of RegExpTreeDictionary around
splitting up simple and complex patterns between Vectorscan and RE2.

The alternative is to use (?i) and/or (?s) for all patterns. However,
(?s) isn't handled properly by OptimizedRegularExpression::analyze().
And while (?i) is, it still causes the dictionary to treat the pattern
as "complex" for sequential scanning with RE2 rather than multi-matching
with Vectorscan, even though Vectorscan supports case insensitive
literal matching. Setting dictionary-wide flags is both more convenient,
and circumvents these problems.
2023-09-06 11:28:53 -05:00
Vitaly Baranov
3b58c5baa6 Always check that block has rows to fix wrong allocation in HashedArrayDictionary::updateData and others. 2023-09-05 09:57:13 +02:00
Sergei Trifonov
802579f3f1
Merge pull request #49618 from ClickHouse/concurrency-control-controllable
Make concurrency control controllable
2023-08-29 19:44:51 +02:00
Raúl Marín
93dac0c880 Support clang-18 (Wmissing-field-initializers) 2023-08-23 15:53:45 +02:00
Amos Bird
076a67bdaa
Consistent file management in CMake 2023-08-21 11:45:08 +08:00
Amos Bird
c43bf153f5
Refactor 2023-08-18 15:38:46 +08:00
Amos Bird
dd0c71b32a
Add error_exit_reaction 2023-08-18 15:38:46 +08:00
Amos Bird
476f3cedc1
Various reactions when executable stderr has data 2023-08-18 15:38:45 +08:00
Sergei Trifonov
771710b377
Merge branch 'master' into concurrency-control-controllable 2023-08-11 16:50:13 +02:00
Alexey Milovidov
aa757490bd Ditch tons of garbage 2023-08-09 02:19:02 +02:00
Han Fei
65dcd79eb0 fix mem leak in RegExpTreeDictionary 2023-08-08 14:58:18 +02:00
Sergei Trifonov
01196ac41f
Merge branch 'master' into concurrency-control-controllable 2023-08-01 15:40:50 +02:00
xiebin
33e2cfcecb
Merge branch 'master' into master 2023-07-30 12:20:54 +08:00
Yakov Olkhovskiy
9a1c59a2f1
Merge branch 'master' into fix-ip-dict 2023-07-26 12:08:49 -04:00
Alexey Milovidov
21382afa2b Check for punctuation 2023-07-25 06:10:04 +02:00
Nikita Mikhaylov
ee0bbc0e54
Merge branch 'master' into headers-blacklist 2023-07-17 19:08:52 +02:00
Yakov Olkhovskiy
e95d413d9a
Merge branch 'master' into fix-ip-dict 2023-07-14 09:11:42 -04:00
Dmitry Kardymon
385a210fee Merge remote-tracking branch 'origin/master' into ADQM-870 2023-07-10 13:19:21 +00:00
robot-clickhouse
1343e5cc45
Merge pull request #51853 from kitaisreal/cache-dictionary-request-only-unique-keys-from-source
CacheDictionary request only unique keys from source
2023-07-08 20:58:16 +02:00
Maksim Kita
8266067e1a Fixed style check 2023-07-07 19:09:55 +03:00
Maksim Kita
23bd23802f CacheDictionary request only unique keys from source 2023-07-07 12:26:15 +03:00
Nikolay Degterinsky
e98d136243
Merge branch 'master' into headers-blacklist 2023-07-07 04:44:06 +02:00
Kseniia Sumarokova
e97e107bcc
Merge branch 'master' into add-separate-access-for-use-named-collections 2023-07-06 12:16:53 +02:00
Alexey Milovidov
2c96580a77
Merge branch 'master' into concurrency-control-controllable 2023-07-04 23:16:04 +03:00
Dmitry Kardymon
ab4142eb8f Merge remote-tracking branch 'clickhouse/master' into ADQM-870 2023-07-04 08:23:31 +03:00
Yakov Olkhovskiy
0529772dd8 support IPv4 and IPv6 as dictionary attributes 2023-07-04 02:19:45 +00:00
Nikolay Degterinsky
82e0237e67
Merge branch 'master' into headers-blacklist 2023-07-03 16:54:50 +02:00
kssenii
ac77f5fe6f Merge remote-tracking branch 'upstream/master' into add-separate-access-for-use-named-collections 2023-07-03 13:55:45 +02:00
Robert Schulze
fe49e98455
Follow-up to re2 update 2023-06-02 (#50949) 2023-07-03 08:28:25 +00:00
Nikolay Degterinsky
8dfa773f44
Merge branch 'master' into headers-blacklist 2023-06-30 23:40:17 +02:00
Sema Checherinda
d0d12bbf3b
Merge branch 'master' into no-finalize-WriteBufferFromOStream 2023-06-30 12:15:17 +02:00
Robert Schulze
6872084051
Merge pull request #50949 from georgthegreat/update-re2
Update contrib/re2 to 2023-06-02
2023-06-30 10:40:17 +02:00
Sema Checherinda
2a1f34e3f9
Merge branch 'master' into no-finalize-WriteBufferFromOStream 2023-06-30 08:01:05 +02:00
Igor Nikonov
56354b7251 Fix yet another place 2023-06-28 16:55:22 +00:00
Igor Nikonov
0b19c1832a Fix: detach from thread group 2023-06-28 14:15:03 +00:00
Sema Checherinda
fe97021929 add missing finalize calls in buffers 2023-06-27 16:54:14 +02:00
Yuriy Chernyshov
3e6654a1fe
Merge branch 'master' into update-re2 2023-06-24 22:34:44 +02:00
Nikita Taranov
fb7d23f245 fix build 2023-06-22 23:54:25 +02:00
Anton Kozlov
0c440b9d6f Report loading status for executable dictionaries correctly 2023-06-22 10:28:13 +00:00
Nikolay Degterinsky
575a1a4907 Add header checks to HTTP dictionary source 2023-06-20 13:29:25 +00:00
Dmitry Kardymon
806176d88e Add input_format_csv_missing_as_default setting and tests 2023-06-15 11:23:08 +00:00
kssenii
25ae93bbf8 Merge remote-tracking branch 'upstream/master' into add-separate-access-for-use-named-collections 2023-06-14 13:33:56 +02:00
JackyWoo
a1641aa25d
Merge branch 'master' into support_redis 2023-06-12 09:53:06 +08:00
Nikolay Degterinsky
9ad8e022a8
Merge branch 'master' into update-mongo 2023-06-10 10:58:02 +02:00
pufit
55d228e78e
Merge branch 'master' into support_redis 2023-06-09 11:45:12 -04:00
kssenii
63f8a3275b Merge remote-tracking branch 'upstream/master' into add-separate-access-for-use-named-collections 2023-06-09 14:32:41 +02:00
johanngan
be8e048799 Revert invalid RegExpTreeDictionary optimization
This reverts the following commits:
- e77dd81036
- e8527e720b

Additionally, functional tests are added.

When scanning complex regexp nodes sequentially with RE2, the old code
has an optimization to break out of the loop early upon finding a leaf
node that matches. This is an invalid optimization because there's no
guarantee that it's actually a VALID match, because its parents might
NOT have matched. Semantically, a user would expect this match to be
discarded and for the search to continue. Instead, since we skipped
matching after the first false positive, subsequent nodes that would
have matched are missing from the output value. This affects both
dictGet and dictGetAll.

It's difficult to distinguish a true positive from a false positive
while looping through complex_regexp_nodes because we would have to scan
all the parents of a matching node to confirm a true positive. Trying to
do this might actually end up being slower than just scanning every
complex regexp node, because complex_regexp_nodes is only a subset of
all the tree nodes; we may end up duplicating work with scanning
that Vectorscan has already done, depending on whether the parent nodes
are "simple" or "complex". So instead of trying to fix this
optimization, just remove it entirely.
2023-06-06 16:28:44 -05:00
kssenii
adfedb4df0 Add USE NAMED COLLECTION access 2023-06-06 14:46:34 +02:00
johanngan
c0f162c5b6 Add dictGetAll function for RegExpTreeDictionary
This function outputs an array of attribute values from all regexp nodes
that matched in a regexp tree dictionary. An optional final argument can
be passed to limit the array size.
2023-06-04 23:46:04 -05:00
JackyWoo
e6d1b3c351 little fix 2023-06-02 10:05:54 +08:00
JackyWoo
f4f939162d new redis engine schema design 2023-06-02 10:05:54 +08:00
JackyWoo
357df40c8f fix tests 2023-06-02 10:05:54 +08:00
JackyWoo
b35867d907 unify storage type 2023-06-02 10:05:54 +08:00
JackyWoo
40cc8d2107 fix code style 2023-06-02 10:05:54 +08:00
JackyWoo
ce203b5ce6 Check redis table structure 2023-06-02 10:05:54 +08:00
JackyWoo
9a495cbf99 Push down filter into Redis 2023-06-02 10:05:54 +08:00
JackyWoo
e91867373c Add table function Redis 2023-06-02 10:05:54 +08:00
xiebin
28d2269661
Merge branch 'master' into master 2023-05-30 16:13:52 +08:00
xiebin
beb3690c7e if dictionary id is number, do not convert layout to complex 2023-05-30 16:09:01 +08:00
Alexander Tokmakov
876490ff40
Merge pull request #50065 from azat/dict/load-factor-range-fix
Fix hashed/sparse_hashed dictionaries max_load_factor upper range
2023-05-22 15:04:56 +03:00
Nikolay Degterinsky
183f90e45a Update MongoDB protocol 2023-05-22 09:05:23 +00:00
Azat Khuzhin
c30658a9ed Fix hashed/sparse_hashed dictionaries max_load_factor upper range
Previously due to comparison of floats with doubles, it was incorrectly
works for the upper range:

    (lldb) p (float)0.99 > (float)0.99
    (bool) $0 = false
    (lldb) p (float)0.99 > (double)0.99
    (bool) $1 = true

This should also fix performance tests errors on CI:

    clickhouse_driver.errors.ServerException: Code: 36.
    DB::Exception: default.simple_key_HASHED_dictionary_l0_99: max_load_factor parameter should be within [0.5, 0.99], got 0.99. Stack trace:

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-22 08:59:48 +02:00
Azat Khuzhin
0586a27432 Charge only server memory for dictionaries
Right now the memory will be counted for query/user for dictionary, but
only if it load by user (via SYSTEM RELOAD QUERY or via dictGet()), but
it could be also loaded in backgrounad (due to lifetime, or
update_field, so it is like Buffer, only server memory should be
charged.

v2: mark test as long
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Co-authored-by: Sergei Trifonov <svtrifonov@gmail.com>
2023-05-21 22:53:52 +02:00
Azat Khuzhin
e1e2a83a9e Print type of the structure that will be used for HASHED/SPARSE_HASHED
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
f8e7d2cb1f Remove part of the HashTableGrowerWithPrecalculationAndMaxLoadFactor comment
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
c9cde110cd Add initial degree as parameter for HashTableGrowerWithPrecalculationAndMaxLoadFactor
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
42eac6bfbc Wrap implementation helpers into HashedDictionaryImpl namespace
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
6f351851ad Rename grower to HashTableGrowerWithPrecalculationAndMaxLoadFactor
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
1ab130132c Add more comments into HashedDictionaryCollectionType.h
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
422cbe08fe Do not use PackedHashMap for non-POD for the purposes of layout
In clang-16 the behaviour for POD types had been changed in [1], this
does not allows us to use PackedHashMap for some types.

  [1]: 277123376c

Note, that I tried to come up with a more generic solution then
enumeratic types, but failed. Though now I think that this is good,
since this shows which types are not allowed for PackedHashMap

Another option is to use -fclang-abi-compat=13.0 but I doubt it is a
good idea.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
2996b38606 Add ability to configure maximum load factor for the HASHED/SPARSE_HASHED layout
As it turns out, HashMap/PackedHashMap works great even with max load
factor of 0.99. By "great" I mean it least it works faster then
google sparsehash, and not to mention it's friendliness to the memory
allocator (it has zero fragmentation since it works with a continuious
memory region, in comparison to the sparsehash that doing lots of
realloc, which jemalloc does not like, due to it's slabs).

Here is a table of different setups:

settings                         | load (sec) | read (sec) | read (million rows/s) | bytes_allocated | RSS
-                                | -          | -          | -                     | -               | -
HASHED upstream                  | -          | -          | -                     | -               | 35GiB
SPARSE_HASHED upstream           | -          | -          | -                     | -               | 26GiB
-                                | -          | -          | -                     | -               | -
sparse_hash_map glibc hashbench  | -          | -          | -                     | -               | 17.5GiB
sparse_hash_map packed allocator | 101.878    | 231.48     | 4.32                  | -               | 17.7GiB
PackedHashMap 0.5                | 15.514     | 42.35      | 23.61                 | 20GiB           | 22GiB
hashed 0.95                      | 34.903     | 115.615    | 8.65                  | 16GiB           | 18.7GiB
**PackedHashMap 0.95**           | **93.6**   | **19.883** | **10.68**             | **10GiB**       | **12.8GiB**
PackedHashMap 0.99               | 26.113     | 83.6       | 11.96                 | 10GiB           | 12.3GiB

As it shows, PackedHashMap with 0.95 max_load_factor, eats 2.6x less
memory then SPARSE_HASHED in upstream, and it also 2x faster for read!

v2: fix grower
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
3698302ddb Accept float values for dictionary layouts configurations
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
Azat Khuzhin
7b5d156cc5 Optimize SPARSE_HASHED layout (by using PackedHashMap)
In case you want dictionary optimized for memory, SPARSE_HASHED is not
always gives you what you need.

Consider the following example <UInt64, UInt16> as <Key, Value>, but
this pair will also have a 6 byte padding (on amd64), so this is almost
40% of space wastage.

And because of this padding, even google::sparse_hash_map, does not make
picture better, in fact, sparse_hash_map is not very friendly to memory
allocators (especially jemalloc).

Here are some numbers for dictionary with 1e9 elements and UInt64 as
key, and UInt16 as value:

settings                         | load (sec) | read (sec) | read (million rows/s) | bytes_allocated | RSS
HASHED upstream                  | -          | -          | -                     | -               | 35GiB
SPARSE_HASHED upstream           | -          | -          | -                     | -               | 26GiB
-                                | -          | -          | -                     | -               | -
sparse_hash_map glibc hashbench  | -          | -          | -                     | -               | 17.5GiB
sparse_hash_map packed allocator | 101.878    | 231.48     | 4.32                  | -               | 17.7GiB
PackedHashMap                    | 15.514     | 42.35      | 23.61                 | 20GiB           | 22GiB

As you can see PackedHashMap looks way more better then HASHED, and
even better then SPARSE_HASHED, but slightly worse then sparse_hash_map
with packed allocator (it is done with a custom patch to google
sparse_hash_map).

v2: rebase on top of bucket_count fix
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-19 06:07:21 +02:00
xiebin
b358b53d31
Merge branch 'ClickHouse:master' into master 2023-05-15 11:25:54 +08:00
Alexey Milovidov
5a44dc26e7 Fixes for clang-17 2023-05-13 02:57:31 +02:00
Han Fei
ef74e64336 address comments 2023-05-11 22:18:08 +02:00
xiebin
f9eb6ca6fd if the data type of numeric key is not native uint, convert to complex. 2023-05-10 23:47:15 +08:00
Alexey Milovidov
dc3aca6e98
Merge branch 'master' into master 2023-05-10 07:44:10 +03:00
Han Fei
ddce47f79e refine table source for regexp tree dictionary 2023-05-09 20:17:54 +02:00
Han Fei
72fc567d4a
Merge branch 'master' into hanfei/regexp-dict-read 2023-05-08 16:20:12 +02:00
Han Fei
92e57817a2 Support dictionary table function for RegExpTreeDictionary 2023-05-08 16:14:08 +02:00
Alexander Tokmakov
1224ac9eda fix build 2023-05-08 00:57:13 +02:00
Alexander Tokmakov
abf6c60ad2 Merge branch 'master' into fix_dictionaries_loading_order 2023-05-08 00:31:03 +02:00
Alexey Milovidov
e633ebee85 Merge branch 'master' into concurrency-control-controllable 2023-05-07 20:07:07 +02:00
xiebin
7f9b21849c Fixed a lowercase initial letter and removed needless data 2023-05-07 19:06:06 +08:00
xbthink
72dd039d1c add comments and functional test 2023-05-07 16:22:05 +08:00
Alexey Milovidov
6fddb5bad3 Simplification 2023-05-07 06:31:00 +02:00
Alexey Milovidov
a695d6227d Make concurrency control controllable 2023-05-07 06:16:30 +02:00
Michael Kolupaev
49394a097e Fix 'noisy Warning messages' failing when there are no Warning messages 2023-05-06 10:39:59 -07:00
Bin Xie
bc16cc59ff If a dictionary is created with a complex key, automatically choose the "complex key" layout variant. 2023-05-06 11:09:45 +08:00
Alexander Tokmakov
846abe95e9 fix 2023-05-05 20:50:13 +02:00
Alexander Tokmakov
dd1bbf7c78 fix another issue with dependencies 2023-05-05 16:27:12 +02:00
Azat Khuzhin
2fd1a73812 Fix element_count for HASHED/SPARSE_HASHED with multiple attributes
Previosly element_count was multplied by the number of attributes.

Fixes: #5440
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-01 12:44:41 +02:00
Azat Khuzhin
93201f21d9 Fix load_factor for HASHED/SPARSE_HASHED dictionaries with SHARDS
Previously, bucket_count was set only for the one shard, and hence
load_factor was > 1.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2023-05-01 12:44:41 +02:00
MikhailBurdukov
5c9959af49 Resolve conservation 2023-04-28 12:40:47 +00:00
MikhailBurdukov
40ad8499a0 fix 2023-04-26 21:03:27 +00:00
MikhailBurdukov
b229a28e94
Merge branch 'master' into mongo_dict_tls 2023-04-26 23:39:27 +03:00
MikhailBurdukov
389c0af922 Fix style 2023-04-26 19:36:34 +00:00
MikhailBurdukov
baaee66e85 Missing files 2023-04-26 19:29:29 +00:00
MikhailBurdukov
d76430fe90 Added options handling for mongoo dict 2023-04-26 19:19:10 +00:00