Commit Graph

3744 Commits

Author SHA1 Message Date
Robert Schulze
0b0d64c5ca
Don't resize output vector in each loop iteration 2022-07-06 14:45:22 +00:00
Robert Schulze
144c1edb03
Inherit default implementation of getArgumentsThatAreAlwaysConstant() 2022-07-06 14:42:19 +00:00
mergify[bot]
8cc2c3914b
Merge branch 'master' into isnullable 2022-07-06 14:40:01 +00:00
Robert Schulze
6a907b23fb
Replace typeid_cast() with checkAndGetColumnConst()
... syntactic sugar
2022-07-06 14:38:28 +00:00
Robert Schulze
0c4da85e75
More uniform naming 2022-07-06 14:29:42 +00:00
mergify[bot]
e5535f5ab4
Merge branch 'master' into mapupdate_dev 2022-07-06 13:54:36 +00:00
Nikolai Kochetov
020c99a269
Merge pull request #38617 from azat/contrib-debug-symbols
Add separate option to omit symbols from heavy contrib
2022-07-06 14:40:24 +02:00
Robert Schulze
d0b2f13f9d
Fix style check 2022-07-05 13:41:52 +02:00
lokax
e6bd0105b1 feat(Function): isNullable
Signed-off-by: lokax <m632656684@gmail.com>
2022-07-05 15:51:53 +08:00
lokax
849c46e6fa feat(Function): isNullable 2022-07-05 15:23:07 +08:00
mergify[bot]
f14b62b2d6
Merge branch 'master' into index-fix-1 2022-07-04 22:50:36 +00:00
Robert Schulze
1eed72b525
Make more multi-search methods work with non-const needles
After making function multi[Fuzzy]Match(Any|AnyIndex|AllIndices)() work
with non-const needles, 12 more functions started to fail in test
"00233_position_function_family":

multiSearchAny()
multiSearchAnyCaseInsensitive()
multiSearchAnyUTF8
multiSearchAnyCaseInsensitiveUTF8()

multiSearchFirstPosition()
multiSearchFirstPositionCaseInsensitive()
multiSearchFirstPositionUTF8()
multiSearchFirstPositionCaseInsensitiveUTF8()

multiSearchFirstIndex()
multiSearchFirstIndexCaseInsensitive()
multiSearchFirstIndexUTF8()
multiSearchFirstIndexCaseInsensitiveUTF8()

Failing queries take the form
  select 0 = multiSearchAny('\0', CAST([], 'Array(String)'));
2022-07-04 14:00:21 +00:00
Robert Schulze
ece61f6da3
Fix davenger's review comments
https://github.com/ClickHouse/ClickHouse/pull/38434#discussion_r907397214
https://github.com/ClickHouse/ClickHouse/pull/38434#discussion_r907385290
https://github.com/ClickHouse/ClickHouse/pull/38434#discussion_r907406097

(the latter is no longer relevant as the affected places were removed in
the meantime)
2022-07-04 10:43:21 +00:00
Robert Schulze
d547aa7849
Allow non-const pattern array argument in multi[Fuzzy]Match*()
Resolves #38046
2022-07-04 10:43:16 +00:00
Alexander Gololobov
8ce8158f7f Do computations in Float32 (not Float64) for arrays of Float32 2022-07-03 10:33:11 +02:00
Alexander Gololobov
c6691cc5f2 Improved vectorized execution of main loop for array norm/distance 2022-07-02 22:45:22 +02:00
mergify[bot]
b016be264c
Merge branch 'master' into squared_l2 2022-07-02 09:17:28 +00:00
Azat Khuzhin
e8f5cd3c68 Add separate option to omit symbols from heavy contrib
Sometimes it is useful to build contrib with debug symbols for further
debugging.

With everything turned ON (i.e. debug build) I got 3.3GB vs 3.0GB w/o
this patch, 9% bloat, thoughts about this is this OK or not for you, if
not STRIP_DEBUG_SYMBOLS_HEAVY_CONTRIB can be OFF by default (regardless
of build type).

P.S. aws debug symbols adds just 1.7%.
v2: rename STRIP_HEAVY_DEBUG_SYMBOLS
v3: OMIT_HEAVY_DEBUG_SYMBOLS
v4: documentation had been removed
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-07-02 06:32:03 +03:00
Robert Schulze
2a1ede0f5a
Merge pull request #38589 from ClickHouse/fix-zero-bytes-in-haystack
Fix countSubstrings() & position() on patterns with 0-bytes
2022-07-01 16:15:43 +02:00
Amos Bird
84a407f381
Fix toHour monotonicity 2022-07-01 18:24:24 +08:00
Alexander Gololobov
b2b31103c5 Reuse common code for L2Squared and L2 2022-06-30 14:12:25 +02:00
Azat Khuzhin
a47355877e
Add revision() function (#38555)
It can be useful to match versions, since in some tables
(system.trace_log) there is only revision column.

P.S. came to this when was digging into stress reports from CI.
P.P.S. case insensitive by analogy with version().

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-06-30 12:58:26 +02:00
Robert Schulze
81bb2242fd
Fix countSubstrings() & position() on patterns with 0-bytes
SQL functions countSubstrings(), countSubstringsCaseInsensitive(),
countSubstringsUTF8(), position(), positionCaseInsensitive(),
positionUTF8() with non-const pattern argument use fallback sorters
LibCASCIICaseSensitiveStringSearcher and LibCASCIICaseInsensitiveStringSearcher
which call ::strstr(), resp. ::strcasestr(). These functions assume that
the haystack is 0-terminated and they even document that. However, the
callers did not check if the haystack contains 0-byte (perhaps because
its sort of expensive). As a consequence, if the haystack contained a
zero byte in it's payload, matches behind this zero byte were ignored.

    create table t (id UInt32, pattern String) engine = MergeTree() order by id;
    insert into t values (1, 'x');
    select countSubstrings('aaaxxxaa\0xxx', pattern) from t;

We returned 3 before this commit, now we return 6
2022-06-29 21:41:18 +00:00
hexiaoting
e32a0838d1 fix bug for mapUpdate 2022-06-29 15:52:08 +08:00
Julian Gilyadov
d0d72e0b5e
Implement L2Squared Distance and Norm 2022-06-28 15:28:58 -04:00
Miel Donkers
4e9f396a48
Small improvement of the error message to hint at possible issue (#38458) 2022-06-28 13:36:30 +02:00
Robert Schulze
959cbaab02
Move loop over patterns into implementations
- This is preparation for non-const regexp arguments, where this loop
  will run for each row.
2022-06-26 16:26:13 +00:00
Robert Schulze
cb5d1a4a85
Fix style check 2022-06-26 16:25:49 +00:00
Robert Schulze
b8f67185bf
Cosmetics: Whitespaces 2022-06-26 16:25:49 +00:00
Robert Schulze
e2b11899a1
Move check if cfg allows hyperscan into implementations
- This is not needed for non-const regexp array arguments but cleans up
  the code and runs the check only in functions which actually use
  hyperscan.
2022-06-26 16:25:49 +00:00
Robert Schulze
c2cea38b97
Move local variable into if statement 2022-06-26 16:25:49 +00:00
Robert Schulze
c9ce0efa66
Instantiate MultiMatchAnyImpl template using enums
- With this, invalid combinations of the FindAny/FindAnyIndex bools are
  no longer possible and we can remove the corresponding check

- Also makes the instantiations more readable.
2022-06-26 16:25:49 +00:00
Robert Schulze
2f15d45f27
Move check for regexp array size into implementations
- This is not needed for non-const regexp array arguments (the
  cardinality of arrays is fixed per column) but it cleans up the code
  and runs the check only in functions which have restrictions on the
  number of patterns.

- For functions using hyperscans, it was checked that the number of
  regexes is < 2^32. Removed the check because I don't think anyone will
  every specify 4 billion patterns.
2022-06-26 16:25:43 +00:00
Robert Schulze
3478db9fb6
Move check for regexp array size into implementations
- This is not needed for non-const regexp array arguments (the
  cardinality of arrays is fixed per column) but it cleans up the code
  and runs the check only in functions which have restrictions on the
  number of patterns.

- For functions using hyperscans, it was checked that the number of
  regexes is < 2^32. Removed the check because I don't think anyone will
  every specify 4 billion patterns.
2022-06-26 15:38:12 +00:00
Robert Schulze
7913edc172
Move check for hyperscan regexp constraints into implementations
- This is preparation for non-const regexp arguments, where this check
  will run for each row.
2022-06-26 15:38:05 +00:00
Robert Schulze
89bfdd50bf
Remove unnecessary check
- getReturnTypeImpl() ensures that the haystack column has type "String"
  and we can simply assert that.
2022-06-26 15:34:24 +00:00
Robert Schulze
580d89477f
Minimally faster performance 2022-06-26 15:34:22 +00:00
Robert Schulze
4bc59c18e3
Cosmetics: Move some code around + docs + whitespaces + minor stuff 2022-06-26 15:34:15 +00:00
Robert Schulze
1273756911
Cosmetics: fmt-based exceptions 2022-06-26 15:33:18 +00:00
Robert Schulze
e5c74a14f7
Cosmetics: More consistent naming
- rename utility function and file to "checkHyperscanRegexp"
2022-06-26 15:33:18 +00:00
Robert Schulze
072e0855a8
Cosmetics: Make member variables const 2022-06-26 15:32:26 +00:00
Robert Schulze
2ebfd01c2e
Cosmetics: Pull out settings variable 2022-06-26 15:32:23 +00:00
Robert Schulze
bb7c627964
Cosmetics: Pass patterns around as std::string_view instead of StringRef
- The patterns are not used in hashing, there should not be a performance
  impact when we use stuff from the standard library instead.

- added forgotten .reserve() in FunctionsMultiStringPosition.h
2022-06-26 15:32:19 +00:00
Alexey Milovidov
b3098822e0
Merge pull request #38171 from ClickHouse/hyper-to-vectorscan
Replace hyperscan by vectorscan
2022-06-26 10:01:45 +03:00
Alexey Milovidov
0654684bd4 Fix wrong implementation of filesystem* functions 2022-06-25 06:10:50 +02:00
Andrey Zvonov
ea73d9c492
Merge branch 'master' into zvonand-base58 2022-06-24 21:37:20 +03:00
Kruglov Pavel
0201d62090
Merge pull request #38173 from Avogar/fix-short-circuit
Fix bug with nested short-circuit functions
2022-06-24 16:04:17 +02:00
Robert Schulze
2c828338f4
Replace hyperscan by vectorscan
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

(*) Addresses (but does not resolve) #38046

(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
    preserved for compatibility. Likewise, error codes (e.g.
    "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
    "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
    replacement.

[0] https://www.youtube.com/watch?v=KlZWmmflW6M
2022-06-24 10:47:52 +02:00
Andrey Zvonov
c18d09a617
Merge branch 'master' into zvonand-base58 2022-06-24 07:05:49 +03:00
zvonand
dd8203038f updated exception handling 2022-06-24 00:36:57 +05:00
zvonand
a94c40e33a Merge branch 'master' of github.com:ClickHouse/ClickHouse into dt64_timeslots 2022-06-23 17:08:28 +05:00
mergify[bot]
234f0c6399
Merge branch 'master' into revert-35914-FIPS_compliance 2022-06-23 12:06:17 +00:00
zvonand
946117ec89 Merge branch 'master' of github.com:ClickHouse/ClickHouse into zvonand-base58 2022-06-23 17:04:40 +05:00
Alexey Milovidov
5855668514 Remove trash 2022-06-22 06:23:35 +02:00
mergify[bot]
bb79eb73e6
Merge branch 'master' into fix-short-circuit 2022-06-21 10:40:07 +00:00
Kruglov Pavel
b9b58b4305
Merge pull request #37759 from Avogar/fix-nothing-error
Fix possible logical error with type Nothing in some functions
2022-06-21 12:35:05 +02:00
Larry Luo
bbd73ba727 use utility methods to access x509 struct fields. 2022-06-20 21:27:33 -04:00
zvonand
22af00b757 rename variable + fix handling of ENABLE_LIBRARIES 2022-06-20 23:53:47 +05:00
mergify[bot]
b440ee84ae
Merge branch 'master' into fix-short-circuit 2022-06-20 15:14:19 +00:00
zvonand
d4e5686b99 minor: fix message for base64 2022-06-20 20:13:09 +05:00
zvonand
78d55d6f46 small fixes 2022-06-20 19:30:54 +05:00
zvonand
832fd6e0a9 Added tests + minor updates 2022-06-19 23:10:28 +05:00
Alexey Milovidov
0cf88e0950
Revert "ClickHouse's boringssl module updated to the official version of the FIPS compliant." 2022-06-18 23:16:18 +03:00
zvonand
f4b3af091d fix zero byte 2022-06-17 23:48:14 +05:00
avogar
23f48a9fb9 Fix bug with nested short-circuit functions 2022-06-17 11:44:49 +00:00
Andrey Zvonov
f987f461e5 fix style -- rm unused ErrorCode 2022-06-17 15:00:32 +05:00
Igor Nikonov
baebbc084f
Merge pull request #38027 from ClickHouse/decimal_rounding_fix
Fix: rounding for Decimal128/Decimal256 with more than 19-digits long scale
2022-06-17 09:48:18 +02:00
zvonand
c1b2b669ab remove wrong code 2022-06-17 01:52:45 +05:00
mergify[bot]
f46f7257dd
Merge branch 'master' into fix-nothing-error 2022-06-16 10:58:03 +00:00
mergify[bot]
2557e8ad51
Merge branch 'master' into decimal_rounding_fix 2022-06-16 10:53:49 +00:00
avogar
a3a7cc7a5d Fix logical error in array mapped functions with const nullable column 2022-06-16 10:41:53 +00:00
zvonand
a800158438 wip upload 2022-06-16 15:11:41 +05:00
Danila Kutenin
048f56bf4d Fix some tests and comments 2022-06-15 14:40:21 +00:00
Danila Kutenin
08e3f77a9c Optimize most important parts with NEON SIMD
First part, updated most UTF8, hashing, memory and codecs. Except
utf8lower and upper, maybe a little later.

That includes huge amount of research with movemask dealing. Exact
details and blog post TBD.
2022-06-15 13:19:29 +00:00
mergify[bot]
d704264fae
Merge branch 'master' into decimal_rounding_fix 2022-06-15 10:47:09 +00:00
zvonand
c149c916ec initial setup 2022-06-15 11:49:55 +05:00
Igor Nikonov
bf7dd39282 Fix: decimal rounding
Fixes #37531
2022-06-14 18:03:05 +00:00
Maksim Kita
dc2e117cce UnaryLogicalFunctions improve performance using dynamic dispatch 2022-06-14 17:30:11 +02:00
zvonand
a5a980b69d Added no_sanitize 2022-06-13 19:45:54 +05:00
zvonand
54b8709cb1 minor fix 2022-06-13 19:21:07 +05:00
Robert Schulze
5f5732a2c4
Merge pull request #37969 from ClickHouse/consistent-macro-usage
More consistent use of platform macros
2022-06-10 14:10:01 +02:00
zvonand
fb67b080b9 added docs 2022-06-10 14:30:17 +03:00
zvonand
551d1ea875 fix wrong interval 2022-06-10 13:21:31 +03:00
Robert Schulze
1a0b5f33b3
More consistent use of platform macros
cmake/target.cmake defines macros for the supported platforms, this
commit changes predefined system macros to our own macros.

__linux__ --> OS_LINUX
__APPLE__ --> OS_DARWIN
__FreeBSD__ --> OS_FREEBSD
2022-06-10 10:22:31 +02:00
zvonand
e19653618c fix wrongfully added submodule 2022-06-10 11:19:38 +03:00
zvonand
16087ea400 enable dt64 for timeslots 2022-06-09 15:28:18 +03:00
Maksim Kita
0c1211eb61
Merge pull request #37930 from kitaisreal/function-dict-get-check-arguments-size
Function dictGet check arguments size
2022-06-08 23:25:14 +02:00
Maksim Kita
b7152fa2bf Function dictGet check arguments size 2022-06-08 17:19:30 +02:00
Maksim Kita
7d1a43cfeb Fix setting cast_ipv4_ipv6_default_on_conversion_error for internal cast 2022-06-08 12:43:39 +02:00
Maksim Kita
4e160105b9
Merge pull request #37805 from kitaisreal/dictionaries-hierarchy-nullable-key-support
Hierarchical dictinaries support nullable parent key
2022-06-08 12:36:09 +02:00
Anton Popov
df6882d2b9
Revert "Fix errors of CheckTriviallyCopyableMove type" 2022-06-07 13:53:10 +02:00
mergify[bot]
014d9e2144
Merge branch 'master' into fix-nothing-error 2022-06-07 11:24:28 +00:00
avogar
cbd50aecd4 Fix 2022-06-07 11:23:59 +00:00
Vitaly Baranov
d199478169
Merge pull request #37303 from ClickHouse/fix_trash
Try to fix some trash
2022-06-07 10:17:39 +02:00
Robert Schulze
2d87af2a15
Merge pull request #37647 from DevTeamBK/Fix-all-CheckTriviallyCopyableMove-Errors
Fix errors of CheckTriviallyCopyableMove type
2022-06-05 19:58:47 +02:00
zvonand
5475f62363 32 to 64 2022-06-05 13:06:48 +03:00
Maksim Kita
6db5c08fde Functions dictGetChildren, dictGetDescendants added support for nullable parent key 2022-06-03 17:36:16 +02:00
Maksim Kita
a0cbbd9edc Hierarchical Cache, Direct dictionaries added support for nullable parent key 2022-06-03 17:21:55 +02:00
Anton Popov
f592a802a1
Merge pull request #37482 from CurtizJ/json-new-serialization
Better binary serialization of `ColumnObject`
2022-06-03 13:29:19 +02:00
Robert Schulze
05f08357a9
Merge pull request #37764 from ClickHouse/like_with_trailing_backslash
Disallow LIKE patterns with trailing escape
2022-06-03 13:19:51 +02:00
Alexey Milovidov
1529d47207
Merge pull request #34754 from ClickHouse/llvm-14
Switch to clang/llvm 14
2022-06-03 14:07:34 +03:00
Alexey Milovidov
de16784832
Merge pull request #37633 from ClickHouse/dump-column-structure-more-precise
More precise result of the `dumpColumnStructure` and `byteSize` miscellaneous functions
2022-06-03 14:05:20 +03:00
Alexey Milovidov
ea89f81a78 Merge branch 'master' of github.com:ClickHouse/ClickHouse into llvm-14 2022-06-03 03:07:14 +02:00
Robert Schulze
657662d89f
Minor follow-up to cache table: std::{vector-->array} 2022-06-02 20:18:10 +02:00
Maksim Kita
20b55a45b2 Hierarchical dictionaries support nullable parent key 2022-06-02 19:24:23 +02:00
HeenaBansal2009
e3080f2a97 Merge remote-tracking branch 'origin' into Fix-all-CheckTriviallyCopyableMove-Errors 2022-06-02 07:30:08 -07:00
Alexander Gololobov
b34782dc6a
Merge pull request #37775 from liuneng1994/fix_date32_to_string
fix toString error on DatatypeDate32
2022-06-02 16:40:47 +03:00
Vladimir C
670c721ded
Merge pull request #37742 from ucasfl/hashid 2022-06-02 12:47:11 +02:00
Robert Schulze
4e18659bfd
Fix tests + more precise exception msg 2022-06-02 11:11:56 +02:00
liuneng1994
7b15055e72 fix toString error on DatatypeDate32
Signed-off-by: liuneng1994 <1398775315@qq.com>
2022-06-02 16:56:43 +08:00
Alexey Milovidov
b5f48a7d3f Merge branch 'master' of github.com:ClickHouse/ClickHouse into llvm-14 2022-06-01 22:09:58 +02:00
Robert Schulze
366f368d06
Disallow LIKE patterns with trailing escape
Trailing escape ('ab\') is disallowed in SQL, in standardese:

  "If an escape character is specified, then [...] If there is not a
  partitioning of the string PVC into substrings such that each substring
  has length 1 (one) or 2, no substring of length 1 (one) is the escape
  character ECV, and each substring of length 2 is the escape character
  ECV followed by either the escape character ECV, an <underscore>
  character, or the <percent> character, then an exception condition is
  raised: data exception - invalid escape sequence."

I first thought this is checked already higher up in the stack, at least
for const needles, as single trailing backslashes ('ab\') are rejected,
but then I realized that ClickHouse quotes by default. I.e., double
trailing backslashes ('ab\\') are not rejected but when interpreted as
LIKE needle ('ab\') they should.
2022-06-01 21:38:46 +02:00
Robert Schulze
b3b0716b32
Merge pull request #37544 from ClickHouse/cached_patterns
Cache compiled regexps when evaluating non-const needles
2022-06-01 19:55:25 +02:00
avogar
966b864986 Fix possible logical error with type Nothing and JSON functions 2022-06-01 16:34:31 +00:00
flynn
b62e4cec65 Fix crash of FunctionHashID 2022-06-01 12:39:16 +00:00
Alexander Tokmakov
75f49a48e1 Merge branch 'master' into fix_trash 2022-06-01 14:20:46 +02:00
Robert Schulze
600512cc08
Replace exceptions thrown for programming errors by asserts 2022-06-01 11:53:37 +02:00
Anton Popov
20e319d67a
Merge pull request #37666 from CurtizJ/optimize-coalesce
Optimize function `COALESCE` with two arguments
2022-05-31 23:48:13 +02:00
Yakov Olkhovskiy
873ac9f8ff
Merge pull request #37540 from ClickHouse/feature-server-certificate
showCertificate function implementation
2022-05-31 02:50:03 -04:00
Anton Popov
30f8eb800a optimize function coalesce with two arguments 2022-05-30 22:29:35 +00:00
Nikolai Kochetov
77b07dd0a8
Merge pull request #37163 from ClickHouse/grouping-function
Add GROUPING function
2022-05-30 20:45:04 +02:00
HeenaBansal2009
b7eb6bbd38 Fixed clang-tidy-CheckTriviallyCopyableMove-errors 2022-05-30 11:09:03 -07:00
Robert Schulze
ad12adc31c
Measure and rework internal re2 caching
This commit is based on local benchmarks of ClickHouse's re2 caching.

Question 1: -----------------------------------------------------------
Is pattern caching useful for queries with const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T;

The short answer is: no. Runtime is (unsurprisingly) dominated by
pattern evaluation + other stuff going on in queries, but definitely not
pattern compilation. For space reasons, I omit details of the local
experiments.

(Side note: the current caching scheme is unbounded in size which poses
a DoS risk (think of multi-tenancy). This risk is more pronounced when
unbounded caching is used with non-const patterns ..., see next
question)

Question 2: -----------------------------------------------------------
Is pattern caching useful for queries with non-const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T;

I benchmarked five caching strategies:
1. no caching as a baseline (= recompile for each row)
2. unbounded cache (= threadsafe global hash-map)
3. LRU cache (= threadsafe global hash-map + LRU queue)
4. lightweight local cache 1 (= not threadsafe local hashmap with
   collision list which grows to a certain size (here: 10 elements) and
   afterwards never changes)
5. lightweight local cache 2 (not threadsafe local hashmap without
   collision list in which a collision replaces the stored element, idea
   by Alexey)

... using a haystack of 2 mio strings and
A). 2 mio distinct simple patterns
B). 10 simple patterns
C)  2 mio distinct complex patterns
D)  10 complex patterns

Fo A) and C), caching does not help but these queries still allow to
judge the static overhead of caching on query runtimes.

B) and D) are extreme but common cases in practice. They include
queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' :
'%pattern2%'). Caching should help significantly.

Because LIKE patterns are internally translated to re2 expressions, I
show only measurements for MATCH queries.

Results in sec, averaged over on multiple measurements;

1.A): 2.12
  B): 1.68
  C): 9.75
  D): 9.45

2.A): 2.17
  B): 1.73
  C): 9.78
  D): 9.47

3.A): 9.8
  B): 0.63
  C): 31.8
  D): 0.98

4.A): 2.14
  B): 0.29
  C): 9.82
  D): 0.41

5.A) 2.12 / 2.15 / 2.26
  B) 1.51 / 0.43 / 0.30
  C) 9.97 / 9.88 / 10.13
  D) 5.70 / 0.42 / 0.43
(10/100/1000 buckets, resp. 10/1/0.1% collision rate)

Evaluation:

1. This is the baseline. It was surprised that complex patterns (C, D)
   slow down the queries so badly compared to simple patterns (A, B).
   The runtime includes evaluation costs, but as caching only helps with
   compilation, and looking at 4.D and 5.D, compilation makes up over 90%
   of the runtime!

2. No speedup compared to 1, probably due to locking overhead. The cache
   is unbounded, and in experiments with data sets > 2 mio rows, 2. is
   the only scheme to throw OOM exceptions which is not acceptable.

3. Unique patterns (A and C) lead to thrashing of the LRU cache and very
   bad runtimes due to LRU queue maintenance and locking. Works pretty
   well however with few distinct patterns (B and D).

4. This scheme is tailored to queries B and D where it performs pretty
   good. More importantly, the caching is lightweight enough to not
   deteriorate performance on datasets A and C.

5. After some tuning of the hash map size, 100 buckets seem optimal to
   be in the same ballpark with 10 distinct patterns as 4. Performance
   also does not deteriorate on A and C compared to the baseline.
   Unlike 4., this scheme behaves LRU-like and can adjust to changing
   pattern distributions.

As a conclusion, this commit implementes two things:

1. Based on Q1, pattern search with const needle no longer uses
   caching. This applies to LIKE and MATCH + a few (exotic) other SQL
   functions. The code for the unbounded caching was removed.

2. Based on Q2, pattern search with non-const needles now use method 5.
2022-05-30 20:00:35 +02:00
Alexey Milovidov
f1fb57c6ce Fix clang-tidy-14 2022-05-30 05:36:26 +02:00
Alexey Milovidov
c0e6ff4216 More precise result of "dumpColumnStructure" and "byteSize" miscellaneous functions 2022-05-30 04:56:54 +02:00
Alexey Milovidov
c1169019d2 Merge branch 'master' into llvm-14 2022-05-29 02:29:02 +02:00
Alexey Milovidov
73e2e63414
Merge pull request #37612 from ClickHouse/clang-tidy-14
Fix clang-tidy-14, part 1
2022-05-29 03:16:32 +03:00
Alexander Tokmakov
4e52f45695 Merge branch 'master' into fix_trash 2022-05-28 19:43:19 +02:00
Alexey Milovidov
c50791dd3b Fix clang-tidy-14, part 1 2022-05-27 22:52:14 +02:00
Alexey Milovidov
d2c6fd90cb Fix clang-tidy-14, part 1 2022-05-27 22:51:37 +02:00
Alexander Gololobov
9b1b30855c Fixed check for HUGE_VAL 2022-05-27 18:25:11 +02:00
Alexander Gololobov
6361c5f38c Fix for failed style check 2022-05-27 18:22:16 +02:00
Alexander Gololobov
540353566c Added LpNorm and LpDistance functions for arrays 2022-05-27 17:17:08 +02:00
Robert Schulze
80061aa3e2
Merge remote-tracking branch 'origin/master' into cached_patterns 2022-05-27 09:21:01 +02:00
Alexey Milovidov
86afa3a245
Merge pull request #37502 from ClickHouse/array_norm_dist_fixes
Renamed arrayXXNorm/arrayXXDistance functions to XXNorm/XXDistance and fixed some overflow cases
2022-05-27 00:56:29 +03:00
mergify[bot]
a7629f900f
Merge branch 'master' into normalize-utf8-performance-tests-fix 2022-05-26 10:29:55 +00:00
Maksim Kita
3a92e61827
Merge pull request #37148 from kitaisreal/dictionary-get-descendants-performance-improvement
Dictionary getDescendants performance improvement
2022-05-26 12:29:17 +02:00
Yakov Olkhovskiy
2dc160a4c3 style fix 2022-05-25 20:56:36 -04:00
Dmitry Novik
7cd7782e4f Process columns more efficiently in GROUPING() 2022-05-25 21:55:41 +00:00
Dmitry Novik
3c1b6609ae Add comments and make tests more verbose 2022-05-25 21:23:35 +00:00
Maksim Kita
58cd1bd3ec
Merge pull request #36843 from bharatnc/ncb/h3-unidirectionaledges-funcs
add h3 unidirectional edge functions
2022-05-25 22:46:40 +02:00
Maksim Kita
bee3c30f66
Merge pull request #37524 from kitaisreal/geo-distance-functions-improve-performance
Geo distance functions improve performance
2022-05-25 22:40:40 +02:00
Alexander Gololobov
168b47d0ad Use same norm and distance function names for tuples and arrays 2022-05-25 22:39:59 +02:00
Alexander Gololobov
b065839f44 always return Float64 2022-05-25 22:27:00 +02:00
Alexander Gololobov
5df14cd956 Cast arguments to result type to avoid int overflow 2022-05-25 22:27:00 +02:00
Robert Schulze
49934a3dc8
Cache compiled regexps when evaluating non-const needles
Needles in a (non-const) needle column may repeat and this commit allows
to skip compilation for known needles. Out of the different design
alternatives (see below, if someone is interested), we now maintain
- one global pattern cache,
- with a fixed size of 42k elements currently,
- and use LRU as eviction strategy.

------------------------------------------------------------------------

(sorry for the wall of text, dumping it here not for reading but just
for reference)

Write-up about considered design alternatives:

1. Keep the current global cache of const needles. For non-const
   needles, probe the cache but don't store values in it.
   Pros: need to maintain just a single cache, no problem with cache
         pollution assuming there are few distinct constant needles
   Cons: only useful if a non-const needle occurred as already as a
         const needle
   --> overall too simplistic

2. Keep the current global cache for const needles. For non-const
   needles, create a local (e.g. per-query) cache
   Pros: unlike (1.), non-const needles can be skipped even if they
         did not occur yet, no pollution of the const pattern cache when
         there are very many non-const needles (e.g. large / highly
         distinct needle columns).
   Cons: caches may explode "horizontally", i.e. we'll end up with the
         const cache + caches for Q1, Q2, ... QN, this makes it harder
         to control the overall space consumption, also patterns
         residing in different caches cannot be reused between queries,
         another difficulty is that the concept of "query" does not
         really exist at matching level - there are only column chunks
         and we'd potentially end up with 1 cache / chunk

3. Queries with const and non-const needles insert into the same global
   cache.
   Pros: the advantages of (2.) + allows to reuse compiled patterns
         accross parallel queries
   Cons: needs an eviction strategy to control cache size and pollution
         (and btw. (2.) also needs eviction strategies for the
         individual caches)

4. Queries with const needle use global cache, queries with non-const
   needle use a different global cache
   --> Overall similar to (3) but ignores the (likely) edge case that
       const and non-const needles overlap.

In sum, (3.) seems the simplest and most beneficial approach.

Eviction strategies:

0. Don't ever evict --> cache may grow infinitely and eventually make
   the system unusable (may even pose a DoS risk)

1. Flush the cache after a certain threshold is exceeded --> very
   simple but may lead to peridic performance drops

2. Use LRU --> more graceful performance degradation at threshold but
   comes with a (constant) performance overhead to maintain the LRU
   queue

In sum, given that the pattern compilation in RE2 should be quite costly
(pattern-to-DFA/NFA), LRU may be acceptable.
2022-05-25 22:04:06 +02:00
Robert Schulze
ea60a614d2
Decrease namespace indent 2022-05-25 21:56:35 +02:00
Alexey Milovidov
abf2558fba
Merge pull request #37491 from ClickHouse/match_refactoring
Refactorings of LIKE/MATCH code
2022-05-25 22:05:38 +03:00
Alexey Milovidov
4482da9eb6
Update greatCircleDistance.cpp 2022-05-25 21:59:31 +03:00
Alexander Tokmakov
779e6ea0b9 make it better, fix on cluster queries 2022-05-25 20:17:49 +02:00