Commit Graph

3611 Commits

Author SHA1 Message Date
Igor Nikonov
bf7dd39282 Fix: decimal rounding
Fixes #37531
2022-06-14 18:03:05 +00:00
Maksim Kita
dc2e117cce UnaryLogicalFunctions improve performance using dynamic dispatch 2022-06-14 17:30:11 +02:00
Robert Schulze
5f5732a2c4
Merge pull request #37969 from ClickHouse/consistent-macro-usage
More consistent use of platform macros
2022-06-10 14:10:01 +02:00
Robert Schulze
1a0b5f33b3
More consistent use of platform macros
cmake/target.cmake defines macros for the supported platforms, this
commit changes predefined system macros to our own macros.

__linux__ --> OS_LINUX
__APPLE__ --> OS_DARWIN
__FreeBSD__ --> OS_FREEBSD
2022-06-10 10:22:31 +02:00
Maksim Kita
0c1211eb61
Merge pull request #37930 from kitaisreal/function-dict-get-check-arguments-size
Function dictGet check arguments size
2022-06-08 23:25:14 +02:00
Maksim Kita
b7152fa2bf Function dictGet check arguments size 2022-06-08 17:19:30 +02:00
Maksim Kita
7d1a43cfeb Fix setting cast_ipv4_ipv6_default_on_conversion_error for internal cast 2022-06-08 12:43:39 +02:00
Maksim Kita
4e160105b9
Merge pull request #37805 from kitaisreal/dictionaries-hierarchy-nullable-key-support
Hierarchical dictinaries support nullable parent key
2022-06-08 12:36:09 +02:00
Anton Popov
df6882d2b9
Revert "Fix errors of CheckTriviallyCopyableMove type" 2022-06-07 13:53:10 +02:00
mergify[bot]
014d9e2144
Merge branch 'master' into fix-nothing-error 2022-06-07 11:24:28 +00:00
avogar
cbd50aecd4 Fix 2022-06-07 11:23:59 +00:00
Vitaly Baranov
d199478169
Merge pull request #37303 from ClickHouse/fix_trash
Try to fix some trash
2022-06-07 10:17:39 +02:00
Robert Schulze
2d87af2a15
Merge pull request #37647 from DevTeamBK/Fix-all-CheckTriviallyCopyableMove-Errors
Fix errors of CheckTriviallyCopyableMove type
2022-06-05 19:58:47 +02:00
Maksim Kita
6db5c08fde Functions dictGetChildren, dictGetDescendants added support for nullable parent key 2022-06-03 17:36:16 +02:00
Maksim Kita
a0cbbd9edc Hierarchical Cache, Direct dictionaries added support for nullable parent key 2022-06-03 17:21:55 +02:00
Anton Popov
f592a802a1
Merge pull request #37482 from CurtizJ/json-new-serialization
Better binary serialization of `ColumnObject`
2022-06-03 13:29:19 +02:00
Robert Schulze
05f08357a9
Merge pull request #37764 from ClickHouse/like_with_trailing_backslash
Disallow LIKE patterns with trailing escape
2022-06-03 13:19:51 +02:00
Alexey Milovidov
1529d47207
Merge pull request #34754 from ClickHouse/llvm-14
Switch to clang/llvm 14
2022-06-03 14:07:34 +03:00
Alexey Milovidov
de16784832
Merge pull request #37633 from ClickHouse/dump-column-structure-more-precise
More precise result of the `dumpColumnStructure` and `byteSize` miscellaneous functions
2022-06-03 14:05:20 +03:00
Alexey Milovidov
ea89f81a78 Merge branch 'master' of github.com:ClickHouse/ClickHouse into llvm-14 2022-06-03 03:07:14 +02:00
Robert Schulze
657662d89f
Minor follow-up to cache table: std::{vector-->array} 2022-06-02 20:18:10 +02:00
Maksim Kita
20b55a45b2 Hierarchical dictionaries support nullable parent key 2022-06-02 19:24:23 +02:00
HeenaBansal2009
e3080f2a97 Merge remote-tracking branch 'origin' into Fix-all-CheckTriviallyCopyableMove-Errors 2022-06-02 07:30:08 -07:00
Alexander Gololobov
b34782dc6a
Merge pull request #37775 from liuneng1994/fix_date32_to_string
fix toString error on DatatypeDate32
2022-06-02 16:40:47 +03:00
Vladimir C
670c721ded
Merge pull request #37742 from ucasfl/hashid 2022-06-02 12:47:11 +02:00
Robert Schulze
4e18659bfd
Fix tests + more precise exception msg 2022-06-02 11:11:56 +02:00
liuneng1994
7b15055e72 fix toString error on DatatypeDate32
Signed-off-by: liuneng1994 <1398775315@qq.com>
2022-06-02 16:56:43 +08:00
Alexey Milovidov
b5f48a7d3f Merge branch 'master' of github.com:ClickHouse/ClickHouse into llvm-14 2022-06-01 22:09:58 +02:00
Robert Schulze
366f368d06
Disallow LIKE patterns with trailing escape
Trailing escape ('ab\') is disallowed in SQL, in standardese:

  "If an escape character is specified, then [...] If there is not a
  partitioning of the string PVC into substrings such that each substring
  has length 1 (one) or 2, no substring of length 1 (one) is the escape
  character ECV, and each substring of length 2 is the escape character
  ECV followed by either the escape character ECV, an <underscore>
  character, or the <percent> character, then an exception condition is
  raised: data exception - invalid escape sequence."

I first thought this is checked already higher up in the stack, at least
for const needles, as single trailing backslashes ('ab\') are rejected,
but then I realized that ClickHouse quotes by default. I.e., double
trailing backslashes ('ab\\') are not rejected but when interpreted as
LIKE needle ('ab\') they should.
2022-06-01 21:38:46 +02:00
Robert Schulze
b3b0716b32
Merge pull request #37544 from ClickHouse/cached_patterns
Cache compiled regexps when evaluating non-const needles
2022-06-01 19:55:25 +02:00
avogar
966b864986 Fix possible logical error with type Nothing and JSON functions 2022-06-01 16:34:31 +00:00
flynn
b62e4cec65 Fix crash of FunctionHashID 2022-06-01 12:39:16 +00:00
Alexander Tokmakov
75f49a48e1 Merge branch 'master' into fix_trash 2022-06-01 14:20:46 +02:00
Robert Schulze
600512cc08
Replace exceptions thrown for programming errors by asserts 2022-06-01 11:53:37 +02:00
Anton Popov
20e319d67a
Merge pull request #37666 from CurtizJ/optimize-coalesce
Optimize function `COALESCE` with two arguments
2022-05-31 23:48:13 +02:00
Yakov Olkhovskiy
873ac9f8ff
Merge pull request #37540 from ClickHouse/feature-server-certificate
showCertificate function implementation
2022-05-31 02:50:03 -04:00
Anton Popov
30f8eb800a optimize function coalesce with two arguments 2022-05-30 22:29:35 +00:00
Nikolai Kochetov
77b07dd0a8
Merge pull request #37163 from ClickHouse/grouping-function
Add GROUPING function
2022-05-30 20:45:04 +02:00
HeenaBansal2009
b7eb6bbd38 Fixed clang-tidy-CheckTriviallyCopyableMove-errors 2022-05-30 11:09:03 -07:00
Robert Schulze
ad12adc31c
Measure and rework internal re2 caching
This commit is based on local benchmarks of ClickHouse's re2 caching.

Question 1: -----------------------------------------------------------
Is pattern caching useful for queries with const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T;

The short answer is: no. Runtime is (unsurprisingly) dominated by
pattern evaluation + other stuff going on in queries, but definitely not
pattern compilation. For space reasons, I omit details of the local
experiments.

(Side note: the current caching scheme is unbounded in size which poses
a DoS risk (think of multi-tenancy). This risk is more pronounced when
unbounded caching is used with non-const patterns ..., see next
question)

Question 2: -----------------------------------------------------------
Is pattern caching useful for queries with non-const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T;

I benchmarked five caching strategies:
1. no caching as a baseline (= recompile for each row)
2. unbounded cache (= threadsafe global hash-map)
3. LRU cache (= threadsafe global hash-map + LRU queue)
4. lightweight local cache 1 (= not threadsafe local hashmap with
   collision list which grows to a certain size (here: 10 elements) and
   afterwards never changes)
5. lightweight local cache 2 (not threadsafe local hashmap without
   collision list in which a collision replaces the stored element, idea
   by Alexey)

... using a haystack of 2 mio strings and
A). 2 mio distinct simple patterns
B). 10 simple patterns
C)  2 mio distinct complex patterns
D)  10 complex patterns

Fo A) and C), caching does not help but these queries still allow to
judge the static overhead of caching on query runtimes.

B) and D) are extreme but common cases in practice. They include
queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' :
'%pattern2%'). Caching should help significantly.

Because LIKE patterns are internally translated to re2 expressions, I
show only measurements for MATCH queries.

Results in sec, averaged over on multiple measurements;

1.A): 2.12
  B): 1.68
  C): 9.75
  D): 9.45

2.A): 2.17
  B): 1.73
  C): 9.78
  D): 9.47

3.A): 9.8
  B): 0.63
  C): 31.8
  D): 0.98

4.A): 2.14
  B): 0.29
  C): 9.82
  D): 0.41

5.A) 2.12 / 2.15 / 2.26
  B) 1.51 / 0.43 / 0.30
  C) 9.97 / 9.88 / 10.13
  D) 5.70 / 0.42 / 0.43
(10/100/1000 buckets, resp. 10/1/0.1% collision rate)

Evaluation:

1. This is the baseline. It was surprised that complex patterns (C, D)
   slow down the queries so badly compared to simple patterns (A, B).
   The runtime includes evaluation costs, but as caching only helps with
   compilation, and looking at 4.D and 5.D, compilation makes up over 90%
   of the runtime!

2. No speedup compared to 1, probably due to locking overhead. The cache
   is unbounded, and in experiments with data sets > 2 mio rows, 2. is
   the only scheme to throw OOM exceptions which is not acceptable.

3. Unique patterns (A and C) lead to thrashing of the LRU cache and very
   bad runtimes due to LRU queue maintenance and locking. Works pretty
   well however with few distinct patterns (B and D).

4. This scheme is tailored to queries B and D where it performs pretty
   good. More importantly, the caching is lightweight enough to not
   deteriorate performance on datasets A and C.

5. After some tuning of the hash map size, 100 buckets seem optimal to
   be in the same ballpark with 10 distinct patterns as 4. Performance
   also does not deteriorate on A and C compared to the baseline.
   Unlike 4., this scheme behaves LRU-like and can adjust to changing
   pattern distributions.

As a conclusion, this commit implementes two things:

1. Based on Q1, pattern search with const needle no longer uses
   caching. This applies to LIKE and MATCH + a few (exotic) other SQL
   functions. The code for the unbounded caching was removed.

2. Based on Q2, pattern search with non-const needles now use method 5.
2022-05-30 20:00:35 +02:00
Alexey Milovidov
f1fb57c6ce Fix clang-tidy-14 2022-05-30 05:36:26 +02:00
Alexey Milovidov
c0e6ff4216 More precise result of "dumpColumnStructure" and "byteSize" miscellaneous functions 2022-05-30 04:56:54 +02:00
Alexey Milovidov
c1169019d2 Merge branch 'master' into llvm-14 2022-05-29 02:29:02 +02:00
Alexey Milovidov
73e2e63414
Merge pull request #37612 from ClickHouse/clang-tidy-14
Fix clang-tidy-14, part 1
2022-05-29 03:16:32 +03:00
Alexander Tokmakov
4e52f45695 Merge branch 'master' into fix_trash 2022-05-28 19:43:19 +02:00
Alexey Milovidov
c50791dd3b Fix clang-tidy-14, part 1 2022-05-27 22:52:14 +02:00
Alexey Milovidov
d2c6fd90cb Fix clang-tidy-14, part 1 2022-05-27 22:51:37 +02:00
Alexander Gololobov
9b1b30855c Fixed check for HUGE_VAL 2022-05-27 18:25:11 +02:00
Alexander Gololobov
6361c5f38c Fix for failed style check 2022-05-27 18:22:16 +02:00
Alexander Gololobov
540353566c Added LpNorm and LpDistance functions for arrays 2022-05-27 17:17:08 +02:00
Robert Schulze
80061aa3e2
Merge remote-tracking branch 'origin/master' into cached_patterns 2022-05-27 09:21:01 +02:00
Alexey Milovidov
86afa3a245
Merge pull request #37502 from ClickHouse/array_norm_dist_fixes
Renamed arrayXXNorm/arrayXXDistance functions to XXNorm/XXDistance and fixed some overflow cases
2022-05-27 00:56:29 +03:00
mergify[bot]
a7629f900f
Merge branch 'master' into normalize-utf8-performance-tests-fix 2022-05-26 10:29:55 +00:00
Maksim Kita
3a92e61827
Merge pull request #37148 from kitaisreal/dictionary-get-descendants-performance-improvement
Dictionary getDescendants performance improvement
2022-05-26 12:29:17 +02:00
Yakov Olkhovskiy
2dc160a4c3 style fix 2022-05-25 20:56:36 -04:00
Dmitry Novik
7cd7782e4f Process columns more efficiently in GROUPING() 2022-05-25 21:55:41 +00:00
Dmitry Novik
3c1b6609ae Add comments and make tests more verbose 2022-05-25 21:23:35 +00:00
Maksim Kita
58cd1bd3ec
Merge pull request #36843 from bharatnc/ncb/h3-unidirectionaledges-funcs
add h3 unidirectional edge functions
2022-05-25 22:46:40 +02:00
Maksim Kita
bee3c30f66
Merge pull request #37524 from kitaisreal/geo-distance-functions-improve-performance
Geo distance functions improve performance
2022-05-25 22:40:40 +02:00
Alexander Gololobov
168b47d0ad Use same norm and distance function names for tuples and arrays 2022-05-25 22:39:59 +02:00
Alexander Gololobov
b065839f44 always return Float64 2022-05-25 22:27:00 +02:00
Alexander Gololobov
5df14cd956 Cast arguments to result type to avoid int overflow 2022-05-25 22:27:00 +02:00
Robert Schulze
49934a3dc8
Cache compiled regexps when evaluating non-const needles
Needles in a (non-const) needle column may repeat and this commit allows
to skip compilation for known needles. Out of the different design
alternatives (see below, if someone is interested), we now maintain
- one global pattern cache,
- with a fixed size of 42k elements currently,
- and use LRU as eviction strategy.

------------------------------------------------------------------------

(sorry for the wall of text, dumping it here not for reading but just
for reference)

Write-up about considered design alternatives:

1. Keep the current global cache of const needles. For non-const
   needles, probe the cache but don't store values in it.
   Pros: need to maintain just a single cache, no problem with cache
         pollution assuming there are few distinct constant needles
   Cons: only useful if a non-const needle occurred as already as a
         const needle
   --> overall too simplistic

2. Keep the current global cache for const needles. For non-const
   needles, create a local (e.g. per-query) cache
   Pros: unlike (1.), non-const needles can be skipped even if they
         did not occur yet, no pollution of the const pattern cache when
         there are very many non-const needles (e.g. large / highly
         distinct needle columns).
   Cons: caches may explode "horizontally", i.e. we'll end up with the
         const cache + caches for Q1, Q2, ... QN, this makes it harder
         to control the overall space consumption, also patterns
         residing in different caches cannot be reused between queries,
         another difficulty is that the concept of "query" does not
         really exist at matching level - there are only column chunks
         and we'd potentially end up with 1 cache / chunk

3. Queries with const and non-const needles insert into the same global
   cache.
   Pros: the advantages of (2.) + allows to reuse compiled patterns
         accross parallel queries
   Cons: needs an eviction strategy to control cache size and pollution
         (and btw. (2.) also needs eviction strategies for the
         individual caches)

4. Queries with const needle use global cache, queries with non-const
   needle use a different global cache
   --> Overall similar to (3) but ignores the (likely) edge case that
       const and non-const needles overlap.

In sum, (3.) seems the simplest and most beneficial approach.

Eviction strategies:

0. Don't ever evict --> cache may grow infinitely and eventually make
   the system unusable (may even pose a DoS risk)

1. Flush the cache after a certain threshold is exceeded --> very
   simple but may lead to peridic performance drops

2. Use LRU --> more graceful performance degradation at threshold but
   comes with a (constant) performance overhead to maintain the LRU
   queue

In sum, given that the pattern compilation in RE2 should be quite costly
(pattern-to-DFA/NFA), LRU may be acceptable.
2022-05-25 22:04:06 +02:00
Robert Schulze
ea60a614d2
Decrease namespace indent 2022-05-25 21:56:35 +02:00
Alexey Milovidov
abf2558fba
Merge pull request #37491 from ClickHouse/match_refactoring
Refactorings of LIKE/MATCH code
2022-05-25 22:05:38 +03:00
Alexey Milovidov
4482da9eb6
Update greatCircleDistance.cpp 2022-05-25 21:59:31 +03:00
Alexander Tokmakov
779e6ea0b9 make it better, fix on cluster queries 2022-05-25 20:17:49 +02:00
Nikolai Kochetov
ff98c24d44
Merge pull request #37048 from Avogar/fix-array-map-nothing
Add default implementation for Nothing in functions
2022-05-25 19:10:40 +02:00
Yakov Olkhovskiy
6692b9c2ed showCertificate function implementation 2022-05-25 12:11:44 -04:00
Alexey Milovidov
cb92482ca5
Merge pull request #37484 from kitaisreal/function-has-all-avx2-dynamic-dispatch
Function hasAll added dynamic dispatch
2022-05-25 19:05:32 +03:00
Maksim Kita
28355114c0 Fixed tests 2022-05-25 16:19:29 +02:00
Maksim Kita
e67b3537f7 Functions normalizeUTF8 unstable performance tests fix 2022-05-25 15:54:52 +02:00
Maksim Kita
45da28ecae Improve performance of geo distance functions 2022-05-25 14:22:22 +02:00
Maksim Kita
c372c3d6aa Fix performance tests 2022-05-25 11:49:59 +02:00
Kseniia Sumarokova
b50d4549c9
Merge pull request #37356 from amosbird/partition-prune-for-s3
"Partition pruning" for s3
2022-05-25 11:03:07 +02:00
Robert Schulze
05e4fa7df1
Fix special case of trivial regexp
Previously, we would alsays set 1 in case of a trivial regex (which is
correct). If someone in future builds a negated operator, then this
will produce wrong results. Right now, negation of regexp (SQL: NOT
MATCH) is implemented at a higher level, so we are safe and this is more
a preventive fix.
2022-05-25 10:05:55 +02:00
Robert Schulze
01ab7b9bad
Pass strings in some places as string_view
The original goal was to get change

  const auto & needle = String(
        reinterpret_cast<const char *>(cur_needle_data),
        cur_needle_length);

in Functions/MatchImpl.h into a std::string_view to save an allocation +
copy. The needle is eventually passed as search pattern into the re2
library. Re2 has an alternative constructor taking a const char * i.e. a
NULL-terminated string. Here, the needle is NULL-terminated but
1. this is only because it is passed inside a ColumnString yet this is
   not always the case (e.g. fixed string columns has a dense layout w/o
   NULL terminator).
2. assuming NULL termination for users != MatchImpl of the regex code is
   too dangerous.

So, for now we'll stay with copying to be on the safe side. One fine day
when re2 has a ptr/size ctor, we can use std::string_view.

Just changing a few other places from std::string to std::string_view
but this will not help with performance.
2022-05-25 10:05:51 +02:00
Robert Schulze
040fbf3686
Tighter sanity checks in matching code 2022-05-25 10:05:06 +02:00
Robert Schulze
35bef17302
Introduce variables to hold the match result
--> nicer when debugging
2022-05-25 10:04:47 +02:00
Robert Schulze
b044d44fef
Refactoring: Make template instantiation easier to read
- introduced class MatchTraits with enums that replace bool template
  parameters

- (minor: made negation the last template parameters because negation
  executes last during evaluation)
2022-05-25 10:03:58 +02:00
Bharat Nallan Chakravarthy
57cfc0bd04 check for validity of h3 index 2022-05-25 06:17:15 +05:30
Alexander Gololobov
2ff747785e
Merge pull request #37394 from ClickHouse/array_norm_dist_fixes
Do computations on the raw input data without copying to Eigen::Matrix
2022-05-24 20:59:04 +02:00
Robert Schulze
7348a0eb28
Merge pull request #37251 from ClickHouse/non_const_like
Support non-constant SQL functions (NOT) (I)LIKE and MATCH
2022-05-24 20:28:31 +02:00
Robert Schulze
028f15c4fa
Review comment: Throw LOGICAL_ERROR for different sizes of haystack / needles 2022-05-24 20:19:13 +02:00
Maksim Kita
3c0c322d7c
Merge pull request #37480 from kitaisreal/dynamic-dispatch-infrastructure-improvements
Dynamic dispatch infrastructure style fixes
2022-05-24 18:13:53 +02:00
Maksim Kita
6fb51e8bd3 Function hasAll added dynamic dispatch 2022-05-24 17:06:06 +02:00
Maksim Kita
86180614e7 Fixed tests 2022-05-24 15:33:03 +02:00
Anton Popov
e96af9fd75 better binary serialization of ColumnObject 2022-05-24 13:16:11 +00:00
Maksim Kita
e6e4b2826d Dynamic dispatch infrastructure style fixes 2022-05-24 14:25:29 +02:00
Amos Bird
c25ef92139
Fix tests 2022-05-24 18:57:55 +08:00
Amos Bird
093d315756
partition pruning for s3 2022-05-24 18:57:55 +08:00
Maksim Kita
712b000f2a
Merge pull request #37443 from kitaisreal/functions-normalize-utf8-fix
Functions normalize utf8 fix
2022-05-24 11:11:15 +02:00
Alexander Gololobov
7d0ed7e51a Remove eigen library 2022-05-24 10:24:50 +02:00
Alexander Gololobov
caad1435d5 Optimized the case when one the argumnets is Const 2022-05-24 10:24:50 +02:00
Alexander Gololobov
65fbda436a Do computations on the raw input data without copying to Eigen::Matrix 2022-05-24 10:24:50 +02:00
Bharat Nallan Chakravarthy
6e49b76cfd try suppress h3 asan errors 2022-05-24 10:22:46 +05:30
Maksim Kita
996241493f
Merge pull request #37447 from kitaisreal/binary-function-vectorized-remove-macro
BinaryFunctionVectorized remove macro
2022-05-23 16:50:12 +02:00
Maksim Kita
fe21b4ca9e Fixed style check 2022-05-23 14:41:07 +02:00
Maksim Kita
008de5c779
Merge pull request #37438 from kitaisreal/function-binary-representation-style-fixes
FunctionBinaryRepresentation style fixes
2022-05-23 13:54:15 +02:00
Maksim Kita
e550843d56 BinaryFunctionVectorized remove macro 2022-05-23 12:45:16 +02:00