Commit Graph

89681 Commits

Author SHA1 Message Date
Robert Schulze
ad12adc31c
Measure and rework internal re2 caching
This commit is based on local benchmarks of ClickHouse's re2 caching.

Question 1: -----------------------------------------------------------
Is pattern caching useful for queries with const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T;

The short answer is: no. Runtime is (unsurprisingly) dominated by
pattern evaluation + other stuff going on in queries, but definitely not
pattern compilation. For space reasons, I omit details of the local
experiments.

(Side note: the current caching scheme is unbounded in size which poses
a DoS risk (think of multi-tenancy). This risk is more pronounced when
unbounded caching is used with non-const patterns ..., see next
question)

Question 2: -----------------------------------------------------------
Is pattern caching useful for queries with non-const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T;

I benchmarked five caching strategies:
1. no caching as a baseline (= recompile for each row)
2. unbounded cache (= threadsafe global hash-map)
3. LRU cache (= threadsafe global hash-map + LRU queue)
4. lightweight local cache 1 (= not threadsafe local hashmap with
   collision list which grows to a certain size (here: 10 elements) and
   afterwards never changes)
5. lightweight local cache 2 (not threadsafe local hashmap without
   collision list in which a collision replaces the stored element, idea
   by Alexey)

... using a haystack of 2 mio strings and
A). 2 mio distinct simple patterns
B). 10 simple patterns
C)  2 mio distinct complex patterns
D)  10 complex patterns

Fo A) and C), caching does not help but these queries still allow to
judge the static overhead of caching on query runtimes.

B) and D) are extreme but common cases in practice. They include
queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' :
'%pattern2%'). Caching should help significantly.

Because LIKE patterns are internally translated to re2 expressions, I
show only measurements for MATCH queries.

Results in sec, averaged over on multiple measurements;

1.A): 2.12
  B): 1.68
  C): 9.75
  D): 9.45

2.A): 2.17
  B): 1.73
  C): 9.78
  D): 9.47

3.A): 9.8
  B): 0.63
  C): 31.8
  D): 0.98

4.A): 2.14
  B): 0.29
  C): 9.82
  D): 0.41

5.A) 2.12 / 2.15 / 2.26
  B) 1.51 / 0.43 / 0.30
  C) 9.97 / 9.88 / 10.13
  D) 5.70 / 0.42 / 0.43
(10/100/1000 buckets, resp. 10/1/0.1% collision rate)

Evaluation:

1. This is the baseline. It was surprised that complex patterns (C, D)
   slow down the queries so badly compared to simple patterns (A, B).
   The runtime includes evaluation costs, but as caching only helps with
   compilation, and looking at 4.D and 5.D, compilation makes up over 90%
   of the runtime!

2. No speedup compared to 1, probably due to locking overhead. The cache
   is unbounded, and in experiments with data sets > 2 mio rows, 2. is
   the only scheme to throw OOM exceptions which is not acceptable.

3. Unique patterns (A and C) lead to thrashing of the LRU cache and very
   bad runtimes due to LRU queue maintenance and locking. Works pretty
   well however with few distinct patterns (B and D).

4. This scheme is tailored to queries B and D where it performs pretty
   good. More importantly, the caching is lightweight enough to not
   deteriorate performance on datasets A and C.

5. After some tuning of the hash map size, 100 buckets seem optimal to
   be in the same ballpark with 10 distinct patterns as 4. Performance
   also does not deteriorate on A and C compared to the baseline.
   Unlike 4., this scheme behaves LRU-like and can adjust to changing
   pattern distributions.

As a conclusion, this commit implementes two things:

1. Based on Q1, pattern search with const needle no longer uses
   caching. This applies to LIKE and MATCH + a few (exotic) other SQL
   functions. The code for the unbounded caching was removed.

2. Based on Q2, pattern search with non-const needles now use method 5.
2022-05-30 20:00:35 +02:00
Robert Schulze
ff228d63e8
Fix typo 2022-05-27 10:14:13 +02:00
Robert Schulze
80061aa3e2
Merge remote-tracking branch 'origin/master' into cached_patterns 2022-05-27 09:21:01 +02:00
Dmitry Novik
96ca23acd5
Merge pull request #37582 from ClickHouse/revert-34775-union-type-cast
Revert "RFC: Fix converting types for UNION queries (may produce LOGICAL_ERROR)"
2022-05-27 04:09:20 +02:00
Dmitry Novik
3a9239b79f
Revert "RFC: Fix converting types for UNION queries (may produce LOGICAL_ERROR)" 2022-05-27 04:05:32 +02:00
Alexey Milovidov
e1a76e51fb
Merge pull request #37575 from ClickHouse/security-generator
Add security generator
2022-05-27 02:22:33 +03:00
Alexey Milovidov
ce777ac1b5
Merge pull request #37579 from ClickHouse/remove-useless-files-3
Remove useless files
2022-05-27 02:21:51 +03:00
Alexey Milovidov
48ec7ceddb Remove useless files 2022-05-27 00:39:16 +02:00
Alexey Milovidov
8ba865bb60
Merge pull request #37344 from excitoon-favorites/fixs3colonandequalssign
Fixed error with symbols in key name in S3
2022-05-27 00:58:35 +03:00
Alexey Milovidov
86afa3a245
Merge pull request #37502 from ClickHouse/array_norm_dist_fixes
Renamed arrayXXNorm/arrayXXDistance functions to XXNorm/XXDistance and fixed some overflow cases
2022-05-27 00:56:29 +03:00
Alexey Milovidov
aeacfa0d7e Readability 2022-05-26 22:23:37 +02:00
Alexey Milovidov
434d8729de Readability 2022-05-26 22:22:14 +02:00
Alexey Milovidov
359e36f421 Readability 2022-05-26 22:21:49 +02:00
Alexey Milovidov
3074be8d17 Add security generator 2022-05-26 22:19:15 +02:00
Alexander Gololobov
e655863d53
Merge pull request #37528 from kitaisreal/normalize-utf8-performance-tests-fix
Functions normalizeUTF8 unstable performance tests fix
2022-05-26 20:49:10 +02:00
Alexander Tokmakov
eb71dd4c78
Merge pull request #37547 from ClickHouse/followup_37398
Follow-up to #37398
2022-05-26 20:29:41 +03:00
Dmitry Novik
673a521d0b
Merge pull request #34775 from azat/union-type-cast
RFC: Fix converting types for UNION queries (may produce LOGICAL_ERROR)
2022-05-26 17:28:23 +02:00
Alexander Tokmakov
e8f33fb0d9 fix flaky tests 2022-05-26 14:17:05 +02:00
Azat Khuzhin
dc9ca3d70c
Fix LOGICAL_ERROR in getMaxSourcePartsSizeForMerge during merges (#37413) 2022-05-26 14:14:58 +02:00
Nikolai Kochetov
fea2401f1f
Merge pull request #37532 from ClickHouse/add-separate-mutex-for-factories-info
Use a separate mutex for query_factories_info in Context.
2022-05-26 13:03:28 +02:00
mergify[bot]
a7629f900f
Merge branch 'master' into normalize-utf8-performance-tests-fix 2022-05-26 10:29:55 +00:00
Maksim Kita
3a92e61827
Merge pull request #37148 from kitaisreal/dictionary-get-descendants-performance-improvement
Dictionary getDescendants performance improvement
2022-05-26 12:29:17 +02:00
Antonio Andelic
fe236c98d5
Merge pull request #37534 from ClickHouse/revert-37036-keeper-preprocess-operations
Revert "Add support for preprocessing ZooKeeper operations in `clickhouse-keeper`"
2022-05-26 08:14:46 +02:00
Sergei Trifonov
eedddf86fd
Merge pull request #37552 from ClickHouse/serxa-patch-1
fix root CMakeLists.txt search
2022-05-26 04:41:07 +02:00
Sergei Trifonov
417296481e
fix root CMakeLists.txt search 2022-05-26 04:39:02 +02:00
Dmitry Novik
5c3c994d2a
Merge pull request #37493 from ClickHouse/grouping-sets-optimization-fix
Fix ORDER BY optimization in case of GROUPING SETS
2022-05-26 02:25:02 +02:00
Alexey Milovidov
f321925032
Merge pull request #36341 from ClickHouse/allow-setuid-inside-clickhouse
Allow to drop privileges at startup
2022-05-26 01:07:04 +03:00
Maksim Kita
58cd1bd3ec
Merge pull request #36843 from bharatnc/ncb/h3-unidirectionaledges-funcs
add h3 unidirectional edge functions
2022-05-25 22:46:40 +02:00
Maksim Kita
bee3c30f66
Merge pull request #37524 from kitaisreal/geo-distance-functions-improve-performance
Geo distance functions improve performance
2022-05-25 22:40:40 +02:00
Maksim Kita
b12b363158 Fixed build of hierarchical index for HashedArrayDictionary 2022-05-25 22:40:19 +02:00
Alexander Gololobov
168b47d0ad Use same norm and distance function names for tuples and arrays 2022-05-25 22:39:59 +02:00
Alexander Gololobov
b065839f44 always return Float64 2022-05-25 22:27:00 +02:00
Alexander Gololobov
5df14cd956 Cast arguments to result type to avoid int overflow 2022-05-25 22:27:00 +02:00
alesapin
bf0da38d6f
Merge pull request #37402 from DanRoscigno/origin/67-replace-zookeeper-to-clickhouse-keeper-in-docs-and-tutorials
add ClickHouse Keeper to doc pages describing ZooKeeper use
2022-05-25 22:24:56 +02:00
Robert Schulze
7543841438
Merge pull request #37518 from ClickHouse/bump-cctz-to-2022-05-15
Bump cctz to 2022-05-15
2022-05-25 22:14:41 +02:00
Alexander Tokmakov
6ca6b267fa
Merge pull request #37545 from ClickHouse/revert-37424-fix_fetching_part_deadlock
Revert "(only with zero-copy replication, non-production experimental feature not recommended to use) fix possible deadlock during fetching part"
2022-05-25 23:11:16 +03:00
Alexander Tokmakov
47820c216d
Revert "(only with zero-copy replication, non-production experimental feature not recommended to use) fix possible deadlock during fetching part" 2022-05-25 23:10:33 +03:00
Robert Schulze
49934a3dc8
Cache compiled regexps when evaluating non-const needles
Needles in a (non-const) needle column may repeat and this commit allows
to skip compilation for known needles. Out of the different design
alternatives (see below, if someone is interested), we now maintain
- one global pattern cache,
- with a fixed size of 42k elements currently,
- and use LRU as eviction strategy.

------------------------------------------------------------------------

(sorry for the wall of text, dumping it here not for reading but just
for reference)

Write-up about considered design alternatives:

1. Keep the current global cache of const needles. For non-const
   needles, probe the cache but don't store values in it.
   Pros: need to maintain just a single cache, no problem with cache
         pollution assuming there are few distinct constant needles
   Cons: only useful if a non-const needle occurred as already as a
         const needle
   --> overall too simplistic

2. Keep the current global cache for const needles. For non-const
   needles, create a local (e.g. per-query) cache
   Pros: unlike (1.), non-const needles can be skipped even if they
         did not occur yet, no pollution of the const pattern cache when
         there are very many non-const needles (e.g. large / highly
         distinct needle columns).
   Cons: caches may explode "horizontally", i.e. we'll end up with the
         const cache + caches for Q1, Q2, ... QN, this makes it harder
         to control the overall space consumption, also patterns
         residing in different caches cannot be reused between queries,
         another difficulty is that the concept of "query" does not
         really exist at matching level - there are only column chunks
         and we'd potentially end up with 1 cache / chunk

3. Queries with const and non-const needles insert into the same global
   cache.
   Pros: the advantages of (2.) + allows to reuse compiled patterns
         accross parallel queries
   Cons: needs an eviction strategy to control cache size and pollution
         (and btw. (2.) also needs eviction strategies for the
         individual caches)

4. Queries with const needle use global cache, queries with non-const
   needle use a different global cache
   --> Overall similar to (3) but ignores the (likely) edge case that
       const and non-const needles overlap.

In sum, (3.) seems the simplest and most beneficial approach.

Eviction strategies:

0. Don't ever evict --> cache may grow infinitely and eventually make
   the system unusable (may even pose a DoS risk)

1. Flush the cache after a certain threshold is exceeded --> very
   simple but may lead to peridic performance drops

2. Use LRU --> more graceful performance degradation at threshold but
   comes with a (constant) performance overhead to maintain the LRU
   queue

In sum, given that the pattern compilation in RE2 should be quite costly
(pattern-to-DFA/NFA), LRU may be acceptable.
2022-05-25 22:04:06 +02:00
Robert Schulze
ea60a614d2
Decrease namespace indent 2022-05-25 21:56:35 +02:00
Robert Schulze
23378ab67b
Merge pull request #37520 from ClickHouse/update-3rd-party-contribution-guide
Update 3rd party contribution guide
2022-05-25 21:54:49 +02:00
Alexey Milovidov
abf2558fba
Merge pull request #37491 from ClickHouse/match_refactoring
Refactorings of LIKE/MATCH code
2022-05-25 22:05:38 +03:00
Alexey Milovidov
4482da9eb6
Update greatCircleDistance.cpp 2022-05-25 21:59:31 +03:00
Alexey Milovidov
de90c6e6c0
Merge pull request #37533 from ClickHouse/fixes-architecture-doc
Update architecture.md
2022-05-25 21:57:26 +03:00
Alexey Milovidov
5ecde38b40
Merge pull request #37541 from ClickHouse/blinkov-patch-23
Update SECURITY.md
2022-05-25 21:54:05 +03:00
alesapin
620ab399c9
Update docs/en/operations/clickhouse-keeper.md 2022-05-25 20:23:24 +02:00
alesapin
51868a9a4f
Merge pull request #37424 from metahys/fix_fetching_part_deadlock
(only with zero-copy replication, non-production experimental feature not recommended to use) fix possible deadlock during fetching part
2022-05-25 20:15:41 +02:00
Azat Khuzhin
a813f5996e Fix converting types for UNION queries (may produce LOGICAL_ERROR)
CI founds [1]:

    2022.02.20 15:14:23.969247 [ 492 ] {} <Fatal> BaseDaemon: (version 22.3.1.1, build id: 6082C357CFA6FF99) (from thread 472) (query_id: a5187ff9-962a-4e7c-86f6-8d48850a47d6) (query: SELECT 0., round(avgWeighted(x, y)) FROM (SELECT toDate(toDate('214748364.8', '-922337203.6854775808', '-0.1', NULL) - NULL, 10.000100135803223, '-2147483647'), 255 AS x, -2147483647 AS y UNION ALL SELECT y, NULL AS x, 2147483646 AS y)) Received signal Aborted (6)

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/0/26d0e5438c86e52a145aaaf4cb523c399989a878/fuzzer_astfuzzerdebug,actions//report.html

The problem is that subqueries returns different headers:
- first query  -- x, y
- second query -- y, x

v2: Make order of columns strict only for UNION
    https://s3.amazonaws.com/clickhouse-test-reports/34775/9cc8c01a463d18c471853568b2f0af659a4e643f/stateless_tests__address__actions__[2/2].html
    Fixes: 00597_push_down_predicate_long
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-05-25 20:31:47 +03:00
Nikolai Kochetov
ff98c24d44
Merge pull request #37048 from Avogar/fix-array-map-nothing
Add default implementation for Nothing in functions
2022-05-25 19:10:40 +02:00
Ivan Blinkov
df84be9b43
Update SECURITY.md 2022-05-25 20:04:20 +03:00
Alexey Milovidov
97c5a4c725
Update SECURITY.md 2022-05-25 20:04:15 +03:00