Commit Graph

32 Commits

Author SHA1 Message Date
Robert Schulze
ad12adc31c
Measure and rework internal re2 caching
This commit is based on local benchmarks of ClickHouse's re2 caching.

Question 1: -----------------------------------------------------------
Is pattern caching useful for queries with const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T;

The short answer is: no. Runtime is (unsurprisingly) dominated by
pattern evaluation + other stuff going on in queries, but definitely not
pattern compilation. For space reasons, I omit details of the local
experiments.

(Side note: the current caching scheme is unbounded in size which poses
a DoS risk (think of multi-tenancy). This risk is more pronounced when
unbounded caching is used with non-const patterns ..., see next
question)

Question 2: -----------------------------------------------------------
Is pattern caching useful for queries with non-const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T;

I benchmarked five caching strategies:
1. no caching as a baseline (= recompile for each row)
2. unbounded cache (= threadsafe global hash-map)
3. LRU cache (= threadsafe global hash-map + LRU queue)
4. lightweight local cache 1 (= not threadsafe local hashmap with
   collision list which grows to a certain size (here: 10 elements) and
   afterwards never changes)
5. lightweight local cache 2 (not threadsafe local hashmap without
   collision list in which a collision replaces the stored element, idea
   by Alexey)

... using a haystack of 2 mio strings and
A). 2 mio distinct simple patterns
B). 10 simple patterns
C)  2 mio distinct complex patterns
D)  10 complex patterns

Fo A) and C), caching does not help but these queries still allow to
judge the static overhead of caching on query runtimes.

B) and D) are extreme but common cases in practice. They include
queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' :
'%pattern2%'). Caching should help significantly.

Because LIKE patterns are internally translated to re2 expressions, I
show only measurements for MATCH queries.

Results in sec, averaged over on multiple measurements;

1.A): 2.12
  B): 1.68
  C): 9.75
  D): 9.45

2.A): 2.17
  B): 1.73
  C): 9.78
  D): 9.47

3.A): 9.8
  B): 0.63
  C): 31.8
  D): 0.98

4.A): 2.14
  B): 0.29
  C): 9.82
  D): 0.41

5.A) 2.12 / 2.15 / 2.26
  B) 1.51 / 0.43 / 0.30
  C) 9.97 / 9.88 / 10.13
  D) 5.70 / 0.42 / 0.43
(10/100/1000 buckets, resp. 10/1/0.1% collision rate)

Evaluation:

1. This is the baseline. It was surprised that complex patterns (C, D)
   slow down the queries so badly compared to simple patterns (A, B).
   The runtime includes evaluation costs, but as caching only helps with
   compilation, and looking at 4.D and 5.D, compilation makes up over 90%
   of the runtime!

2. No speedup compared to 1, probably due to locking overhead. The cache
   is unbounded, and in experiments with data sets > 2 mio rows, 2. is
   the only scheme to throw OOM exceptions which is not acceptable.

3. Unique patterns (A and C) lead to thrashing of the LRU cache and very
   bad runtimes due to LRU queue maintenance and locking. Works pretty
   well however with few distinct patterns (B and D).

4. This scheme is tailored to queries B and D where it performs pretty
   good. More importantly, the caching is lightweight enough to not
   deteriorate performance on datasets A and C.

5. After some tuning of the hash map size, 100 buckets seem optimal to
   be in the same ballpark with 10 distinct patterns as 4. Performance
   also does not deteriorate on A and C compared to the baseline.
   Unlike 4., this scheme behaves LRU-like and can adjust to changing
   pattern distributions.

As a conclusion, this commit implementes two things:

1. Based on Q1, pattern search with const needle no longer uses
   caching. This applies to LIKE and MATCH + a few (exotic) other SQL
   functions. The code for the unbounded caching was removed.

2. Based on Q2, pattern search with non-const needles now use method 5.
2022-05-30 20:00:35 +02:00
Robert Schulze
7232f47c68
Fix Bug 37114 - ilike on FixedString(N) columns produces wrong results
The main fix is in MatchImpl.h where the "case_insensitive" parameter is
added to Regexps::get().

Also made "case_insensitive" a non-default template parameter to reduce
the risk of future bugs.

The remainder of this commit are minor random code improvements.

resoves #37114
2022-05-11 14:30:21 +02:00
Maksim Kita
538f8cbaad Fix clang-tidy warnings in Disks, Formats, Functions folders 2022-03-14 18:17:35 +00:00
Maksim Kita
c2407fee06 Fixed tests 2021-09-30 14:35:24 +03:00
Pavel Kruglov
70b51133c1 Try to simplify code 2021-08-09 18:01:08 +03:00
Pavel Kruglov
0662df8b76 Fix performance with JIT, add arguments to function isSuitableForShortCircuitArgumentsExecution 2021-08-09 17:54:14 +03:00
Pavel Kruglov
e792fa588f Mark all Functions as sutable or not for executing as short circuit arguments 2021-08-09 17:50:09 +03:00
Vasily Nemkov
a1fb16df52 setting regexp_max_matches_per_row instead of 3rd argument 2021-07-30 12:28:21 +03:00
Vasily Nemkov
ec77ba8bfc Updated extractAllGroupsHorizontal - flexible limit on number of matches per row.
If it is not set via third argument, it deafults to previously hardcoded
value 1000.
2021-07-29 15:36:55 +03:00
Nikolai Kochetov
dbaa6ffc62 Rename ContextConstPtr to ContextPtr. 2021-06-01 15:20:52 +03:00
Alexander Kuzmenkov
3f57fc085b remove mutable context references from functions interface
Also remove it from some visitors.
2021-05-28 19:45:37 +03:00
Maksim Kita
d923d9e6ef Function move file 2021-05-17 10:30:42 +03:00
Maksim Kita
947f28d430 IFunction refactoring 2021-05-15 20:33:15 +03:00
alexey-milovidov
6a2a9cecdd
Update extractAllGroups.h 2021-04-14 01:24:46 +03:00
Vasily Nemkov
77bdb5b391 Fixed erroneus failure of extractAllGroupsHorizontal on large columns 2021-04-14 00:17:06 +03:00
Ivan
495c6e03aa
Replace all Context references with std::weak_ptr (#22297)
* Replace all Context references with std::weak_ptr

* Fix shared context captured by value

* Fix build

* Fix Context with named sessions

* Fix copy context

* Fix gcc build

* Merge with master and fix build

* Fix gcc-9 build
2021-04-11 02:33:54 +03:00
Alexey Milovidov
e38ff3517d Fail fast in incorrect usage of extractAllGroups 2021-01-22 02:48:26 +03:00
Ivan Lezhankin
f897f7c93f Refactor IFunction to execute with const arguments 2020-11-17 16:24:45 +03:00
Nikolai Kochetov
295e612343 Fix build and tests. 2020-10-20 16:11:57 +03:00
Nikolai Kochetov
ce2f6a0560 Part 4. 2020-10-18 00:41:50 +03:00
Nikolai Kochetov
959424f28a Rename block to columns. 2020-10-14 17:04:50 +03:00
Nikolai Kochetov
966b1d6cf5 Rename Block to ColumnsWithTypeAndName. 2020-10-14 16:09:11 +03:00
Nikolai Kochetov
9b42bfdc36
Merge pull request #15817 from ClickHouse/new-block-for-functions-2
Use `ColumnsWithTypeAndName` instead of `Block` for function calls [part 2]
2020-10-12 00:32:58 +03:00
Alexey Milovidov
269b6383f5 Check for #pragma once in headers 2020-10-10 21:37:02 +03:00
Nikolai Kochetov
d28325a353 Replace getByPosition to [] 2020-10-10 21:24:57 +03:00
Alexey Milovidov
c37b55c3b1 Fix error in "extractAllGroups" function 2020-09-17 00:19:58 +03:00
Alexey Milovidov
edd89a8610 Fix half of typos 2020-08-08 03:47:03 +03:00
Nikolai Kochetov
e4689ce302 Make IFunction::executeImpl const. 2020-07-21 16:58:07 +03:00
Alexey Milovidov
9dc43fc435 Fix race condition in extractAllGroups 2020-06-25 19:57:30 +03:00
Alexey Milovidov
9901e4d528 Remove debug output #11554 2020-06-13 20:20:54 +03:00
Alexey Milovidov
787163d0b4 Minor modifications after merging #11554 2020-06-12 17:03:00 +03:00
Vasily Nemkov
50a184acac extractAllGroupsHorizontal and extractAllGroupsVertical
Split tests, fixed some error messages

Fixed test and error reporting of extractGroups
2020-06-11 11:03:17 +03:00