ClickHouse

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-12-04 13:32:13 +00:00

Author	SHA1	Message	Date
Robert Schulze	600512cc08	Replace exceptions thrown for programming errors by asserts	2022-06-01 11:53:37 +02:00
Robert Schulze	ad12adc31c	Measure and rework internal re2 caching This commit is based on local benchmarks of ClickHouse's re2 caching. Question 1: ----------------------------------------------------------- Is pattern caching useful for queries with const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T; The short answer is: no. Runtime is (unsurprisingly) dominated by pattern evaluation + other stuff going on in queries, but definitely not pattern compilation. For space reasons, I omit details of the local experiments. (Side note: the current caching scheme is unbounded in size which poses a DoS risk (think of multi-tenancy). This risk is more pronounced when unbounded caching is used with non-const patterns ..., see next question) Question 2: ----------------------------------------------------------- Is pattern caching useful for queries with non-const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T; I benchmarked five caching strategies: 1. no caching as a baseline (= recompile for each row) 2. unbounded cache (= threadsafe global hash-map) 3. LRU cache (= threadsafe global hash-map + LRU queue) 4. lightweight local cache 1 (= not threadsafe local hashmap with collision list which grows to a certain size (here: 10 elements) and afterwards never changes) 5. lightweight local cache 2 (not threadsafe local hashmap without collision list in which a collision replaces the stored element, idea by Alexey) ... using a haystack of 2 mio strings and A). 2 mio distinct simple patterns B). 10 simple patterns C) 2 mio distinct complex patterns D) 10 complex patterns Fo A) and C), caching does not help but these queries still allow to judge the static overhead of caching on query runtimes. B) and D) are extreme but common cases in practice. They include queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' : '%pattern2%'). Caching should help significantly. Because LIKE patterns are internally translated to re2 expressions, I show only measurements for MATCH queries. Results in sec, averaged over on multiple measurements; 1.A): 2.12 B): 1.68 C): 9.75 D): 9.45 2.A): 2.17 B): 1.73 C): 9.78 D): 9.47 3.A): 9.8 B): 0.63 C): 31.8 D): 0.98 4.A): 2.14 B): 0.29 C): 9.82 D): 0.41 5.A) 2.12 / 2.15 / 2.26 B) 1.51 / 0.43 / 0.30 C) 9.97 / 9.88 / 10.13 D) 5.70 / 0.42 / 0.43 (10/100/1000 buckets, resp. 10/1/0.1% collision rate) Evaluation: 1. This is the baseline. It was surprised that complex patterns (C, D) slow down the queries so badly compared to simple patterns (A, B). The runtime includes evaluation costs, but as caching only helps with compilation, and looking at 4.D and 5.D, compilation makes up over 90% of the runtime! 2. No speedup compared to 1, probably due to locking overhead. The cache is unbounded, and in experiments with data sets > 2 mio rows, 2. is the only scheme to throw OOM exceptions which is not acceptable. 3. Unique patterns (A and C) lead to thrashing of the LRU cache and very bad runtimes due to LRU queue maintenance and locking. Works pretty well however with few distinct patterns (B and D). 4. This scheme is tailored to queries B and D where it performs pretty good. More importantly, the caching is lightweight enough to not deteriorate performance on datasets A and C. 5. After some tuning of the hash map size, 100 buckets seem optimal to be in the same ballpark with 10 distinct patterns as 4. Performance also does not deteriorate on A and C compared to the baseline. Unlike 4., this scheme behaves LRU-like and can adjust to changing pattern distributions. As a conclusion, this commit implementes two things: 1. Based on Q1, pattern search with const needle no longer uses caching. This applies to LIKE and MATCH + a few (exotic) other SQL functions. The code for the unbounded caching was removed. 2. Based on Q2, pattern search with non-const needles now use method 5.	2022-05-30 20:00:35 +02:00
Robert Schulze	49934a3dc8	Cache compiled regexps when evaluating non-const needles Needles in a (non-const) needle column may repeat and this commit allows to skip compilation for known needles. Out of the different design alternatives (see below, if someone is interested), we now maintain - one global pattern cache, - with a fixed size of 42k elements currently, - and use LRU as eviction strategy. ------------------------------------------------------------------------ (sorry for the wall of text, dumping it here not for reading but just for reference) Write-up about considered design alternatives: 1. Keep the current global cache of const needles. For non-const needles, probe the cache but don't store values in it. Pros: need to maintain just a single cache, no problem with cache pollution assuming there are few distinct constant needles Cons: only useful if a non-const needle occurred as already as a const needle --> overall too simplistic 2. Keep the current global cache for const needles. For non-const needles, create a local (e.g. per-query) cache Pros: unlike (1.), non-const needles can be skipped even if they did not occur yet, no pollution of the const pattern cache when there are very many non-const needles (e.g. large / highly distinct needle columns). Cons: caches may explode "horizontally", i.e. we'll end up with the const cache + caches for Q1, Q2, ... QN, this makes it harder to control the overall space consumption, also patterns residing in different caches cannot be reused between queries, another difficulty is that the concept of "query" does not really exist at matching level - there are only column chunks and we'd potentially end up with 1 cache / chunk 3. Queries with const and non-const needles insert into the same global cache. Pros: the advantages of (2.) + allows to reuse compiled patterns accross parallel queries Cons: needs an eviction strategy to control cache size and pollution (and btw. (2.) also needs eviction strategies for the individual caches) 4. Queries with const needle use global cache, queries with non-const needle use a different global cache --> Overall similar to (3) but ignores the (likely) edge case that const and non-const needles overlap. In sum, (3.) seems the simplest and most beneficial approach. Eviction strategies: 0. Don't ever evict --> cache may grow infinitely and eventually make the system unusable (may even pose a DoS risk) 1. Flush the cache after a certain threshold is exceeded --> very simple but may lead to peridic performance drops 2. Use LRU --> more graceful performance degradation at threshold but comes with a (constant) performance overhead to maintain the LRU queue In sum, given that the pattern compilation in RE2 should be quite costly (pattern-to-DFA/NFA), LRU may be acceptable.	2022-05-25 22:04:06 +02:00
Robert Schulze	05e4fa7df1	Fix special case of trivial regexp Previously, we would alsays set 1 in case of a trivial regex (which is correct). If someone in future builds a negated operator, then this will produce wrong results. Right now, negation of regexp (SQL: NOT MATCH) is implemented at a higher level, so we are safe and this is more a preventive fix.	2022-05-25 10:05:55 +02:00
Robert Schulze	01ab7b9bad	Pass strings in some places as string_view The original goal was to get change const auto & needle = String( reinterpret_cast<const char >(cur_needle_data), cur_needle_length); in Functions/MatchImpl.h into a std::string_view to save an allocation + copy. The needle is eventually passed as search pattern into the re2 library. Re2 has an alternative constructor taking a const char i.e. a NULL-terminated string. Here, the needle is NULL-terminated but 1. this is only because it is passed inside a ColumnString yet this is not always the case (e.g. fixed string columns has a dense layout w/o NULL terminator). 2. assuming NULL termination for users != MatchImpl of the regex code is too dangerous. So, for now we'll stay with copying to be on the safe side. One fine day when re2 has a ptr/size ctor, we can use std::string_view. Just changing a few other places from std::string to std::string_view but this will not help with performance.	2022-05-25 10:05:51 +02:00
Robert Schulze	040fbf3686	Tighter sanity checks in matching code	2022-05-25 10:05:06 +02:00
Robert Schulze	35bef17302	Introduce variables to hold the match result --> nicer when debugging	2022-05-25 10:04:47 +02:00
Robert Schulze	b044d44fef	Refactoring: Make template instantiation easier to read - introduced class MatchTraits with enums that replace bool template parameters - (minor: made negation the last template parameters because negation executes last during evaluation)	2022-05-25 10:03:58 +02:00
Robert Schulze	028f15c4fa	Review comment: Throw LOGICAL_ERROR for different sizes of haystack / needles	2022-05-24 20:19:13 +02:00
Robert Schulze	e25ca139cd	Implement SQL functions (NOT) (I)LIKE() + MATCH() with non-const needles With this commit, SQL functions LIKE and MATCH and their variants can work with non-const needle arguments. E.g. create table tab (id UInt32, haystack String, needle String) engine = MergeTree() order by id; insert into tab values (1, 'Hello', '%ell%') (2, 'World', '%orl%') select id, haystack, needle, like(haystack, needle) from tab; For that, methods vectorVector() and vectorFixedVector() were added to MatchImpl. The existing code for const needles has an optimization where the compiled regexp is cached. The new code expects a different needle per row and consequently does not cache the regexp.	2022-05-23 09:41:28 +02:00
Robert Schulze	4829ae8380	Replace overly clever const argument logic by something simpler The previous logic was smart but too inflexible to support the next commits. Replace by a simple pushdown logic where string search implementations return their const arguments instead of having the common class figure these out based on properties/traits.	2022-05-22 17:50:38 +02:00
Robert Schulze	0299cc87e4	Improve naming consistency of string search code Just renamings, nothing major ...	2022-05-22 17:50:38 +02:00
Robert Schulze	7232f47c68	Fix Bug 37114 - ilike on FixedString(N) columns produces wrong results The main fix is in MatchImpl.h where the "case_insensitive" parameter is added to Regexps::get(). Also made "case_insensitive" a non-default template parameter to reduce the risk of future bugs. The remainder of this commit are minor random code improvements. resoves #37114	2022-05-11 14:30:21 +02:00
Azat Khuzhin	789dfd9f3b	Remove unbundled re2 support v2: preserve re2_st name to make PVS check pass (since docker image update fails)	2022-01-20 10:00:49 +03:00
Alexey Milovidov	8b4a6a2416	Remove cruft	2021-10-28 02:10:39 +03:00
Nikolay Degterinsky	f861da2dd1	Fix [I]LIKE function	2021-10-18 17:19:02 +03:00
Alexey Milovidov	fe6b7c77c7	Rename "common" to "base"	2021-10-02 10:13:14 +03:00
Vasily Nemkov	2c6b9aa174	Better exception messages for some String-related functions	2021-09-26 08:18:37 +03:00
Nikolai Kochetov	959424f28a	Rename block to columns.	2020-10-14 17:04:50 +03:00
alesapin	8c0581c503	Fix ilike cache	2020-10-02 17:27:47 +03:00
vdimir	7260009978	Fix argument handling in string search functions	2020-08-04 10:10:50 +03:00
vdimir	368314b930	Fix style in PositionImpl and FunctionsStringSearch	2020-08-02 14:24:39 +00:00
vdimir	9ed7df64cd	Support 'start_pos' argument in 'position' function	2020-08-02 12:56:18 +00:00
myrrc	8c3417fbf7	ILIKE operator (#12125 ) * Integrated CachingAllocator into MarkCache * fixed build errors * reset func hotfix * upd: Fixing build * updated submodules links * fix 2 * updating grabber allocator proto * updating lost work * updating CMake to use concepts * some other changes to get it building (integration into MarkCache) * further integration into caches * updated Async metrics, fixed some build errors * and some other errors revealing * added perfect forwarding to some functions * fix: forward template * fix: constexpr modifier * fix: FakePODAllocator missing member func * updated PODArray constructor taking alloc params * fix: PODArray overload with n restored * fix: FakePODAlloc duplicating alloc() func * added constexpr variable for alloc_tag_t * split cache values by allocators, provided updates * fix: memcpy * fix: constexpr modifier * fix: noexcept modifier * fix: alloc_tag_t for PODArray constructor * fix: PODArray copy ctor with different alloc * fix: resize() signature * updating to lastest working master * syncing with 273267 * first draft version * fix: update Searcher to case-insensitive * added ILIKE test * fixed style errors, updated test, split like and ilike, added notILike * replaced inconsistent comments * fixed show tables ilike * updated missing test cases * regenerated ya.make * Update 01355_ilike.sql Co-authored-by: myrrc <me-clickhouse@myrrec.space> Co-authored-by: alexey-milovidov <milovidov@yandex-team.ru>	2020-07-05 18:57:59 +03:00
Alexey Milovidov	9dc43fc435	Fix race condition in extractAllGroups	2020-06-25 19:57:30 +03:00
Alexey Milovidov	466da445e1	Every function in its own file, part 13	2020-05-07 02:21:13 +03:00

26 Commits