ClickHouse

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-12-12 01:12:12 +00:00

Author	SHA1	Message	Date
Robert Schulze	c9ce0efa66	Instantiate MultiMatchAnyImpl template using enums - With this, invalid combinations of the FindAny/FindAnyIndex bools are no longer possible and we can remove the corresponding check - Also makes the instantiations more readable.	2022-06-26 16:25:49 +00:00
Robert Schulze	580d89477f	Minimally faster performance	2022-06-26 15:34:22 +00:00
Robert Schulze	4bc59c18e3	Cosmetics: Move some code around + docs + whitespaces + minor stuff	2022-06-26 15:34:15 +00:00
Robert Schulze	bb7c627964	Cosmetics: Pass patterns around as std::string_view instead of StringRef - The patterns are not used in hashing, there should not be a performance impact when we use stuff from the standard library instead. - added forgotten .reserve() in FunctionsMultiStringPosition.h	2022-06-26 15:32:19 +00:00
Robert Schulze	2c828338f4	Replace hyperscan by vectorscan This commit migrates ClickHouse to Vectorscan. The first 10 min of [0] explain the reasons for it. () Addresses (but does not resolve) #38046 () Config parameter names (e.g. "max_hyperscan_regexp_length") are preserved for compatibility. Likewise, error codes (e.g. "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g. "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in replacement. [0] https://www.youtube.com/watch?v=KlZWmmflW6M	2022-06-24 10:47:52 +02:00
Robert Schulze	657662d89f	Minor follow-up to cache table: std::{vector-->array}	2022-06-02 20:18:10 +02:00
Robert Schulze	ad12adc31c	Measure and rework internal re2 caching This commit is based on local benchmarks of ClickHouse's re2 caching. Question 1: ----------------------------------------------------------- Is pattern caching useful for queries with const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T; The short answer is: no. Runtime is (unsurprisingly) dominated by pattern evaluation + other stuff going on in queries, but definitely not pattern compilation. For space reasons, I omit details of the local experiments. (Side note: the current caching scheme is unbounded in size which poses a DoS risk (think of multi-tenancy). This risk is more pronounced when unbounded caching is used with non-const patterns ..., see next question) Question 2: ----------------------------------------------------------- Is pattern caching useful for queries with non-const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T; I benchmarked five caching strategies: 1. no caching as a baseline (= recompile for each row) 2. unbounded cache (= threadsafe global hash-map) 3. LRU cache (= threadsafe global hash-map + LRU queue) 4. lightweight local cache 1 (= not threadsafe local hashmap with collision list which grows to a certain size (here: 10 elements) and afterwards never changes) 5. lightweight local cache 2 (not threadsafe local hashmap without collision list in which a collision replaces the stored element, idea by Alexey) ... using a haystack of 2 mio strings and A). 2 mio distinct simple patterns B). 10 simple patterns C) 2 mio distinct complex patterns D) 10 complex patterns Fo A) and C), caching does not help but these queries still allow to judge the static overhead of caching on query runtimes. B) and D) are extreme but common cases in practice. They include queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' : '%pattern2%'). Caching should help significantly. Because LIKE patterns are internally translated to re2 expressions, I show only measurements for MATCH queries. Results in sec, averaged over on multiple measurements; 1.A): 2.12 B): 1.68 C): 9.75 D): 9.45 2.A): 2.17 B): 1.73 C): 9.78 D): 9.47 3.A): 9.8 B): 0.63 C): 31.8 D): 0.98 4.A): 2.14 B): 0.29 C): 9.82 D): 0.41 5.A) 2.12 / 2.15 / 2.26 B) 1.51 / 0.43 / 0.30 C) 9.97 / 9.88 / 10.13 D) 5.70 / 0.42 / 0.43 (10/100/1000 buckets, resp. 10/1/0.1% collision rate) Evaluation: 1. This is the baseline. It was surprised that complex patterns (C, D) slow down the queries so badly compared to simple patterns (A, B). The runtime includes evaluation costs, but as caching only helps with compilation, and looking at 4.D and 5.D, compilation makes up over 90% of the runtime! 2. No speedup compared to 1, probably due to locking overhead. The cache is unbounded, and in experiments with data sets > 2 mio rows, 2. is the only scheme to throw OOM exceptions which is not acceptable. 3. Unique patterns (A and C) lead to thrashing of the LRU cache and very bad runtimes due to LRU queue maintenance and locking. Works pretty well however with few distinct patterns (B and D). 4. This scheme is tailored to queries B and D where it performs pretty good. More importantly, the caching is lightweight enough to not deteriorate performance on datasets A and C. 5. After some tuning of the hash map size, 100 buckets seem optimal to be in the same ballpark with 10 distinct patterns as 4. Performance also does not deteriorate on A and C compared to the baseline. Unlike 4., this scheme behaves LRU-like and can adjust to changing pattern distributions. As a conclusion, this commit implementes two things: 1. Based on Q1, pattern search with const needle no longer uses caching. This applies to LIKE and MATCH + a few (exotic) other SQL functions. The code for the unbounded caching was removed. 2. Based on Q2, pattern search with non-const needles now use method 5.	2022-05-30 20:00:35 +02:00
Robert Schulze	49934a3dc8	Cache compiled regexps when evaluating non-const needles Needles in a (non-const) needle column may repeat and this commit allows to skip compilation for known needles. Out of the different design alternatives (see below, if someone is interested), we now maintain - one global pattern cache, - with a fixed size of 42k elements currently, - and use LRU as eviction strategy. ------------------------------------------------------------------------ (sorry for the wall of text, dumping it here not for reading but just for reference) Write-up about considered design alternatives: 1. Keep the current global cache of const needles. For non-const needles, probe the cache but don't store values in it. Pros: need to maintain just a single cache, no problem with cache pollution assuming there are few distinct constant needles Cons: only useful if a non-const needle occurred as already as a const needle --> overall too simplistic 2. Keep the current global cache for const needles. For non-const needles, create a local (e.g. per-query) cache Pros: unlike (1.), non-const needles can be skipped even if they did not occur yet, no pollution of the const pattern cache when there are very many non-const needles (e.g. large / highly distinct needle columns). Cons: caches may explode "horizontally", i.e. we'll end up with the const cache + caches for Q1, Q2, ... QN, this makes it harder to control the overall space consumption, also patterns residing in different caches cannot be reused between queries, another difficulty is that the concept of "query" does not really exist at matching level - there are only column chunks and we'd potentially end up with 1 cache / chunk 3. Queries with const and non-const needles insert into the same global cache. Pros: the advantages of (2.) + allows to reuse compiled patterns accross parallel queries Cons: needs an eviction strategy to control cache size and pollution (and btw. (2.) also needs eviction strategies for the individual caches) 4. Queries with const needle use global cache, queries with non-const needle use a different global cache --> Overall similar to (3) but ignores the (likely) edge case that const and non-const needles overlap. In sum, (3.) seems the simplest and most beneficial approach. Eviction strategies: 0. Don't ever evict --> cache may grow infinitely and eventually make the system unusable (may even pose a DoS risk) 1. Flush the cache after a certain threshold is exceeded --> very simple but may lead to peridic performance drops 2. Use LRU --> more graceful performance degradation at threshold but comes with a (constant) performance overhead to maintain the LRU queue In sum, given that the pattern compilation in RE2 should be quite costly (pattern-to-DFA/NFA), LRU may be acceptable.	2022-05-25 22:04:06 +02:00
Robert Schulze	ea60a614d2	Decrease namespace indent	2022-05-25 21:56:35 +02:00
Robert Schulze	e25ca139cd	Implement SQL functions (NOT) (I)LIKE() + MATCH() with non-const needles With this commit, SQL functions LIKE and MATCH and their variants can work with non-const needle arguments. E.g. create table tab (id UInt32, haystack String, needle String) engine = MergeTree() order by id; insert into tab values (1, 'Hello', '%ell%') (2, 'World', '%orl%') select id, haystack, needle, like(haystack, needle) from tab; For that, methods vectorVector() and vectorFixedVector() were added to MatchImpl. The existing code for const needles has an optimization where the compiled regexp is cached. The new code expects a different needle per row and consequently does not cache the regexp.	2022-05-23 09:41:28 +02:00
Robert Schulze	0299cc87e4	Improve naming consistency of string search code Just renamings, nothing major ...	2022-05-22 17:50:38 +02:00
Robert Schulze	7232f47c68	Fix Bug 37114 - ilike on FixedString(N) columns produces wrong results The main fix is in MatchImpl.h where the "case_insensitive" parameter is added to Regexps::get(). Also made "case_insensitive" a non-default template parameter to reduce the risk of future bugs. The remainder of this commit are minor random code improvements. resoves #37114	2022-05-11 14:30:21 +02:00
Azat Khuzhin	66a210410f	Fix build w/o hyperscan	2022-01-20 10:02:02 +03:00
Alexey Milovidov	8b4a6a2416	Remove cruft	2021-10-28 02:10:39 +03:00
Alexey Milovidov	fe6b7c77c7	Rename "common" to "base"	2021-10-02 10:13:14 +03:00
Yatsishin Ilya	17bb938541	fix style	2021-08-23 13:59:01 +03:00
Amos Bird	45a86b1aa4	Remove silly and wrong optimization	2021-08-05 17:01:15 +08:00
Amos Bird	a1dd7e5c97	Split mutex into individual regexp construction.	2021-08-05 13:59:47 +08:00
Alexey Milovidov	37f88a1468	Whitespace	2021-01-31 12:02:54 +03:00
alesapin	8c0581c503	Fix ilike cache	2020-10-02 17:27:47 +03:00
myrrc	8c3417fbf7	ILIKE operator (#12125 ) * Integrated CachingAllocator into MarkCache * fixed build errors * reset func hotfix * upd: Fixing build * updated submodules links * fix 2 * updating grabber allocator proto * updating lost work * updating CMake to use concepts * some other changes to get it building (integration into MarkCache) * further integration into caches * updated Async metrics, fixed some build errors * and some other errors revealing * added perfect forwarding to some functions * fix: forward template * fix: constexpr modifier * fix: FakePODAllocator missing member func * updated PODArray constructor taking alloc params * fix: PODArray overload with n restored * fix: FakePODAlloc duplicating alloc() func * added constexpr variable for alloc_tag_t * split cache values by allocators, provided updates * fix: memcpy * fix: constexpr modifier * fix: noexcept modifier * fix: alloc_tag_t for PODArray constructor * fix: PODArray copy ctor with different alloc * fix: resize() signature * updating to lastest working master * syncing with 273267 * first draft version * fix: update Searcher to case-insensitive * added ILIKE test * fixed style errors, updated test, split like and ilike, added notILike * replaced inconsistent comments * fixed show tables ilike * updated missing test cases * regenerated ya.make * Update 01355_ilike.sql Co-authored-by: myrrc <me-clickhouse@myrrec.space> Co-authored-by: alexey-milovidov <milovidov@yandex-team.ru>	2020-07-05 18:57:59 +03:00
Alexey Milovidov	1462a66d1e	Fix typos	2020-06-27 22:05:00 +03:00
Alexey Milovidov	875369676f	Added a comment #11949	2020-06-25 22:46:43 +03:00
Ivan Lezhankin	e230632645	Changes required for auto-sync with Arcadia	2020-04-16 15:31:57 +03:00
Ivan Lezhankin	06446b4f08	dbms/ → src/	2020-04-03 18:14:31 +03:00

25 Commits