ClickHouse

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-12-11 08:52:06 +00:00

Author	SHA1	Message	Date
taiyang-li	c6449a1b23	fix bug	2022-11-10 10:46:51 +08:00
taiyang-li	0df7e95845	improve doc and uts	2022-11-03 16:12:19 +08:00
taiyang-li	a51a1b4394	rename alphaTokens to SplitByAlphaImpl	2022-11-03 15:06:58 +08:00
taiyang-li	5fa0968bd5	reset to original solution	2022-11-03 15:05:23 +08:00
taiyang-li	05b77b3e34	fix comment	2022-10-26 16:25:25 +08:00
taiyang-li	fcbc217a7d	enable limits for functions using FunctionTokens	2022-10-26 16:18:32 +08:00
Mikhail Filimonov	33ee858d18	Fix bug with maxsplit in the splitByChar	2022-07-25 13:11:02 +02:00
Robert Schulze	ad12adc31c	Measure and rework internal re2 caching This commit is based on local benchmarks of ClickHouse's re2 caching. Question 1: ----------------------------------------------------------- Is pattern caching useful for queries with const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T; The short answer is: no. Runtime is (unsurprisingly) dominated by pattern evaluation + other stuff going on in queries, but definitely not pattern compilation. For space reasons, I omit details of the local experiments. (Side note: the current caching scheme is unbounded in size which poses a DoS risk (think of multi-tenancy). This risk is more pronounced when unbounded caching is used with non-const patterns ..., see next question) Question 2: ----------------------------------------------------------- Is pattern caching useful for queries with non-const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T; I benchmarked five caching strategies: 1. no caching as a baseline (= recompile for each row) 2. unbounded cache (= threadsafe global hash-map) 3. LRU cache (= threadsafe global hash-map + LRU queue) 4. lightweight local cache 1 (= not threadsafe local hashmap with collision list which grows to a certain size (here: 10 elements) and afterwards never changes) 5. lightweight local cache 2 (not threadsafe local hashmap without collision list in which a collision replaces the stored element, idea by Alexey) ... using a haystack of 2 mio strings and A). 2 mio distinct simple patterns B). 10 simple patterns C) 2 mio distinct complex patterns D) 10 complex patterns Fo A) and C), caching does not help but these queries still allow to judge the static overhead of caching on query runtimes. B) and D) are extreme but common cases in practice. They include queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' : '%pattern2%'). Caching should help significantly. Because LIKE patterns are internally translated to re2 expressions, I show only measurements for MATCH queries. Results in sec, averaged over on multiple measurements; 1.A): 2.12 B): 1.68 C): 9.75 D): 9.45 2.A): 2.17 B): 1.73 C): 9.78 D): 9.47 3.A): 9.8 B): 0.63 C): 31.8 D): 0.98 4.A): 2.14 B): 0.29 C): 9.82 D): 0.41 5.A) 2.12 / 2.15 / 2.26 B) 1.51 / 0.43 / 0.30 C) 9.97 / 9.88 / 10.13 D) 5.70 / 0.42 / 0.43 (10/100/1000 buckets, resp. 10/1/0.1% collision rate) Evaluation: 1. This is the baseline. It was surprised that complex patterns (C, D) slow down the queries so badly compared to simple patterns (A, B). The runtime includes evaluation costs, but as caching only helps with compilation, and looking at 4.D and 5.D, compilation makes up over 90% of the runtime! 2. No speedup compared to 1, probably due to locking overhead. The cache is unbounded, and in experiments with data sets > 2 mio rows, 2. is the only scheme to throw OOM exceptions which is not acceptable. 3. Unique patterns (A and C) lead to thrashing of the LRU cache and very bad runtimes due to LRU queue maintenance and locking. Works pretty well however with few distinct patterns (B and D). 4. This scheme is tailored to queries B and D where it performs pretty good. More importantly, the caching is lightweight enough to not deteriorate performance on datasets A and C. 5. After some tuning of the hash map size, 100 buckets seem optimal to be in the same ballpark with 10 distinct patterns as 4. Performance also does not deteriorate on A and C compared to the baseline. Unlike 4., this scheme behaves LRU-like and can adjust to changing pattern distributions. As a conclusion, this commit implementes two things: 1. Based on Q1, pattern search with const needle no longer uses caching. This applies to LIKE and MATCH + a few (exotic) other SQL functions. The code for the unbounded caching was removed. 2. Based on Q2, pattern search with non-const needles now use method 5.	2022-05-30 20:00:35 +02:00
Robert Schulze	49934a3dc8	Cache compiled regexps when evaluating non-const needles Needles in a (non-const) needle column may repeat and this commit allows to skip compilation for known needles. Out of the different design alternatives (see below, if someone is interested), we now maintain - one global pattern cache, - with a fixed size of 42k elements currently, - and use LRU as eviction strategy. ------------------------------------------------------------------------ (sorry for the wall of text, dumping it here not for reading but just for reference) Write-up about considered design alternatives: 1. Keep the current global cache of const needles. For non-const needles, probe the cache but don't store values in it. Pros: need to maintain just a single cache, no problem with cache pollution assuming there are few distinct constant needles Cons: only useful if a non-const needle occurred as already as a const needle --> overall too simplistic 2. Keep the current global cache for const needles. For non-const needles, create a local (e.g. per-query) cache Pros: unlike (1.), non-const needles can be skipped even if they did not occur yet, no pollution of the const pattern cache when there are very many non-const needles (e.g. large / highly distinct needle columns). Cons: caches may explode "horizontally", i.e. we'll end up with the const cache + caches for Q1, Q2, ... QN, this makes it harder to control the overall space consumption, also patterns residing in different caches cannot be reused between queries, another difficulty is that the concept of "query" does not really exist at matching level - there are only column chunks and we'd potentially end up with 1 cache / chunk 3. Queries with const and non-const needles insert into the same global cache. Pros: the advantages of (2.) + allows to reuse compiled patterns accross parallel queries Cons: needs an eviction strategy to control cache size and pollution (and btw. (2.) also needs eviction strategies for the individual caches) 4. Queries with const needle use global cache, queries with non-const needle use a different global cache --> Overall similar to (3) but ignores the (likely) edge case that const and non-const needles overlap. In sum, (3.) seems the simplest and most beneficial approach. Eviction strategies: 0. Don't ever evict --> cache may grow infinitely and eventually make the system unusable (may even pose a DoS risk) 1. Flush the cache after a certain threshold is exceeded --> very simple but may lead to peridic performance drops 2. Use LRU --> more graceful performance degradation at threshold but comes with a (constant) performance overhead to maintain the LRU queue In sum, given that the pattern compilation in RE2 should be quite costly (pattern-to-DFA/NFA), LRU may be acceptable.	2022-05-25 22:04:06 +02:00
Robert Schulze	7232f47c68	Fix Bug 37114 - ilike on FixedString(N) columns produces wrong results The main fix is in MatchImpl.h where the "case_insensitive" parameter is added to Regexps::get(). Also made "case_insensitive" a non-default template parameter to reduce the risk of future bugs. The remainder of this commit are minor random code improvements. resoves #37114	2022-05-11 14:30:21 +02:00
Antonio Andelic	9990abb76a	Use compile-time check for Exception messages, fix wrong messages	2022-03-29 13:16:11 +00:00
Maksim Kita	538f8cbaad	Fix clang-tidy warnings in Disks, Formats, Functions folders	2022-03-14 18:17:35 +00:00
taiyang-li	b9c42effb4	change as requested	2022-02-05 19:30:40 +08:00
taiyang-li	e9c435a23f	fix style	2022-01-30 13:23:11 +08:00
taiyang-li	c9d5251e12	finish dev	2022-01-30 09:10:27 +08:00
taiyang-li	5228a3e421	commit again	2022-01-29 23:42:04 +08:00
Dmitry Novik	64b0365848	Fix build	2021-12-15 20:40:36 +03:00
Nickita Taranov	734bb5b026	support any serializable column	2021-10-28 23:17:45 +03:00
Nickita Taranov	877e8b579b	fix style	2021-10-24 14:45:12 +03:00
Nickita Taranov	2d23e3b17d	support for ColumnString	2021-10-22 18:53:02 +03:00
Nickita Taranov	a211c9ecbc	support nullable for ColumnConst	2021-10-22 18:52:59 +03:00
Pavel Kruglov	70b51133c1	Try to simplify code	2021-08-09 18:01:08 +03:00
Pavel Kruglov	0662df8b76	Fix performance with JIT, add arguments to function isSuitableForShortCircuitArgumentsExecution	2021-08-09 17:54:14 +03:00
Pavel Kruglov	e792fa588f	Mark all Functions as sutable or not for executing as short circuit arguments	2021-08-09 17:50:09 +03:00
Nikolay Degterinsky	491ddc859d	Fixed spelling and code style, added generated ya.make files	2021-06-19 18:52:09 +00:00
Nikolay Degterinsky	4fb23c25fb	Added SplitByWhitespace & SplitByNonAlpha functions (new tokenize functions)	2021-06-19 12:33:36 +00:00
Nikolai Kochetov	dbaa6ffc62	Rename ContextConstPtr to ContextPtr.	2021-06-01 15:20:52 +03:00
Alexander Kuzmenkov	3f57fc085b	remove mutable context references from functions interface Also remove it from some visitors.	2021-05-28 19:45:37 +03:00
Maksim Kita	d923d9e6ef	Function move file	2021-05-17 10:30:42 +03:00
Maksim Kita	947f28d430	IFunction refactoring	2021-05-15 20:33:15 +03:00
Vladimir	454b77c654	Update SplitByRegexpImpl	2021-05-13 13:27:29 +03:00
abel-wang	51c1d7c7ba	split into characters when split by '' & add docs	2021-05-13 11:15:38 +08:00
abel-wang	99b9fe6c33	add function splitByRegexp	2021-05-13 10:37:09 +08:00
Ivan	495c6e03aa	Replace all Context references with std::weak_ptr (#22297 ) * Replace all Context references with std::weak_ptr * Fix shared context captured by value * Fix build * Fix Context with named sessions * Fix copy context * Fix gcc build * Merge with master and fix build * Fix gcc-9 build	2021-04-11 02:33:54 +03:00
Ivan Lezhankin	f897f7c93f	Refactor IFunction to execute with const arguments	2020-11-17 16:24:45 +03:00
Nikolai Kochetov	740fad66f3	Part 6.	2020-10-18 22:00:13 +03:00
Nikolai Kochetov	959424f28a	Rename block to columns.	2020-10-14 17:04:50 +03:00
Nikolai Kochetov	966b1d6cf5	Rename Block to ColumnsWithTypeAndName.	2020-10-14 16:09:11 +03:00
Nikolai Kochetov	3a17c2a7ac	Rename FunctionArguments to ColumnsWithTypeAndName	2020-10-11 22:20:20 +03:00
Nikolai Kochetov	d28325a353	Replace getByPosition to []	2020-10-10 21:24:57 +03:00
Nikolai Kochetov	a7fb2e38a5	Use ColumnWithTypeAndName as function argument instead of Block.	2020-10-09 10:41:28 +03:00
Nikolai Kochetov	e4689ce302	Make IFunction::executeImpl const.	2020-07-21 16:58:07 +03:00
Ivan Lezhankin	06446b4f08	dbms/ → src/	2020-04-03 18:14:31 +03:00

43 Commits