ClickHouse

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-11-25 17:12:03 +00:00

Author	SHA1	Message	Date
Robert Schulze	81318e07d6	Try to fix performance test results	2022-06-01 11:53:37 +02:00
Robert Schulze	600512cc08	Replace exceptions thrown for programming errors by asserts	2022-06-01 11:53:37 +02:00
Robert Schulze	ad12adc31c	Measure and rework internal re2 caching This commit is based on local benchmarks of ClickHouse's re2 caching. Question 1: ----------------------------------------------------------- Is pattern caching useful for queries with const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T; The short answer is: no. Runtime is (unsurprisingly) dominated by pattern evaluation + other stuff going on in queries, but definitely not pattern compilation. For space reasons, I omit details of the local experiments. (Side note: the current caching scheme is unbounded in size which poses a DoS risk (think of multi-tenancy). This risk is more pronounced when unbounded caching is used with non-const patterns ..., see next question) Question 2: ----------------------------------------------------------- Is pattern caching useful for queries with non-const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T; I benchmarked five caching strategies: 1. no caching as a baseline (= recompile for each row) 2. unbounded cache (= threadsafe global hash-map) 3. LRU cache (= threadsafe global hash-map + LRU queue) 4. lightweight local cache 1 (= not threadsafe local hashmap with collision list which grows to a certain size (here: 10 elements) and afterwards never changes) 5. lightweight local cache 2 (not threadsafe local hashmap without collision list in which a collision replaces the stored element, idea by Alexey) ... using a haystack of 2 mio strings and A). 2 mio distinct simple patterns B). 10 simple patterns C) 2 mio distinct complex patterns D) 10 complex patterns Fo A) and C), caching does not help but these queries still allow to judge the static overhead of caching on query runtimes. B) and D) are extreme but common cases in practice. They include queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' : '%pattern2%'). Caching should help significantly. Because LIKE patterns are internally translated to re2 expressions, I show only measurements for MATCH queries. Results in sec, averaged over on multiple measurements; 1.A): 2.12 B): 1.68 C): 9.75 D): 9.45 2.A): 2.17 B): 1.73 C): 9.78 D): 9.47 3.A): 9.8 B): 0.63 C): 31.8 D): 0.98 4.A): 2.14 B): 0.29 C): 9.82 D): 0.41 5.A) 2.12 / 2.15 / 2.26 B) 1.51 / 0.43 / 0.30 C) 9.97 / 9.88 / 10.13 D) 5.70 / 0.42 / 0.43 (10/100/1000 buckets, resp. 10/1/0.1% collision rate) Evaluation: 1. This is the baseline. It was surprised that complex patterns (C, D) slow down the queries so badly compared to simple patterns (A, B). The runtime includes evaluation costs, but as caching only helps with compilation, and looking at 4.D and 5.D, compilation makes up over 90% of the runtime! 2. No speedup compared to 1, probably due to locking overhead. The cache is unbounded, and in experiments with data sets > 2 mio rows, 2. is the only scheme to throw OOM exceptions which is not acceptable. 3. Unique patterns (A and C) lead to thrashing of the LRU cache and very bad runtimes due to LRU queue maintenance and locking. Works pretty well however with few distinct patterns (B and D). 4. This scheme is tailored to queries B and D where it performs pretty good. More importantly, the caching is lightweight enough to not deteriorate performance on datasets A and C. 5. After some tuning of the hash map size, 100 buckets seem optimal to be in the same ballpark with 10 distinct patterns as 4. Performance also does not deteriorate on A and C compared to the baseline. Unlike 4., this scheme behaves LRU-like and can adjust to changing pattern distributions. As a conclusion, this commit implementes two things: 1. Based on Q1, pattern search with const needle no longer uses caching. This applies to LIKE and MATCH + a few (exotic) other SQL functions. The code for the unbounded caching was removed. 2. Based on Q2, pattern search with non-const needles now use method 5.	2022-05-30 20:00:35 +02:00
Robert Schulze	ff228d63e8	Fix typo	2022-05-27 10:14:13 +02:00
Robert Schulze	80061aa3e2	Merge remote-tracking branch 'origin/master' into cached_patterns	2022-05-27 09:21:01 +02:00
Dmitry Novik	96ca23acd5	Merge pull request #37582 from ClickHouse/revert-34775-union-type-cast Revert "RFC: Fix converting types for UNION queries (may produce LOGICAL_ERROR)"	2022-05-27 04:09:20 +02:00
Dmitry Novik	3a9239b79f	Revert "RFC: Fix converting types for UNION queries (may produce LOGICAL_ERROR)"	2022-05-27 04:05:32 +02:00
Alexey Milovidov	e1a76e51fb	Merge pull request #37575 from ClickHouse/security-generator Add security generator	2022-05-27 02:22:33 +03:00
Alexey Milovidov	ce777ac1b5	Merge pull request #37579 from ClickHouse/remove-useless-files-3 Remove useless files	2022-05-27 02:21:51 +03:00
Alexey Milovidov	48ec7ceddb	Remove useless files	2022-05-27 00:39:16 +02:00
Alexey Milovidov	8ba865bb60	Merge pull request #37344 from excitoon-favorites/fixs3colonandequalssign Fixed error with symbols in key name in S3	2022-05-27 00:58:35 +03:00
Alexey Milovidov	86afa3a245	Merge pull request #37502 from ClickHouse/array_norm_dist_fixes Renamed arrayXXNorm/arrayXXDistance functions to XXNorm/XXDistance and fixed some overflow cases	2022-05-27 00:56:29 +03:00
Alexey Milovidov	aeacfa0d7e	Readability	2022-05-26 22:23:37 +02:00
Alexey Milovidov	434d8729de	Readability	2022-05-26 22:22:14 +02:00
Alexey Milovidov	359e36f421	Readability	2022-05-26 22:21:49 +02:00
Alexey Milovidov	3074be8d17	Add security generator	2022-05-26 22:19:15 +02:00
Alexander Gololobov	e655863d53	Merge pull request #37528 from kitaisreal/normalize-utf8-performance-tests-fix Functions normalizeUTF8 unstable performance tests fix	2022-05-26 20:49:10 +02:00
Alexander Tokmakov	eb71dd4c78	Merge pull request #37547 from ClickHouse/followup_37398 Follow-up to #37398	2022-05-26 20:29:41 +03:00
Dmitry Novik	673a521d0b	Merge pull request #34775 from azat/union-type-cast RFC: Fix converting types for UNION queries (may produce LOGICAL_ERROR)	2022-05-26 17:28:23 +02:00
Alexander Tokmakov	e8f33fb0d9	fix flaky tests	2022-05-26 14:17:05 +02:00
Azat Khuzhin	dc9ca3d70c	Fix LOGICAL_ERROR in getMaxSourcePartsSizeForMerge during merges (#37413 )	2022-05-26 14:14:58 +02:00
Nikolai Kochetov	fea2401f1f	Merge pull request #37532 from ClickHouse/add-separate-mutex-for-factories-info Use a separate mutex for query_factories_info in Context.	2022-05-26 13:03:28 +02:00
mergify[bot]	a7629f900f	Merge branch 'master' into normalize-utf8-performance-tests-fix	2022-05-26 10:29:55 +00:00
Maksim Kita	3a92e61827	Merge pull request #37148 from kitaisreal/dictionary-get-descendants-performance-improvement Dictionary getDescendants performance improvement	2022-05-26 12:29:17 +02:00
Antonio Andelic	fe236c98d5	Merge pull request #37534 from ClickHouse/revert-37036-keeper-preprocess-operations Revert "Add support for preprocessing ZooKeeper operations in `clickhouse-keeper`"	2022-05-26 08:14:46 +02:00
Sergei Trifonov	eedddf86fd	Merge pull request #37552 from ClickHouse/serxa-patch-1 fix root CMakeLists.txt search	2022-05-26 04:41:07 +02:00
Sergei Trifonov	417296481e	fix root CMakeLists.txt search	2022-05-26 04:39:02 +02:00
Dmitry Novik	5c3c994d2a	Merge pull request #37493 from ClickHouse/grouping-sets-optimization-fix Fix ORDER BY optimization in case of GROUPING SETS	2022-05-26 02:25:02 +02:00
Alexey Milovidov	f321925032	Merge pull request #36341 from ClickHouse/allow-setuid-inside-clickhouse Allow to drop privileges at startup	2022-05-26 01:07:04 +03:00
Maksim Kita	58cd1bd3ec	Merge pull request #36843 from bharatnc/ncb/h3-unidirectionaledges-funcs add h3 unidirectional edge functions	2022-05-25 22:46:40 +02:00
Maksim Kita	bee3c30f66	Merge pull request #37524 from kitaisreal/geo-distance-functions-improve-performance Geo distance functions improve performance	2022-05-25 22:40:40 +02:00
Maksim Kita	b12b363158	Fixed build of hierarchical index for HashedArrayDictionary	2022-05-25 22:40:19 +02:00
Alexander Gololobov	168b47d0ad	Use same norm and distance function names for tuples and arrays	2022-05-25 22:39:59 +02:00
Alexander Gololobov	b065839f44	always return Float64	2022-05-25 22:27:00 +02:00
Alexander Gololobov	5df14cd956	Cast arguments to result type to avoid int overflow	2022-05-25 22:27:00 +02:00
alesapin	bf0da38d6f	Merge pull request #37402 from DanRoscigno/origin/67-replace-zookeeper-to-clickhouse-keeper-in-docs-and-tutorials add ClickHouse Keeper to doc pages describing ZooKeeper use	2022-05-25 22:24:56 +02:00
Robert Schulze	7543841438	Merge pull request #37518 from ClickHouse/bump-cctz-to-2022-05-15 Bump cctz to 2022-05-15	2022-05-25 22:14:41 +02:00
Alexander Tokmakov	6ca6b267fa	Merge pull request #37545 from ClickHouse/revert-37424-fix_fetching_part_deadlock Revert "(only with zero-copy replication, non-production experimental feature not recommended to use) fix possible deadlock during fetching part"	2022-05-25 23:11:16 +03:00
Alexander Tokmakov	47820c216d	Revert "(only with zero-copy replication, non-production experimental feature not recommended to use) fix possible deadlock during fetching part"	2022-05-25 23:10:33 +03:00
Robert Schulze	49934a3dc8	Cache compiled regexps when evaluating non-const needles Needles in a (non-const) needle column may repeat and this commit allows to skip compilation for known needles. Out of the different design alternatives (see below, if someone is interested), we now maintain - one global pattern cache, - with a fixed size of 42k elements currently, - and use LRU as eviction strategy. ------------------------------------------------------------------------ (sorry for the wall of text, dumping it here not for reading but just for reference) Write-up about considered design alternatives: 1. Keep the current global cache of const needles. For non-const needles, probe the cache but don't store values in it. Pros: need to maintain just a single cache, no problem with cache pollution assuming there are few distinct constant needles Cons: only useful if a non-const needle occurred as already as a const needle --> overall too simplistic 2. Keep the current global cache for const needles. For non-const needles, create a local (e.g. per-query) cache Pros: unlike (1.), non-const needles can be skipped even if they did not occur yet, no pollution of the const pattern cache when there are very many non-const needles (e.g. large / highly distinct needle columns). Cons: caches may explode "horizontally", i.e. we'll end up with the const cache + caches for Q1, Q2, ... QN, this makes it harder to control the overall space consumption, also patterns residing in different caches cannot be reused between queries, another difficulty is that the concept of "query" does not really exist at matching level - there are only column chunks and we'd potentially end up with 1 cache / chunk 3. Queries with const and non-const needles insert into the same global cache. Pros: the advantages of (2.) + allows to reuse compiled patterns accross parallel queries Cons: needs an eviction strategy to control cache size and pollution (and btw. (2.) also needs eviction strategies for the individual caches) 4. Queries with const needle use global cache, queries with non-const needle use a different global cache --> Overall similar to (3) but ignores the (likely) edge case that const and non-const needles overlap. In sum, (3.) seems the simplest and most beneficial approach. Eviction strategies: 0. Don't ever evict --> cache may grow infinitely and eventually make the system unusable (may even pose a DoS risk) 1. Flush the cache after a certain threshold is exceeded --> very simple but may lead to peridic performance drops 2. Use LRU --> more graceful performance degradation at threshold but comes with a (constant) performance overhead to maintain the LRU queue In sum, given that the pattern compilation in RE2 should be quite costly (pattern-to-DFA/NFA), LRU may be acceptable.	2022-05-25 22:04:06 +02:00
Robert Schulze	ea60a614d2	Decrease namespace indent	2022-05-25 21:56:35 +02:00
Robert Schulze	23378ab67b	Merge pull request #37520 from ClickHouse/update-3rd-party-contribution-guide Update 3rd party contribution guide	2022-05-25 21:54:49 +02:00
Alexey Milovidov	abf2558fba	Merge pull request #37491 from ClickHouse/match_refactoring Refactorings of LIKE/MATCH code	2022-05-25 22:05:38 +03:00
Alexey Milovidov	4482da9eb6	Update greatCircleDistance.cpp	2022-05-25 21:59:31 +03:00
Alexey Milovidov	de90c6e6c0	Merge pull request #37533 from ClickHouse/fixes-architecture-doc Update architecture.md	2022-05-25 21:57:26 +03:00
Alexey Milovidov	5ecde38b40	Merge pull request #37541 from ClickHouse/blinkov-patch-23 Update SECURITY.md	2022-05-25 21:54:05 +03:00
alesapin	620ab399c9	Update docs/en/operations/clickhouse-keeper.md	2022-05-25 20:23:24 +02:00
alesapin	51868a9a4f	Merge pull request #37424 from metahys/fix_fetching_part_deadlock (only with zero-copy replication, non-production experimental feature not recommended to use) fix possible deadlock during fetching part	2022-05-25 20:15:41 +02:00
Azat Khuzhin	a813f5996e	Fix converting types for UNION queries (may produce LOGICAL_ERROR) CI founds [1]: 2022.02.20 15:14:23.969247 [ 492 ] {} <Fatal> BaseDaemon: (version 22.3.1.1, build id: 6082C357CFA6FF99) (from thread 472) (query_id: a5187ff9-962a-4e7c-86f6-8d48850a47d6) (query: SELECT 0., round(avgWeighted(x, y)) FROM (SELECT toDate(toDate('214748364.8', '-922337203.6854775808', '-0.1', NULL) - NULL, 10.000100135803223, '-2147483647'), 255 AS x, -2147483647 AS y UNION ALL SELECT y, NULL AS x, 2147483646 AS y)) Received signal Aborted (6) [1]: https://s3.amazonaws.com/clickhouse-test-reports/0/26d0e5438c86e52a145aaaf4cb523c399989a878/fuzzer_astfuzzerdebug,actions//report.html The problem is that subqueries returns different headers: - first query -- x, y - second query -- y, x v2: Make order of columns strict only for UNION https://s3.amazonaws.com/clickhouse-test-reports/34775/9cc8c01a463d18c471853568b2f0af659a4e643f/stateless_tests__address__actions__[2/2].html Fixes: 00597_push_down_predicate_long Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2022-05-25 20:31:47 +03:00
Nikolai Kochetov	ff98c24d44	Merge pull request #37048 from Avogar/fix-array-map-nothing Add default implementation for Nothing in functions	2022-05-25 19:10:40 +02:00

1 2 3 4 5 ...

89683 Commits