ClickHouse

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-12-12 17:32:32 +00:00

Author	SHA1	Message	Date
Alexander Tokmakov	d4784203b7	Revert "Fix toHour() monotonicity which can lead to incorrect query result (incorrect index analysis)"	2022-07-08 12:51:30 +03:00
Robert Schulze	524f39551c	Merge pull request #38485 from ClickHouse/multi-match-with-non_const-patterns Multi match with non const patterns	2022-07-08 09:29:10 +02:00
Roman Vasin	12f4a48957	Extend LUT range to 1900..2300	2022-07-08 06:48:05 +00:00
Alexey Milovidov	89fdcbf08c	Merge pull request #38675 from amosbird/index-fix-1 Fix toHour() monotonicity which can lead to incorrect query result (incorrect index analysis)	2022-07-08 00:32:27 +03:00
Robert Schulze	49348b833a	Simplify	2022-07-07 20:25:26 +00:00
Robert Schulze	1de5e9a7da	Avoid copy-ing array elements	2022-07-07 12:33:34 +00:00
Nikolay Degterinsky	1869b7f408	Add functions translate & translateUTF8	2022-07-07 08:25:00 +00:00
Robert Schulze	dec184f61d	Add comment about const needle + const haystack	2022-07-06 17:41:15 +00:00
Robert Schulze	8a18705729	More variable renamings for more uniformity	2022-07-06 14:55:02 +00:00
Robert Schulze	0b0d64c5ca	Don't resize output vector in each loop iteration	2022-07-06 14:45:22 +00:00
Robert Schulze	144c1edb03	Inherit default implementation of getArgumentsThatAreAlwaysConstant()	2022-07-06 14:42:19 +00:00
mergify[bot]	8cc2c3914b	Merge branch 'master' into isnullable	2022-07-06 14:40:01 +00:00
Robert Schulze	6a907b23fb	Replace typeid_cast() with checkAndGetColumnConst() ... syntactic sugar	2022-07-06 14:38:28 +00:00
Robert Schulze	0c4da85e75	More uniform naming	2022-07-06 14:29:42 +00:00
mergify[bot]	e5535f5ab4	Merge branch 'master' into mapupdate_dev	2022-07-06 13:54:36 +00:00
Nikolai Kochetov	020c99a269	Merge pull request #38617 from azat/contrib-debug-symbols Add separate option to omit symbols from heavy contrib	2022-07-06 14:40:24 +02:00
Robert Schulze	d0b2f13f9d	Fix style check	2022-07-05 13:41:52 +02:00
lokax	e6bd0105b1	feat(Function): isNullable Signed-off-by: lokax <m632656684@gmail.com>	2022-07-05 15:51:53 +08:00
lokax	849c46e6fa	feat(Function): isNullable	2022-07-05 15:23:07 +08:00
mergify[bot]	f14b62b2d6	Merge branch 'master' into index-fix-1	2022-07-04 22:50:36 +00:00
Robert Schulze	1eed72b525	Make more multi-search methods work with non-const needles After making function multi[Fuzzy]Match(Any\|AnyIndex\|AllIndices)() work with non-const needles, 12 more functions started to fail in test "00233_position_function_family": multiSearchAny() multiSearchAnyCaseInsensitive() multiSearchAnyUTF8 multiSearchAnyCaseInsensitiveUTF8() multiSearchFirstPosition() multiSearchFirstPositionCaseInsensitive() multiSearchFirstPositionUTF8() multiSearchFirstPositionCaseInsensitiveUTF8() multiSearchFirstIndex() multiSearchFirstIndexCaseInsensitive() multiSearchFirstIndexUTF8() multiSearchFirstIndexCaseInsensitiveUTF8() Failing queries take the form select 0 = multiSearchAny('\0', CAST([], 'Array(String)'));	2022-07-04 14:00:21 +00:00
Robert Schulze	ece61f6da3	Fix davenger's review comments https://github.com/ClickHouse/ClickHouse/pull/38434#discussion_r907397214 https://github.com/ClickHouse/ClickHouse/pull/38434#discussion_r907385290 https://github.com/ClickHouse/ClickHouse/pull/38434#discussion_r907406097 (the latter is no longer relevant as the affected places were removed in the meantime)	2022-07-04 10:43:21 +00:00
Robert Schulze	d547aa7849	Allow non-const pattern array argument in multi[Fuzzy]Match*() Resolves #38046	2022-07-04 10:43:16 +00:00
Alexander Gololobov	8ce8158f7f	Do computations in Float32 (not Float64) for arrays of Float32	2022-07-03 10:33:11 +02:00
Alexander Gololobov	c6691cc5f2	Improved vectorized execution of main loop for array norm/distance	2022-07-02 22:45:22 +02:00
mergify[bot]	b016be264c	Merge branch 'master' into squared_l2	2022-07-02 09:17:28 +00:00
Azat Khuzhin	e8f5cd3c68	Add separate option to omit symbols from heavy contrib Sometimes it is useful to build contrib with debug symbols for further debugging. With everything turned ON (i.e. debug build) I got 3.3GB vs 3.0GB w/o this patch, 9% bloat, thoughts about this is this OK or not for you, if not STRIP_DEBUG_SYMBOLS_HEAVY_CONTRIB can be OFF by default (regardless of build type). P.S. aws debug symbols adds just 1.7%. v2: rename STRIP_HEAVY_DEBUG_SYMBOLS v3: OMIT_HEAVY_DEBUG_SYMBOLS v4: documentation had been removed Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2022-07-02 06:32:03 +03:00
Robert Schulze	2a1ede0f5a	Merge pull request #38589 from ClickHouse/fix-zero-bytes-in-haystack Fix countSubstrings() & position() on patterns with 0-bytes	2022-07-01 16:15:43 +02:00
Amos Bird	84a407f381	Fix toHour monotonicity	2022-07-01 18:24:24 +08:00
Alexander Gololobov	b2b31103c5	Reuse common code for L2Squared and L2	2022-06-30 14:12:25 +02:00
Azat Khuzhin	a47355877e	Add revision() function (#38555 ) It can be useful to match versions, since in some tables (system.trace_log) there is only revision column. P.S. came to this when was digging into stress reports from CI. P.P.S. case insensitive by analogy with version(). Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2022-06-30 12:58:26 +02:00
Robert Schulze	81bb2242fd	Fix countSubstrings() & position() on patterns with 0-bytes SQL functions countSubstrings(), countSubstringsCaseInsensitive(), countSubstringsUTF8(), position(), positionCaseInsensitive(), positionUTF8() with non-const pattern argument use fallback sorters LibCASCIICaseSensitiveStringSearcher and LibCASCIICaseInsensitiveStringSearcher which call ::strstr(), resp. ::strcasestr(). These functions assume that the haystack is 0-terminated and they even document that. However, the callers did not check if the haystack contains 0-byte (perhaps because its sort of expensive). As a consequence, if the haystack contained a zero byte in it's payload, matches behind this zero byte were ignored. create table t (id UInt32, pattern String) engine = MergeTree() order by id; insert into t values (1, 'x'); select countSubstrings('aaaxxxaa\0xxx', pattern) from t; We returned 3 before this commit, now we return 6	2022-06-29 21:41:18 +00:00
hexiaoting	e32a0838d1	fix bug for mapUpdate	2022-06-29 15:52:08 +08:00
Julian Gilyadov	d0d72e0b5e	Implement L2Squared Distance and Norm	2022-06-28 15:28:58 -04:00
Miel Donkers	4e9f396a48	Small improvement of the error message to hint at possible issue (#38458 )	2022-06-28 13:36:30 +02:00
Robert Schulze	959cbaab02	Move loop over patterns into implementations - This is preparation for non-const regexp arguments, where this loop will run for each row.	2022-06-26 16:26:13 +00:00
Robert Schulze	cb5d1a4a85	Fix style check	2022-06-26 16:25:49 +00:00
Robert Schulze	b8f67185bf	Cosmetics: Whitespaces	2022-06-26 16:25:49 +00:00
Robert Schulze	e2b11899a1	Move check if cfg allows hyperscan into implementations - This is not needed for non-const regexp array arguments but cleans up the code and runs the check only in functions which actually use hyperscan.	2022-06-26 16:25:49 +00:00
Robert Schulze	c2cea38b97	Move local variable into if statement	2022-06-26 16:25:49 +00:00
Robert Schulze	c9ce0efa66	Instantiate MultiMatchAnyImpl template using enums - With this, invalid combinations of the FindAny/FindAnyIndex bools are no longer possible and we can remove the corresponding check - Also makes the instantiations more readable.	2022-06-26 16:25:49 +00:00
Robert Schulze	2f15d45f27	Move check for regexp array size into implementations - This is not needed for non-const regexp array arguments (the cardinality of arrays is fixed per column) but it cleans up the code and runs the check only in functions which have restrictions on the number of patterns. - For functions using hyperscans, it was checked that the number of regexes is < 2^32. Removed the check because I don't think anyone will every specify 4 billion patterns.	2022-06-26 16:25:43 +00:00
Robert Schulze	3478db9fb6	Move check for regexp array size into implementations - This is not needed for non-const regexp array arguments (the cardinality of arrays is fixed per column) but it cleans up the code and runs the check only in functions which have restrictions on the number of patterns. - For functions using hyperscans, it was checked that the number of regexes is < 2^32. Removed the check because I don't think anyone will every specify 4 billion patterns.	2022-06-26 15:38:12 +00:00
Robert Schulze	7913edc172	Move check for hyperscan regexp constraints into implementations - This is preparation for non-const regexp arguments, where this check will run for each row.	2022-06-26 15:38:05 +00:00
Robert Schulze	89bfdd50bf	Remove unnecessary check - getReturnTypeImpl() ensures that the haystack column has type "String" and we can simply assert that.	2022-06-26 15:34:24 +00:00
Robert Schulze	580d89477f	Minimally faster performance	2022-06-26 15:34:22 +00:00
Robert Schulze	4bc59c18e3	Cosmetics: Move some code around + docs + whitespaces + minor stuff	2022-06-26 15:34:15 +00:00
Robert Schulze	1273756911	Cosmetics: fmt-based exceptions	2022-06-26 15:33:18 +00:00
Robert Schulze	e5c74a14f7	Cosmetics: More consistent naming - rename utility function and file to "checkHyperscanRegexp"	2022-06-26 15:33:18 +00:00
Robert Schulze	072e0855a8	Cosmetics: Make member variables const	2022-06-26 15:32:26 +00:00
Robert Schulze	2ebfd01c2e	Cosmetics: Pull out settings variable	2022-06-26 15:32:23 +00:00
Robert Schulze	bb7c627964	Cosmetics: Pass patterns around as std::string_view instead of StringRef - The patterns are not used in hashing, there should not be a performance impact when we use stuff from the standard library instead. - added forgotten .reserve() in FunctionsMultiStringPosition.h	2022-06-26 15:32:19 +00:00
Alexey Milovidov	b3098822e0	Merge pull request #38171 from ClickHouse/hyper-to-vectorscan Replace hyperscan by vectorscan	2022-06-26 10:01:45 +03:00
Alexey Milovidov	0654684bd4	Fix wrong implementation of filesystem* functions	2022-06-25 06:10:50 +02:00
Andrey Zvonov	ea73d9c492	Merge branch 'master' into zvonand-base58	2022-06-24 21:37:20 +03:00
Kruglov Pavel	0201d62090	Merge pull request #38173 from Avogar/fix-short-circuit Fix bug with nested short-circuit functions	2022-06-24 16:04:17 +02:00
Robert Schulze	2c828338f4	Replace hyperscan by vectorscan This commit migrates ClickHouse to Vectorscan. The first 10 min of [0] explain the reasons for it. () Addresses (but does not resolve) #38046 () Config parameter names (e.g. "max_hyperscan_regexp_length") are preserved for compatibility. Likewise, error codes (e.g. "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g. "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in replacement. [0] https://www.youtube.com/watch?v=KlZWmmflW6M	2022-06-24 10:47:52 +02:00
Andrey Zvonov	c18d09a617	Merge branch 'master' into zvonand-base58	2022-06-24 07:05:49 +03:00
zvonand	dd8203038f	updated exception handling	2022-06-24 00:36:57 +05:00
zvonand	a94c40e33a	Merge branch 'master' of github.com:ClickHouse/ClickHouse into dt64_timeslots	2022-06-23 17:08:28 +05:00
mergify[bot]	234f0c6399	Merge branch 'master' into revert-35914-FIPS_compliance	2022-06-23 12:06:17 +00:00
zvonand	946117ec89	Merge branch 'master' of github.com:ClickHouse/ClickHouse into zvonand-base58	2022-06-23 17:04:40 +05:00
Alexey Milovidov	5855668514	Remove trash	2022-06-22 06:23:35 +02:00
mergify[bot]	bb79eb73e6	Merge branch 'master' into fix-short-circuit	2022-06-21 10:40:07 +00:00
Kruglov Pavel	b9b58b4305	Merge pull request #37759 from Avogar/fix-nothing-error Fix possible logical error with type Nothing in some functions	2022-06-21 12:35:05 +02:00
Larry Luo	bbd73ba727	use utility methods to access x509 struct fields.	2022-06-20 21:27:33 -04:00
zvonand	22af00b757	rename variable + fix handling of ENABLE_LIBRARIES	2022-06-20 23:53:47 +05:00
mergify[bot]	b440ee84ae	Merge branch 'master' into fix-short-circuit	2022-06-20 15:14:19 +00:00
zvonand	d4e5686b99	minor: fix message for base64	2022-06-20 20:13:09 +05:00
zvonand	78d55d6f46	small fixes	2022-06-20 19:30:54 +05:00
zvonand	832fd6e0a9	Added tests + minor updates	2022-06-19 23:10:28 +05:00
Alexey Milovidov	0cf88e0950	Revert "ClickHouse's boringssl module updated to the official version of the FIPS compliant."	2022-06-18 23:16:18 +03:00
zvonand	f4b3af091d	fix zero byte	2022-06-17 23:48:14 +05:00
avogar	23f48a9fb9	Fix bug with nested short-circuit functions	2022-06-17 11:44:49 +00:00
Andrey Zvonov	f987f461e5	fix style -- rm unused ErrorCode	2022-06-17 15:00:32 +05:00
Igor Nikonov	baebbc084f	Merge pull request #38027 from ClickHouse/decimal_rounding_fix Fix: rounding for Decimal128/Decimal256 with more than 19-digits long scale	2022-06-17 09:48:18 +02:00
zvonand	c1b2b669ab	remove wrong code	2022-06-17 01:52:45 +05:00
mergify[bot]	f46f7257dd	Merge branch 'master' into fix-nothing-error	2022-06-16 10:58:03 +00:00
mergify[bot]	2557e8ad51	Merge branch 'master' into decimal_rounding_fix	2022-06-16 10:53:49 +00:00
avogar	a3a7cc7a5d	Fix logical error in array mapped functions with const nullable column	2022-06-16 10:41:53 +00:00
zvonand	a800158438	wip upload	2022-06-16 15:11:41 +05:00
Danila Kutenin	048f56bf4d	Fix some tests and comments	2022-06-15 14:40:21 +00:00
Danila Kutenin	08e3f77a9c	Optimize most important parts with NEON SIMD First part, updated most UTF8, hashing, memory and codecs. Except utf8lower and upper, maybe a little later. That includes huge amount of research with movemask dealing. Exact details and blog post TBD.	2022-06-15 13:19:29 +00:00
mergify[bot]	d704264fae	Merge branch 'master' into decimal_rounding_fix	2022-06-15 10:47:09 +00:00
zvonand	c149c916ec	initial setup	2022-06-15 11:49:55 +05:00
Igor Nikonov	bf7dd39282	Fix: decimal rounding Fixes #37531	2022-06-14 18:03:05 +00:00
Maksim Kita	dc2e117cce	UnaryLogicalFunctions improve performance using dynamic dispatch	2022-06-14 17:30:11 +02:00
zvonand	a5a980b69d	Added no_sanitize	2022-06-13 19:45:54 +05:00
zvonand	54b8709cb1	minor fix	2022-06-13 19:21:07 +05:00
Robert Schulze	5f5732a2c4	Merge pull request #37969 from ClickHouse/consistent-macro-usage More consistent use of platform macros	2022-06-10 14:10:01 +02:00
zvonand	fb67b080b9	added docs	2022-06-10 14:30:17 +03:00
zvonand	551d1ea875	fix wrong interval	2022-06-10 13:21:31 +03:00
Robert Schulze	1a0b5f33b3	More consistent use of platform macros cmake/target.cmake defines macros for the supported platforms, this commit changes predefined system macros to our own macros. __linux__ --> OS_LINUX __APPLE__ --> OS_DARWIN __FreeBSD__ --> OS_FREEBSD	2022-06-10 10:22:31 +02:00
zvonand	e19653618c	fix wrongfully added submodule	2022-06-10 11:19:38 +03:00
zvonand	16087ea400	enable dt64 for timeslots	2022-06-09 15:28:18 +03:00
Maksim Kita	0c1211eb61	Merge pull request #37930 from kitaisreal/function-dict-get-check-arguments-size Function dictGet check arguments size	2022-06-08 23:25:14 +02:00
Maksim Kita	b7152fa2bf	Function dictGet check arguments size	2022-06-08 17:19:30 +02:00
Maksim Kita	7d1a43cfeb	Fix setting cast_ipv4_ipv6_default_on_conversion_error for internal cast	2022-06-08 12:43:39 +02:00
Maksim Kita	4e160105b9	Merge pull request #37805 from kitaisreal/dictionaries-hierarchy-nullable-key-support Hierarchical dictinaries support nullable parent key	2022-06-08 12:36:09 +02:00
Anton Popov	df6882d2b9	Revert "Fix errors of CheckTriviallyCopyableMove type"	2022-06-07 13:53:10 +02:00
mergify[bot]	014d9e2144	Merge branch 'master' into fix-nothing-error	2022-06-07 11:24:28 +00:00
avogar	cbd50aecd4	Fix	2022-06-07 11:23:59 +00:00
Vitaly Baranov	d199478169	Merge pull request #37303 from ClickHouse/fix_trash Try to fix some trash	2022-06-07 10:17:39 +02:00
Robert Schulze	2d87af2a15	Merge pull request #37647 from DevTeamBK/Fix-all-CheckTriviallyCopyableMove-Errors Fix errors of CheckTriviallyCopyableMove type	2022-06-05 19:58:47 +02:00
zvonand	5475f62363	32 to 64	2022-06-05 13:06:48 +03:00
Maksim Kita	6db5c08fde	Functions dictGetChildren, dictGetDescendants added support for nullable parent key	2022-06-03 17:36:16 +02:00
Maksim Kita	a0cbbd9edc	Hierarchical Cache, Direct dictionaries added support for nullable parent key	2022-06-03 17:21:55 +02:00
Anton Popov	f592a802a1	Merge pull request #37482 from CurtizJ/json-new-serialization Better binary serialization of `ColumnObject`	2022-06-03 13:29:19 +02:00
Robert Schulze	05f08357a9	Merge pull request #37764 from ClickHouse/like_with_trailing_backslash Disallow LIKE patterns with trailing escape	2022-06-03 13:19:51 +02:00
Alexey Milovidov	1529d47207	Merge pull request #34754 from ClickHouse/llvm-14 Switch to clang/llvm 14	2022-06-03 14:07:34 +03:00
Alexey Milovidov	de16784832	Merge pull request #37633 from ClickHouse/dump-column-structure-more-precise More precise result of the `dumpColumnStructure` and `byteSize` miscellaneous functions	2022-06-03 14:05:20 +03:00
Alexey Milovidov	ea89f81a78	Merge branch 'master' of github.com:ClickHouse/ClickHouse into llvm-14	2022-06-03 03:07:14 +02:00
Robert Schulze	657662d89f	Minor follow-up to cache table: std::{vector-->array}	2022-06-02 20:18:10 +02:00
Maksim Kita	20b55a45b2	Hierarchical dictionaries support nullable parent key	2022-06-02 19:24:23 +02:00
HeenaBansal2009	e3080f2a97	Merge remote-tracking branch 'origin' into Fix-all-CheckTriviallyCopyableMove-Errors	2022-06-02 07:30:08 -07:00
Alexander Gololobov	b34782dc6a	Merge pull request #37775 from liuneng1994/fix_date32_to_string fix toString error on DatatypeDate32	2022-06-02 16:40:47 +03:00
Vladimir C	670c721ded	Merge pull request #37742 from ucasfl/hashid	2022-06-02 12:47:11 +02:00
Robert Schulze	4e18659bfd	Fix tests + more precise exception msg	2022-06-02 11:11:56 +02:00
liuneng1994	7b15055e72	fix toString error on DatatypeDate32 Signed-off-by: liuneng1994 <1398775315@qq.com>	2022-06-02 16:56:43 +08:00
Alexey Milovidov	b5f48a7d3f	Merge branch 'master' of github.com:ClickHouse/ClickHouse into llvm-14	2022-06-01 22:09:58 +02:00
Robert Schulze	366f368d06	Disallow LIKE patterns with trailing escape Trailing escape ('ab\') is disallowed in SQL, in standardese: "If an escape character is specified, then [...] If there is not a partitioning of the string PVC into substrings such that each substring has length 1 (one) or 2, no substring of length 1 (one) is the escape character ECV, and each substring of length 2 is the escape character ECV followed by either the escape character ECV, an <underscore> character, or the <percent> character, then an exception condition is raised: data exception - invalid escape sequence." I first thought this is checked already higher up in the stack, at least for const needles, as single trailing backslashes ('ab\') are rejected, but then I realized that ClickHouse quotes by default. I.e., double trailing backslashes ('ab\\') are not rejected but when interpreted as LIKE needle ('ab\') they should.	2022-06-01 21:38:46 +02:00
Robert Schulze	b3b0716b32	Merge pull request #37544 from ClickHouse/cached_patterns Cache compiled regexps when evaluating non-const needles	2022-06-01 19:55:25 +02:00
avogar	966b864986	Fix possible logical error with type Nothing and JSON functions	2022-06-01 16:34:31 +00:00
flynn	b62e4cec65	Fix crash of FunctionHashID	2022-06-01 12:39:16 +00:00
Alexander Tokmakov	75f49a48e1	Merge branch 'master' into fix_trash	2022-06-01 14:20:46 +02:00
Robert Schulze	600512cc08	Replace exceptions thrown for programming errors by asserts	2022-06-01 11:53:37 +02:00
Anton Popov	20e319d67a	Merge pull request #37666 from CurtizJ/optimize-coalesce Optimize function `COALESCE` with two arguments	2022-05-31 23:48:13 +02:00
Yakov Olkhovskiy	873ac9f8ff	Merge pull request #37540 from ClickHouse/feature-server-certificate showCertificate function implementation	2022-05-31 02:50:03 -04:00
Anton Popov	30f8eb800a	optimize function coalesce with two arguments	2022-05-30 22:29:35 +00:00
Nikolai Kochetov	77b07dd0a8	Merge pull request #37163 from ClickHouse/grouping-function Add GROUPING function	2022-05-30 20:45:04 +02:00
HeenaBansal2009	b7eb6bbd38	Fixed clang-tidy-CheckTriviallyCopyableMove-errors	2022-05-30 11:09:03 -07:00
Robert Schulze	ad12adc31c	Measure and rework internal re2 caching This commit is based on local benchmarks of ClickHouse's re2 caching. Question 1: ----------------------------------------------------------- Is pattern caching useful for queries with const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T; The short answer is: no. Runtime is (unsurprisingly) dominated by pattern evaluation + other stuff going on in queries, but definitely not pattern compilation. For space reasons, I omit details of the local experiments. (Side note: the current caching scheme is unbounded in size which poses a DoS risk (think of multi-tenancy). This risk is more pronounced when unbounded caching is used with non-const patterns ..., see next question) Question 2: ----------------------------------------------------------- Is pattern caching useful for queries with non-const LIKE/REGEX patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T; I benchmarked five caching strategies: 1. no caching as a baseline (= recompile for each row) 2. unbounded cache (= threadsafe global hash-map) 3. LRU cache (= threadsafe global hash-map + LRU queue) 4. lightweight local cache 1 (= not threadsafe local hashmap with collision list which grows to a certain size (here: 10 elements) and afterwards never changes) 5. lightweight local cache 2 (not threadsafe local hashmap without collision list in which a collision replaces the stored element, idea by Alexey) ... using a haystack of 2 mio strings and A). 2 mio distinct simple patterns B). 10 simple patterns C) 2 mio distinct complex patterns D) 10 complex patterns Fo A) and C), caching does not help but these queries still allow to judge the static overhead of caching on query runtimes. B) and D) are extreme but common cases in practice. They include queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' : '%pattern2%'). Caching should help significantly. Because LIKE patterns are internally translated to re2 expressions, I show only measurements for MATCH queries. Results in sec, averaged over on multiple measurements; 1.A): 2.12 B): 1.68 C): 9.75 D): 9.45 2.A): 2.17 B): 1.73 C): 9.78 D): 9.47 3.A): 9.8 B): 0.63 C): 31.8 D): 0.98 4.A): 2.14 B): 0.29 C): 9.82 D): 0.41 5.A) 2.12 / 2.15 / 2.26 B) 1.51 / 0.43 / 0.30 C) 9.97 / 9.88 / 10.13 D) 5.70 / 0.42 / 0.43 (10/100/1000 buckets, resp. 10/1/0.1% collision rate) Evaluation: 1. This is the baseline. It was surprised that complex patterns (C, D) slow down the queries so badly compared to simple patterns (A, B). The runtime includes evaluation costs, but as caching only helps with compilation, and looking at 4.D and 5.D, compilation makes up over 90% of the runtime! 2. No speedup compared to 1, probably due to locking overhead. The cache is unbounded, and in experiments with data sets > 2 mio rows, 2. is the only scheme to throw OOM exceptions which is not acceptable. 3. Unique patterns (A and C) lead to thrashing of the LRU cache and very bad runtimes due to LRU queue maintenance and locking. Works pretty well however with few distinct patterns (B and D). 4. This scheme is tailored to queries B and D where it performs pretty good. More importantly, the caching is lightweight enough to not deteriorate performance on datasets A and C. 5. After some tuning of the hash map size, 100 buckets seem optimal to be in the same ballpark with 10 distinct patterns as 4. Performance also does not deteriorate on A and C compared to the baseline. Unlike 4., this scheme behaves LRU-like and can adjust to changing pattern distributions. As a conclusion, this commit implementes two things: 1. Based on Q1, pattern search with const needle no longer uses caching. This applies to LIKE and MATCH + a few (exotic) other SQL functions. The code for the unbounded caching was removed. 2. Based on Q2, pattern search with non-const needles now use method 5.	2022-05-30 20:00:35 +02:00
Alexey Milovidov	f1fb57c6ce	Fix clang-tidy-14	2022-05-30 05:36:26 +02:00
Alexey Milovidov	c0e6ff4216	More precise result of "dumpColumnStructure" and "byteSize" miscellaneous functions	2022-05-30 04:56:54 +02:00
Alexey Milovidov	c1169019d2	Merge branch 'master' into llvm-14	2022-05-29 02:29:02 +02:00
Alexey Milovidov	73e2e63414	Merge pull request #37612 from ClickHouse/clang-tidy-14 Fix clang-tidy-14, part 1	2022-05-29 03:16:32 +03:00
Alexander Tokmakov	4e52f45695	Merge branch 'master' into fix_trash	2022-05-28 19:43:19 +02:00
Alexey Milovidov	c50791dd3b	Fix clang-tidy-14, part 1	2022-05-27 22:52:14 +02:00
Alexey Milovidov	d2c6fd90cb	Fix clang-tidy-14, part 1	2022-05-27 22:51:37 +02:00
Alexander Gololobov	9b1b30855c	Fixed check for HUGE_VAL	2022-05-27 18:25:11 +02:00
Alexander Gololobov	6361c5f38c	Fix for failed style check	2022-05-27 18:22:16 +02:00
Alexander Gololobov	540353566c	Added LpNorm and LpDistance functions for arrays	2022-05-27 17:17:08 +02:00
Robert Schulze	80061aa3e2	Merge remote-tracking branch 'origin/master' into cached_patterns	2022-05-27 09:21:01 +02:00
Alexey Milovidov	86afa3a245	Merge pull request #37502 from ClickHouse/array_norm_dist_fixes Renamed arrayXXNorm/arrayXXDistance functions to XXNorm/XXDistance and fixed some overflow cases	2022-05-27 00:56:29 +03:00
mergify[bot]	a7629f900f	Merge branch 'master' into normalize-utf8-performance-tests-fix	2022-05-26 10:29:55 +00:00
Maksim Kita	3a92e61827	Merge pull request #37148 from kitaisreal/dictionary-get-descendants-performance-improvement Dictionary getDescendants performance improvement	2022-05-26 12:29:17 +02:00
Yakov Olkhovskiy	2dc160a4c3	style fix	2022-05-25 20:56:36 -04:00
Dmitry Novik	7cd7782e4f	Process columns more efficiently in GROUPING()	2022-05-25 21:55:41 +00:00
Dmitry Novik	3c1b6609ae	Add comments and make tests more verbose	2022-05-25 21:23:35 +00:00
Maksim Kita	58cd1bd3ec	Merge pull request #36843 from bharatnc/ncb/h3-unidirectionaledges-funcs add h3 unidirectional edge functions	2022-05-25 22:46:40 +02:00
Maksim Kita	bee3c30f66	Merge pull request #37524 from kitaisreal/geo-distance-functions-improve-performance Geo distance functions improve performance	2022-05-25 22:40:40 +02:00
Alexander Gololobov	168b47d0ad	Use same norm and distance function names for tuples and arrays	2022-05-25 22:39:59 +02:00
Alexander Gololobov	b065839f44	always return Float64	2022-05-25 22:27:00 +02:00
Alexander Gololobov	5df14cd956	Cast arguments to result type to avoid int overflow	2022-05-25 22:27:00 +02:00
Robert Schulze	49934a3dc8	Cache compiled regexps when evaluating non-const needles Needles in a (non-const) needle column may repeat and this commit allows to skip compilation for known needles. Out of the different design alternatives (see below, if someone is interested), we now maintain - one global pattern cache, - with a fixed size of 42k elements currently, - and use LRU as eviction strategy. ------------------------------------------------------------------------ (sorry for the wall of text, dumping it here not for reading but just for reference) Write-up about considered design alternatives: 1. Keep the current global cache of const needles. For non-const needles, probe the cache but don't store values in it. Pros: need to maintain just a single cache, no problem with cache pollution assuming there are few distinct constant needles Cons: only useful if a non-const needle occurred as already as a const needle --> overall too simplistic 2. Keep the current global cache for const needles. For non-const needles, create a local (e.g. per-query) cache Pros: unlike (1.), non-const needles can be skipped even if they did not occur yet, no pollution of the const pattern cache when there are very many non-const needles (e.g. large / highly distinct needle columns). Cons: caches may explode "horizontally", i.e. we'll end up with the const cache + caches for Q1, Q2, ... QN, this makes it harder to control the overall space consumption, also patterns residing in different caches cannot be reused between queries, another difficulty is that the concept of "query" does not really exist at matching level - there are only column chunks and we'd potentially end up with 1 cache / chunk 3. Queries with const and non-const needles insert into the same global cache. Pros: the advantages of (2.) + allows to reuse compiled patterns accross parallel queries Cons: needs an eviction strategy to control cache size and pollution (and btw. (2.) also needs eviction strategies for the individual caches) 4. Queries with const needle use global cache, queries with non-const needle use a different global cache --> Overall similar to (3) but ignores the (likely) edge case that const and non-const needles overlap. In sum, (3.) seems the simplest and most beneficial approach. Eviction strategies: 0. Don't ever evict --> cache may grow infinitely and eventually make the system unusable (may even pose a DoS risk) 1. Flush the cache after a certain threshold is exceeded --> very simple but may lead to peridic performance drops 2. Use LRU --> more graceful performance degradation at threshold but comes with a (constant) performance overhead to maintain the LRU queue In sum, given that the pattern compilation in RE2 should be quite costly (pattern-to-DFA/NFA), LRU may be acceptable.	2022-05-25 22:04:06 +02:00
Robert Schulze	ea60a614d2	Decrease namespace indent	2022-05-25 21:56:35 +02:00
Alexey Milovidov	abf2558fba	Merge pull request #37491 from ClickHouse/match_refactoring Refactorings of LIKE/MATCH code	2022-05-25 22:05:38 +03:00
Alexey Milovidov	4482da9eb6	Update greatCircleDistance.cpp	2022-05-25 21:59:31 +03:00
Alexander Tokmakov	779e6ea0b9	make it better, fix on cluster queries	2022-05-25 20:17:49 +02:00
Nikolai Kochetov	ff98c24d44	Merge pull request #37048 from Avogar/fix-array-map-nothing Add default implementation for Nothing in functions	2022-05-25 19:10:40 +02:00
Yakov Olkhovskiy	6692b9c2ed	showCertificate function implementation	2022-05-25 12:11:44 -04:00
Alexey Milovidov	cb92482ca5	Merge pull request #37484 from kitaisreal/function-has-all-avx2-dynamic-dispatch Function hasAll added dynamic dispatch	2022-05-25 19:05:32 +03:00
Maksim Kita	28355114c0	Fixed tests	2022-05-25 16:19:29 +02:00
Maksim Kita	e67b3537f7	Functions normalizeUTF8 unstable performance tests fix	2022-05-25 15:54:52 +02:00
Maksim Kita	45da28ecae	Improve performance of geo distance functions	2022-05-25 14:22:22 +02:00
Maksim Kita	c372c3d6aa	Fix performance tests	2022-05-25 11:49:59 +02:00
Kseniia Sumarokova	b50d4549c9	Merge pull request #37356 from amosbird/partition-prune-for-s3 "Partition pruning" for s3	2022-05-25 11:03:07 +02:00
Robert Schulze	05e4fa7df1	Fix special case of trivial regexp Previously, we would alsays set 1 in case of a trivial regex (which is correct). If someone in future builds a negated operator, then this will produce wrong results. Right now, negation of regexp (SQL: NOT MATCH) is implemented at a higher level, so we are safe and this is more a preventive fix.	2022-05-25 10:05:55 +02:00
Robert Schulze	01ab7b9bad	Pass strings in some places as string_view The original goal was to get change const auto & needle = String( reinterpret_cast<const char >(cur_needle_data), cur_needle_length); in Functions/MatchImpl.h into a std::string_view to save an allocation + copy. The needle is eventually passed as search pattern into the re2 library. Re2 has an alternative constructor taking a const char i.e. a NULL-terminated string. Here, the needle is NULL-terminated but 1. this is only because it is passed inside a ColumnString yet this is not always the case (e.g. fixed string columns has a dense layout w/o NULL terminator). 2. assuming NULL termination for users != MatchImpl of the regex code is too dangerous. So, for now we'll stay with copying to be on the safe side. One fine day when re2 has a ptr/size ctor, we can use std::string_view. Just changing a few other places from std::string to std::string_view but this will not help with performance.	2022-05-25 10:05:51 +02:00
Robert Schulze	040fbf3686	Tighter sanity checks in matching code	2022-05-25 10:05:06 +02:00
Robert Schulze	35bef17302	Introduce variables to hold the match result --> nicer when debugging	2022-05-25 10:04:47 +02:00
Robert Schulze	b044d44fef	Refactoring: Make template instantiation easier to read - introduced class MatchTraits with enums that replace bool template parameters - (minor: made negation the last template parameters because negation executes last during evaluation)	2022-05-25 10:03:58 +02:00
Bharat Nallan Chakravarthy	57cfc0bd04	check for validity of h3 index	2022-05-25 06:17:15 +05:30
Alexander Gololobov	2ff747785e	Merge pull request #37394 from ClickHouse/array_norm_dist_fixes Do computations on the raw input data without copying to Eigen::Matrix	2022-05-24 20:59:04 +02:00
Robert Schulze	7348a0eb28	Merge pull request #37251 from ClickHouse/non_const_like Support non-constant SQL functions (NOT) (I)LIKE and MATCH	2022-05-24 20:28:31 +02:00
Robert Schulze	028f15c4fa	Review comment: Throw LOGICAL_ERROR for different sizes of haystack / needles	2022-05-24 20:19:13 +02:00
Maksim Kita	3c0c322d7c	Merge pull request #37480 from kitaisreal/dynamic-dispatch-infrastructure-improvements Dynamic dispatch infrastructure style fixes	2022-05-24 18:13:53 +02:00
Maksim Kita	6fb51e8bd3	Function hasAll added dynamic dispatch	2022-05-24 17:06:06 +02:00
Maksim Kita	86180614e7	Fixed tests	2022-05-24 15:33:03 +02:00
Anton Popov	e96af9fd75	better binary serialization of ColumnObject	2022-05-24 13:16:11 +00:00
Maksim Kita	e6e4b2826d	Dynamic dispatch infrastructure style fixes	2022-05-24 14:25:29 +02:00
Amos Bird	c25ef92139	Fix tests	2022-05-24 18:57:55 +08:00
Amos Bird	093d315756	partition pruning for s3	2022-05-24 18:57:55 +08:00
Maksim Kita	712b000f2a	Merge pull request #37443 from kitaisreal/functions-normalize-utf8-fix Functions normalize utf8 fix	2022-05-24 11:11:15 +02:00
Alexander Gololobov	7d0ed7e51a	Remove eigen library	2022-05-24 10:24:50 +02:00
Alexander Gololobov	caad1435d5	Optimized the case when one the argumnets is Const	2022-05-24 10:24:50 +02:00
Alexander Gololobov	65fbda436a	Do computations on the raw input data without copying to Eigen::Matrix	2022-05-24 10:24:50 +02:00
Bharat Nallan Chakravarthy	6e49b76cfd	try suppress h3 asan errors	2022-05-24 10:22:46 +05:30
Maksim Kita	996241493f	Merge pull request #37447 from kitaisreal/binary-function-vectorized-remove-macro BinaryFunctionVectorized remove macro	2022-05-23 16:50:12 +02:00
Maksim Kita	fe21b4ca9e	Fixed style check	2022-05-23 14:41:07 +02:00
Maksim Kita	008de5c779	Merge pull request #37438 from kitaisreal/function-binary-representation-style-fixes FunctionBinaryRepresentation style fixes	2022-05-23 13:54:15 +02:00
Maksim Kita	e550843d56	BinaryFunctionVectorized remove macro	2022-05-23 12:45:16 +02:00
Maksim Kita	585b86446e	Added hierarchical_index_bytes_allocated column in system.dictionaries	2022-05-23 12:42:00 +02:00
Maksim Kita	be9c3d9bd4	Fixed build	2022-05-23 12:42:00 +02:00
Maksim Kita	100afa8bcf	Dictionary getDescendants performance improvement	2022-05-23 12:42:00 +02:00
Maksim Kita	78782de887	Functions normalizeUTF8 logical error fix	2022-05-23 12:19:14 +02:00
Maksim Kita	98bb34f2f2	FunctionBinaryRepresentation style fixes	2022-05-23 10:59:33 +02:00
Robert Schulze	e25ca139cd	Implement SQL functions (NOT) (I)LIKE() + MATCH() with non-const needles With this commit, SQL functions LIKE and MATCH and their variants can work with non-const needle arguments. E.g. create table tab (id UInt32, haystack String, needle String) engine = MergeTree() order by id; insert into tab values (1, 'Hello', '%ell%') (2, 'World', '%orl%') select id, haystack, needle, like(haystack, needle) from tab; For that, methods vectorVector() and vectorFixedVector() were added to MatchImpl. The existing code for const needles has an optimization where the compiled regexp is cached. The new code expects a different needle per row and consequently does not cache the regexp.	2022-05-23 09:41:28 +02:00
Alexey Milovidov	698e5e5352	Merge pull request #37415 from Joeywzr/gen_uuid Generate multiple columns with UUID	2022-05-23 00:29:42 +03:00
Robert Schulze	4829ae8380	Replace overly clever const argument logic by something simpler The previous logic was smart but too inflexible to support the next commits. Replace by a simple pushdown logic where string search implementations return their const arguments instead of having the common class figure these out based on properties/traits.	2022-05-22 17:50:38 +02:00

... 2 3 4 5 6 ...

3807 Commits