First part, updated most UTF8, hashing, memory and codecs. Except
utf8lower and upper, maybe a little later.
That includes huge amount of research with movemask dealing. Exact
details and blog post TBD.
cmake/target.cmake defines macros for the supported platforms, this
commit changes predefined system macros to our own macros.
__linux__ --> OS_LINUX
__APPLE__ --> OS_DARWIN
__FreeBSD__ --> OS_FREEBSD
Trailing escape ('ab\') is disallowed in SQL, in standardese:
"If an escape character is specified, then [...] If there is not a
partitioning of the string PVC into substrings such that each substring
has length 1 (one) or 2, no substring of length 1 (one) is the escape
character ECV, and each substring of length 2 is the escape character
ECV followed by either the escape character ECV, an <underscore>
character, or the <percent> character, then an exception condition is
raised: data exception - invalid escape sequence."
I first thought this is checked already higher up in the stack, at least
for const needles, as single trailing backslashes ('ab\') are rejected,
but then I realized that ClickHouse quotes by default. I.e., double
trailing backslashes ('ab\\') are not rejected but when interpreted as
LIKE needle ('ab\') they should.
This commit is based on local benchmarks of ClickHouse's re2 caching.
Question 1: -----------------------------------------------------------
Is pattern caching useful for queries with const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, '%HelloWorld') FROM T;
The short answer is: no. Runtime is (unsurprisingly) dominated by
pattern evaluation + other stuff going on in queries, but definitely not
pattern compilation. For space reasons, I omit details of the local
experiments.
(Side note: the current caching scheme is unbounded in size which poses
a DoS risk (think of multi-tenancy). This risk is more pronounced when
unbounded caching is used with non-const patterns ..., see next
question)
Question 2: -----------------------------------------------------------
Is pattern caching useful for queries with non-const LIKE/REGEX
patterns? E.g. SELECT LIKE(col_haystack, col_needle) FROM T;
I benchmarked five caching strategies:
1. no caching as a baseline (= recompile for each row)
2. unbounded cache (= threadsafe global hash-map)
3. LRU cache (= threadsafe global hash-map + LRU queue)
4. lightweight local cache 1 (= not threadsafe local hashmap with
collision list which grows to a certain size (here: 10 elements) and
afterwards never changes)
5. lightweight local cache 2 (not threadsafe local hashmap without
collision list in which a collision replaces the stored element, idea
by Alexey)
... using a haystack of 2 mio strings and
A). 2 mio distinct simple patterns
B). 10 simple patterns
C) 2 mio distinct complex patterns
D) 10 complex patterns
Fo A) and C), caching does not help but these queries still allow to
judge the static overhead of caching on query runtimes.
B) and D) are extreme but common cases in practice. They include
queries like "SELECT ... WHERE LIKE (col_haystack, flag ? '%pattern1%' :
'%pattern2%'). Caching should help significantly.
Because LIKE patterns are internally translated to re2 expressions, I
show only measurements for MATCH queries.
Results in sec, averaged over on multiple measurements;
1.A): 2.12
B): 1.68
C): 9.75
D): 9.45
2.A): 2.17
B): 1.73
C): 9.78
D): 9.47
3.A): 9.8
B): 0.63
C): 31.8
D): 0.98
4.A): 2.14
B): 0.29
C): 9.82
D): 0.41
5.A) 2.12 / 2.15 / 2.26
B) 1.51 / 0.43 / 0.30
C) 9.97 / 9.88 / 10.13
D) 5.70 / 0.42 / 0.43
(10/100/1000 buckets, resp. 10/1/0.1% collision rate)
Evaluation:
1. This is the baseline. It was surprised that complex patterns (C, D)
slow down the queries so badly compared to simple patterns (A, B).
The runtime includes evaluation costs, but as caching only helps with
compilation, and looking at 4.D and 5.D, compilation makes up over 90%
of the runtime!
2. No speedup compared to 1, probably due to locking overhead. The cache
is unbounded, and in experiments with data sets > 2 mio rows, 2. is
the only scheme to throw OOM exceptions which is not acceptable.
3. Unique patterns (A and C) lead to thrashing of the LRU cache and very
bad runtimes due to LRU queue maintenance and locking. Works pretty
well however with few distinct patterns (B and D).
4. This scheme is tailored to queries B and D where it performs pretty
good. More importantly, the caching is lightweight enough to not
deteriorate performance on datasets A and C.
5. After some tuning of the hash map size, 100 buckets seem optimal to
be in the same ballpark with 10 distinct patterns as 4. Performance
also does not deteriorate on A and C compared to the baseline.
Unlike 4., this scheme behaves LRU-like and can adjust to changing
pattern distributions.
As a conclusion, this commit implementes two things:
1. Based on Q1, pattern search with const needle no longer uses
caching. This applies to LIKE and MATCH + a few (exotic) other SQL
functions. The code for the unbounded caching was removed.
2. Based on Q2, pattern search with non-const needles now use method 5.
Needles in a (non-const) needle column may repeat and this commit allows
to skip compilation for known needles. Out of the different design
alternatives (see below, if someone is interested), we now maintain
- one global pattern cache,
- with a fixed size of 42k elements currently,
- and use LRU as eviction strategy.
------------------------------------------------------------------------
(sorry for the wall of text, dumping it here not for reading but just
for reference)
Write-up about considered design alternatives:
1. Keep the current global cache of const needles. For non-const
needles, probe the cache but don't store values in it.
Pros: need to maintain just a single cache, no problem with cache
pollution assuming there are few distinct constant needles
Cons: only useful if a non-const needle occurred as already as a
const needle
--> overall too simplistic
2. Keep the current global cache for const needles. For non-const
needles, create a local (e.g. per-query) cache
Pros: unlike (1.), non-const needles can be skipped even if they
did not occur yet, no pollution of the const pattern cache when
there are very many non-const needles (e.g. large / highly
distinct needle columns).
Cons: caches may explode "horizontally", i.e. we'll end up with the
const cache + caches for Q1, Q2, ... QN, this makes it harder
to control the overall space consumption, also patterns
residing in different caches cannot be reused between queries,
another difficulty is that the concept of "query" does not
really exist at matching level - there are only column chunks
and we'd potentially end up with 1 cache / chunk
3. Queries with const and non-const needles insert into the same global
cache.
Pros: the advantages of (2.) + allows to reuse compiled patterns
accross parallel queries
Cons: needs an eviction strategy to control cache size and pollution
(and btw. (2.) also needs eviction strategies for the
individual caches)
4. Queries with const needle use global cache, queries with non-const
needle use a different global cache
--> Overall similar to (3) but ignores the (likely) edge case that
const and non-const needles overlap.
In sum, (3.) seems the simplest and most beneficial approach.
Eviction strategies:
0. Don't ever evict --> cache may grow infinitely and eventually make
the system unusable (may even pose a DoS risk)
1. Flush the cache after a certain threshold is exceeded --> very
simple but may lead to peridic performance drops
2. Use LRU --> more graceful performance degradation at threshold but
comes with a (constant) performance overhead to maintain the LRU
queue
In sum, given that the pattern compilation in RE2 should be quite costly
(pattern-to-DFA/NFA), LRU may be acceptable.
Previously, we would alsays set 1 in case of a trivial regex (which is
correct). If someone in future builds a negated operator, then this
will produce wrong results. Right now, negation of regexp (SQL: NOT
MATCH) is implemented at a higher level, so we are safe and this is more
a preventive fix.
The original goal was to get change
const auto & needle = String(
reinterpret_cast<const char *>(cur_needle_data),
cur_needle_length);
in Functions/MatchImpl.h into a std::string_view to save an allocation +
copy. The needle is eventually passed as search pattern into the re2
library. Re2 has an alternative constructor taking a const char * i.e. a
NULL-terminated string. Here, the needle is NULL-terminated but
1. this is only because it is passed inside a ColumnString yet this is
not always the case (e.g. fixed string columns has a dense layout w/o
NULL terminator).
2. assuming NULL termination for users != MatchImpl of the regex code is
too dangerous.
So, for now we'll stay with copying to be on the safe side. One fine day
when re2 has a ptr/size ctor, we can use std::string_view.
Just changing a few other places from std::string to std::string_view
but this will not help with performance.
- introduced class MatchTraits with enums that replace bool template
parameters
- (minor: made negation the last template parameters because negation
executes last during evaluation)
With this commit, SQL functions LIKE and MATCH and their variants can
work with non-const needle arguments. E.g.
create table tab
(id UInt32, haystack String, needle String)
engine = MergeTree()
order by id;
insert into tab values
(1, 'Hello', '%ell%')
(2, 'World', '%orl%')
select id, haystack, needle, like(haystack, needle)
from tab;
For that, methods vectorVector() and vectorFixedVector() were added to
MatchImpl. The existing code for const needles has an optimization where
the compiled regexp is cached. The new code expects a different needle
per row and consequently does not cache the regexp.
The previous logic was smart but too inflexible to support the next
commits. Replace by a simple pushdown logic where string search
implementations return their const arguments instead of having the
common class figure these out based on properties/traits.
In PR #37300, Alexej asked why we the compiler does not warn about
unnecessary semicolons, e.g.
f()
{
}; // <-- here
The answer is surprising: In C++98, above syntax was disallowed but by
most compilers accepted it regardless. C++>11 introduced "empty
declarations" which made the syntax legal.
The previous behavior can be restored using flag
-Wc++98-compat-extra-semi. This finds many useless semicolons which were
removed in this change. Unfortunately, there are also false positives
which would require #pragma-s and HAS_* logic (--> check_flags.cmake) to
suppress. In the end, -Wc++98-compat-extra-semi comes with extra effort
for little benefit. Therefore, this change only fixes some semicolons
but does not enable the flag.
Enable:
- bugprone-lambda-function-name: "Checks for attempts to get the name of
a function from within a lambda expression. The name of a lambda is
always something like operator(), which is almost never what was
intended."
- bugprone-unhandled-self-assignment: "Finds user-defined copy
assignment operators which do not protect the code against
self-assignment either by checking self-assignment explicitly or using
the copy-and-swap or the copy-and-move method.""
- hicpp-invalid-access-moved: "Warns if an object is used after it has
been moved."
- hicpp-use-noexcept: "This check replaces deprecated dynamic exception
specifications with the appropriate noexcept specification (introduced
in C++11)"
- hicpp-use-override: "Adds override (introduced in C++11) to overridden
virtual functions and removes virtual from those functions as it is
not required."
- performance-type-promotion-in-math-fn: "Finds calls to C math library
functions (from math.h or, in C++, cmath) with implicit float to
double promotions."
Split up:
- cppcoreguidelines-*. Some of them may be useful (haven't checked in
detail), therefore allow to toggle them individually.
Disable:
- linuxkernel-*. Obvious.
The main fix is in MatchImpl.h where the "case_insensitive" parameter is
added to Regexps::get().
Also made "case_insensitive" a non-default template parameter to reduce
the risk of future bugs.
The remainder of this commit are minor random code improvements.
resoves #37114
Official docs:
Some headers from C library were deprecated in C++ and are no longer
welcome in C++ codebases. Some have no effect in C++. For more details
refer to the C++ 14 Standard [depr.c.headers] section. This check
replaces C standard library headers with their C++ alternatives and
removes redundant ones.
Official docs:
Correct indentation helps to understand code. Mismatch of the
syntactical structure and the indentation of the code may hide serious
problems.
Omitted new checks which produce too many matches or which are
controversial (e.g. readability-identifier-length).
New checks:
- misc-misleading-bidirectional + misc-misleading-identifier
Detects potential attack as described in the Trojan Source attack
- modernize-macro-to-enum
Replaces groups of adjacent macros with an unscoped anonymous enum
- modernize-shrink-to-fit
Replace copy and swap tricks on shrinkable containers with the
shrink_to_fit() method call
- modernize-use-transparent-functors
Prefer transparent functors to non-transparent ones
- modernize-use-uncaught-exceptions
This check will warn on calls to std::uncaught_exception and replace
them with calls to std::uncaught_exceptions (uncaught_exception was
deprecated with C++17)
- performance-no-int-to-ptr
Diagnoses every integer to pointer cast
- readability-duplicate-include
Looks for duplicate includes and removes them
- readability-redundant-preprocessor
Finds potentially redundant preprocessor directives
- bugprone-lambda-function-name
Checks for attempts to get the name of a function from within a lambda
expression
- bugprone-redundant-branch-condition
Finds condition variables in nested if statements that were also
checked in the outer if statement and were not changed
- bugprone-shared-ptr-array-mismatch
Finds initializations of C++ shared pointers to non-array type that
are initialized with an array
- bugprone-stringview-nullptr
Checks for various ways that the const CharT* constructor of
std::basic_string_view can be passed a null argument and replaces them
with the default constructor in most cases
- bugprone-suspicious-memory-comparison
Finds potentially incorrect calls to memcmp() based on properties of
the arguments