Function to count number of substring occurrences in the string:
- in case of needle is multi char - counts non-intersecting substrings
- the code is based on position helpers.
The following new functions is available:
- countSubstrings()
- countSubstringsCaseInsensitive()
- countSubstringsCaseInsensitiveUTF8()
v0: substringCount()
v2:
- add substringCountCaseInsensitiveUTF8
- improve tests
- fix coding style issues
- fix multichar needle
v3: rename to countSubstrings (by analogy with countEqual())
Consider the following example:
CREATE TABLE test(p DateTime, k int) ENGINE MergeTree PARTITION BY toDate(p) ORDER BY k;
INSERT INTO test VALUES ('2020-09-01 00:01:02', 1), ('2020-09-01 20:01:03', 2), ('2020-09-02 00:01:03', 3);
- SELECT count() FROM test WHERE toDate(p) >= '2020-09-01' AND p <= '2020-09-01 00:00:00'
In this case rpn will be (FUNCTION_IN_RANGE, FUNCTION_UNKNOWN (due to strict), FUNCTION_AND)
and for optimize_trivial_count_query we cannot use index if there is at least one FUNCTION_UNKNOWN.
since there is no post processing and return count() based on only the first predicate is wrong.
Before this patch FUNCTION_UNKNOWN was allowed for optimize_trivial_count_query, and the result was wrong.
And two examples above just to show the difference, the behaviour hadn't been changed with this patch:
- SELECT * FROM test WHERE toDate(p) >= '2020-09-01' AND p <= '2020-09-01 00:00:00'
In this case will be (FUNCTION_IN_RANGE, FUNCTION_IN_RANGE (due to non-strict), FUNCTION_AND)
so it will prune everything out and nothing will be read.
- SELECT * FROM test WHERE toDate(p) >= '2020-09-01' AND toUnixTimestamp(p)%5==0
In this case will be (FUNCTION_IN_RANGE, FUNCTION_UNKNOWN, FUNCTION_AND)
and all, two, partitions will be scanned, but due to filtering later none of rows will be matched.
Before this patch the following query ignores the settings for INSERT:
insert into test_parallel_insert select * from numbers_mt(65535*2) settings max_insert_threads=10
And the reason is that SETTINGS was parsed by the SELECT parser.
Fix this by push down the SETTINGS from the SELECT to INSERT.
Also note that since INSERT parser does not use ParserQueryWithOutput the
following works:
insert into test_parallel_insert select * from numbers_mt(65535*2) format Null settings max_insert_threads=10
Making it implicitly cast to Date() does not looks correct, since before
it returns somewhat unexpected result:
SELECT toUnixTimestamp(today())
┌─toUnixTimestamp(today())─┐
│ 18591 │
└──────────────────────────┘
Since if the connection will be closed (by some reason), then the
setting will not be applied after transparent reconnect (since only
native clickhouse-client can do this, since it parses the query, but
perf tests uses python driver).
Just use inplace SETTINGS clause or <settings>.
Otherwise flaky tests will run then both, not a problem, but this is
just to trigger flaky check, since there is some tricky issue [1]:
2020-11-20 00:35:51 01247_optimize_distributed_group_by_sharding_key: [ FAIL ] 0.67 sec. - return code 101
2020-11-20 00:35:51 Received exception from server (version 20.12.1):
2020-11-20 00:35:51 Code: 101. DB::Exception: Received from localhost:9000. DB::Exception: Received from 127.0.0.2:9000. DB::Exception: Unexpected packet Data received from client.
[1]: https://clickhouse-test-reports.s3.yandex.net/16996/8d71564056925df1415880f382aaa169cbdf37b0/functional_stateless_tests_flaky_check_(address)/test_run.txt.out.log
While reading from AggregatingMergeTree with
SimpleAggregateFunction(String) in primary key and
optimize_aggregation_in_order perf top shows:
Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 287759760270 lost: 0/0 drop: 0/0
Children Self Shared Object Symbol
+ 12.64% 11.39% clickhouse [.] memcpy
+ 9.08% 0.23% [unknown] [.] 0000000000000000
+ 8.45% 8.40% clickhouse [.] ProfileEvents::increment # <-- this, and in debug it has not 0.08x overhead, but 5.8x overhead
+ 7.68% 7.67% clickhouse [.] LZ4_compress_fast_extState
+ 5.29% 5.22% clickhouse [.] DB::IAggregateFunctionHelper<DB::AggregateFunctionNullUnary<true, true> >::addFree
The reason is obvious, ProfileEvents is atomic counters (and also they
are nested):
<details>
```
Samples: 7M of event 'cycles', 4000 Hz, Event count (approx.): 450726149337
ProfileEvents::increment /usr/bin/clickhouse [Percent: local period]
Percent│
│
│
│ Disassembly of section .text:
│
│ 00000000078d8900 <ProfileEvents::increment(unsigned long, unsigned long)@@Base>:
│ ProfileEvents::increment(unsigned long, unsigned long):
0.17 │ push %rbp
0.00 │ mov %rsi,%rbp
0.04 │ push %rbx
0.20 │ mov %rdi,%rbx
0.17 │ sub $0x8,%rsp
0.26 │ → callq DB::CurrentThread::getProfileEvents
│ ProfileEvents::Counters::increment(unsigned long, unsigned long):
0.00 │ lea 0x0(,%rbx,8),%rdi
0.05 │ nop
│ unsigned long std::__1::__cxx_atomic_fetch_add<unsigned long, unsigned long>(std::__1::__cxx_atomic_base_impl<unsigned long>*, unsigned long, std::__1::memory_order):
1.02 │ mov (%rax),%rdx
97.04 │ lock add %rbp,(%rdx,%rdi,1)
│ ProfileEvents::Counters::increment(unsigned long, unsigned long):
0.21 │ mov 0x10(%rax),%rax
0.04 │ test %rax,%rax
0.00 │ → jne 78d8920 <ProfileEvents::increment(unsigned long, unsigned long)@@Base+0x20>
│ ProfileEvents::increment(unsigned long, unsigned long):
0.38 │ add $0x8,%rsp
0.00 │ pop %rbx
0.04 │ pop %rbp
0.38 │ ← retq
```
</details>
These ProfileEvents was ArenaAllocChunks (it shows ~1.5M events per
second), and the reason is that the table has
SimpleAggregateFunction(String) in PK, which requires Arena.
But most of the time there Arena wasn't even used, so avoid this cost by
re-creating Arena only if it was "used" (i.e. has new chunks).
Another possibility is to avoid populating Arena::head in ctor, but this
will make the Arena code more complex, so for now this was preferred.
Also as a long-term solution it worth looking at implementing them via
RCU (to move the extra overhead out from the write code path into read
side).
For now uuids are not generated at all, they are present only if the
part is updated manually (as you can see in the integration test).
The only place where they can be seen today by an end user is in
`system.parts` table. I was looking for hiding this column behind an
option but couldn't find an easy way to do that.
Likely this is also required for WAL, but need to think how not to break
compatibility.
Relates to #13574, https://github.com/ClickHouse/ClickHouse/issues/13574
Next 1: In the upcoming PR the plan is to integrate de-duplication based on
these fingerprints in the query pipeline.
Next 2: We'll enable automatic generation of uuids and come up with a
way for conditionally sending uuids when processing distributed queries
only when part movement is in progress.
Sometimes it is odd to get TLD itself from the
cutToFirstSignificantSubdomain() (since you will not get TLD itself if
you pass it directly):
- cutToFirstSignificantSubdomain('org') -> ""
- cutToFirstSignificantSubdomain('www.org') -> org
- cutToFirstSignificantSubdomain('kernel.org') -> kernel.org
- cutToFirstSignificantSubdomain('www.kernel.org') -> kernel.org
So add one more function to get www.org in this case:
- cutToFirstSignificantSubdomainWithWWW('org') -> ""
- cutToFirstSignificantSubdomainWithWWW('www.org') -> www.org
- cutToFirstSignificantSubdomainWithWWW('kernel.org') -> kernel.org
- cutToFirstSignificantSubdomainWithWWW('www.kernel.org') -> kernel.org
P.S. not sure about the naming though, so it will great if someone has
suggestion for the name.