Align all the SSE4.1 requirement from StringSearcher. Use needle_size
in while loop to make the code clean.
Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
The Kafka table engine allows global configuration and per-Kafka-topic
configuration. The latter uses syntax <kafka_TOPIC>, e.g. for topic
"football":
<kafka_football>
<retry_backoff_ms>250</retry_backoff_ms>
<fetch_min_bytes>100000</fetch_min_bytes>
</kafka_football>
Some users had to find out the hard way that such configuration doesn't
take effect if the topic name contains a period, e.g. "sports.football".
The reason is that ClickHouse configuration framework already uses
periods as level separators to descend the configuration hierarchy.
(Besides that, per-topic configuration at the same level as global
configuration could be considered ugly.)
Note that Kafka topics may contain characters "a-zA-Z0-9._-" (*) and
a tree-like topic organization using periods is quite common in
practice.
This PR deprecates the existing per-topic configuration syntax (but
continues to support it for backward compat) and introduces a new
per-topic configuration syntax below the global Kafka configuration of
the form:
<kafka>
<topic name="football">
<retry_backoff_ms>250</retry_backoff_ms>
<fetch_min_bytes>100000</fetch_min_bytes>
</topic>
</kafka>
The period restriction doesn't apply to XML attributes, so <topic
name="sports.football"> will work. Also, everything Kafka-related is
below <kafka>.
Considered but rejected alternatives:
- Extending Poco ConfigurationView with custom separators (e.g."/"
instead of "."). Won't work easily because ConfigurationView only
builds a path but defers descending the configuration tree to the
normal configuration classes.
- Reloading the configuration file in StorageKafka (instead of reading
the loaded file) but with a custom separator. This mode is supported
by XML configuration. Too ugly and error-prone since the true
configuration is composed from multiple configuration files.
(*) https://stackoverflow.com/a/37067544
This patch has revised the name of value and added comments to make
the SIMD StringSearcher clean and easy to understand based on pull
request 46289.
Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
This patch offers an additional optimization when the needle_size is
large. If the needle_size is larger than the haystack_size, there is
no need to search any more.
The optimized SIMD StringSearcher has led at most 41.7% than Volnitsky
algorithm when the needle_size is less than 21, and fallen behind only
about 1% even when the needle_size is bigger than 50, which is not
considered as a common case.
Test platform: ICX server
Test query: SELECT COUNT(*) FROM hits WHERE URL LIKE '%{Needle}%';
Needle_size opt/baseline
5 141.7%
6 129.4%
8 118.5%
9 112.3%
10 107.4%
14 103.4%
20 100.2%
21 100.7%
51 99.0%
Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
This patch offers the optimized SIMD StringSearcher by searching the first
and second chars together rather than only the first char, which will result
in big performance gain. The patch also provides a quick path when the needle
size is 1.
With this patch, I have tested the 43 queries in clickbench on ICX server.
Query 20 has got 35% performance gain. Other StringSearcher related queries
have got around 10% performance improvement. And the overall geomean of all
the queries has got 4.1% performance gain.
Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>