Public suffix list may contain special characters (you may find format
here - [1]):
- asterisk (*)
- exclamation mark (!)
[1]: https://github.com/publicsuffix/list/wiki/Format
It is easier to describe how it should be interpreted with an examples.
Consider the following part of the list:
*.sch.uk
*.kawasaki.jp
!city.kawasaki.jp
And here are the results for `cutToFirstSignificantSubdomainCustom()`:
If you have only asterisk (*):
foo.something.sheffield.sch.uk -> something.sheffield.sch.uk
sheffield.sch.uk -> sheffield.sch.uk
If you have exclamation mark (!) too:
foo.kawasaki.jp -> foo.kawasaki.jp
foo.foo.kawasaki.jp -> foo.foo.kawasaki.jp
city.kawasaki.jp -> city.kawasaki.jp
some.city.kawasaki.jp -> city.kawasaki.jp
TLDs had been verified wit the following script [2], to match with
python publicsuffix2 module.
[2]: https://gist.github.com/azat/c1a7a9f1e3519793134ef4b1df5461a6
v2: fix StringHashTable padding requirements
Fixes: #39468
Follow-up for: #17748
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Sometimes it is useful to build contrib with debug symbols for further
debugging.
With everything turned ON (i.e. debug build) I got 3.3GB vs 3.0GB w/o
this patch, 9% bloat, thoughts about this is this OK or not for you, if
not STRIP_DEBUG_SYMBOLS_HEAVY_CONTRIB can be OFF by default (regardless
of build type).
P.S. aws debug symbols adds just 1.7%.
v2: rename STRIP_HEAVY_DEBUG_SYMBOLS
v3: OMIT_HEAVY_DEBUG_SYMBOLS
v4: documentation had been removed
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.
(*) Addresses (but does not resolve) #38046
(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
preserved for compatibility. Likewise, error codes (e.g.
"ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
"HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
replacement.
[0] https://www.youtube.com/watch?v=KlZWmmflW6M
Official docs:
Some headers from C library were deprecated in C++ and are no longer
welcome in C++ codebases. Some have no effect in C++. For more details
refer to the C++ 14 Standard [depr.c.headers] section. This check
replaces C standard library headers with their C++ alternatives and
removes redundant ones.
When I tried to add cool new clang-tidy 14 warnings, I noticed that the
current clang-tidy settings already produce a ton of warnings. This
commit addresses many of these. Almost all of them were non-critical,
i.e. C vs. C++ style casts.
Custom TLD lists (added in #17748), may contain domain of the 3-d level,
however builtin TLD lists does not have such records, so it is not
affected.
Note that this will significantly increase hashtable lookups.
Fixes: #17748
v2: Add a note that top_level_domains_lists aren not applied w/o restart
v3: Remove ExtractFirstSignificantSubdomain{Default,Custom}Lookup.h headers
v4: TLDListsHolder: remove FIXME for dense_hash_map (this is not significant)
Sometimes it is odd to get TLD itself from the
cutToFirstSignificantSubdomain() (since you will not get TLD itself if
you pass it directly):
- cutToFirstSignificantSubdomain('org') -> ""
- cutToFirstSignificantSubdomain('www.org') -> org
- cutToFirstSignificantSubdomain('kernel.org') -> kernel.org
- cutToFirstSignificantSubdomain('www.kernel.org') -> kernel.org
So add one more function to get www.org in this case:
- cutToFirstSignificantSubdomainWithWWW('org') -> ""
- cutToFirstSignificantSubdomainWithWWW('www.org') -> www.org
- cutToFirstSignificantSubdomainWithWWW('kernel.org') -> kernel.org
- cutToFirstSignificantSubdomainWithWWW('www.kernel.org') -> kernel.org
P.S. not sure about the naming though, so it will great if someone has
suggestion for the name.