ClickHouse/src/Functions/URL
Azat Khuzhin 1d4a7c7290 Add support of !/* (exclamation/asterisk) in custom TLDs
Public suffix list may contain special characters (you may find format
here - [1]):
- asterisk (*)
- exclamation mark (!)

  [1]: https://github.com/publicsuffix/list/wiki/Format

It is easier to describe how it should be interpreted with an examples.

Consider the following part of the list:

    *.sch.uk
    *.kawasaki.jp
    !city.kawasaki.jp

And here are the results for `cutToFirstSignificantSubdomainCustom()`:

If you have only asterisk (*):

    foo.something.sheffield.sch.uk -> something.sheffield.sch.uk
    sheffield.sch.uk               -> sheffield.sch.uk

If you have exclamation mark (!) too:

    foo.kawasaki.jp                -> foo.kawasaki.jp
    foo.foo.kawasaki.jp            -> foo.foo.kawasaki.jp
    city.kawasaki.jp               -> city.kawasaki.jp
    some.city.kawasaki.jp          -> city.kawasaki.jp

TLDs had been verified wit the following script [2], to match with
python publicsuffix2 module.

  [2]: https://gist.github.com/azat/c1a7a9f1e3519793134ef4b1df5461a6

v2: fix StringHashTable padding requirements
Fixes: #39468
Follow-up for: #17748
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
2022-07-26 08:34:30 +03:00
..
basename.cpp
CMakeLists.txt Add separate option to omit symbols from heavy contrib 2022-07-02 06:32:03 +03:00
config_functions_url.h.in
cutFragment.cpp dbms/ → src/ 2020-04-03 18:14:31 +03:00
cutQueryString.cpp
cutQueryStringAndFragment.cpp
cutToFirstSignificantSubdomain.cpp
cutToFirstSignificantSubdomainCustom.cpp
cutURLParameter.cpp
cutWWW.cpp Rename "common" to "base" 2021-10-02 10:13:14 +03:00
decodeURLComponent.cpp Function encodeURLComponent minor fixes 2022-02-17 18:34:23 +00:00
domain.cpp
domain.h First try at reducing the use of StringRef 2022-07-17 17:26:02 +00:00
domainWithoutWWW.cpp
ExtractFirstSignificantSubdomain.h Add support of !/* (exclamation/asterisk) in custom TLDs 2022-07-26 08:34:30 +03:00
extractURLParameter.cpp
extractURLParameterNames.cpp finish dev 2022-01-30 09:10:27 +08:00
extractURLParameters.cpp finish dev 2022-01-30 09:10:27 +08:00
firstSignificantSubdomain.cpp
firstSignificantSubdomainCustom.cpp
FirstSignificantSubdomainCustomImpl.h Add support of !/* (exclamation/asterisk) in custom TLDs 2022-07-26 08:34:30 +03:00
fragment.cpp
fragment.h
FunctionsURL.h Remove Arcadia 2022-04-16 00:20:47 +02:00
netloc.cpp First try at reducing the use of StringRef 2022-07-17 17:26:02 +00:00
path.cpp
path.h
pathFull.cpp
port.cpp First try at reducing the use of StringRef 2022-07-17 17:26:02 +00:00
protocol.cpp
protocol.h First try at reducing the use of StringRef 2022-07-17 17:26:02 +00:00
queryString.cpp
queryString.h
queryStringAndFragment.cpp
queryStringAndFragment.h
registerFunctionsURL.cpp to #31092_add_encodeURLComponent_function 2022-02-16 10:19:20 +08:00
tldLookup.generated.cpp Enable clang-tidy modernize-deprecated-headers & hicpp-deprecated-headers 2022-05-09 08:23:33 +02:00
tldLookup.gperf
tldLookup.h Add ability to use custom TLD list 2020-12-09 21:08:22 +03:00
tldLookup.sh
topLevelDomain.cpp First try at reducing the use of StringRef 2022-07-17 17:26:02 +00:00
URLHierarchy.cpp finish dev 2022-01-30 09:10:27 +08:00
URLPathHierarchy.cpp finish dev 2022-01-30 09:10:27 +08:00