Merge pull request #48966 from ClickHouse/rs/string-function-cleanup

Clean up string search function docs
This commit is contained in:
Robert Schulze 2023-04-20 11:53:52 +02:00 committed by GitHub
commit ba0adf705e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 228 additions and 505 deletions

View File

@ -6,9 +6,7 @@ sidebar_label: Strings
# Functions for Working with Strings # Functions for Working with Strings
:::note
Functions for [searching](../../sql-reference/functions/string-search-functions.md) in strings and for [replacing](../../sql-reference/functions/string-replace-functions.md) in strings are described separately. Functions for [searching](../../sql-reference/functions/string-search-functions.md) in strings and for [replacing](../../sql-reference/functions/string-replace-functions.md) in strings are described separately.
:::
## empty ## empty

View File

@ -6,9 +6,7 @@ sidebar_label: Replacing in Strings
# Functions for Searching and Replacing in Strings # Functions for Searching and Replacing in Strings
:::note
Functions for [searching](../../sql-reference/functions/string-search-functions.md) and [other manipulations with strings](../../sql-reference/functions/string-functions.md) are described separately. Functions for [searching](../../sql-reference/functions/string-search-functions.md) and [other manipulations with strings](../../sql-reference/functions/string-functions.md) are described separately.
:::
## replaceOne(haystack, pattern, replacement) ## replaceOne(haystack, pattern, replacement)

View File

@ -6,19 +6,22 @@ sidebar_label: Searching in Strings
# Functions for Searching in Strings # Functions for Searching in Strings
The search is case-sensitive by default in all these functions. There are separate variants for case insensitive search. All functions in this section search by default case-sensitively. Case-insensitive search is usually provided by separate function variants.
Note that case-insensitive search follows the lowercase-uppercase rules of the English language. E.g. Uppercased `i` in English language is
`I` whereas in Turkish language it is `İ` - results for languages other than English may be unexpected.
:::note Functions in this section also assume that the searched string and the search string are single-byte encoded text. If this assumption is
Functions for [replacing](../../sql-reference/functions/string-replace-functions.md) and [other manipulations with strings](../../sql-reference/functions/string-functions.md) are described separately. violated, no exception is thrown and results are undefined. Search with UTF-8 encoded strings is usually provided by separate function
::: variants. Likewise, if a UTF-8 function variant is used and the input strings are not UTF-8 encoded text, no exception is thrown and the
results are undefined. Note that no automatic Unicode normalization is performed, you can use the
[normalizeUTF8*()](https://clickhouse.com/docs/en/sql-reference/functions/string-functions/) functions for that.
## position(haystack, needle), locate(haystack, needle)
Searches for the substring `needle` in the string `haystack`. [General strings functions](../../sql-reference/functions/string-functions.md) and [functions for replacing in strings](../../sql-reference/functions/string-replace-functions.md) are described separately.
Returns the position (in bytes) of the found substring in the string, starting from 1. ## position
For a case-insensitive search, use the function [positionCaseInsensitive](#positioncaseinsensitive). Returns the position (in bytes, starting at 1) of a substring `needle` in a string `haystack`.
**Syntax** **Syntax**
@ -26,35 +29,33 @@ For a case-insensitive search, use the function [positionCaseInsensitive](#posit
position(haystack, needle[, start_pos]) position(haystack, needle[, start_pos])
``` ```
``` sql Alias:
position(needle IN haystack) - `position(needle IN haystack)`
``` - `locate(haystack, needle[, start_pos])`.
Alias: `locate(haystack, needle[, start_pos])`.
:::note
Syntax of `position(needle IN haystack)` provides SQL-compatibility, the function works the same way as to `position(haystack, needle)`.
:::
**Arguments** **Arguments**
- `haystack` — String, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal). - `haystack` — String in which the search is performed. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — Substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal). - `needle` — Substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `start_pos` Position of the first character in the string to start search. [UInt](../../sql-reference/data-types/int-uint.md). Optional. - `start_pos` Position (1-based) in `haystack` at which the search starts. [UInt](../../sql-reference/data-types/int-uint.md). Optional.
**Returned values** **Returned values**
- Starting position in bytes (counting from 1), if substring was found. - Starting position in bytes and counting from 1, if the substring was found.
- 0, if the substring was not found. - 0, if the substring was not found.
If substring `needle` is empty, these rules apply:
- if no `start_pos` was specified: return `1`
- if `start_pos = 0`: return `1`
- if `start_pos >= 1` and `start_pos <= length(haystack) + 1`: return `start_pos`
- otherwise: return `0`
The same rules also apply to functions `positionCaseInsensitive`, `positionUTF8` and `positionCaseInsensitiveUTF8`
Type: `Integer`. Type: `Integer`.
**Examples** **Examples**
The phrase “Hello, world!” contains a set of bytes representing a single-byte encoded text. The function returns some expected result:
Query:
``` sql ``` sql
SELECT position('Hello, world!', '!'); SELECT position('Hello, world!', '!');
``` ```
@ -67,6 +68,8 @@ Result:
└────────────────────────────────┘ └────────────────────────────────┘
``` ```
Example with `start_pos` argument:
``` sql ``` sql
SELECT SELECT
position('Hello, world!', 'o', 1), position('Hello, world!', 'o', 1),
@ -79,29 +82,21 @@ SELECT
└───────────────────────────────────┴───────────────────────────────────┘ └───────────────────────────────────┴───────────────────────────────────┘
``` ```
The same phrase in Russian contains characters which cant be represented using a single byte. The function returns some unexpected result (use [positionUTF8](#positionutf8) function for multi-byte encoded text): Example for `needle IN haystack` syntax:
Query: ```sql
SELECT 6 = position('/' IN s) FROM (SELECT 'Hello/World' AS s);
``` sql
SELECT position('Привет, мир!', '!');
``` ```
Result: Result:
``` text ```text
┌─position('Привет, мир!', '!')─┐ ┌─equals(6, position(s, '/'))─┐
21 │ │ 1 │
└─────────────────────────────── └─────────────────────────────┘
``` ```
If argument `needle` is empty the following rules apply: Examples with empty `needle` substring:
- if no `start_pos` was specified: return `1`
- if `start_pos = 0`: return `1`
- if `start_pos >= 1` and `start_pos <= length(haystack) + 1`: return `start_pos`
- otherwise: return `0`
The same rules also apply to functions `positionCaseInsensitive`, `positionUTF8` and `positionCaseInsensitiveUTF8`
``` sql ``` sql
SELECT SELECT
@ -120,223 +115,59 @@ SELECT
└─────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┘ └─────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┘
``` ```
**Examples for POSITION(needle IN haystack) syntax**
Query:
```sql
SELECT 3 = position('c' IN 'abc');
```
Result:
```text
┌─equals(3, position('abc', 'c'))─┐
│ 1 │
└─────────────────────────────────┘
```
Query:
```sql
SELECT 6 = position('/' IN s) FROM (SELECT 'Hello/World' AS s);
```
Result:
```text
┌─equals(6, position(s, '/'))─┐
│ 1 │
└─────────────────────────────┘
```
## positionCaseInsensitive ## positionCaseInsensitive
The same as [position](#position) returns the position (in bytes) of the found substring in the string, starting from 1. Use the function for a case-insensitive search. Like [position](#position) but searches case-insensitively.
Works under the assumption that the string contains a set of bytes representing a single-byte encoded text. If this assumption is not met and a character cant be represented using a single byte, the function does not throw an exception and returns some unexpected result. If character can be represented using two bytes, it will use two bytes and so on.
**Syntax**
``` sql
positionCaseInsensitive(haystack, needle[, start_pos])
```
**Arguments**
- `haystack` — String, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — Substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `start_pos` — Optional parameter, position of the first character in the string to start search. [UInt](../../sql-reference/data-types/int-uint.md).
**Returned values**
- Starting position in bytes (counting from 1), if substring was found.
- 0, if the substring was not found.
Type: `Integer`.
**Example**
Query:
``` sql
SELECT positionCaseInsensitive('Hello, world!', 'hello');
```
Result:
``` text
┌─positionCaseInsensitive('Hello, world!', 'hello')─┐
│ 1 │
└───────────────────────────────────────────────────┘
```
## positionUTF8 ## positionUTF8
Returns the position (in Unicode points) of the found substring in the string, starting from 1. Like [position](#position) but assumes `haystack` and `needle` are UTF-8 encoded strings.
Works under the assumption that the string contains a set of bytes representing a UTF-8 encoded text. If this assumption is not met, the function does not throw an exception and returns some unexpected result. If character can be represented using two Unicode points, it will use two and so on.
For a case-insensitive search, use the function [positionCaseInsensitiveUTF8](#positioncaseinsensitiveutf8).
**Syntax**
``` sql
positionUTF8(haystack, needle[, start_pos])
```
**Arguments**
- `haystack` — String, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — Substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `start_pos` — Optional parameter, position of the first character in the string to start search. [UInt](../../sql-reference/data-types/int-uint.md)
**Returned values**
- Starting position in Unicode points (counting from 1), if substring was found.
- 0, if the substring was not found.
Type: `Integer`.
**Examples** **Examples**
The phrase “Hello, world!” in Russian contains a set of Unicode points representing a single-point encoded text. The function returns some expected result: Function `positionUTF8` correctly counts character `ö` (represented by two points) as a single Unicode codepoint:
Query:
``` sql ``` sql
SELECT positionUTF8('Привет, мир!', '!'); SELECT positionUTF8('Motörhead', 'r');
``` ```
Result: Result:
``` text ``` text
┌─positionUTF8('Привет, мир!', '!')─┐ ┌─position('Motörhead', 'r')─┐
│ 12 │ │ 5 │
└───────────────────────────────────┘ └────────────────────────────┘
```
The phrase “Salut, étudiante!”, where character `é` can be represented using a one point (`U+00E9`) or two points (`U+0065U+0301`) the function can be returned some unexpected result:
Query for the letter `é`, which is represented one Unicode point `U+00E9`:
``` sql
SELECT positionUTF8('Salut, étudiante!', '!');
```
Result:
``` text
┌─positionUTF8('Salut, étudiante!', '!')─┐
│ 17 │
└────────────────────────────────────────┘
```
Query for the letter `é`, which is represented two Unicode points `U+0065U+0301`:
``` sql
SELECT positionUTF8('Salut, étudiante!', '!');
```
Result:
``` text
┌─positionUTF8('Salut, étudiante!', '!')─┐
│ 18 │
└────────────────────────────────────────┘
``` ```
## positionCaseInsensitiveUTF8 ## positionCaseInsensitiveUTF8
The same as [positionUTF8](#positionutf8), but is case-insensitive. Returns the position (in Unicode points) of the found substring in the string, starting from 1. Like [positionUTF8](#positionutf8) but searches case-insensitively.
Works under the assumption that the string contains a set of bytes representing a UTF-8 encoded text. If this assumption is not met, the function does not throw an exception and returns some unexpected result. If character can be represented using two Unicode points, it will use two and so on.
**Syntax**
``` sql
positionCaseInsensitiveUTF8(haystack, needle[, start_pos])
```
**Arguments**
- `haystack` — String, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — Substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `start_pos` — Optional parameter, position of the first character in the string to start search. [UInt](../../sql-reference/data-types/int-uint.md)
**Returned value**
- Starting position in Unicode points (counting from 1), if substring was found.
- 0, if the substring was not found.
Type: `Integer`.
**Example**
Query:
``` sql
SELECT positionCaseInsensitiveUTF8('Привет, мир!', 'Мир');
```
Result:
``` text
┌─positionCaseInsensitiveUTF8('Привет, мир!', 'Мир')─┐
│ 9 │
└────────────────────────────────────────────────────┘
```
## multiSearchAllPositions ## multiSearchAllPositions
The same as [position](../../sql-reference/functions/string-search-functions.md#position) but returns `Array` of positions (in bytes) of the found corresponding substrings in the string. Positions are indexed starting from 1. Like [position](#position) but returns an array of positions (in bytes, starting at 1) for multiple `needle` substrings in a `haystack` string.
The search is performed on sequences of bytes without respect to string encoding and collation. :::note
All `multiSearch*()` functions only support up to 2<sup>8</sup> needles.
- For case-insensitive ASCII search, use the function `multiSearchAllPositionsCaseInsensitive`. :::
- For search in UTF-8, use the function [multiSearchAllPositionsUTF8](#multiSearchAllPositionsUTF8).
- For case-insensitive UTF-8 search, use the function multiSearchAllPositionsCaseInsensitiveUTF8.
**Syntax** **Syntax**
``` sql ``` sql
multiSearchAllPositions(haystack, [needle1, needle2, ..., needlen]) multiSearchAllPositions(haystack, [needle1, needle2, ..., needleN])
``` ```
**Arguments** **Arguments**
- `haystack` — String, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal). - `haystack` — String in which the search is performed. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — Substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal). - `needle` — Substrings to be searched. Array
**Returned values** **Returned values**
- Array of starting positions in bytes (counting from 1), if the corresponding substring was found and 0 if not found. - Array of the starting position in bytes and counting from 1 (if the substring was found) or 0 (if the substring was not found)
**Example** **Example**
Query:
``` sql ``` sql
SELECT multiSearchAllPositions('Hello, World!', ['hello', '!', 'world']); SELECT multiSearchAllPositions('Hello, World!', ['hello', '!', 'world']);
``` ```
@ -351,103 +182,172 @@ Result:
## multiSearchAllPositionsUTF8 ## multiSearchAllPositionsUTF8
See `multiSearchAllPositions`. Like [multiSearchAllPositions](#multiSearchAllPositions) but assumes `haystack` and the `needle`-s are UTF-8 encoded strings.
## multiSearchFirstPosition(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\]) ## multiSearchFirstPosition
The same as `position` but returns the leftmost offset of the string `haystack` that is matched to some of the needles. Like `position` but returns the leftmost offset in a `haystack` string which matches any of multiple `needle` strings.
For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstPositionCaseInsensitive, multiSearchFirstPositionUTF8, multiSearchFirstPositionCaseInsensitiveUTF8`. Functions `multiSearchFirstPositionCaseInsensitive`, `multiSearchFirstPositionUTF8` and `multiSearchFirstPositionCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function.
## multiSearchFirstIndex(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\]) **Syntax**
```sql
multiSearchFirstPosition(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\])
```
## multiSearchFirstIndex
Returns the index `i` (starting from 1) of the leftmost found needle<sub>i</sub> in the string `haystack` and 0 otherwise. Returns the index `i` (starting from 1) of the leftmost found needle<sub>i</sub> in the string `haystack` and 0 otherwise.
For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstIndexCaseInsensitive, multiSearchFirstIndexUTF8, multiSearchFirstIndexCaseInsensitiveUTF8`. Functions `multiSearchFirstIndexCaseInsensitive`, `multiSearchFirstIndexUTF8` and `multiSearchFirstIndexCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function.
## multiSearchAny(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\]) **Syntax**
```sql
multiSearchFirstIndex(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\])
```
## multiSearchAny
Returns 1, if at least one string needle<sub>i</sub> matches the string `haystack` and 0 otherwise. Returns 1, if at least one string needle<sub>i</sub> matches the string `haystack` and 0 otherwise.
For a case-insensitive search or/and in UTF-8 format use functions `multiSearchAnyCaseInsensitive, multiSearchAnyUTF8, multiSearchAnyCaseInsensitiveUTF8`. Functions `multiSearchAnyCaseInsensitive`, `multiSearchAnyUTF8` and `multiSearchAnyCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function.
:::note **Syntax**
In all `multiSearch*` functions the number of needles should be less than 2<sup>8</sup> because of implementation specification.
:::
## match(haystack, pattern), haystack REGEXP pattern operator ```sql
multiSearchAny(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\])
```
Checks whether string `haystack` matches the regular expression `pattern`. The pattern is an [re2 regular expression](https://github.com/google/re2/wiki/Syntax) which has a more limited syntax than Perl regular expressions. ## match
Returns 1 in case of a match, and 0 otherwise. Returns whether string `haystack` matches the regular expression `pattern` in [re2 regular syntax](https://github.com/google/re2/wiki/Syntax).
Matching is based on UTF-8, e.g. `.` matches the Unicode code point `¥` which is represented in UTF-8 using two bytes. The regular expression must not contain null bytes. Matching is based on UTF-8, e.g. `.` matches the Unicode code point `¥` which is represented in UTF-8 using two bytes. The regular
If the haystack or the pattern are not valid UTF-8, then the behavior is undefined. expression must not contain null bytes. If the haystack or the pattern are not valid UTF-8, then the behavior is undefined.
No automatic Unicode normalization is performed, you can use the [normalizeUTF8*()](https://clickhouse.com/docs/en/sql-reference/functions/string-functions/) functions for that.
Unlike re2's default behavior, `.` matches line breaks. To disable this, prepend the pattern with `(?-s)`. Unlike re2's default behavior, `.` matches line breaks. To disable this, prepend the pattern with `(?-s)`.
For patterns to search for substrings in a string, it is better to use functions [like](#like) or [position](#position) since they work much faster. If you only want to search substrings in a string, you can use functions [like](#like) or [position](#position) instead - they work much faster than this function.
## multiMatchAny(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) **Syntax**
The same as `match`, but returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. For patterns to search substrings in a string, it is better to use `multiSearchAny` since it works much faster. ```sql
match(haystack, pattern)
```
Alias: `haystack REGEXP pattern operator`
## multiMatchAny
Like `match` but returns 1 if at least one of the patterns match and 0 otherwise.
:::note :::note
Functions `multiMatchAny`, `multiMatchAnyIndex`, `multiMatchAllIndices` and their fuzzy equivalents (`multiFuzzyMatchAny`, Functions in the `multi[Fuzzy]Match*()` family use the the (Vectorscan)[https://github.com/VectorCamp/vectorscan] library. As such, they are only enabled if ClickHouse is compiled with support for vectorscan.
`multiFuzzyMatchAnyIndex`, `multiFuzzyMatchAllIndices`) use the (Vectorscan)[https://github.com/VectorCamp/vectorscan] library. As such,
they are only enabled if ClickHouse is compiled with support for vectorscan. To turn off all functions that use hyperscan, use setting `SET allow_hyperscan = 0;`.
Due to restrictions of vectorscan, the length of the `haystack` string must be less than 2<sup>32</sup> bytes. Due to restrictions of vectorscan, the length of the `haystack` string must be less than 2<sup>32</sup> bytes.
Hyperscan is generally vulnerable to regular expression denial of service (ReDoS) attacks (e.g. see Hyperscan is generally vulnerable to regular expression denial of service (ReDoS) attacks (e.g. see
(here)[https://www.usenix.org/conference/usenixsecurity22/presentation/turonova], (here)[https://doi.org/10.1007/s10664-021-10033-1] and (here)[https://www.usenix.org/conference/usenixsecurity22/presentation/turonova], (here)[https://doi.org/10.1007/s10664-021-10033-1] and
(here)[ https://doi.org/10.1145/3236024.3236027]. Users are adviced to check the provided patterns carefully. (here)[https://doi.org/10.1145/3236024.3236027]. Users are adviced to check the provided patterns carefully.
::: :::
## multiMatchAnyIndex(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) If you only want to search multiple substrings in a string, you can use function [multiSearchAny](#multisearchany) instead - it works much faster than this function.
The same as `multiMatchAny`, but returns any index that matches the haystack. **Syntax**
## multiMatchAllIndices(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) ```sql
multiMatchAny(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
```
The same as `multiMatchAny`, but returns the array of all indices that match the haystack in any order. ## multiMatchAnyIndex
## multiFuzzyMatchAny(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) Like `multiMatchAny` but returns any index that matches the haystack.
The same as `multiMatchAny`, but returns 1 if any pattern matches the haystack within a constant [edit distance](https://en.wikipedia.org/wiki/Edit_distance). This function relies on the experimental feature of [hyperscan](https://intel.github.io/hyperscan/dev-reference/compilation.html#approximate-matching) library, and can be slow for some corner cases. The performance depends on the edit distance value and patterns used, but it's always more expensive compared to a non-fuzzy variants. **Syntax**
## multiFuzzyMatchAnyIndex(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) ```sql
multiMatchAnyIndex(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
```
The same as `multiFuzzyMatchAny`, but returns any index that matches the haystack within a constant edit distance. ## multiMatchAllIndices
## multiFuzzyMatchAllIndices(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) Like `multiMatchAny` but returns the array of all indices that match the haystack in any order.
The same as `multiFuzzyMatchAny`, but returns the array of all indices in any order that match the haystack within a constant edit distance. **Syntax**
```sql
multiMatchAllIndices(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
```
## multiFuzzyMatchAny
Like `multiMatchAny` but returns 1 if any pattern matches the haystack within a constant [edit distance](https://en.wikipedia.org/wiki/Edit_distance). This function relies on the experimental feature of [hyperscan](https://intel.github.io/hyperscan/dev-reference/compilation.html#approximate-matching) library, and can be slow for some corner cases. The performance depends on the edit distance value and patterns used, but it's always more expensive compared to a non-fuzzy variants.
:::note :::note
`multiFuzzyMatch*` functions do not support UTF-8 regular expressions, and such expressions are treated as bytes because of hyperscan restriction. `multiFuzzyMatch*()` function family do not support UTF-8 regular expressions (it threats them as a sequence of bytes) due to restrictions of hyperscan.
::: :::
:::note **Syntax**
To turn off all functions that use hyperscan, use setting `SET allow_hyperscan = 0;`.
:::
## extract(haystack, pattern) ```sql
multiFuzzyMatchAny(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
```
Extracts a fragment of a string using a regular expression. If haystack does not match the pattern regex, an empty string is returned. If the regex does not contain subpatterns, it takes the fragment that matches the entire regex. Otherwise, it takes the fragment that matches the first subpattern. ## multiFuzzyMatchAnyIndex
## extractAll(haystack, pattern) Like `multiFuzzyMatchAny` but returns any index that matches the haystack within a constant edit distance.
Extracts all the fragments of a string using a regular expression. If haystack does not match the pattern regex, an empty string is returned. Returns an array of strings consisting of all matches to the regex. In general, the behavior is the same as the extract function (it takes the first subpattern, or the entire expression if there isnt a subpattern). **Syntax**
```sql
multiFuzzyMatchAnyIndex(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
```
## multiFuzzyMatchAllIndices
Like `multiFuzzyMatchAny` but returns the array of all indices in any order that match the haystack within a constant edit distance.
**Syntax**
```sql
multiFuzzyMatchAllIndices(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\])
```
## extract
Extracts a fragment of a string using a regular expression. If `haystack` does not match the `pattern` regex, an empty string is returned.
For regex without subpatterns, the function uses the fragment that matches the entire regex. Otherwise, it uses the fragment that matches the first subpattern.
**Syntax**
```sql
extract(haystack, pattern)
```
## extractAll
Extracts all fragments of a string using a regular expression. If `haystack` does not match the `pattern` regex, an empty string is returned.
Returns an array of strings consisting of all matches of the regex.
The behavior with respect to subpatterns is the same as in function `extract`.
**Syntax**
```sql
extractAll(haystack, pattern)
```
## extractAllGroupsHorizontal ## extractAllGroupsHorizontal
Matches all groups of the `haystack` string using the `pattern` regular expression. Returns an array of arrays, where the first array includes all fragments matching the first group, the second array - matching the second group, etc. Matches all groups of the `haystack` string using the `pattern` regular expression. Returns an array of arrays, where the first array includes all fragments matching the first group, the second array - matching the second group, etc.
:::note This function is slower than [extractAllGroupsVertical](#extractallgroups-vertical).
`extractAllGroupsHorizontal` function is slower than [extractAllGroupsVertical](#extractallgroups-vertical).
:::
**Syntax** **Syntax**
@ -468,8 +368,6 @@ If `haystack` does not match the `pattern` regex, an array of empty arrays is re
**Example** **Example**
Query:
``` sql ``` sql
SELECT extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)'); SELECT extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)');
``` ```
@ -482,10 +380,6 @@ Result:
└──────────────────────────────────────────────────────────────────────────────────────────┘ └──────────────────────────────────────────────────────────────────────────────────────────┘
``` ```
**See Also**
- [extractAllGroupsVertical](#extractallgroups-vertical)
## extractAllGroupsVertical ## extractAllGroupsVertical
Matches all groups of the `haystack` string using the `pattern` regular expression. Returns an array of arrays, where each array includes matching fragments from every group. Fragments are grouped in order of appearance in the `haystack`. Matches all groups of the `haystack` string using the `pattern` regular expression. Returns an array of arrays, where each array includes matching fragments from every group. Fragments are grouped in order of appearance in the `haystack`.
@ -509,8 +403,6 @@ If `haystack` does not match the `pattern` regex, an empty array is returned.
**Example** **Example**
Query:
``` sql ``` sql
SELECT extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)'); SELECT extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)');
``` ```
@ -523,119 +415,88 @@ Result:
└────────────────────────────────────────────────────────────────────────────────────────┘ └────────────────────────────────────────────────────────────────────────────────────────┘
``` ```
**See Also** ## like
- [extractAllGroupsHorizontal](#extractallgroups-horizontal) Returns whether string `haystack` matches the LIKE expression `pattern`.
## like(haystack, pattern), haystack LIKE pattern operator A LIKE expression can contain normal characters and the following metasymbols:
Checks whether a string matches a LIKE expression.
A LIKE expression contains a mix of normal characters and the following metasymbols:
- `%` indicates an arbitrary number of arbitrary characters (including zero characters). - `%` indicates an arbitrary number of arbitrary characters (including zero characters).
- `_` indicates a single arbitrary character. - `_` indicates a single arbitrary character.
- `\` is for escaping literals `%`, `_` and `\`. - `\` is for escaping literals `%`, `_` and `\`.
Matching is based on UTF-8, e.g. `_` matches the Unicode code point `¥` which is represented in UTF-8 using two bytes. Matching is based on UTF-8, e.g. `_` matches the Unicode code point `¥` which is represented in UTF-8 using two bytes.
If the haystack or the pattern are not valid UTF-8, then the behavior is undefined.
If the haystack or the LIKE expression are not valid UTF-8, the behavior is undefined.
No automatic Unicode normalization is performed, you can use the [normalizeUTF8*()](https://clickhouse.com/docs/en/sql-reference/functions/string-functions/) functions for that. No automatic Unicode normalization is performed, you can use the [normalizeUTF8*()](https://clickhouse.com/docs/en/sql-reference/functions/string-functions/) functions for that.
To match against literals `%`, `_` and `/` (which are LIKE metacharacters), prepend them with a backslash, i.e. `\%`, `\_` and `\\`. To match against literal `%`, `_` and `/` (which are LIKE metacharacters), prepend them with a backslash: `\%`, `\_` and `\\`.
The backslash loses its special meaning, i.e. is interpreted literally, if it prepends a character different than `%`, `_` or `\`. The backslash loses its special meaning (i.e. is interpreted literally) if it prepends a character different than `%`, `_` or `\`.
Note that ClickHouse requires backslashes in strings [to be quoted as well](../syntax.md#string), so you would actually need to write `\\%`, `\\_` and `\\\\`. Note that ClickHouse requires backslashes in strings [to be quoted as well](../syntax.md#string), so you would actually need to write `\\%`, `\\_` and `\\\\`.
For patterns of the form `%needle%`, the function is as fast as the `position` function. For LIKE expressions of the form `%needle%`, the function is as fast as the `position` function.
Other LIKE expressions are internally converted to a regular expression and executed with a performance similar to function `match`. All other LIKE expressions are internally converted to a regular expression and executed with a performance similar to function `match`.
## notLike(haystack, pattern), haystack NOT LIKE pattern operator
The same thing as `like`, but negative.
## ilike
Case insensitive variant of [like](https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions/#function-like) function. You can use `ILIKE` operator instead of the `ilike` function.
The function ignores the language, e.g. for Turkish (i/İ), the result might be incorrect.
**Syntax** **Syntax**
``` sql ```sql
ilike(haystack, pattern) like(haystack, pattern)
``` ```
**Arguments** Alias: `haystack LIKE pattern` (operator)
- `haystack` — Input string. [String](../../sql-reference/syntax.md#syntax-string-literal). ## notLike
- `pattern` — If `pattern` does not contain percent signs or underscores, then the `pattern` only represents the string itself. An underscore (`_`) in `pattern` stands for (matches) any single character. A percent sign (`%`) matches any sequence of zero or more characters.
Some `pattern` examples: Like `like` but negates the result.
``` text Alias: `haystack NOT LIKE pattern` (operator)
'abc' ILIKE 'abc' true
'abc' ILIKE 'a%' true ## ilike
'abc' ILIKE '_b_' true
'abc' ILIKE 'c' false Like `like` but searches case-insensitively.
Alias: `haystack ILIKE pattern` (operator)
## notILike
Like `ilike` but negates the result.
Alias: `haystack NOT ILIKE pattern` (operator)
## ngramDistance
Calculates the 4-gram distance between a `haystack` string and a `needle` string. For that, it counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns a Float32 between 0 and 1. The smaller the result is, the more strings are similar to each other. Throws an exception if constant `needle` or `haystack` arguments are more than 32Kb in size. If any of non-constant `haystack` or `needle` arguments is more than 32Kb in size, the distance is always 1.
Functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function.
**Syntax**
```sql
ngramDistance(haystack, needle)
``` ```
**Returned values** ## ngramSearch
- True, if the string matches `pattern`. Like `ngramDistance` but calculates the non-symmetric difference between a `needle` string and a `haystack` string, i.e. the number of n-grams from needle minus the common number of n-grams normalized by the number of `needle` n-grams. Returns a Float32 between 0 and 1. The bigger the result is, the more likely `needle` is in the `haystack`. This function is useful for fuzzy string search. Also see function `soundex`.
- False, if the string does not match `pattern`.
**Example** Functions `ngramSearchCaseInsensitive, ngramSearchUTF8, ngramSearchCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function.
Input table:
``` text
┌─id─┬─name─────┬─days─┐
│ 1 │ January │ 31 │
│ 2 │ February │ 29 │
│ 3 │ March │ 31 │
│ 4 │ April │ 30 │
└────┴──────────┴──────┘
```
Query:
``` sql
SELECT * FROM Months WHERE ilike(name, '%j%');
```
Result:
``` text
┌─id─┬─name────┬─days─┐
│ 1 │ January │ 31 │
└────┴─────────┴──────┘
```
## notILike(haystack, pattern), haystack NOT ILIKE pattern operator
The same thing as `ilike`, but negative.
## ngramDistance(haystack, needle)
Calculates the 4-gram distance between `haystack` and `needle`: counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns float number from 0 to 1 the closer to zero, the more strings are similar to each other. If the constant `needle` or `haystack` is more than 32Kb, throws an exception. If some of the non-constant `haystack` or `needle` strings are more than 32Kb, the distance is always one.
For case-insensitive search or/and in UTF-8 format use functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8`.
## ngramSearch(haystack, needle)
Same as `ngramDistance` but calculates the non-symmetric difference between `needle` and `haystack` the number of n-grams from needle minus the common number of n-grams normalized by the number of `needle` n-grams. The closer to one, the more likely `needle` is in the `haystack`. Can be useful for fuzzy string search.
For case-insensitive search or/and in UTF-8 format use functions `ngramSearchCaseInsensitive, ngramSearchUTF8, ngramSearchCaseInsensitiveUTF8`.
:::note :::note
For UTF-8 case we use 3-gram distance. All these are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one this works for Latin and mostly for all Cyrillic letters. The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one this works for Latin and mostly for all Cyrillic letters.
::: :::
**Syntax**
```sql
ngramSearch(haystack, needle)
```
## countSubstrings ## countSubstrings
Returns the number of substring occurrences. Returns how often substring `needle` occurs in string `haystack`.
For a case-insensitive search, use [countSubstringsCaseInsensitive](../../sql-reference/functions/string-search-functions.md#countSubstringsCaseInsensitive) or [countSubstringsCaseInsensitiveUTF8](../../sql-reference/functions/string-search-functions.md#countSubstringsCaseInsensitiveUTF8) functions. Functions `countSubstringsCaseInsensitive` and `countSubstringsCaseInsensitiveUTF8` provide a case-insensitive and case-insensitive + UTF-8 variants of this function.
**Syntax** **Syntax**
@ -645,34 +506,18 @@ countSubstrings(haystack, needle[, start_pos])
**Arguments** **Arguments**
- `haystack`The string to search in. [String](../../sql-reference/syntax.md#syntax-string-literal). - `haystack`String in which the search is performed. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle`The substring to search for. [String](../../sql-reference/syntax.md#syntax-string-literal). - `needle`Substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `start_pos` Position of the first character in the string to start search. Optional. [UInt](../../sql-reference/data-types/int-uint.md). - `start_pos` Position (1-based) in `haystack` at which the search starts. [UInt](../../sql-reference/data-types/int-uint.md). Optional.
**Returned values** **Returned values**
- Number of occurrences. - The number of occurrences.
Type: [UInt64](../../sql-reference/data-types/int-uint.md). Type: [UInt64](../../sql-reference/data-types/int-uint.md).
**Examples** **Examples**
Query:
``` sql
SELECT countSubstrings('foobar.com', '.');
```
Result:
``` text
┌─countSubstrings('foobar.com', '.')─┐
│ 1 │
└────────────────────────────────────┘
```
Query:
``` sql ``` sql
SELECT countSubstrings('aaaa', 'aa'); SELECT countSubstrings('aaaa', 'aa');
``` ```
@ -685,7 +530,7 @@ Result:
└───────────────────────────────┘ └───────────────────────────────┘
``` ```
Query: Example with `start_pos` argument:
```sql ```sql
SELECT countSubstrings('abc___abc', 'abc', 4); SELECT countSubstrings('abc___abc', 'abc', 4);
@ -699,125 +544,7 @@ Result:
└────────────────────────────────────────┘ └────────────────────────────────────────┘
``` ```
## countSubstringsCaseInsensitive ## countMatches
Returns the number of substring occurrences case-insensitive.
**Syntax**
``` sql
countSubstringsCaseInsensitive(haystack, needle[, start_pos])
```
**Arguments**
- `haystack` — The string to search in. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — The substring to search for. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `start_pos` — Position of the first character in the string to start search. Optional. [UInt](../../sql-reference/data-types/int-uint.md).
**Returned values**
- Number of occurrences.
Type: [UInt64](../../sql-reference/data-types/int-uint.md).
**Examples**
Query:
``` sql
SELECT countSubstringsCaseInsensitive('aba', 'B');
```
Result:
``` text
┌─countSubstringsCaseInsensitive('aba', 'B')─┐
│ 1 │
└────────────────────────────────────────────┘
```
Query:
``` sql
SELECT countSubstringsCaseInsensitive('foobar.com', 'CoM');
```
Result:
``` text
┌─countSubstringsCaseInsensitive('foobar.com', 'CoM')─┐
│ 1 │
└─────────────────────────────────────────────────────┘
```
Query:
``` sql
SELECT countSubstringsCaseInsensitive('abC___abC', 'aBc', 2);
```
Result:
``` text
┌─countSubstringsCaseInsensitive('abC___abC', 'aBc', 2)─┐
│ 1 │
└───────────────────────────────────────────────────────┘
```
## countSubstringsCaseInsensitiveUTF8
Returns the number of substring occurrences in `UTF-8` case-insensitive.
**Syntax**
``` sql
SELECT countSubstringsCaseInsensitiveUTF8(haystack, needle[, start_pos])
```
**Arguments**
- `haystack` — The string to search in. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — The substring to search for. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `start_pos` — Position of the first character in the string to start search. Optional. [UInt](../../sql-reference/data-types/int-uint.md).
**Returned values**
- Number of occurrences.
Type: [UInt64](../../sql-reference/data-types/int-uint.md).
**Examples**
Query:
``` sql
SELECT countSubstringsCaseInsensitiveUTF8('абв', 'A');
```
Result:
``` text
┌─countSubstringsCaseInsensitiveUTF8('абв', 'A')─┐
│ 1 │
└────────────────────────────────────────────────┘
```
Query:
```sql
SELECT countSubstringsCaseInsensitiveUTF8('аБв__АбВ__абв', 'Абв');
```
Result:
``` text
┌─countSubstringsCaseInsensitiveUTF8('аБв__АбВ__абв', 'Абв')─┐
│ 3 │
└────────────────────────────────────────────────────────────┘
```
## countMatches(haystack, pattern)
Returns the number of regular expression matches for a `pattern` in a `haystack`. Returns the number of regular expression matches for a `pattern` in a `haystack`.
@ -840,8 +567,6 @@ Type: [UInt64](../../sql-reference/data-types/int-uint.md).
**Examples** **Examples**
Query:
``` sql ``` sql
SELECT countMatches('foobar.com', 'o+'); SELECT countMatches('foobar.com', 'o+');
``` ```
@ -854,8 +579,6 @@ Result:
└──────────────────────────────────┘ └──────────────────────────────────┘
``` ```
Query:
``` sql ``` sql
SELECT countMatches('aaaa', 'aa'); SELECT countMatches('aaaa', 'aa');
``` ```
@ -868,7 +591,7 @@ Result:
└───────────────────────────────┘ └───────────────────────────────┘
``` ```
## regexpExtract(haystack, pattern[, index]) ## regexpExtract
Extracts the first string in haystack that matches the regexp pattern and corresponds to the regex group index. Extracts the first string in haystack that matches the regexp pattern and corresponds to the regex group index.
@ -899,8 +622,12 @@ SELECT
regexpExtract('100-200', '(\\d+)-(\\d+)', 1), regexpExtract('100-200', '(\\d+)-(\\d+)', 1),
regexpExtract('100-200', '(\\d+)-(\\d+)', 2), regexpExtract('100-200', '(\\d+)-(\\d+)', 2),
regexpExtract('100-200', '(\\d+)-(\\d+)', 0), regexpExtract('100-200', '(\\d+)-(\\d+)', 0),
regexpExtract('100-200', '(\\d+)-(\\d+)') regexpExtract('100-200', '(\\d+)-(\\d+)');
```
Result:
``` text
┌─regexpExtract('100-200', '(\\d+)-(\\d+)', 1)─┬─regexpExtract('100-200', '(\\d+)-(\\d+)', 2)─┬─regexpExtract('100-200', '(\\d+)-(\\d+)', 0)─┬─regexpExtract('100-200', '(\\d+)-(\\d+)')─┐ ┌─regexpExtract('100-200', '(\\d+)-(\\d+)', 1)─┬─regexpExtract('100-200', '(\\d+)-(\\d+)', 2)─┬─regexpExtract('100-200', '(\\d+)-(\\d+)', 0)─┬─regexpExtract('100-200', '(\\d+)-(\\d+)')─┐
│ 100 │ 200 │ 100-200 │ 100 │ │ 100 │ 200 │ 100-200 │ 100 │
└──────────────────────────────────────────────┴──────────────────────────────────────────────┴──────────────────────────────────────────────┴───────────────────────────────────────────┘ └──────────────────────────────────────────────┴──────────────────────────────────────────────┴──────────────────────────────────────────────┴───────────────────────────────────────────┘