ClickHouse/docs/en/sql-reference/functions/string-search-functions.md

378 lines
17 KiB
Markdown
Raw Normal View History

2020-04-03 13:23:32 +00:00
---
toc_priority: 41
2020-06-19 10:05:38 +00:00
toc_title: Searching in Strings
2020-04-03 13:23:32 +00:00
---
2020-03-20 10:10:48 +00:00
# Functions for Searching Strings {#functions-for-searching-strings}
2019-02-11 12:49:33 +00:00
The search is case-sensitive by default in all these functions. There are separate variants for case insensitive search.
2020-03-18 18:43:51 +00:00
## position(haystack, needle), locate(haystack, needle) {#position}
Returns the position (in bytes) of the found substring in the string, starting from 1.
2020-03-20 10:10:48 +00:00
Works under the assumption that the string contains a set of bytes representing a single-byte encoded text. If this assumption is not met and a character cant be represented using a single byte, the function doesnt throw an exception and returns some unexpected result. If character can be represented using two bytes, it will use two bytes and so on.
For a case-insensitive search, use the function [positionCaseInsensitive](#positioncaseinsensitive).
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
position(haystack, needle)
```
Alias: `locate(haystack, needle)`.
**Parameters**
- `haystack` — string, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
**Returned values**
- Starting position in bytes (counting from 1), if substring was found.
- 0, if the substring was not found.
Type: `Integer`.
**Examples**
2020-03-20 10:10:48 +00:00
The phrase “Hello, world!” contains a set of bytes representing a single-byte encoded text. The function returns some expected result:
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT position('Hello, world!', '!')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─position('Hello, world!', '!')─┐
│ 13 │
└────────────────────────────────┘
```
2020-03-20 10:10:48 +00:00
The same phrase in Russian contains characters which cant be represented using a single byte. The function returns some unexpected result (use [positionUTF8](#positionutf8) function for multi-byte encoded text):
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT position('Привет, мир!', '!')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─position('Привет, мир!', '!')─┐
│ 21 │
└───────────────────────────────┘
```
2020-03-18 18:43:51 +00:00
## positionCaseInsensitive {#positioncaseinsensitive}
The same as [position](#position) returns the position (in bytes) of the found substring in the string, starting from 1. Use the function for a case-insensitive search.
2020-03-20 10:10:48 +00:00
Works under the assumption that the string contains a set of bytes representing a single-byte encoded text. If this assumption is not met and a character cant be represented using a single byte, the function doesnt throw an exception and returns some unexpected result. If character can be represented using two bytes, it will use two bytes and so on.
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
positionCaseInsensitive(haystack, needle)
```
**Parameters**
- `haystack` — string, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
**Returned values**
- Starting position in bytes (counting from 1), if substring was found.
- 0, if the substring was not found.
Type: `Integer`.
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT positionCaseInsensitive('Hello, world!', 'hello')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─positionCaseInsensitive('Hello, world!', 'hello')─┐
│ 1 │
└───────────────────────────────────────────────────┘
```
2020-03-18 18:43:51 +00:00
## positionUTF8 {#positionutf8}
Returns the position (in Unicode points) of the found substring in the string, starting from 1.
2020-03-20 10:10:48 +00:00
Works under the assumption that the string contains a set of bytes representing a UTF-8 encoded text. If this assumption is not met, the function doesnt throw an exception and returns some unexpected result. If character can be represented using two Unicode points, it will use two and so on.
For a case-insensitive search, use the function [positionCaseInsensitiveUTF8](#positioncaseinsensitiveutf8).
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
positionUTF8(haystack, needle)
```
**Parameters**
- `haystack` — string, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
**Returned values**
- Starting position in Unicode points (counting from 1), if substring was found.
- 0, if the substring was not found.
Type: `Integer`.
**Examples**
2020-03-20 10:10:48 +00:00
The phrase “Hello, world!” in Russian contains a set of Unicode points representing a single-point encoded text. The function returns some expected result:
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT positionUTF8('Привет, мир!', '!')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─positionUTF8('Привет, мир!', '!')─┐
│ 12 │
└───────────────────────────────────┘
```
2020-06-19 10:05:38 +00:00
The phrase “Salut, étudiante!”, where character `é` can be represented using a one point (`U+00E9`) or two points (`U+0065U+0301`) the function can be returned some unexpected result:
Query for the letter `é`, which is represented one Unicode point `U+00E9`:
2020-03-20 10:10:48 +00:00
``` sql
SELECT positionUTF8('Salut, étudiante!', '!')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─positionUTF8('Salut, étudiante!', '!')─┐
│ 17 │
└────────────────────────────────────────┘
```
2020-06-19 10:05:38 +00:00
Query for the letter `é`, which is represented two Unicode points `U+0065U+0301`:
2020-03-20 10:10:48 +00:00
``` sql
2020-06-19 10:05:38 +00:00
SELECT positionUTF8('Salut, étudiante!', '!')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-06-19 10:05:38 +00:00
┌─positionUTF8('Salut, étudiante!', '!')─┐
│ 18 │
└────────────────────────────────────────┘
```
2020-03-18 18:43:51 +00:00
## positionCaseInsensitiveUTF8 {#positioncaseinsensitiveutf8}
The same as [positionUTF8](#positionutf8), but is case-insensitive. Returns the position (in Unicode points) of the found substring in the string, starting from 1.
2020-03-20 10:10:48 +00:00
Works under the assumption that the string contains a set of bytes representing a UTF-8 encoded text. If this assumption is not met, the function doesnt throw an exception and returns some unexpected result. If character can be represented using two Unicode points, it will use two and so on.
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
positionCaseInsensitiveUTF8(haystack, needle)
```
**Parameters**
- `haystack` — string, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
**Returned value**
- Starting position in Unicode points (counting from 1), if substring was found.
- 0, if the substring was not found.
Type: `Integer`.
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT positionCaseInsensitiveUTF8('Привет, мир!', 'Мир')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─positionCaseInsensitiveUTF8('Привет, мир!', 'Мир')─┐
│ 9 │
└────────────────────────────────────────────────────┘
```
## multiSearchAllPositions {#multisearchallpositions}
2019-01-23 08:38:32 +00:00
The same as [position](../../sql-reference/functions/string-search-functions.md#position) but returns `Array` of positions (in bytes) of the found corresponding substrings in the string. Positions are indexed starting from 1.
2019-01-23 08:38:32 +00:00
2020-02-02 21:38:00 +00:00
The search is performed on sequences of bytes without respect to string encoding and collation.
2019-01-23 08:38:32 +00:00
- For case-insensitive ASCII search, use the function `multiSearchAllPositionsCaseInsensitive`.
- For search in UTF-8, use the function [multiSearchAllPositionsUTF8](#multiSearchAllPositionsUTF8).
- For case-insensitive UTF-8 search, use the function multiSearchAllPositionsCaseInsensitiveUTF8.
2020-03-20 10:10:48 +00:00
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
multiSearchAllPositions(haystack, [needle1, needle2, ..., needlen])
```
2019-10-22 12:27:52 +00:00
**Parameters**
- `haystack` — string, in which substring will to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
- `needle` — substring to be searched. [String](../../sql-reference/syntax.md#syntax-string-literal).
**Returned values**
- Array of starting positions in bytes (counting from 1), if the corresponding substring was found and 0 if not found.
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT multiSearchAllPositions('Hello, World!', ['hello', '!', 'world'])
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─multiSearchAllPositions('Hello, World!', ['hello', '!', 'world'])─┐
│ [0,13,0] │
└───────────────────────────────────────────────────────────────────┘
```
## multiSearchAllPositionsUTF8 {#multiSearchAllPositionsUTF8}
2020-02-02 21:38:00 +00:00
See `multiSearchAllPositions`.
2019-01-23 08:38:32 +00:00
## multiSearchFirstPosition(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\]) {#multisearchfirstposition}
2019-01-23 08:38:32 +00:00
The same as `position` but returns the leftmost offset of the string `haystack` that is matched to some of the needles.
2019-01-23 08:38:32 +00:00
For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstPositionCaseInsensitive, multiSearchFirstPositionUTF8, multiSearchFirstPositionCaseInsensitiveUTF8`.
2019-01-23 08:38:32 +00:00
2020-03-20 10:10:48 +00:00
## multiSearchFirstIndex(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\]) {#multisearchfirstindexhaystack-needle1-needle2-needlen}
Returns the index `i` (starting from 1) of the leftmost found needle<sub>i</sub> in the string `haystack` and 0 otherwise.
For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstIndexCaseInsensitive, multiSearchFirstIndexUTF8, multiSearchFirstIndexCaseInsensitiveUTF8`.
2020-03-20 10:10:48 +00:00
## multiSearchAny(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\]) {#function-multisearchany}
2019-01-23 08:38:32 +00:00
Returns 1, if at least one string needle<sub>i</sub> matches the string `haystack` and 0 otherwise.
2019-01-23 08:38:32 +00:00
For a case-insensitive search or/and in UTF-8 format use functions `multiSearchAnyCaseInsensitive, multiSearchAnyUTF8, multiSearchAnyCaseInsensitiveUTF8`.
2019-01-23 08:38:32 +00:00
2019-09-03 08:56:16 +00:00
!!! note "Note"
In all `multiSearch*` functions the number of needles should be less than 2<sup>8</sup> because of implementation specification.
2019-03-28 15:12:37 +00:00
2020-03-20 10:10:48 +00:00
## match(haystack, pattern) {#matchhaystack-pattern}
Checks whether the string matches the `pattern` regular expression. A `re2` regular expression. The [syntax](https://github.com/google/re2/wiki/Syntax) of the `re2` regular expressions is more limited than the syntax of the Perl regular expressions.
2020-03-20 10:10:48 +00:00
Returns 0 if it doesnt match, or 1 if it matches.
Note that the backslash symbol (`\`) is used for escaping in the regular expression. The same symbol is used for escaping in string literals. So in order to escape the symbol in a regular expression, you must write two backslashes (\\) in a string literal.
2020-03-20 10:10:48 +00:00
The regular expression works with the string as if it is a set of bytes. The regular expression cant contain null bytes.
For patterns to search for substrings in a string, it is better to use LIKE or position, since they work much faster.
2020-03-20 10:10:48 +00:00
## multiMatchAny(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multimatchanyhaystack-pattern1-pattern2-patternn}
2019-03-28 15:12:37 +00:00
The same as `match`, but returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. It uses [hyperscan](https://github.com/intel/hyperscan) library. For patterns to search substrings in a string, it is better to use `multiSearchAny` since it works much faster.
2019-09-03 08:56:16 +00:00
!!! note "Note"
The length of any of the `haystack` string must be less than 2<sup>32</sup> bytes otherwise the exception is thrown. This restriction takes place because of hyperscan API.
2020-03-20 10:10:48 +00:00
## multiMatchAnyIndex(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multimatchanyindexhaystack-pattern1-pattern2-patternn}
The same as `multiMatchAny`, but returns any index that matches the haystack.
2020-03-20 10:10:48 +00:00
## multiMatchAllIndices(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multimatchallindiceshaystack-pattern1-pattern2-patternn}
The same as `multiMatchAny`, but returns the array of all indicies that match the haystack in any order.
2020-03-20 10:10:48 +00:00
## multiFuzzyMatchAny(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multifuzzymatchanyhaystack-distance-pattern1-pattern2-patternn}
2019-03-29 01:02:05 +00:00
2019-03-29 01:39:59 +00:00
The same as `multiMatchAny`, but returns 1 if any pattern matches the haystack within a constant [edit distance](https://en.wikipedia.org/wiki/Edit_distance). This function is also in an experimental mode and can be extremely slow. For more information see [hyperscan documentation](https://intel.github.io/hyperscan/dev-reference/compilation.html#approximate-matching).
2019-03-29 01:02:05 +00:00
2020-03-20 10:10:48 +00:00
## multiFuzzyMatchAnyIndex(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multifuzzymatchanyindexhaystack-distance-pattern1-pattern2-patternn}
2019-03-29 01:02:05 +00:00
2019-03-29 01:39:59 +00:00
The same as `multiFuzzyMatchAny`, but returns any index that matches the haystack within a constant edit distance.
2019-03-29 01:02:05 +00:00
2020-03-20 10:10:48 +00:00
## multiFuzzyMatchAllIndices(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multifuzzymatchallindiceshaystack-distance-pattern1-pattern2-patternn}
2019-10-13 13:35:43 +00:00
The same as `multiFuzzyMatchAny`, but returns the array of all indices in any order that match the haystack within a constant edit distance.
2019-09-03 08:56:16 +00:00
!!! note "Note"
`multiFuzzyMatch*` functions do not support UTF-8 regular expressions, and such expressions are treated as bytes because of hyperscan restriction.
2019-09-03 08:56:16 +00:00
!!! note "Note"
To turn off all functions that use hyperscan, use setting `SET allow_hyperscan = 0;`.
2019-03-29 01:02:05 +00:00
2020-03-20 10:10:48 +00:00
## extract(haystack, pattern) {#extracthaystack-pattern}
2020-03-20 10:10:48 +00:00
Extracts a fragment of a string using a regular expression. If haystack doesnt match the pattern regex, an empty string is returned. If the regex doesnt contain subpatterns, it takes the fragment that matches the entire regex. Otherwise, it takes the fragment that matches the first subpattern.
2020-03-20 10:10:48 +00:00
## extractAll(haystack, pattern) {#extractallhaystack-pattern}
2020-03-20 10:10:48 +00:00
Extracts all the fragments of a string using a regular expression. If haystack doesnt match the pattern regex, an empty string is returned. Returns an array of strings consisting of all matches to the regex. In general, the behavior is the same as the extract function (it takes the first subpattern, or the entire expression if there isnt a subpattern).
2020-03-18 18:43:51 +00:00
## like(haystack, pattern), haystack LIKE pattern operator {#function-like}
Checks whether a string matches a simple regular expression.
The regular expression can contain the metasymbols `%` and `_`.
2019-05-05 17:38:05 +00:00
`%` indicates any quantity of any bytes (including zero characters).
`_` indicates any one byte.
2020-03-20 10:10:48 +00:00
Use the backslash (`\`) for escaping metasymbols. See the note on escaping in the description of the match function.
For regular expressions like `%needle%`, the code is more optimal and works as fast as the `position` function.
2020-03-20 10:10:48 +00:00
For other regular expressions, the code is the same as for the match function.
2020-03-18 18:43:51 +00:00
## notLike(haystack, pattern), haystack NOT LIKE pattern operator {#function-notlike}
2020-03-20 10:10:48 +00:00
The same thing as like, but negative.
2020-03-20 10:10:48 +00:00
## ngramDistance(haystack, needle) {#ngramdistancehaystack-needle}
2020-03-20 10:10:48 +00:00
Calculates the 4-gram distance between `haystack` and `needle`: counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns float number from 0 to 1 the closer to zero, the more strings are similar to each other. If the constant `needle` or `haystack` is more than 32Kb, throws an exception. If some of the non-constant `haystack` or `needle` strings are more than 32Kb, the distance is always one.
For case-insensitive search or/and in UTF-8 format use functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8`.
2020-03-20 10:10:48 +00:00
## ngramSearch(haystack, needle) {#ngramsearchhaystack-needle}
2020-03-20 10:10:48 +00:00
Same as `ngramDistance` but calculates the non-symmetric difference between `needle` and `haystack` the number of n-grams from needle minus the common number of n-grams normalized by the number of `needle` n-grams. The closer to one, the more likely `needle` is in the `haystack`. Can be useful for fuzzy string search.
2019-05-25 18:47:26 +00:00
2019-05-27 09:05:02 +00:00
For case-insensitive search or/and in UTF-8 format use functions `ngramSearchCaseInsensitive, ngramSearchUTF8, ngramSearchCaseInsensitiveUTF8`.
2019-05-25 18:47:26 +00:00
2019-09-03 08:53:22 +00:00
!!! note "Note"
2020-03-20 10:10:48 +00:00
For UTF-8 case we use 3-gram distance. All these are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one this works for Latin and mostly for all Cyrillic letters.
2020-01-30 10:34:55 +00:00
[Original article](https://clickhouse.tech/docs/en/query_language/functions/string_search_functions/) <!--hide-->