2020-04-03 13:23:32 +00:00
---
toc_priority: 41
2020-06-19 10:10:23 +00:00
toc_title: For Searching in Strings
2020-04-03 13:23:32 +00:00
---
2020-06-19 10:08:10 +00:00
# Functions for Searching in Strings {#functions-for-searching-strings}
2017-12-28 15:13:23 +00:00
2019-02-11 12:49:33 +00:00
The search is case-sensitive by default in all these functions. There are separate variants for case insensitive search.
2017-12-28 15:13:23 +00:00
2020-06-19 10:08:10 +00:00
!!! note "Note"
2020-07-09 15:10:35 +00:00
Functions for [replacing ](../../sql-reference/functions/string-replace-functions.md ) and [other manipulations with strings ](../../sql-reference/functions/string-functions.md ) are described separately.
2020-06-19 10:08:10 +00:00
2020-03-18 18:43:51 +00:00
## position(haystack, needle), locate(haystack, needle) {#position}
2017-12-28 15:13:23 +00:00
2020-03-13 06:33:02 +00:00
Returns the position (in bytes) of the found substring in the string, starting from 1.
2018-04-28 11:45:37 +00:00
2020-03-13 06:33:02 +00:00
For a case-insensitive search, use the function [positionCaseInsensitive ](#positioncaseinsensitive ).
2017-12-28 15:13:23 +00:00
2020-03-13 06:33:02 +00:00
**Syntax**
2018-04-28 11:45:37 +00:00
2020-03-20 10:10:48 +00:00
``` sql
2020-08-02 13:29:10 +00:00
position(haystack, needle[, start_pos])
2020-03-13 06:33:02 +00:00
```
2020-08-02 13:29:10 +00:00
Alias: `locate(haystack, needle[, start_pos])` .
2020-03-13 06:33:02 +00:00
2021-02-15 21:22:10 +00:00
**Arguments**
2020-03-13 06:33:02 +00:00
2020-06-18 08:24:31 +00:00
- `haystack` — string, in which substring will to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `needle` — substring to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
2020-08-02 13:29:10 +00:00
- `start_pos` – Optional parameter, position of the first character in the string to start search. [UInt ](../../sql-reference/data-types/int-uint.md )
2020-03-13 06:33:02 +00:00
**Returned values**
2020-03-21 04:11:51 +00:00
- Starting position in bytes (counting from 1), if substring was found.
- 0, if the substring was not found.
2020-03-13 06:33:02 +00:00
Type: `Integer` .
**Examples**
2020-03-20 10:10:48 +00:00
The phrase “Hello, world!” contains a set of bytes representing a single-byte encoded text. The function returns some expected result:
2020-03-13 06:33:02 +00:00
Query:
2020-03-20 10:10:48 +00:00
``` sql
2020-03-13 06:33:02 +00:00
SELECT position('Hello, world!', '!')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-03-13 06:33:02 +00:00
┌─position('Hello, world!', '!')─┐
│ 13 │
└────────────────────────────────┘
```
2020-08-02 13:29:10 +00:00
``` sql
SELECT
position('Hello, world!', 'o', 1),
position('Hello, world!', 'o', 7)
```
``` text
┌─position('Hello, world!', 'o', 1)─┬─position('Hello, world!', 'o', 7)─┐
│ 5 │ 9 │
└───────────────────────────────────┴───────────────────────────────────┘
```
2020-03-20 10:10:48 +00:00
The same phrase in Russian contains characters which can’ t be represented using a single byte. The function returns some unexpected result (use [positionUTF8 ](#positionutf8 ) function for multi-byte encoded text):
2020-03-13 06:33:02 +00:00
Query:
2020-03-20 10:10:48 +00:00
``` sql
2020-03-13 06:33:02 +00:00
SELECT position('Привет, мир!', '!')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-03-13 06:33:02 +00:00
┌─position('Привет, мир!', '!')─┐
│ 21 │
└───────────────────────────────┘
```
2020-03-18 18:43:51 +00:00
## positionCaseInsensitive {#positioncaseinsensitive}
2020-03-13 06:33:02 +00:00
The same as [position ](#position ) returns the position (in bytes) of the found substring in the string, starting from 1. Use the function for a case-insensitive search.
2020-03-20 10:10:48 +00:00
Works under the assumption that the string contains a set of bytes representing a single-byte encoded text. If this assumption is not met and a character can’ t be represented using a single byte, the function doesn’ t throw an exception and returns some unexpected result. If character can be represented using two bytes, it will use two bytes and so on.
2020-03-13 06:33:02 +00:00
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
2020-08-02 13:29:10 +00:00
positionCaseInsensitive(haystack, needle[, start_pos])
2020-03-13 06:33:02 +00:00
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-03-13 06:33:02 +00:00
2020-06-18 08:24:31 +00:00
- `haystack` — string, in which substring will to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `needle` — substring to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
2020-08-02 13:29:10 +00:00
- `start_pos` – Optional parameter, position of the first character in the string to start search. [UInt ](../../sql-reference/data-types/int-uint.md )
2020-03-13 06:33:02 +00:00
**Returned values**
2020-03-21 04:11:51 +00:00
- Starting position in bytes (counting from 1), if substring was found.
- 0, if the substring was not found.
2020-03-13 06:33:02 +00:00
Type: `Integer` .
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
2020-03-13 06:33:02 +00:00
SELECT positionCaseInsensitive('Hello, world!', 'hello')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-03-13 06:33:02 +00:00
┌─positionCaseInsensitive('Hello, world!', 'hello')─┐
│ 1 │
└───────────────────────────────────────────────────┘
```
2020-03-18 18:43:51 +00:00
## positionUTF8 {#positionutf8}
2020-03-13 06:33:02 +00:00
Returns the position (in Unicode points) of the found substring in the string, starting from 1.
2020-03-20 10:10:48 +00:00
Works under the assumption that the string contains a set of bytes representing a UTF-8 encoded text. If this assumption is not met, the function doesn’ t throw an exception and returns some unexpected result. If character can be represented using two Unicode points, it will use two and so on.
2020-03-13 06:33:02 +00:00
For a case-insensitive search, use the function [positionCaseInsensitiveUTF8 ](#positioncaseinsensitiveutf8 ).
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
2020-08-02 13:29:10 +00:00
positionUTF8(haystack, needle[, start_pos])
2020-03-13 06:33:02 +00:00
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-03-13 06:33:02 +00:00
2020-06-18 08:24:31 +00:00
- `haystack` — string, in which substring will to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `needle` — substring to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
2020-08-02 13:29:10 +00:00
- `start_pos` – Optional parameter, position of the first character in the string to start search. [UInt ](../../sql-reference/data-types/int-uint.md )
2020-03-13 06:33:02 +00:00
**Returned values**
2020-03-21 04:11:51 +00:00
- Starting position in Unicode points (counting from 1), if substring was found.
- 0, if the substring was not found.
2020-03-13 06:33:02 +00:00
Type: `Integer` .
**Examples**
2020-03-20 10:10:48 +00:00
The phrase “Hello, world!” in Russian contains a set of Unicode points representing a single-point encoded text. The function returns some expected result:
2020-03-13 06:33:02 +00:00
Query:
2020-03-20 10:10:48 +00:00
``` sql
2020-03-13 06:33:02 +00:00
SELECT positionUTF8('Привет, мир!', '!')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-03-13 06:33:02 +00:00
┌─positionUTF8('Привет, мир!', '!')─┐
│ 12 │
└───────────────────────────────────┘
```
2020-06-19 10:05:38 +00:00
The phrase “Salut, étudiante!”, where character `é` can be represented using a one point (`U+00E9`) or two points (`U+0065U+0301`) the function can be returned some unexpected result:
2020-03-13 06:33:02 +00:00
Query for the letter `é` , which is represented one Unicode point `U+00E9` :
2020-03-20 10:10:48 +00:00
``` sql
2020-03-13 06:33:02 +00:00
SELECT positionUTF8('Salut, étudiante!', '!')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-03-13 06:33:02 +00:00
┌─positionUTF8('Salut, étudiante!', '!')─┐
│ 17 │
└────────────────────────────────────────┘
```
2020-06-19 10:05:38 +00:00
Query for the letter `é` , which is represented two Unicode points `U+0065U+0301` :
2020-03-13 06:33:02 +00:00
2020-03-20 10:10:48 +00:00
``` sql
2020-06-19 10:05:38 +00:00
SELECT positionUTF8('Salut, étudiante!', '!')
2020-03-13 06:33:02 +00:00
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-06-19 10:05:38 +00:00
┌─positionUTF8('Salut, étudiante!', '!')─┐
2020-03-13 06:33:02 +00:00
│ 18 │
└────────────────────────────────────────┘
```
2020-03-18 18:43:51 +00:00
## positionCaseInsensitiveUTF8 {#positioncaseinsensitiveutf8}
2020-03-13 06:33:02 +00:00
The same as [positionUTF8 ](#positionutf8 ), but is case-insensitive. Returns the position (in Unicode points) of the found substring in the string, starting from 1.
2020-03-20 10:10:48 +00:00
Works under the assumption that the string contains a set of bytes representing a UTF-8 encoded text. If this assumption is not met, the function doesn’ t throw an exception and returns some unexpected result. If character can be represented using two Unicode points, it will use two and so on.
2020-03-13 06:33:02 +00:00
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
2020-08-02 13:29:10 +00:00
positionCaseInsensitiveUTF8(haystack, needle[, start_pos])
2020-03-13 06:33:02 +00:00
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-03-13 06:33:02 +00:00
2020-06-18 08:24:31 +00:00
- `haystack` — string, in which substring will to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `needle` — substring to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
2020-08-02 13:29:10 +00:00
- `start_pos` – Optional parameter, position of the first character in the string to start search. [UInt ](../../sql-reference/data-types/int-uint.md )
2020-03-13 06:33:02 +00:00
**Returned value**
2020-03-21 04:11:51 +00:00
- Starting position in Unicode points (counting from 1), if substring was found.
- 0, if the substring was not found.
2020-03-13 06:33:02 +00:00
Type: `Integer` .
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
2020-03-13 06:33:02 +00:00
SELECT positionCaseInsensitiveUTF8('Привет, мир!', 'Мир')
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-03-13 06:33:02 +00:00
┌─positionCaseInsensitiveUTF8('Привет, мир!', 'Мир')─┐
│ 9 │
└────────────────────────────────────────────────────┘
```
2017-12-28 15:13:23 +00:00
2020-03-21 04:11:51 +00:00
## multiSearchAllPositions {#multisearchallpositions}
2019-01-23 08:38:32 +00:00
2020-06-18 08:24:31 +00:00
The same as [position ](../../sql-reference/functions/string-search-functions.md#position ) but returns `Array` of positions (in bytes) of the found corresponding substrings in the string. Positions are indexed starting from 1.
2019-01-23 08:38:32 +00:00
2020-02-02 21:38:00 +00:00
The search is performed on sequences of bytes without respect to string encoding and collation.
2019-01-23 08:38:32 +00:00
2020-03-21 04:11:51 +00:00
- For case-insensitive ASCII search, use the function `multiSearchAllPositionsCaseInsensitive` .
- For search in UTF-8, use the function [multiSearchAllPositionsUTF8 ](#multiSearchAllPositionsUTF8 ).
- For case-insensitive UTF-8 search, use the function multiSearchAllPositionsCaseInsensitiveUTF8.
2019-10-12 21:12:09 +00:00
2020-03-20 10:10:48 +00:00
**Syntax**
2019-10-12 21:12:09 +00:00
2020-03-20 10:10:48 +00:00
``` sql
2019-10-12 21:12:09 +00:00
multiSearchAllPositions(haystack, [needle1, needle2, ..., needlen])
```
2021-02-15 21:22:10 +00:00
**Arguments**
2019-10-12 21:12:09 +00:00
2020-06-18 08:24:31 +00:00
- `haystack` — string, in which substring will to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `needle` — substring to be searched. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
2019-10-12 21:12:09 +00:00
**Returned values**
2020-03-21 04:11:51 +00:00
- Array of starting positions in bytes (counting from 1), if the corresponding substring was found and 0 if not found.
2019-10-12 21:12:09 +00:00
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
2019-10-12 21:12:09 +00:00
SELECT multiSearchAllPositions('Hello, World!', ['hello', '!', 'world'])
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2019-10-12 21:12:09 +00:00
┌─multiSearchAllPositions('Hello, World!', ['hello', '!', 'world'])─┐
│ [0,13,0] │
└───────────────────────────────────────────────────────────────────┘
```
2020-03-22 09:14:59 +00:00
## multiSearchAllPositionsUTF8 {#multiSearchAllPositionsUTF8}
2019-10-12 21:12:09 +00:00
2020-02-02 21:38:00 +00:00
See `multiSearchAllPositions` .
2019-01-23 08:38:32 +00:00
2020-03-21 04:11:51 +00:00
## multiSearchFirstPosition(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\]) {#multisearchfirstposition}
2019-01-23 08:38:32 +00:00
2019-03-23 22:49:38 +00:00
The same as `position` but returns the leftmost offset of the string `haystack` that is matched to some of the needles.
2019-01-23 08:38:32 +00:00
2019-03-23 22:49:38 +00:00
For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstPositionCaseInsensitive, multiSearchFirstPositionUTF8, multiSearchFirstPositionCaseInsensitiveUTF8` .
2019-01-23 08:38:32 +00:00
2020-03-20 10:10:48 +00:00
## multiSearchFirstIndex(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\]) {#multisearchfirstindexhaystack-needle1-needle2-needlen}
2019-03-23 22:49:38 +00:00
2019-05-05 06:51:36 +00:00
Returns the index `i` (starting from 1) of the leftmost found needle< sub > i</ sub > in the string `haystack` and 0 otherwise.
2019-03-23 22:49:38 +00:00
For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstIndexCaseInsensitive, multiSearchFirstIndexUTF8, multiSearchFirstIndexCaseInsensitiveUTF8` .
2020-03-20 10:10:48 +00:00
## multiSearchAny(haystack, \[needle<sub>1</sub>, needle<sub>2</sub>, …, needle<sub>n</sub>\]) {#function-multisearchany}
2019-01-23 08:38:32 +00:00
2019-05-05 06:51:36 +00:00
Returns 1, if at least one string needle< sub > i</ sub > matches the string `haystack` and 0 otherwise.
2019-01-23 08:38:32 +00:00
2019-03-23 22:49:38 +00:00
For a case-insensitive search or/and in UTF-8 format use functions `multiSearchAnyCaseInsensitive, multiSearchAnyUTF8, multiSearchAnyCaseInsensitiveUTF8` .
2019-01-23 08:38:32 +00:00
2019-09-03 08:56:16 +00:00
!!! note "Note"
In all `multiSearch*` functions the number of needles should be less than 2< sup > 8</ sup > because of implementation specification.
2019-03-28 15:12:37 +00:00
2020-03-20 10:10:48 +00:00
## match(haystack, pattern) {#matchhaystack-pattern}
2017-12-28 15:13:23 +00:00
2018-12-25 15:25:43 +00:00
Checks whether the string matches the `pattern` regular expression. A `re2` regular expression. The [syntax ](https://github.com/google/re2/wiki/Syntax ) of the `re2` regular expressions is more limited than the syntax of the Perl regular expressions.
2018-11-01 13:28:45 +00:00
2020-03-20 10:10:48 +00:00
Returns 0 if it doesn’ t match, or 1 if it matches.
2017-12-28 15:13:23 +00:00
Note that the backslash symbol (`\`) is used for escaping in the regular expression. The same symbol is used for escaping in string literals. So in order to escape the symbol in a regular expression, you must write two backslashes (\\) in a string literal.
2020-03-20 10:10:48 +00:00
The regular expression works with the string as if it is a set of bytes. The regular expression can’ t contain null bytes.
For patterns to search for substrings in a string, it is better to use LIKE or ‘ position’ , since they work much faster.
2017-12-28 15:13:23 +00:00
2020-03-20 10:10:48 +00:00
## multiMatchAny(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multimatchanyhaystack-pattern1-pattern2-patternn}
2019-03-23 22:49:38 +00:00
2019-03-28 15:12:37 +00:00
The same as `match` , but returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. It uses [hyperscan ](https://github.com/intel/hyperscan ) library. For patterns to search substrings in a string, it is better to use `multiSearchAny` since it works much faster.
2019-03-23 22:49:38 +00:00
2019-09-03 08:56:16 +00:00
!!! note "Note"
The length of any of the `haystack` string must be less than 2< sup > 32</ sup > bytes otherwise the exception is thrown. This restriction takes place because of hyperscan API.
2019-03-24 21:47:34 +00:00
2020-03-20 10:10:48 +00:00
## multiMatchAnyIndex(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multimatchanyindexhaystack-pattern1-pattern2-patternn}
2019-03-23 19:40:16 +00:00
2019-03-23 22:49:38 +00:00
The same as `multiMatchAny` , but returns any index that matches the haystack.
2019-03-23 19:40:16 +00:00
2020-03-20 10:10:48 +00:00
## multiMatchAllIndices(haystack, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multimatchallindiceshaystack-pattern1-pattern2-patternn}
2019-10-13 13:22:09 +00:00
The same as `multiMatchAny` , but returns the array of all indicies that match the haystack in any order.
2020-03-20 10:10:48 +00:00
## multiFuzzyMatchAny(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multifuzzymatchanyhaystack-distance-pattern1-pattern2-patternn}
2019-03-29 01:02:05 +00:00
2019-03-29 01:39:59 +00:00
The same as `multiMatchAny` , but returns 1 if any pattern matches the haystack within a constant [edit distance ](https://en.wikipedia.org/wiki/Edit_distance ). This function is also in an experimental mode and can be extremely slow. For more information see [hyperscan documentation ](https://intel.github.io/hyperscan/dev-reference/compilation.html#approximate-matching ).
2019-03-29 01:02:05 +00:00
2020-03-20 10:10:48 +00:00
## multiFuzzyMatchAnyIndex(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multifuzzymatchanyindexhaystack-distance-pattern1-pattern2-patternn}
2019-03-29 01:02:05 +00:00
2019-03-29 01:39:59 +00:00
The same as `multiFuzzyMatchAny` , but returns any index that matches the haystack within a constant edit distance.
2019-03-29 01:02:05 +00:00
2020-03-20 10:10:48 +00:00
## multiFuzzyMatchAllIndices(haystack, distance, \[pattern<sub>1</sub>, pattern<sub>2</sub>, …, pattern<sub>n</sub>\]) {#multifuzzymatchallindiceshaystack-distance-pattern1-pattern2-patternn}
2019-10-13 13:22:09 +00:00
2019-10-13 13:35:43 +00:00
The same as `multiFuzzyMatchAny` , but returns the array of all indices in any order that match the haystack within a constant edit distance.
2019-10-13 13:22:09 +00:00
2019-09-03 08:56:16 +00:00
!!! note "Note"
`multiFuzzyMatch*` functions do not support UTF-8 regular expressions, and such expressions are treated as bytes because of hyperscan restriction.
2019-05-05 06:51:36 +00:00
2019-09-03 08:56:16 +00:00
!!! note "Note"
To turn off all functions that use hyperscan, use setting `SET allow_hyperscan = 0;` .
2019-03-29 01:02:05 +00:00
2020-03-20 10:10:48 +00:00
## extract(haystack, pattern) {#extracthaystack-pattern}
2017-12-28 15:13:23 +00:00
2020-03-20 10:10:48 +00:00
Extracts a fragment of a string using a regular expression. If ‘ haystack’ doesn’ t match the ‘ pattern’ regex, an empty string is returned. If the regex doesn’ t contain subpatterns, it takes the fragment that matches the entire regex. Otherwise, it takes the fragment that matches the first subpattern.
2017-12-28 15:13:23 +00:00
2020-03-20 10:10:48 +00:00
## extractAll(haystack, pattern) {#extractallhaystack-pattern}
2017-12-28 15:13:23 +00:00
2020-03-20 10:10:48 +00:00
Extracts all the fragments of a string using a regular expression. If ‘ haystack’ doesn’ t match the ‘ pattern’ regex, an empty string is returned. Returns an array of strings consisting of all matches to the regex. In general, the behavior is the same as the ‘ extract’ function (it takes the first subpattern, or the entire expression if there isn’ t a subpattern).
2017-12-28 15:13:23 +00:00
2020-10-06 11:17:19 +00:00
## extractAllGroupsHorizontal {#extractallgroups-horizontal}
Matches all groups of the `haystack` string using the `pattern` regular expression. Returns an array of arrays, where the first array includes all fragments matching the first group, the second array - matching the second group, etc.
!!! note "Note"
`extractAllGroupsHorizontal` function is slower than [extractAllGroupsVertical ](#extractallgroups-vertical ).
**Syntax**
``` sql
extractAllGroupsHorizontal(haystack, pattern)
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-10-06 11:17:19 +00:00
- `haystack` — Input string. Type: [String ](../../sql-reference/data-types/string.md ).
- `pattern` — Regular expression with [re2 syntax ](https://github.com/google/re2/wiki/Syntax ). Must contain groups, each group enclosed in parentheses. If `pattern` contains no groups, an exception is thrown. Type: [String ](../../sql-reference/data-types/string.md ).
**Returned value**
- Type: [Array ](../../sql-reference/data-types/array.md ).
If `haystack` doesn’ t match the `pattern` regex, an array of empty arrays is returned.
**Example**
Query:
``` sql
SELECT extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')
```
Result:
``` text
┌─extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','def','ghi'],['111','222','333']] │
└──────────────────────────────────────────────────────────────────────────────────────────┘
```
2020-12-21 19:30:37 +00:00
**See Also**
2020-10-06 11:17:19 +00:00
- [extractAllGroupsVertical ](#extractallgroups-vertical )
## extractAllGroupsVertical {#extractallgroups-vertical}
Matches all groups of the `haystack` string using the `pattern` regular expression. Returns an array of arrays, where each array includes matching fragments from every group. Fragments are grouped in order of appearance in the `haystack` .
**Syntax**
``` sql
extractAllGroupsVertical(haystack, pattern)
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-10-06 11:17:19 +00:00
- `haystack` — Input string. Type: [String ](../../sql-reference/data-types/string.md ).
- `pattern` — Regular expression with [re2 syntax ](https://github.com/google/re2/wiki/Syntax ). Must contain groups, each group enclosed in parentheses. If `pattern` contains no groups, an exception is thrown. Type: [String ](../../sql-reference/data-types/string.md ).
**Returned value**
- Type: [Array ](../../sql-reference/data-types/array.md ).
If `haystack` doesn’ t match the `pattern` regex, an empty array is returned.
**Example**
Query:
``` sql
SELECT extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')
```
Result:
``` text
┌─extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','111'],['def','222'],['ghi','333']] │
└────────────────────────────────────────────────────────────────────────────────────────┘
```
2020-12-21 19:30:37 +00:00
**See Also**
2020-10-06 11:17:19 +00:00
- [extractAllGroupsHorizontal ](#extractallgroups-horizontal )
2020-03-18 18:43:51 +00:00
## like(haystack, pattern), haystack LIKE pattern operator {#function-like}
2017-12-28 15:13:23 +00:00
Checks whether a string matches a simple regular expression.
The regular expression can contain the metasymbols `%` and `_` .
2019-05-05 17:38:05 +00:00
`%` indicates any quantity of any bytes (including zero characters).
2017-12-28 15:13:23 +00:00
`_` indicates any one byte.
2020-03-20 10:10:48 +00:00
Use the backslash (`\`) for escaping metasymbols. See the note on escaping in the description of the ‘ match’ function.
2017-12-28 15:13:23 +00:00
For regular expressions like `%needle%` , the code is more optimal and works as fast as the `position` function.
2020-03-20 10:10:48 +00:00
For other regular expressions, the code is the same as for the ‘ match’ function.
2017-12-28 15:13:23 +00:00
2020-03-18 18:43:51 +00:00
## notLike(haystack, pattern), haystack NOT LIKE pattern operator {#function-notlike}
2017-12-28 15:13:23 +00:00
2020-03-20 10:10:48 +00:00
The same thing as ‘ like’ , but negative.
2018-09-04 11:18:59 +00:00
2020-10-19 15:32:09 +00:00
## ilike {#ilike}
Case insensitive variant of [like ](https://clickhouse.tech/docs/en/sql-reference/functions/string-search-functions/#function-like ) function. You can use `ILIKE` operator instead of the `ilike` function.
**Syntax**
``` sql
ilike(haystack, pattern)
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-10-19 15:32:09 +00:00
- `haystack` — Input string. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `pattern` — If `pattern` doesn't contain percent signs or underscores, then the `pattern` only represents the string itself. An underscore (`_`) in `pattern` stands for (matches) any single character. A percent sign (`%`) matches any sequence of zero or more characters.
Some `pattern` examples:
``` text
'abc' ILIKE 'abc' true
'abc' ILIKE 'a%' true
'abc' ILIKE '_b_' true
'abc' ILIKE 'c' false
```
**Returned values**
- True, if the string matches `pattern` .
- False, if the string doesn't match `pattern` .
**Example**
Input table:
``` text
┌─id─┬─name─────┬─days─┐
│ 1 │ January │ 31 │
│ 2 │ February │ 29 │
│ 3 │ March │ 31 │
│ 4 │ April │ 30 │
└────┴──────────┴──────┘
```
Query:
``` sql
SELECT * FROM Months WHERE ilike(name, '%j%')
```
Result:
``` text
┌─id─┬─name────┬─days─┐
│ 1 │ January │ 31 │
└────┴─────────┴──────┘
```
**See Also**
- [like ](https://clickhouse.tech/docs/en/sql-reference/functions/string-search-functions/#function-like ) <!--hide-->
2020-03-20 10:10:48 +00:00
## ngramDistance(haystack, needle) {#ngramdistancehaystack-needle}
2019-03-05 22:42:28 +00:00
2020-03-20 10:10:48 +00:00
Calculates the 4-gram distance between `haystack` and `needle` : counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns float number from 0 to 1 – the closer to zero, the more strings are similar to each other. If the constant `needle` or `haystack` is more than 32Kb, throws an exception. If some of the non-constant `haystack` or `needle` strings are more than 32Kb, the distance is always one.
2019-03-05 22:42:28 +00:00
For case-insensitive search or/and in UTF-8 format use functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8` .
2020-03-20 10:10:48 +00:00
## ngramSearch(haystack, needle) {#ngramsearchhaystack-needle}
2019-03-05 22:42:28 +00:00
2020-03-20 10:10:48 +00:00
Same as `ngramDistance` but calculates the non-symmetric difference between `needle` and `haystack` – the number of n-grams from needle minus the common number of n-grams normalized by the number of `needle` n-grams. The closer to one, the more likely `needle` is in the `haystack` . Can be useful for fuzzy string search.
2019-05-25 18:47:26 +00:00
2019-05-27 09:05:02 +00:00
For case-insensitive search or/and in UTF-8 format use functions `ngramSearchCaseInsensitive, ngramSearchUTF8, ngramSearchCaseInsensitiveUTF8` .
2019-05-25 18:47:26 +00:00
2019-09-03 08:53:22 +00:00
!!! note "Note"
2020-03-20 10:10:48 +00:00
For UTF-8 case we use 3-gram distance. All these are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables – collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function – we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one – this works for Latin and mostly for all Cyrillic letters.
2018-10-16 10:47:17 +00:00
2020-12-29 10:54:17 +00:00
## countSubstrings {#countSubstrings}
2020-11-26 18:16:07 +00:00
2020-12-29 11:30:47 +00:00
Returns the number of substring occurrences.
2020-11-26 18:16:07 +00:00
2020-12-29 10:54:17 +00:00
For a case-insensitive search, use [countSubstringsCaseInsensitive ](../../sql-reference/functions/string-search-functions.md#countSubstringsCaseInsensitive ) or [countSubstringsCaseInsensitiveUTF8 ](../../sql-reference/functions/string-search-functions.md#countSubstringsCaseInsensitiveUTF8 ) functions.
2020-11-26 18:16:07 +00:00
**Syntax**
``` sql
countSubstrings(haystack, needle[, start_pos])
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-11-26 18:16:07 +00:00
- `haystack` — The string to search in. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `needle` — The substring to search for. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
2020-12-29 11:11:48 +00:00
- `start_pos` – Position of the first character in the string to start search. Optional. [UInt ](../../sql-reference/data-types/int-uint.md ).
2020-11-26 18:16:07 +00:00
**Returned values**
- Number of occurrences.
2021-01-13 21:13:36 +00:00
Type: [UInt64 ](../../sql-reference/data-types/int-uint.md ).
2020-11-26 18:16:07 +00:00
**Examples**
Query:
``` sql
2020-12-29 10:54:17 +00:00
SELECT countSubstrings('foobar.com', '.');
2020-11-26 18:16:07 +00:00
```
Result:
``` text
┌─countSubstrings('foobar.com', '.')─┐
│ 1 │
└────────────────────────────────────┘
```
Query:
``` sql
2020-12-29 10:54:17 +00:00
SELECT countSubstrings('aaaa', 'aa');
2020-11-26 18:16:07 +00:00
```
Result:
``` text
┌─countSubstrings('aaaa', 'aa')─┐
│ 2 │
└───────────────────────────────┘
```
2020-12-29 10:54:17 +00:00
Query:
```sql
SELECT countSubstrings('abc___abc', 'abc', 4);
```
Result:
``` text
┌─countSubstrings('abc___abc', 'abc', 4)─┐
│ 1 │
└────────────────────────────────────────┘
```
## countSubstringsCaseInsensitive {#countSubstringsCaseInsensitive}
2020-12-29 11:30:47 +00:00
Returns the number of substring occurrences case-insensitive.
2020-12-29 10:54:17 +00:00
**Syntax**
``` sql
countSubstringsCaseInsensitive(haystack, needle[, start_pos])
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-12-29 10:54:17 +00:00
- `haystack` — The string to search in. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `needle` — The substring to search for. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
2020-12-29 11:11:48 +00:00
- `start_pos` – Position of the first character in the string to start search. Optional. [UInt ](../../sql-reference/data-types/int-uint.md ).
2020-12-29 10:54:17 +00:00
**Returned values**
- Number of occurrences.
2021-01-13 21:13:36 +00:00
Type: [UInt64 ](../../sql-reference/data-types/int-uint.md ).
2020-12-29 10:54:17 +00:00
**Examples**
Query:
``` sql
select countSubstringsCaseInsensitive('aba', 'B');
```
Result:
``` text
┌─countSubstringsCaseInsensitive('aba', 'B')─┐
│ 1 │
└────────────────────────────────────────────┘
```
Query:
``` sql
SELECT countSubstringsCaseInsensitive('foobar.com', 'CoM');
```
Result:
``` text
┌─countSubstringsCaseInsensitive('foobar.com', 'CoM')─┐
│ 1 │
└─────────────────────────────────────────────────────┘
```
Query:
``` sql
SELECT countSubstringsCaseInsensitive('abC___abC', 'aBc', 2);
```
Result:
``` text
┌─countSubstringsCaseInsensitive('abC___abC', 'aBc', 2)─┐
│ 1 │
└───────────────────────────────────────────────────────┘
```
## countSubstringsCaseInsensitiveUTF8 {#countSubstringsCaseInsensitiveUTF8}
2020-12-29 11:30:47 +00:00
Returns the number of substring occurrences in `UTF-8` case-insensitive.
2020-12-29 10:54:17 +00:00
**Syntax**
``` sql
SELECT countSubstringsCaseInsensitiveUTF8(haystack, needle[, start_pos])
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-12-29 10:54:17 +00:00
- `haystack` — The string to search in. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `needle` — The substring to search for. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
2020-12-29 11:11:48 +00:00
- `start_pos` – Position of the first character in the string to start search. Optional. [UInt ](../../sql-reference/data-types/int-uint.md ).
2020-12-29 10:54:17 +00:00
**Returned values**
- Number of occurrences.
Type: [UInt64 ](../../sql-reference/data-types/int-uint.md ).
**Examples**
Query:
``` sql
SELECT countSubstringsCaseInsensitiveUTF8('абв', 'A');
```
Result:
``` text
┌─countSubstringsCaseInsensitiveUTF8('абв', 'A')─┐
│ 1 │
└────────────────────────────────────────────────┘
```
Query:
```sql
SELECT countSubstringsCaseInsensitiveUTF8('а Бв__А б В __а б в', 'Абв');
```
Result:
``` text
┌─countSubstringsCaseInsensitiveUTF8('а Бв__А б В __а б в', 'Абв')─┐
│ 3 │
└────────────────────────────────────────────────────────────┘
```
2020-10-23 04:28:25 +00:00
## countMatches(haystack, pattern) {#countmatcheshaystack-pattern}
Returns the number of regular expression matches for a `pattern` in a `haystack` .
2020-12-29 10:54:17 +00:00
2020-12-21 19:30:37 +00:00
**Syntax**
``` sql
countMatches(haystack, pattern)
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-12-21 19:30:37 +00:00
- `haystack` — The string to search in. [String ](../../sql-reference/syntax.md#syntax-string-literal ).
- `pattern` — The regular expression with [re2 syntax ](https://github.com/google/re2/wiki/Syntax ). [String ](../../sql-reference/data-types/string.md ).
**Returned value**
- The number of matches.
Type: [UInt64 ](../../sql-reference/data-types/int-uint.md ).
**Examples**
Query:
``` sql
2020-12-24 17:06:11 +00:00
SELECT countMatches('foobar.com', 'o+');
2020-12-21 19:30:37 +00:00
```
Result:
``` text
2020-12-22 19:10:03 +00:00
┌─countMatches('foobar.com', 'o+')─┐
│ 2 │
└──────────────────────────────────┘
2020-12-21 19:30:37 +00:00
```
Query:
``` sql
2020-12-24 17:06:11 +00:00
SELECT countMatches('aaaa', 'aa');
2020-12-21 19:30:37 +00:00
```
Result:
``` text
2020-12-24 17:06:11 +00:00
┌─countMatches('aaaa', 'aa')────┐
2020-12-21 19:30:37 +00:00
│ 2 │
└───────────────────────────────┘
```
2020-12-29 10:54:17 +00:00
[Original article ](https://clickhouse.tech/docs/en/query_language/functions/string_search_functions/ ) <!--hide-->