ClickHouse/docs/en/query_language/functions/string_search_functions.md

# Functions for Searching Strings

The search is case-sensitive by default in all these functions. There are separate variants for case insensitive search.

## position(haystack, needle), locate(haystack, needle)

Search for the substring `needle` in the string `haystack`.
Returns the position (in bytes) of the found substring, starting from 1, or returns 0 if the substring was not found.

For a case-insensitive search, use the function `positionCaseInsensitive`.

## positionUTF8(haystack, needle)

The same as `position`, but the position is returned in Unicode code points. Works under the assumption that the string contains a set of bytes representing a UTF-8 encoded text. If this assumption is not met, it returns some result (it doesn't throw an exception).

For a case-insensitive search, use the function `positionCaseInsensitiveUTF8`.

## multiSearchAllPositions(haystack, [needle<sub>1</sub>, needle<sub>2</sub>, ..., needle<sub>n</sub>])

The same as `position`, but returns `Array` of the `position`s for all needle<sub>i</sub>.

For a case-insensitive search or/and in UTF-8 format use functions `multiSearchAllPositionsCaseInsensitive, multiSearchAllPositionsUTF8, multiSearchAllPositionsCaseInsensitiveUTF8`.

## multiSearchFirstPosition(haystack, [needle<sub>1</sub>, needle<sub>2</sub>, ..., needle<sub>n</sub>])

The same as `position` but returns the leftmost offset of the string `haystack` that is matched to some of the needles.

For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstPositionCaseInsensitive, multiSearchFirstPositionUTF8, multiSearchFirstPositionCaseInsensitiveUTF8`.

## multiSearchFirstIndex(haystack, [needle<sub>1</sub>, needle<sub>2</sub>, ..., needle<sub>n</sub>])

Returns the index `i` (starting from 1) of the leftmost found needle<sub>i</sub> in the string `haystack` and 0 otherwise.

For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstIndexCaseInsensitive, multiSearchFirstIndexUTF8, multiSearchFirstIndexCaseInsensitiveUTF8`.

## multiSearchAny(haystack, [needle<sub>1</sub>, needle<sub>2</sub>, ..., needle<sub>n</sub>])

Returns 1, if at least one string needle<sub>i</sub> matches the string `haystack` and 0 otherwise.

For a case-insensitive search or/and in UTF-8 format use functions `multiSearchAnyCaseInsensitive, multiSearchAnyUTF8, multiSearchAnyCaseInsensitiveUTF8`.

**Note: in all `multiSearch*` functions the number of needles should be less than 2<sup>8</sup> because of implementation specification.**

## match(haystack, pattern)

Checks whether the string matches the `pattern` regular expression. A `re2` regular expression. The [syntax](https://github.com/google/re2/wiki/Syntax) of the `re2` regular expressions is more limited than the syntax of the Perl regular expressions.

Returns 0 if it doesn't match, or 1 if it matches.

Note that the backslash symbol (`\`) is used for escaping in the regular expression. The same symbol is used for escaping in string literals. So in order to escape the symbol in a regular expression, you must write two backslashes (\\) in a string literal.

The regular expression works with the string as if it is a set of bytes. The regular expression can't contain null bytes.
For patterns to search for substrings in a string, it is better to use LIKE or 'position', since they work much faster.

## multiMatchAny(haystack, [pattern<sub>1</sub>, pattern<sub>2</sub>, ..., pattern<sub>n</sub>])

The same as `match`, but returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. It uses [hyperscan](https://github.com/intel/hyperscan) library. For patterns to search substrings in a string, it is better to use `multiSearchAny` since it works much faster.

**Note: the length of any of the `haystack` string must be less than 2<sup>32</sup> bytes otherwise the exception is thrown. This restriction takes place because of hyperscan API.**

## multiMatchAnyIndex(haystack, [pattern<sub>1</sub>, pattern<sub>2</sub>, ..., pattern<sub>n</sub>])

The same as `multiMatchAny`, but returns any index that matches the haystack.

## multiFuzzyMatchAny(haystack, distance, [pattern<sub>1</sub>, pattern<sub>2</sub>, ..., pattern<sub>n</sub>])

The same as `multiMatchAny`, but returns 1 if any pattern matches the haystack within a constant [edit distance](https://en.wikipedia.org/wiki/Edit_distance). This function is also in an experimental mode and can be extremely slow. For more information see [hyperscan documentation](https://intel.github.io/hyperscan/dev-reference/compilation.html#approximate-matching).

## multiFuzzyMatchAnyIndex(haystack, distance, [pattern<sub>1</sub>, pattern<sub>2</sub>, ..., pattern<sub>n</sub>])

The same as `multiFuzzyMatchAny`, but returns any index that matches the haystack within a constant edit distance.

**Note: `multiFuzzyMatch*` functions do not support UTF-8 regular expressions, and such expressions are treated as bytes because of hyperscan restriction.**

**Note: to turn off all functions that use hyperscan, use setting `SET allow_hyperscan = 0;`.**

## extract(haystack, pattern)

Extracts a fragment of a string using a regular expression. If 'haystack' doesn't match the 'pattern' regex, an empty string is returned. If the regex doesn't contain subpatterns, it takes the fragment that matches the entire regex. Otherwise, it takes the fragment that matches the first subpattern.

## extractAll(haystack, pattern)

Extracts all the fragments of a string using a regular expression. If 'haystack' doesn't match the 'pattern' regex, an empty string is returned. Returns an array of strings consisting of all matches to the regex. In general, the behavior is the same as the 'extract' function (it takes the first subpattern, or the entire expression if there isn't a subpattern).

## like(haystack, pattern), haystack LIKE pattern operator

Checks whether a string matches a simple regular expression.
The regular expression can contain the metasymbols `%` and `_`.

``% indicates any quantity of any bytes (including zero characters).

`_` indicates any one byte.

Use the backslash (`\`) for escaping metasymbols. See the note on escaping in the description of the 'match' function.

For regular expressions like `%needle%`, the code is more optimal and works as fast as the `position` function.
For other regular expressions, the code is the same as for the 'match' function.

## notLike(haystack, pattern), haystack NOT LIKE pattern operator

The same thing as 'like', but negative.

## ngramDistance(haystack, needle)

Calculates the 4-gram distance between `haystack` and `needle`: counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns float number from 0 to 1 -- the closer to zero, the more strings are similar to each other. If the `needle` is more than 32Kb, throws an exception. If some of the `haystack` strings are more than 32Kb, the distance is always one.

For case-insensitive search or/and in UTF-8 format use functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8`.

**Note: For UTF-8 case we use 3-gram distance. All these are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the symmetric difference between these hash tables -- collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function -- we zero the 5-th bit (starting from zero) of each codepoint byte -- this works for Latin and mostly for all Cyrillic letters.**


[Original article](https://clickhouse.yandex/docs/en/query_language/functions/string_search_functions/) <!--hide-->
title case 2019-02-11 09:49:44 +00:00			`# Functions for Searching Strings`
Sources for english documentation switched to Markdown. Edit page link is fixed too for both language versions of documentation. 2017-12-28 15:13:23 +00:00
Update string_search_functions.md 2019-02-11 12:49:33 +00:00			`The search is case-sensitive by default in all these functions. There are separate variants for case insensitive search.`
Sources for english documentation switched to Markdown. Edit page link is fixed too for both language versions of documentation. 2017-12-28 15:13:23 +00:00
ISSUES-3890 sync system functions to en document (#4168) * ISSUES-3890 sync system functions to en document * ISSUES-3890 fix review * ISSUES-3890 add parseDateTimeBestEffort docs * ISSUES-3890 fix review * ISSUES-3890 better sql example 2019-01-30 10:39:46 +00:00			`## position(haystack, needle), locate(haystack, needle)`
Sources for english documentation switched to Markdown. Edit page link is fixed too for both language versions of documentation. 2017-12-28 15:13:23 +00:00
Update of english documentation (#2918) * Updating of english translation. * Some bugs are fixed. 2018-09-04 11:18:59 +00:00			Search for the substring `needle` in the string `haystack`.
Sources for english documentation switched to Markdown. Edit page link is fixed too for both language versions of documentation. 2017-12-28 15:13:23 +00:00			`Returns the position (in bytes) of the found substring, starting from 1, or returns 0 if the substring was not found.`
Changes in accordance with comments from the developers. 2018-04-28 11:45:37 +00:00
Update of english documentation (#2918) * Updating of english translation. * Some bugs are fixed. 2018-09-04 11:18:59 +00:00			For a case-insensitive search, use the function `positionCaseInsensitive`.
Sources for english documentation switched to Markdown. Edit page link is fixed too for both language versions of documentation. 2017-12-28 15:13:23 +00:00
			`## positionUTF8(haystack, needle)`

Changes in accordance with comments from the developers. 2018-04-28 11:45:37 +00:00			The same as `position`, but the position is returned in Unicode code points. Works under the assumption that the string contains a set of bytes representing a UTF-8 encoded text. If this assumption is not met, it returns some result (it doesn't throw an exception).

Update of english documentation (#2918) * Updating of english translation. * Some bugs are fixed. 2018-09-04 11:18:59 +00:00			For a case-insensitive search, use the function `positionCaseInsensitiveUTF8`.
Sources for english documentation switched to Markdown. Edit page link is fixed too for both language versions of documentation. 2017-12-28 15:13:23 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			`## multiSearchAllPositions(haystack, [needle<sub>1</sub>, needle<sub>2</sub>, ..., needle<sub>n</sub>])`
Docs for multi string search (#4123) 2019-01-23 08:38:32 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			The same as `position`, but returns `Array` of the `position`s for all needle<sub>i</sub>.
Docs for multi string search (#4123) 2019-01-23 08:38:32 +00:00
Renamings, fixes to search algorithms, more tests 2019-03-23 22:49:38 +00:00			For a case-insensitive search or/and in UTF-8 format use functions `multiSearchAllPositionsCaseInsensitive, multiSearchAllPositionsUTF8, multiSearchAllPositionsCaseInsensitiveUTF8`.
Docs for multi string search (#4123) 2019-01-23 08:38:32 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			`## multiSearchFirstPosition(haystack, [needle<sub>1</sub>, needle<sub>2</sub>, ..., needle<sub>n</sub>])`
Docs for multi string search (#4123) 2019-01-23 08:38:32 +00:00
Renamings, fixes to search algorithms, more tests 2019-03-23 22:49:38 +00:00			The same as `position` but returns the leftmost offset of the string `haystack` that is matched to some of the needles.
Docs for multi string search (#4123) 2019-01-23 08:38:32 +00:00
Renamings, fixes to search algorithms, more tests 2019-03-23 22:49:38 +00:00			For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstPositionCaseInsensitive, multiSearchFirstPositionUTF8, multiSearchFirstPositionCaseInsensitiveUTF8`.
Docs for multi string search (#4123) 2019-01-23 08:38:32 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			`## multiSearchFirstIndex(haystack, [needle<sub>1</sub>, needle<sub>2</sub>, ..., needle<sub>n</sub>])`
Renamings, fixes to search algorithms, more tests 2019-03-23 22:49:38 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			Returns the index `i` (starting from 1) of the leftmost found needle<sub>i</sub> in the string `haystack` and 0 otherwise.
Renamings, fixes to search algorithms, more tests 2019-03-23 22:49:38 +00:00
			For a case-insensitive search or/and in UTF-8 format use functions `multiSearchFirstIndexCaseInsensitive, multiSearchFirstIndexUTF8, multiSearchFirstIndexCaseInsensitiveUTF8`.

fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			`## multiSearchAny(haystack, [needle<sub>1</sub>, needle<sub>2</sub>, ..., needle<sub>n</sub>])`
Docs for multi string search (#4123) 2019-01-23 08:38:32 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			Returns 1, if at least one string needle<sub>i</sub> matches the string `haystack` and 0 otherwise.
Docs for multi string search (#4123) 2019-01-23 08:38:32 +00:00
Renamings, fixes to search algorithms, more tests 2019-03-23 22:49:38 +00:00			For a case-insensitive search or/and in UTF-8 format use functions `multiSearchAnyCaseInsensitive, multiSearchAnyUTF8, multiSearchAnyCaseInsensitiveUTF8`.
Docs for multi string search (#4123) 2019-01-23 08:38:32 +00:00
Added hyperscan fuzzy search 2019-03-29 01:02:05 +00:00			*Note: in all `multiSearch` functions the number of needles should be less than 2<sup>8</sup> because of implementation specification.**
More restrictions added 2019-03-28 15:12:37 +00:00
Sources for english documentation switched to Markdown. Edit page link is fixed too for both language versions of documentation. 2017-12-28 15:13:23 +00:00			`## match(haystack, pattern)`

Doc fixes: remove double placeholders; add them where missing. (#3923) * Doc fix: add spaces where missing * Doc fixes: rm double spaces * Doc fixes: edit spaces * Doc fixes: rm double spaces in /fa * Revert "Doc fixes: rm double spaces in /fa" This reverts commit bb879a62ef5fa965d989fea3b1b2a693d2016a2d. * Doc fix: resolve all problems with double spaces in /fa * Doc fix: add spaces for readability * Doc fix: add spaces * Fix spaces 2018-12-25 15:25:43 +00:00			Checks whether the string matches the `pattern` regular expression. A `re2` regular expression. The [syntax](https://github.com/google/re2/wiki/Syntax) of the `re2` regular expressions is more limited than the syntax of the Perl regular expressions.
Partial sync between ru and en version (#3464) * Update of english version of descriprion of the table function `file`. * New syntax for ReplacingMergeTree. Some improvements in text. * Significantly change article about SummingMergeTree. Article is restructured, text is changed in many places of the document. New syntax for table creation is described. * Descriptions of AggregateFunction and AggregatingMergeTree are updated. Russian version. * New syntax for new syntax of CREATE TABLE * Added english docs on Aggregating, Replacing and SummingMergeTree. * CollapsingMergeTree docs. English version. * 1. Update of CollapsingMergeTree. 2. Minor changes in markup * Update aggregatefunction.md * Update aggregatefunction.md * Update aggregatefunction.md * Update aggregatingmergetree.md * GraphiteMergeTree docs update. New syntax for creation of Replicated* tables. Minor changes in MergeTree tables creation syntax. Markup fix * Markup and language fixes * Clarification in the CollapsingMergeTree article * DOCAPI-4821. Sync between ru and en versions of docs. * Fixed the ambiguity in geo functions description. * Example of JOIN in ru docs * Deleted misinforming example. 2018-11-01 13:28:45 +00:00
Sources for english documentation switched to Markdown. Edit page link is fixed too for both language versions of documentation. 2017-12-28 15:13:23 +00:00			`Returns 0 if it doesn't match, or 1 if it matches.`

			Note that the backslash symbol (`\`) is used for escaping in the regular expression. The same symbol is used for escaping in string literals. So in order to escape the symbol in a regular expression, you must write two backslashes (\\) in a string literal.

			`The regular expression works with the string as if it is a set of bytes. The regular expression can't contain null bytes.`
			`For patterns to search for substrings in a string, it is better to use LIKE or 'position', since they work much faster.`

fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			`## multiMatchAny(haystack, [pattern<sub>1</sub>, pattern<sub>2</sub>, ..., pattern<sub>n</sub>])`
Renamings, fixes to search algorithms, more tests 2019-03-23 22:49:38 +00:00
More restrictions added 2019-03-28 15:12:37 +00:00			The same as `match`, but returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. It uses [hyperscan](https://github.com/intel/hyperscan) library. For patterns to search substrings in a string, it is better to use `multiSearchAny` since it works much faster.
Renamings, fixes to search algorithms, more tests 2019-03-23 22:49:38 +00:00
Fix docs because hyperscan 5.1.1 released 2019-04-11 10:54:19 +00:00			Note: the length of any of the `haystack` string must be less than 2<sup>32</sup> bytes otherwise the exception is thrown. This restriction takes place because of hyperscan API.
Fix hyperscan, add some notes, test, 4 more perf tests 2019-03-24 21:47:34 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			`## multiMatchAnyIndex(haystack, [pattern<sub>1</sub>, pattern<sub>2</sub>, ..., pattern<sub>n</sub>])`
Hyperscan multi regular expressions search 2019-03-23 19:40:16 +00:00
Renamings, fixes to search algorithms, more tests 2019-03-23 22:49:38 +00:00			The same as `multiMatchAny`, but returns any index that matches the haystack.
Hyperscan multi regular expressions search 2019-03-23 19:40:16 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			`## multiFuzzyMatchAny(haystack, distance, [pattern<sub>1</sub>, pattern<sub>2</sub>, ..., pattern<sub>n</sub>])`
Added hyperscan fuzzy search 2019-03-29 01:02:05 +00:00
Better docs 2019-03-29 01:39:59 +00:00			The same as `multiMatchAny`, but returns 1 if any pattern matches the haystack within a constant [edit distance](https://en.wikipedia.org/wiki/Edit_distance). This function is also in an experimental mode and can be extremely slow. For more information see [hyperscan documentation](https://intel.github.io/hyperscan/dev-reference/compilation.html#approximate-matching).
Added hyperscan fuzzy search 2019-03-29 01:02:05 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			`## multiFuzzyMatchAnyIndex(haystack, distance, [pattern<sub>1</sub>, pattern<sub>2</sub>, ..., pattern<sub>n</sub>])`
Added hyperscan fuzzy search 2019-03-29 01:02:05 +00:00
Better docs 2019-03-29 01:39:59 +00:00			The same as `multiFuzzyMatchAny`, but returns any index that matches the haystack within a constant edit distance.
Added hyperscan fuzzy search 2019-03-29 01:02:05 +00:00
fix hyperscan to treat regular expressions as utf-8 expressions 2019-05-05 06:51:36 +00:00			*Note: `multiFuzzyMatch` functions do not support UTF-8 regular expressions, and such expressions are treated as bytes because of hyperscan restriction.**

Added hyperscan fuzzy search 2019-03-29 01:02:05 +00:00			Note: to turn off all functions that use hyperscan, use setting `SET allow_hyperscan = 0;`.

Sources for english documentation switched to Markdown. Edit page link is fixed too for both language versions of documentation. 2017-12-28 15:13:23 +00:00			`## extract(haystack, pattern)`

			`Extracts a fragment of a string using a regular expression. If 'haystack' doesn't match the 'pattern' regex, an empty string is returned. If the regex doesn't contain subpatterns, it takes the fragment that matches the entire regex. Otherwise, it takes the fragment that matches the first subpattern.`

			`## extractAll(haystack, pattern)`

			`Extracts all the fragments of a string using a regular expression. If 'haystack' doesn't match the 'pattern' regex, an empty string is returned. Returns an array of strings consisting of all matches to the regex. In general, the behavior is the same as the 'extract' function (it takes the first subpattern, or the entire expression if there isn't a subpattern).`

			`## like(haystack, pattern), haystack LIKE pattern operator`

			`Checks whether a string matches a simple regular expression.`
			The regular expression can contain the metasymbols `%` and `_`.

			``% indicates any quantity of any bytes (including zero characters).

			`_` indicates any one byte.

			Use the backslash (`\`) for escaping metasymbols. See the note on escaping in the description of the 'match' function.

			For regular expressions like `%needle%`, the code is more optimal and works as fast as the `position` function.
			`For other regular expressions, the code is the same as for the 'match' function.`

			`## notLike(haystack, pattern), haystack NOT LIKE pattern operator`

			`The same thing as 'like', but negative.`
Update of english documentation (#2918) * Updating of english translation. * Some bugs are fixed. 2018-09-04 11:18:59 +00:00
Rename trigramDistance to ngramDistance, add more functions with CaseInsensitive and UTF, update docs, more job done in perf, added some perf tests for string search that I would like to see 2019-03-05 22:42:28 +00:00			`## ngramDistance(haystack, needle)`

Better docs to the distance functions 2019-03-09 16:57:52 +00:00			Calculates the 4-gram distance between `haystack` and `needle`: counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns float number from 0 to 1 -- the closer to zero, the more strings are similar to each other. If the `needle` is more than 32Kb, throws an exception. If some of the `haystack` strings are more than 32Kb, the distance is always one.
Rename trigramDistance to ngramDistance, add more functions with CaseInsensitive and UTF, update docs, more job done in perf, added some perf tests for string search that I would like to see 2019-03-05 22:42:28 +00:00
			For case-insensitive search or/and in UTF-8 format use functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8`.

Added hyperscan fuzzy search 2019-03-29 01:02:05 +00:00			Note: For UTF-8 case we use 3-gram distance. All these are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the symmetric difference between these hash tables -- collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function -- we zero the 5-th bit (starting from zero) of each codepoint byte -- this works for Latin and mostly for all Cyrillic letters.
Rename trigramDistance to ngramDistance, add more functions with CaseInsensitive and UTF, update docs, more job done in perf, added some perf tests for string search that I would like to see 2019-03-05 22:42:28 +00:00
WIP on docs/website (#3383) * CLICKHOUSE-4063: less manual html @ index.md * CLICKHOUSE-4063: recommend markdown="1" in README.md * CLICKHOUSE-4003: manually purge custom.css for now * CLICKHOUSE-4064: expand <details> before any print (including to pdf) * CLICKHOUSE-3927: rearrange interfaces/formats.md a bit * CLICKHOUSE-3306: add few http headers * Remove copy-paste introduced in #3392 * Hopefully better chinese fonts #3392 * get rid of tabs @ custom.css * Apply comments and patch from #3384 * Add jdbc.md to ToC and some translation, though it still looks badly incomplete * minor punctuation * Add some backlinks to official website from mirrors that just blindly take markdown sources * Do not make fonts extra light * find . -name '.md' -type f \| xargs -I{} perl -pi -e 's//g' {} find . -name '.md' -type f \| xargs -I{} perl -pi -e 's/ sql/g' {} Remove outdated stuff from roadmap.md * Not so light font on front page too * Refactor Chinese formats.md to match recent changes in other languages 2018-10-16 10:47:17 +00:00
			`[Original article](https://clickhouse.yandex/docs/en/query_language/functions/string_search_functions/) <!--hide-->`