ClickHouse/docs/en/sql-reference/functions/splitting-merging-functions.md
2023-09-18 07:42:52 +00:00

13 KiB

slug sidebar_position sidebar_label
/en/sql-reference/functions/splitting-merging-functions 165 Splitting Strings

Functions for Splitting Strings

splitByChar

Splits a string into substrings separated by a specified character. Uses a constant string separator which consists of exactly one character. Returns an array of selected substrings. Empty substrings may be selected if the separator occurs at the beginning or end of the string, or if there are multiple consecutive separators.

Syntax

splitByChar(separator, s[, max_substrings]))

Arguments

  • separator — The separator which should contain exactly one character. String.
  • s — The string to split. String.
  • max_substrings — An optional Int64 defaulting to 0. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible.

Returned value(s)

Returns an array of selected substrings. Empty substrings may be selected when:

  • A separator occurs at the beginning or end of the string;
  • There are multiple consecutive separators;
  • The original string s is empty.

Type: Array(String).

:::note The behavior of parameter max_substrings changed starting with ClickHouse v22.11. In versions older than that, max_substrings > 0 meant that max_substring-many splits were performed and that the remainder of the string was returned as the final element of the list. For example,

  • in v22.10: SELECT splitByChar('=', 'a=b=c=d', 2); -- ['a','b','c=d']
  • in v22.11: SELECT splitByChar('=', 'a=b=c=d', 2); -- ['a','b']

The previous behavior can be restored by setting split_tokens_like_python = 1. :::

Example

SELECT splitByChar(',', '1,2,3,abcde');

Result:

┌─splitByChar(',', '1,2,3,abcde')─┐
│ ['1','2','3','abcde']           │
└─────────────────────────────────┘

splitByString

Splits a string into substrings separated by a string. It uses a constant string separator of multiple characters as the separator. If the string separator is empty, it will split the string s into an array of single characters.

Syntax

splitByString(separator, s[, max_substrings]))

Arguments

  • separator — The separator. String.
  • s — The string to split. String.
  • max_substrings — An optional Int64 defaulting to 0. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible.

Returned value(s)

Returns an array of selected substrings. Empty substrings may be selected when:

Type: Array(String).

  • A non-empty separator occurs at the beginning or end of the string;
  • There are multiple consecutive non-empty separators;
  • The original string s is empty while the separator is not empty.

Setting split_tokens_like_python (default: 0) controls whether with max_substrings > 0, the remaining string (if any) is included in the result array or not.

Example

SELECT splitByString(', ', '1, 2 3, 4,5, abcde');

Result:

┌─splitByString(', ', '1, 2 3, 4,5, abcde')─┐
│ ['1','2 3','4,5','abcde']                 │
└───────────────────────────────────────────┘
SELECT splitByString('', 'abcde');

Result:

┌─splitByString('', 'abcde')─┐
│ ['a','b','c','d','e']      │
└────────────────────────────┘

splitByRegexp

Splits a string into substrings separated by a regular expression. It uses a regular expression string regexp as the separator. If the regexp is empty, it will split the string s into an array of single characters. If no match is found for this regular expression, the string s won't be split.

Syntax

splitByRegexp(regexp, s[, max_substrings]))

Arguments

  • regexp — Regular expression. Constant. String or FixedString.
  • s — The string to split. String.
  • max_substrings — An optional Int64 defaulting to 0. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible.

Returned value(s)

Returns an array of selected substrings. Empty substrings may be selected when:

  • A non-empty regular expression match occurs at the beginning or end of the string;
  • There are multiple consecutive non-empty regular expression matches;
  • The original string s is empty while the regular expression is not empty.

Type: Array(String).

Setting split_tokens_like_python (default: 0) controls whether with max_substrings > 0, the remaining string (if any) is included in the result array or not.

Example

SELECT splitByRegexp('\\d+', 'a12bc23de345f');

Result:

┌─splitByRegexp('\\d+', 'a12bc23de345f')─┐
│ ['a','bc','de','f']                    │
└────────────────────────────────────────┘
SELECT splitByRegexp('', 'abcde');

Result:

┌─splitByRegexp('', 'abcde')─┐
│ ['a','b','c','d','e']      │
└────────────────────────────┘

splitByWhitespace

Splits a string into substrings separated by whitespace characters. Returns an array of selected substrings.

Syntax

splitByWhitespace(s[, max_substrings]))

Arguments

  • s — The string to split. String.
  • max_substrings — An optional Int64 defaulting to 0. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible.

Returned value(s)

Returns an array of selected substrings.

Type: Array(String).

Setting split_tokens_like_python (default: 0) controls whether with max_substrings > 0, the remaining string (if any) is included in the result array or not.

Example

SELECT splitByWhitespace('  1!  a,  b.  ');

Result:

┌─splitByWhitespace('  1!  a,  b.  ')─┐
│ ['1!','a,','b.']                    │
└─────────────────────────────────────┘

splitByNonAlpha

Splits a string into substrings separated by whitespace and punctuation characters. Returns an array of selected substrings.

Syntax

splitByNonAlpha(s[, max_substrings]))

Arguments

  • s — The string to split. String.
  • max_substrings — An optional Int64 defaulting to 0. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible.

Returned value(s)

Returns an array of selected substrings.

Type: Array(String).

Setting split_tokens_like_python (default: 0) controls whether with max_substrings > 0, the remaining string (if any) is included in the result array or not.

Example

SELECT splitByNonAlpha('  1!  a,  b.  ');
┌─splitByNonAlpha('  1!  a,  b.  ')─┐
│ ['1','a','b']                     │
└───────────────────────────────────┘

arrayStringConcat

Concatenates string representations of values listed in the array with the separator. separator is an optional parameter: a constant string, set to an empty string by default. Returns the string.

Syntax

arrayStringConcat(arr\[, separator\])

Example

SELECT arrayStringConcat(['12/05/2021', '12:50:00'], ' ') AS DateString;

Result:

┌─DateString──────────┐
│ 12/05/2021 12:50:00 │
└─────────────────────┘

alphaTokens

Selects substrings of consecutive bytes from the ranges a-z and A-Z.Returns an array of substrings.

Syntax

alphaTokens(s[, max_substrings]))

Alias: splitByAlpha

Arguments

  • s — The string to split. String.
  • max_substrings — An optional Int64 defaulting to 0. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible.

Returned value(s)

Returns an array of selected substrings.

Type: Array(String).

Setting split_tokens_like_python (default: 0) controls whether with max_substrings > 0, the remaining string (if any) is included in the result array or not.

Example

SELECT alphaTokens('abca1abc');
┌─alphaTokens('abca1abc')─┐
│ ['abca','abc']          │
└─────────────────────────┘

extractAllGroups

Extracts all groups from non-overlapping substrings matched by a regular expression.

Syntax

extractAllGroups(text, regexp)

Arguments

Returned values

  • If the function finds at least one matching group, it returns Array(Array(String)) column, clustered by group_id (1 to N, where N is number of capturing groups in regexp).

  • If there is no matching group, returns an empty array.

Type: Array.

Example

SELECT extractAllGroups('abc=123, 8="hkl"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)');

Result:

┌─extractAllGroups('abc=123, 8="hkl"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','123'],['8','"hkl"']]                                         │
└───────────────────────────────────────────────────────────────────────┘

ngrams

Splits a UTF-8 string into n-grams of ngramsize symbols.

Syntax

ngrams(string, ngramsize)

Arguments

Returned values

  • Array with n-grams.

Type: Array(String).

Example

SELECT ngrams('ClickHouse', 3);

Result:

┌─ngrams('ClickHouse', 3)───────────────────────────┐
│ ['Cli','lic','ick','ckH','kHo','Hou','ous','use'] │
└───────────────────────────────────────────────────┘

tokens

Splits a string into tokens using non-alphanumeric ASCII characters as separators.

Arguments

  • input_string — Any set of bytes represented as the String data type object.

Returned value

  • The resulting array of tokens from input string.

Type: Array.

Example

SELECT tokens('test1,;\\ test2,;\\ test3,;\\   test4') AS tokens;

Result:

┌─tokens────────────────────────────┐
│ ['test1','test2','test3','test4'] │
└───────────────────────────────────┘