2020-04-03 13:23:32 +00:00
---
2022-08-28 14:53:34 +00:00
slug: /en/sql-reference/functions/splitting-merging-functions
2023-04-19 17:05:55 +00:00
sidebar_position: 165
2023-04-20 10:18:46 +00:00
sidebar_label: Splitting Strings
2020-04-03 13:23:32 +00:00
---
2023-04-20 10:18:46 +00:00
# Functions for Splitting Strings
2017-12-28 15:13:23 +00:00
2023-04-20 10:18:46 +00:00
## splitByChar
2017-12-28 15:13:23 +00:00
2023-04-20 10:18:46 +00:00
Splits a string into substrings separated by a specified character. Uses a constant string `separator` which consists of exactly one character.
2017-12-28 15:13:23 +00:00
Returns an array of selected substrings. Empty substrings may be selected if the separator occurs at the beginning or end of the string, or if there are multiple consecutive separators.
2020-03-20 05:37:46 +00:00
**Syntax**
2020-04-30 18:19:18 +00:00
``` sql
2022-11-03 08:12:19 +00:00
splitByChar(separator, s[, max_substrings]))
2020-03-20 05:37:46 +00:00
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-03-20 05:37:46 +00:00
2024-05-24 03:54:16 +00:00
- `separator` — The separator which should contain exactly one character. [String ](../data-types/string.md ).
- `s` — The string to split. [String ](../data-types/string.md ).
2023-09-18 20:08:37 +00:00
- `max_substrings` — An optional `Int64` defaulting to 0. If `max_substrings` > 0, the returned array will contain at most `max_substrings` substrings, otherwise the function will return as many substrings as possible.
2020-03-20 05:37:46 +00:00
**Returned value(s)**
2024-05-24 03:54:16 +00:00
- An array of selected substrings. [Array ](../data-types/array.md )([String](../data-types/string.md)).
2024-05-23 13:48:20 +00:00
Empty substrings may be selected when:
2020-03-20 05:37:46 +00:00
2023-04-19 15:55:29 +00:00
- A separator occurs at the beginning or end of the string;
- There are multiple consecutive separators;
- The original string `s` is empty.
2020-03-20 05:37:46 +00:00
2023-09-10 14:12:12 +00:00
:::note
2023-10-07 11:13:21 +00:00
The behavior of parameter `max_substrings` changed starting with ClickHouse v22.11. In versions older than that, `max_substrings > 0` meant that `max_substring` -many splits were performed and that the remainder of the string was returned as the final element of the list.
2023-09-10 14:12:12 +00:00
For example,
2023-10-07 11:13:21 +00:00
- in v22.10: `SELECT splitByChar('=', 'a=b=c=d', 2);` returned `['a','b','c=d']`
- in v22.11: `SELECT splitByChar('=', 'a=b=c=d', 2);` returned `['a','b']`
2023-09-11 18:48:40 +00:00
2023-09-18 20:08:37 +00:00
A behavior similar to ClickHouse pre-v22.11 can be achieved by setting
[splitby_max_substrings_includes_remaining_string ](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string )
`SELECT splitByChar('=', 'a=b=c=d', 2) SETTINGS splitby_max_substrings_includes_remaining_string = 1 -- ['a', 'b=c=d']`
2023-09-10 14:12:12 +00:00
:::
2020-03-20 05:37:46 +00:00
**Example**
2020-03-19 02:35:18 +00:00
2020-03-20 10:10:48 +00:00
``` sql
2021-05-25 13:03:35 +00:00
SELECT splitByChar(',', '1,2,3,abcde');
2020-03-19 02:35:18 +00:00
```
2020-03-20 10:10:48 +00:00
2023-04-20 10:18:46 +00:00
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-03-19 02:35:18 +00:00
┌─splitByChar(',', '1,2,3,abcde')─┐
│ ['1','2','3','abcde'] │
└─────────────────────────────────┘
```
2023-04-20 10:18:46 +00:00
## splitByString
2020-03-20 05:37:46 +00:00
Splits a string into substrings separated by a string. It uses a constant string `separator` of multiple characters as the separator. If the string `separator` is empty, it will split the string `s` into an array of single characters.
**Syntax**
2020-04-30 18:19:18 +00:00
``` sql
2022-11-03 08:12:19 +00:00
splitByString(separator, s[, max_substrings]))
2020-03-20 05:37:46 +00:00
```
2021-02-15 21:22:10 +00:00
**Arguments**
2020-03-20 05:37:46 +00:00
2024-05-24 03:54:16 +00:00
- `separator` — The separator. [String ](../data-types/string.md ).
- `s` — The string to split. [String ](../data-types/string.md ).
2023-04-19 15:55:29 +00:00
- `max_substrings` — An optional `Int64` defaulting to 0. When `max_substrings` > 0, the returned substrings will be no more than `max_substrings` , otherwise the function will return as many substrings as possible.
2022-11-03 08:12:19 +00:00
2020-03-20 05:37:46 +00:00
**Returned value(s)**
2024-05-24 03:54:16 +00:00
- An array of selected substrings. [Array ](../data-types/array.md )([String](../data-types/string.md)).
2020-03-20 05:37:46 +00:00
2024-05-23 13:48:20 +00:00
Empty substrings may be selected when:
2020-03-20 18:36:14 +00:00
2023-04-19 15:55:29 +00:00
- A non-empty separator occurs at the beginning or end of the string;
- There are multiple consecutive non-empty separators;
- The original string `s` is empty while the separator is not empty.
2017-12-28 15:13:23 +00:00
2024-05-24 08:01:06 +00:00
:::note
2023-09-18 20:08:37 +00:00
Setting [splitby_max_substrings_includes_remaining_string ](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string ) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.
2024-05-23 13:48:20 +00:00
:::
2023-09-11 18:48:40 +00:00
2020-03-20 05:37:46 +00:00
**Example**
2020-03-19 02:35:18 +00:00
2020-03-20 10:10:48 +00:00
``` sql
2021-05-25 13:03:35 +00:00
SELECT splitByString(', ', '1, 2 3, 4,5, abcde');
2020-03-19 02:35:18 +00:00
```
2020-03-20 10:10:48 +00:00
2023-04-20 10:18:46 +00:00
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-03-19 02:35:18 +00:00
┌─splitByString(', ', '1, 2 3, 4,5, abcde')─┐
│ ['1','2 3','4,5','abcde'] │
└───────────────────────────────────────────┘
```
2020-03-20 10:10:48 +00:00
``` sql
2021-05-25 13:03:35 +00:00
SELECT splitByString('', 'abcde');
2020-03-19 02:35:18 +00:00
```
2020-03-20 10:10:48 +00:00
2023-04-20 10:18:46 +00:00
Result:
2020-03-20 10:10:48 +00:00
``` text
2020-03-19 02:35:18 +00:00
┌─splitByString('', 'abcde')─┐
│ ['a','b','c','d','e'] │
└────────────────────────────┘
```
2017-12-28 15:13:23 +00:00
2023-04-20 10:18:46 +00:00
## splitByRegexp
2021-05-13 03:15:38 +00:00
2021-05-19 10:21:34 +00:00
Splits a string into substrings separated by a regular expression. It uses a regular expression string `regexp` as the separator. If the `regexp` is empty, it will split the string `s` into an array of single characters. If no match is found for this regular expression, the string `s` won't be split.
2021-05-13 03:15:38 +00:00
**Syntax**
``` sql
2022-11-03 08:12:19 +00:00
splitByRegexp(regexp, s[, max_substrings]))
2021-05-13 03:15:38 +00:00
```
**Arguments**
2023-04-19 15:55:29 +00:00
- `regexp` — Regular expression. Constant. [String ](../data-types/string.md ) or [FixedString ](../data-types/fixedstring.md ).
2024-05-24 03:54:16 +00:00
- `s` — The string to split. [String ](../data-types/string.md ).
2023-04-19 15:55:29 +00:00
- `max_substrings` — An optional `Int64` defaulting to 0. When `max_substrings` > 0, the returned substrings will be no more than `max_substrings` , otherwise the function will return as many substrings as possible.
2022-11-03 08:12:19 +00:00
2021-05-13 03:15:38 +00:00
**Returned value(s)**
2024-05-24 03:54:16 +00:00
- An array of selected substrings. [Array ](../data-types/array.md )([String](../data-types/string.md)).
2024-05-23 13:48:20 +00:00
2024-05-24 08:01:06 +00:00
2024-05-23 13:48:20 +00:00
Empty substrings may be selected when:
2021-05-13 03:15:38 +00:00
2023-04-19 15:55:29 +00:00
- A non-empty regular expression match occurs at the beginning or end of the string;
- There are multiple consecutive non-empty regular expression matches;
- The original string `s` is empty while the regular expression is not empty.
2021-05-13 03:15:38 +00:00
2024-05-24 08:01:06 +00:00
:::note
2023-09-18 20:08:37 +00:00
Setting [splitby_max_substrings_includes_remaining_string ](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string ) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.
2024-05-23 13:48:20 +00:00
:::
2023-09-11 18:48:40 +00:00
2021-05-13 03:15:38 +00:00
**Example**
``` sql
2021-05-25 13:03:35 +00:00
SELECT splitByRegexp('\\d+', 'a12bc23de345f');
2021-05-13 03:15:38 +00:00
```
2021-05-19 10:21:34 +00:00
Result:
2021-05-13 03:15:38 +00:00
``` text
2021-05-13 09:21:00 +00:00
┌─splitByRegexp('\\d+', 'a12bc23de345f')─┐
│ ['a','bc','de','f'] │
└────────────────────────────────────────┘
2021-05-13 03:15:38 +00:00
```
``` sql
2021-05-25 13:03:35 +00:00
SELECT splitByRegexp('', 'abcde');
2021-05-13 03:15:38 +00:00
```
2021-05-19 10:21:34 +00:00
Result:
2021-05-13 03:15:38 +00:00
``` text
┌─splitByRegexp('', 'abcde')─┐
│ ['a','b','c','d','e'] │
└────────────────────────────┘
```
2023-04-20 10:18:46 +00:00
## splitByWhitespace
2021-06-19 12:54:53 +00:00
Splits a string into substrings separated by whitespace characters.
Returns an array of selected substrings.
**Syntax**
``` sql
2022-11-03 08:12:19 +00:00
splitByWhitespace(s[, max_substrings]))
2021-06-19 12:54:53 +00:00
```
**Arguments**
2024-05-24 03:54:16 +00:00
- `s` — The string to split. [String ](../data-types/string.md ).
2023-04-19 15:55:29 +00:00
- `max_substrings` — An optional `Int64` defaulting to 0. When `max_substrings` > 0, the returned substrings will be no more than `max_substrings` , otherwise the function will return as many substrings as possible.
2022-11-03 08:12:19 +00:00
2021-06-19 12:54:53 +00:00
**Returned value(s)**
2024-05-24 03:54:16 +00:00
- An array of selected substrings. [Array ](../data-types/array.md )([String](../data-types/string.md)).
2024-05-23 13:48:20 +00:00
:::note
2023-09-18 20:08:37 +00:00
Setting [splitby_max_substrings_includes_remaining_string ](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string ) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.
2024-05-23 13:48:20 +00:00
:::
2023-09-11 18:48:40 +00:00
2021-06-19 12:54:53 +00:00
**Example**
``` sql
SELECT splitByWhitespace(' 1! a, b. ');
```
2023-04-20 10:18:46 +00:00
Result:
2021-06-19 12:54:53 +00:00
``` text
┌─splitByWhitespace(' 1! a, b. ')─┐
│ ['1!','a,','b.'] │
└─────────────────────────────────────┘
```
2023-04-20 10:18:46 +00:00
## splitByNonAlpha
2021-06-19 12:54:53 +00:00
Splits a string into substrings separated by whitespace and punctuation characters.
Returns an array of selected substrings.
**Syntax**
``` sql
2022-11-03 08:12:19 +00:00
splitByNonAlpha(s[, max_substrings]))
2021-06-19 12:54:53 +00:00
```
**Arguments**
2024-05-24 03:54:16 +00:00
- `s` — The string to split. [String ](../data-types/string.md ).
2023-04-19 15:55:29 +00:00
- `max_substrings` — An optional `Int64` defaulting to 0. When `max_substrings` > 0, the returned substrings will be no more than `max_substrings` , otherwise the function will return as many substrings as possible.
2022-11-03 08:12:19 +00:00
2021-06-19 12:54:53 +00:00
**Returned value(s)**
2024-05-24 03:54:16 +00:00
- An array of selected substrings. [Array ](../data-types/array.md )([String](../data-types/string.md)).
2021-06-19 12:54:53 +00:00
2024-05-23 13:48:20 +00:00
:::note
2023-09-18 20:08:37 +00:00
Setting [splitby_max_substrings_includes_remaining_string ](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string ) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.
2024-05-23 13:48:20 +00:00
:::
2023-09-11 18:48:40 +00:00
2021-06-19 12:54:53 +00:00
**Example**
``` sql
SELECT splitByNonAlpha(' 1! a, b. ');
```
``` text
┌─splitByNonAlpha(' 1! a, b. ')─┐
│ ['1','a','b'] │
└───────────────────────────────────┘
```
2023-04-20 10:18:46 +00:00
## arrayStringConcat
2017-12-28 15:13:23 +00:00
2021-10-28 21:40:42 +00:00
Concatenates string representations of values listed in the array with the separator. `separator` is an optional parameter: a constant string, set to an empty string by default.
2017-12-28 15:13:23 +00:00
Returns the string.
2023-04-20 10:18:46 +00:00
**Syntax**
```sql
arrayStringConcat(arr\[, separator\])
```
2023-02-27 20:16:18 +00:00
**Example**
``` sql
SELECT arrayStringConcat(['12/05/2021', '12:50:00'], ' ') AS DateString;
```
2023-04-20 10:18:46 +00:00
Result:
2023-02-27 20:16:18 +00:00
```text
┌─DateString──────────┐
│ 12/05/2021 12:50:00 │
└─────────────────────┘
```
2023-04-20 10:18:46 +00:00
## alphaTokens
2017-12-28 15:13:23 +00:00
Selects substrings of consecutive bytes from the ranges a-z and A-Z.Returns an array of substrings.
2022-11-03 08:12:19 +00:00
**Syntax**
``` sql
alphaTokens(s[, max_substrings]))
```
2023-04-20 10:18:46 +00:00
Alias: `splitByAlpha`
2022-11-03 08:12:19 +00:00
**Arguments**
2024-05-24 03:54:16 +00:00
- `s` — The string to split. [String ](../data-types/string.md ).
2023-04-19 15:55:29 +00:00
- `max_substrings` — An optional `Int64` defaulting to 0. When `max_substrings` > 0, the returned substrings will be no more than `max_substrings` , otherwise the function will return as many substrings as possible.
2022-11-03 08:12:19 +00:00
**Returned value(s)**
2024-05-24 03:54:16 +00:00
- An array of selected substrings. [Array ](../data-types/array.md )([String](../data-types/string.md)).
2022-11-03 08:12:19 +00:00
2024-05-23 13:48:20 +00:00
:::note
2023-09-18 20:08:37 +00:00
Setting [splitby_max_substrings_includes_remaining_string ](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string ) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.
2024-05-23 13:48:20 +00:00
:::
2023-09-11 18:48:40 +00:00
2020-03-20 05:37:46 +00:00
**Example**
2018-09-21 15:13:45 +00:00
2020-03-20 10:10:48 +00:00
``` sql
2021-05-25 13:03:35 +00:00
SELECT alphaTokens('abca1abc');
2019-09-23 15:31:46 +00:00
```
2020-03-20 10:10:48 +00:00
``` text
2018-09-21 15:13:45 +00:00
┌─alphaTokens('abca1abc')─┐
│ ['abca','abc'] │
└─────────────────────────┘
2018-10-16 10:47:17 +00:00
```
2020-03-20 10:10:48 +00:00
2023-04-20 10:18:46 +00:00
## extractAllGroups
2020-07-19 09:33:50 +00:00
Extracts all groups from non-overlapping substrings matched by a regular expression.
2021-07-29 15:20:55 +00:00
**Syntax**
2020-07-19 09:33:50 +00:00
``` sql
2021-07-29 15:20:55 +00:00
extractAllGroups(text, regexp)
2020-07-19 09:33:50 +00:00
```
2021-07-29 15:20:55 +00:00
**Arguments**
2020-07-19 09:33:50 +00:00
2023-04-19 15:55:29 +00:00
- `text` — [String ](../data-types/string.md ) or [FixedString ](../data-types/fixedstring.md ).
- `regexp` — Regular expression. Constant. [String ](../data-types/string.md ) or [FixedString ](../data-types/fixedstring.md ).
2020-07-19 09:33:50 +00:00
**Returned values**
2024-05-24 04:42:13 +00:00
- If the function finds at least one matching group, it returns `Array(Array(String))` column, clustered by group_id (1 to N, where N is number of capturing groups in `regexp` ). If there is no matching group, it returns an empty array. [Array ](../data-types/array.md ).
2020-07-19 09:33:50 +00:00
**Example**
``` sql
SELECT extractAllGroups('abc=123, 8="hkl"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)');
```
Result:
``` text
┌─extractAllGroups('abc=123, 8="hkl"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','123'],['8','"hkl"']] │
└───────────────────────────────────────────────────────────────────────┘
```
2021-10-25 16:15:28 +00:00
2022-06-02 10:55:18 +00:00
## ngrams
2021-10-25 16:15:28 +00:00
2023-04-20 10:18:46 +00:00
Splits a UTF-8 string into n-grams of `ngramsize` symbols.
2021-10-25 16:15:28 +00:00
**Syntax**
``` sql
ngrams(string, ngramsize)
```
**Arguments**
2024-05-24 03:54:16 +00:00
- `string` — String. [String ](../data-types/string.md ) or [FixedString ](../data-types/fixedstring.md ).
- `ngramsize` — The size of an n-gram. [UInt ](../data-types/int-uint.md ).
2021-10-25 16:15:28 +00:00
**Returned values**
2024-05-24 03:54:16 +00:00
- Array with n-grams. [Array ](../data-types/array.md )([String](../data-types/string.md)).
2021-10-25 16:15:28 +00:00
**Example**
``` sql
SELECT ngrams('ClickHouse', 3);
```
Result:
``` text
┌─ngrams('ClickHouse', 3)───────────────────────────┐
│ ['Cli','lic','ick','ckH','kHo','Hou','ous','use'] │
└───────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## tokens
2021-10-25 16:15:28 +00:00
Splits a string into tokens using non-alphanumeric ASCII characters as separators.
**Arguments**
2024-05-24 03:54:16 +00:00
- `input_string` — Any set of bytes represented as the [String ](../data-types/string.md ) data type object.
2021-10-25 16:15:28 +00:00
**Returned value**
2024-05-23 13:48:20 +00:00
- The resulting array of tokens from input string. [Array ](../data-types/array.md ).
2021-10-25 16:15:28 +00:00
**Example**
``` sql
SELECT tokens('test1,;\\ test2,;\\ test3,;\\ test4') AS tokens;
```
2021-10-26 17:47:10 +00:00
Result:
2021-10-25 16:15:28 +00:00
``` text
┌─tokens────────────────────────────┐
│ ['test1','test2','test3','test4'] │
└───────────────────────────────────┘
2023-02-27 20:16:18 +00:00
```