ClickHouse/docs/en/sql-reference/functions/string-functions.md

1078 lines
29 KiB
Markdown
Raw Normal View History

2020-04-03 13:23:32 +00:00
---
sidebar_position: 40
sidebar_label: Strings
2020-04-03 13:23:32 +00:00
---
2022-06-02 10:55:18 +00:00
# Functions for Working with Strings
:::note
Functions for [searching](../../sql-reference/functions/string-search-functions.md) and [replacing](../../sql-reference/functions/string-replace-functions.md) in strings are described separately.
:::
2020-06-19 10:09:45 +00:00
2022-06-02 10:55:18 +00:00
## empty
2017-04-03 19:49:50 +00:00
2021-08-03 13:07:46 +00:00
Checks whether the input string is empty.
**Syntax**
``` sql
empty(x)
```
2017-04-26 19:16:38 +00:00
A string is considered non-empty if it contains at least one byte, even if this is a space or a null byte.
2021-08-03 13:07:46 +00:00
2021-08-10 10:53:35 +00:00
The function also works for [arrays](array-functions.md#function-empty) or [UUID](uuid-functions.md#empty).
2021-08-03 13:07:46 +00:00
**Arguments**
2021-08-10 10:53:35 +00:00
- `x` — Input value. [String](../data-types/string.md).
2021-08-03 13:07:46 +00:00
**Returned value**
2021-10-14 10:15:45 +00:00
- Returns `1` for an empty string or `0` for a non-empty string.
2021-08-03 13:07:46 +00:00
Type: [UInt8](../data-types/int-uint.md).
**Example**
Query:
```sql
SELECT empty('');
```
Result:
```text
┌─empty('')─┐
│ 1 │
└───────────┘
```
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## notEmpty
2021-08-10 10:53:35 +00:00
Checks whether the input string is non-empty.
2021-08-06 09:04:55 +00:00
**Syntax**
``` sql
2021-08-10 10:53:35 +00:00
notEmpty(x)
2021-08-06 09:04:55 +00:00
```
A string is considered non-empty if it contains at least one byte, even if this is a space or a null byte.
2021-08-06 09:04:55 +00:00
2021-08-10 10:53:35 +00:00
The function also works for [arrays](array-functions.md#function-notempty) or [UUID](uuid-functions.md#notempty).
2021-08-06 09:04:55 +00:00
**Arguments**
2021-08-10 10:53:35 +00:00
- `x` — Input value. [String](../data-types/string.md).
2021-08-06 09:04:55 +00:00
**Returned value**
2021-10-14 10:15:45 +00:00
- Returns `1` for a non-empty string or `0` for an empty string string.
2021-08-06 09:04:55 +00:00
Type: [UInt8](../data-types/int-uint.md).
**Example**
Query:
```sql
2021-08-10 10:53:35 +00:00
SELECT notEmpty('text');
2021-08-06 09:04:55 +00:00
```
Result:
```text
2021-08-10 10:53:35 +00:00
┌─notEmpty('text')─┐
│ 1 │
└──────────────────┘
2021-08-06 09:04:55 +00:00
```
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## length
2017-04-26 19:16:38 +00:00
Returns the length of a string in bytes (not in characters, and not in code points).
The result type is UInt64.
The function also works for arrays.
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## lengthUTF8
2021-05-27 19:44:11 +00:00
Returns the length of a string in Unicode code points (not in characters), assuming that the string contains a set of bytes that make up UTF-8 encoded text. If this assumption is not met, it returns some result (it does not throw an exception).
2017-04-26 19:16:38 +00:00
The result type is UInt64.
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## char_length, CHAR_LENGTH
2021-05-27 19:44:11 +00:00
Returns the length of a string in Unicode code points (not in characters), assuming that the string contains a set of bytes that make up UTF-8 encoded text. If this assumption is not met, it returns some result (it does not throw an exception).
The result type is UInt64.
2022-06-02 10:55:18 +00:00
## character_length, CHARACTER_LENGTH
2021-05-27 19:44:11 +00:00
Returns the length of a string in Unicode code points (not in characters), assuming that the string contains a set of bytes that make up UTF-8 encoded text. If this assumption is not met, it returns some result (it does not throw an exception).
The result type is UInt64.
2022-06-02 10:55:18 +00:00
## leftPad
2021-08-03 18:58:32 +00:00
2021-08-09 13:43:17 +00:00
Pads the current string from the left with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. Similarly to the MySQL `LPAD` function.
2021-08-03 18:58:32 +00:00
**Syntax**
``` sql
leftPad('string', 'length'[, 'pad_string'])
2021-08-03 18:58:32 +00:00
```
**Arguments**
- `string` — Input string that needs to be padded. [String](../data-types/string.md).
- `length` — The length of the resulting string. [UInt](../data-types/int-uint.md). If the value is less than the input string length, then the input string is returned as-is.
- `pad_string` — The string to pad the input string with. [String](../data-types/string.md). Optional. If not specified, then the input string is padded with spaces.
2021-08-03 18:58:32 +00:00
2021-08-09 14:29:26 +00:00
**Returned value**
2021-08-03 18:58:32 +00:00
- The resulting string of the given length.
2021-08-03 18:58:32 +00:00
Type: [String](../data-types/string.md).
**Example**
Query:
``` sql
SELECT leftPad('abc', 7, '*'), leftPad('def', 7);
2021-08-03 18:58:32 +00:00
```
Result:
``` text
┌─leftPad('abc', 7, '*')─┬─leftPad('def', 7)─┐
****abc │ def │
└────────────────────────┴───────────────────┘
2021-08-03 18:58:32 +00:00
```
2022-06-02 10:55:18 +00:00
## leftPadUTF8
2021-08-06 15:03:31 +00:00
2021-08-09 13:43:17 +00:00
Pads the current string from the left with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. Similarly to the MySQL `LPAD` function. While in the [leftPad](#leftpad) function the length is measured in bytes, here in the `leftPadUTF8` function it is measured in code points.
2021-08-06 15:03:31 +00:00
**Syntax**
``` sql
leftPadUTF8('string','length'[, 'pad_string'])
2021-08-06 15:03:31 +00:00
```
**Arguments**
- `string` — Input string that needs to be padded. [String](../data-types/string.md).
- `length` — The length of the resulting string. [UInt](../data-types/int-uint.md). If the value is less than the input string length, then the input string is returned as-is.
- `pad_string` — The string to pad the input string with. [String](../data-types/string.md). Optional. If not specified, then the input string is padded with spaces.
2021-08-06 15:03:31 +00:00
2021-08-09 14:29:26 +00:00
**Returned value**
2021-08-06 15:03:31 +00:00
2021-08-09 14:29:26 +00:00
- The resulting string of the given length.
2021-08-06 15:03:31 +00:00
Type: [String](../data-types/string.md).
**Example**
Query:
``` sql
SELECT leftPadUTF8('абвг', 7, '*'), leftPadUTF8('дежз', 7);
2021-08-06 15:03:31 +00:00
```
Result:
``` text
┌─leftPadUTF8('абвг', 7, '*')─┬─leftPadUTF8('дежз', 7)─┐
***абвг │ дежз │
└─────────────────────────────┴────────────────────────┘
2021-08-06 15:03:31 +00:00
```
2022-06-02 10:55:18 +00:00
## rightPad
2021-08-03 18:58:32 +00:00
2021-08-09 13:43:17 +00:00
Pads the current string from the right with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. Similarly to the MySQL `RPAD` function.
2021-08-03 18:58:32 +00:00
**Syntax**
``` sql
2021-08-09 13:43:17 +00:00
rightPad('string', 'length'[, 'pad_string'])
2021-08-03 18:58:32 +00:00
```
**Arguments**
- `string` — Input string that needs to be padded. [String](../data-types/string.md).
2021-08-09 13:43:17 +00:00
- `length` — The length of the resulting string. [UInt](../data-types/int-uint.md). If the value is less than the input string length, then the input string is returned as-is.
- `pad_string` — The string to pad the input string with. [String](../data-types/string.md). Optional. If not specified, then the input string is padded with spaces.
2021-08-03 18:58:32 +00:00
2021-08-09 14:29:26 +00:00
**Returned value**
2021-08-03 18:58:32 +00:00
2021-08-09 13:43:17 +00:00
- The resulting string of the given length.
2021-08-03 18:58:32 +00:00
Type: [String](../data-types/string.md).
**Example**
Query:
``` sql
2021-08-09 13:43:17 +00:00
SELECT rightPad('abc', 7, '*'), rightPad('abc', 7);
2021-08-03 18:58:32 +00:00
```
Result:
``` text
2021-08-09 13:43:17 +00:00
┌─rightPad('abc', 7, '*')─┬─rightPad('abc', 7)─┐
│ abc**** │ abc │
└─────────────────────────┴────────────────────┘
2021-08-03 18:58:32 +00:00
```
2022-06-02 10:55:18 +00:00
## rightPadUTF8
2021-08-06 15:03:31 +00:00
2021-08-09 14:29:26 +00:00
Pads the current string from the right with spaces or a specified string (multiple times, if needed) until the resulting string reaches the given length. Similarly to the MySQL `RPAD` function. While in the [rightPad](#rightpad) function the length is measured in bytes, here in the `rightPadUTF8` function it is measured in code points.
2021-08-06 15:03:31 +00:00
**Syntax**
``` sql
2021-08-09 13:43:17 +00:00
rightPadUTF8('string','length'[, 'pad_string'])
2021-08-06 15:03:31 +00:00
```
**Arguments**
- `string` — Input string that needs to be padded. [String](../data-types/string.md).
2021-08-09 13:43:17 +00:00
- `length` — The length of the resulting string. [UInt](../data-types/int-uint.md). If the value is less than the input string length, then the input string is returned as-is.
- `pad_string` — The string to pad the input string with. [String](../data-types/string.md). Optional. If not specified, then the input string is padded with spaces.
2021-08-06 15:03:31 +00:00
2021-08-09 14:29:26 +00:00
**Returned value**
2021-08-06 15:03:31 +00:00
2021-08-09 14:29:26 +00:00
- The resulting string of the given length.
2021-08-06 15:03:31 +00:00
Type: [String](../data-types/string.md).
**Example**
Query:
``` sql
2021-08-09 13:43:17 +00:00
SELECT rightPadUTF8('абвг', 7, '*'), rightPadUTF8('абвг', 7);
2021-08-06 15:03:31 +00:00
```
Result:
``` text
2021-08-09 13:43:17 +00:00
┌─rightPadUTF8('абвг', 7, '*')─┬─rightPadUTF8('абвг', 7)─┐
│ абвг*** │ абвг │
└──────────────────────────────┴─────────────────────────┘
2021-08-06 15:03:31 +00:00
```
2022-06-02 10:55:18 +00:00
## lower, lcase
2017-04-26 19:16:38 +00:00
Converts ASCII Latin symbols in a string to lowercase.
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## upper, ucase
2021-08-06 15:44:46 +00:00
Converts ASCII Latin symbols in a string to uppercase.
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## lowerUTF8
Converts a string to lowercase, assuming the string contains a set of bytes that make up a UTF-8 encoded text.
2021-05-27 19:44:11 +00:00
It does not detect the language. So for Turkish the result might not be exactly correct.
If the length of the UTF-8 byte sequence is different for upper and lower case of a code point, the result may be incorrect for this code point.
If the string contains a set of bytes that is not UTF-8, then the behavior is undefined.
2022-06-02 10:55:18 +00:00
## upperUTF8
2017-04-03 19:49:50 +00:00
Converts a string to uppercase, assuming the string contains a set of bytes that make up a UTF-8 encoded text.
2021-05-27 19:44:11 +00:00
It does not detect the language. So for Turkish the result might not be exactly correct.
If the length of the UTF-8 byte sequence is different for upper and lower case of a code point, the result may be incorrect for this code point.
If the string contains a set of bytes that is not UTF-8, then the behavior is undefined.
2022-06-02 10:55:18 +00:00
## isValidUTF8
2019-04-07 18:58:13 +00:00
2019-04-07 18:59:53 +00:00
Returns 1, if the set of bytes is valid UTF-8 encoded, otherwise 0.
2019-04-07 18:58:13 +00:00
2022-06-02 10:55:18 +00:00
## toValidUTF8
2019-05-17 12:55:21 +00:00
Replaces invalid UTF-8 characters by the `<60>` (U+FFFD) character. All running in a row invalid characters are collapsed into the one replacement character.
2020-03-20 10:10:48 +00:00
``` sql
toValidUTF8(input_string)
```
2021-02-16 11:21:23 +00:00
**Arguments**
- `input_string` — Any set of bytes represented as the [String](../../sql-reference/data-types/string.md) data type object.
Returned value: Valid UTF-8 string.
**Example**
2020-03-20 10:10:48 +00:00
``` sql
SELECT toValidUTF8('\x61\xF0\x80\x80\x80b');
```
2020-03-20 10:10:48 +00:00
``` text
2019-05-23 11:37:05 +00:00
┌─toValidUTF8('a<><61><EFBFBD><EFBFBD>b')─┐
│ a<>b │
└───────────────────────┘
2021-10-14 10:15:45 +00:00
```
2022-06-02 10:55:18 +00:00
## repeat
Repeats a string as many times as specified and concatenates the replicated values as a single string.
2021-02-16 11:31:24 +00:00
Alias: `REPEAT`.
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
repeat(s, n)
```
**Arguments**
- `s` — The string to repeat. [String](../../sql-reference/data-types/string.md).
- `n` — The number of times to repeat the string. [UInt](../../sql-reference/data-types/int-uint.md).
**Returned value**
2020-03-20 10:10:48 +00:00
The single string, which contains the string `s` repeated `n` times. If `n` \< 1, the function returns empty string.
Type: `String`.
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT repeat('abc', 10);
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─repeat('abc', 10)──────────────┐
│ abcabcabcabcabcabcabcabcabcabc │
└────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## reverse
2017-04-03 19:49:50 +00:00
2017-04-26 19:16:38 +00:00
Reverses the string (as a sequence of bytes).
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## reverseUTF8
2021-05-27 19:44:11 +00:00
Reverses a sequence of Unicode code points, assuming that the string contains a set of bytes representing a UTF-8 text. Otherwise, it does something else (it does not throw an exception).
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## format(pattern, s0, s1, …)
2019-05-18 11:30:36 +00:00
2020-03-20 10:10:48 +00:00
Formatting constant pattern with the string listed in the arguments. `pattern` is a simplified Python format pattern. Format string contains “replacement fields” surrounded by curly braces `{}`. Anything that is not contained in braces is considered literal text, which is copied unchanged to the output. If you need to include a brace character in the literal text, it can be escaped by doubling: `{{ '{{' }}` and `{{ '}}' }}`. Field names can be numbers (starting from zero) or empty (then they are treated as consequence numbers).
2019-05-18 11:30:36 +00:00
2020-03-20 10:10:48 +00:00
``` sql
2019-05-18 11:30:36 +00:00
SELECT format('{1} {0} {1}', 'World', 'Hello')
DOCAPI-8530: Code blocks markup fix (#7060) * Typo fix. * Links fix. * Fixed links in docs. * More fixes. * docs/en: cleaning some files * docs/en: cleaning data_types * docs/en: cleaning database_engines * docs/en: cleaning development * docs/en: cleaning getting_started * docs/en: cleaning interfaces * docs/en: cleaning operations * docs/en: cleaning query_lamguage * docs/en: cleaning en * docs/ru: cleaning data_types * docs/ru: cleaning index * docs/ru: cleaning database_engines * docs/ru: cleaning development * docs/ru: cleaning general * docs/ru: cleaning getting_started * docs/ru: cleaning interfaces * docs/ru: cleaning operations * docs/ru: cleaning query_language * docs: cleaning interfaces/http * Update docs/en/data_types/array.md decorated ``` Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/getting_started/example_datasets/nyc_taxi.md fixed typo Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/getting_started/example_datasets/ontime.md fixed typo Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/interfaces/formats.md fixed error Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/table_engines/custom_partitioning_key.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/utils/clickhouse-local.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/dicts/external_dicts_dict_sources.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/utils/clickhouse-local.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/json_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/json_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/other_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/other_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/date_time_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/table_engines/jdbc.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * docs: fixed error * docs: fixed error
2019-09-23 15:31:46 +00:00
```
2020-03-20 10:10:48 +00:00
``` text
2019-05-18 11:30:36 +00:00
┌─format('{1} {0} {1}', 'World', 'Hello')─┐
│ Hello World Hello │
└─────────────────────────────────────────┘
DOCAPI-8530: Code blocks markup fix (#7060) * Typo fix. * Links fix. * Fixed links in docs. * More fixes. * docs/en: cleaning some files * docs/en: cleaning data_types * docs/en: cleaning database_engines * docs/en: cleaning development * docs/en: cleaning getting_started * docs/en: cleaning interfaces * docs/en: cleaning operations * docs/en: cleaning query_lamguage * docs/en: cleaning en * docs/ru: cleaning data_types * docs/ru: cleaning index * docs/ru: cleaning database_engines * docs/ru: cleaning development * docs/ru: cleaning general * docs/ru: cleaning getting_started * docs/ru: cleaning interfaces * docs/ru: cleaning operations * docs/ru: cleaning query_language * docs: cleaning interfaces/http * Update docs/en/data_types/array.md decorated ``` Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/getting_started/example_datasets/nyc_taxi.md fixed typo Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/getting_started/example_datasets/ontime.md fixed typo Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/interfaces/formats.md fixed error Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/table_engines/custom_partitioning_key.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/utils/clickhouse-local.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/dicts/external_dicts_dict_sources.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/utils/clickhouse-local.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/json_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/json_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/other_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/other_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/date_time_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/table_engines/jdbc.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * docs: fixed error * docs: fixed error
2019-09-23 15:31:46 +00:00
```
2020-03-20 10:10:48 +00:00
``` sql
2019-05-18 11:30:36 +00:00
SELECT format('{} {}', 'Hello', 'World')
DOCAPI-8530: Code blocks markup fix (#7060) * Typo fix. * Links fix. * Fixed links in docs. * More fixes. * docs/en: cleaning some files * docs/en: cleaning data_types * docs/en: cleaning database_engines * docs/en: cleaning development * docs/en: cleaning getting_started * docs/en: cleaning interfaces * docs/en: cleaning operations * docs/en: cleaning query_lamguage * docs/en: cleaning en * docs/ru: cleaning data_types * docs/ru: cleaning index * docs/ru: cleaning database_engines * docs/ru: cleaning development * docs/ru: cleaning general * docs/ru: cleaning getting_started * docs/ru: cleaning interfaces * docs/ru: cleaning operations * docs/ru: cleaning query_language * docs: cleaning interfaces/http * Update docs/en/data_types/array.md decorated ``` Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/getting_started/example_datasets/nyc_taxi.md fixed typo Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/getting_started/example_datasets/ontime.md fixed typo Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/interfaces/formats.md fixed error Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/table_engines/custom_partitioning_key.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/utils/clickhouse-local.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/dicts/external_dicts_dict_sources.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/utils/clickhouse-local.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/json_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/json_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/other_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/other_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/query_language/functions/date_time_functions.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * Update docs/en/operations/table_engines/jdbc.md Co-Authored-By: BayoNet <da-daos@yandex.ru> * docs: fixed error * docs: fixed error
2019-09-23 15:31:46 +00:00
```
2020-03-20 10:10:48 +00:00
``` text
2019-05-18 11:30:36 +00:00
┌─format('{} {}', 'Hello', 'World')─┐
│ Hello World │
└───────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## concat
Concatenates the strings listed in the arguments, without a separator.
2020-03-20 10:10:48 +00:00
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
concat(s1, s2, ...)
```
**Arguments**
2020-02-02 21:59:23 +00:00
Values of type String or FixedString.
**Returned values**
2020-03-20 10:10:48 +00:00
Returns the String that results from concatenating the arguments.
2020-03-20 10:10:48 +00:00
If any of argument values is `NULL`, `concat` returns `NULL`.
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT concat('Hello, ', 'World!');
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2019-12-26 12:51:48 +00:00
┌─concat('Hello, ', 'World!')─┐
│ Hello, World! │
└─────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## concatAssumeInjective
2019-12-26 12:51:48 +00:00
Same as [concat](#concat), the difference is that you need to ensure that `concat(s1, s2, ...) → sn` is injective, it will be used for optimization of GROUP BY.
2020-03-20 10:10:48 +00:00
The function is named “injective” if it always returns different result for different values of arguments. In other words: different arguments never yield identical result.
2020-03-20 10:10:48 +00:00
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
concatAssumeInjective(s1, s2, ...)
```
**Arguments**
Values of type String or FixedString.
**Returned values**
2020-03-20 10:10:48 +00:00
Returns the String that results from concatenating the arguments.
If any of argument values is `NULL`, `concatAssumeInjective` returns `NULL`.
**Example**
2019-12-26 12:51:48 +00:00
Input table:
2020-03-20 10:10:48 +00:00
``` sql
2020-02-02 21:59:23 +00:00
CREATE TABLE key_val(`key1` String, `key2` String, `value` UInt32) ENGINE = TinyLog;
INSERT INTO key_val VALUES ('Hello, ','World',1), ('Hello, ','World',2), ('Hello, ','World!',3), ('Hello',', World!',2);
SELECT * from key_val;
2019-12-26 12:51:48 +00:00
```
2020-03-20 10:10:48 +00:00
``` text
2019-12-26 12:51:48 +00:00
┌─key1────┬─key2─────┬─value─┐
│ Hello, │ World │ 1 │
│ Hello, │ World │ 2 │
│ Hello, │ World! │ 3 │
│ Hello │ , World! │ 2 │
└─────────┴──────────┴───────┘
```
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT concat(key1, key2), sum(value) FROM key_val GROUP BY concatAssumeInjective(key1, key2);
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─concat(key1, key2)─┬─sum(value)─┐
│ Hello, World! │ 3 │
│ Hello, World! │ 2 │
│ Hello, World │ 3 │
└────────────────────┴────────────┘
```
2022-06-02 10:55:18 +00:00
## substring(s, offset, length), mid(s, offset, length), substr(s, offset, length)
2017-04-03 19:49:50 +00:00
2022-05-07 10:46:17 +00:00
Returns a substring starting with the byte from the offset index that is length bytes long. Character indexing starts from one (as in standard SQL).
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## substringUTF8(s, offset, length)
2021-05-27 19:44:11 +00:00
The same as substring, but for Unicode code points. Works under the assumption that the string contains a set of bytes representing a UTF-8 encoded text. If this assumption is not met, it returns some result (it does not throw an exception).
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## appendTrailingCharIfAbsent(s, c)
2020-03-20 10:10:48 +00:00
If the s string is non-empty and does not contain the c character at the end, it appends the c character to the end.
2022-06-02 10:55:18 +00:00
## convertCharset(s, from, to)
2020-03-20 10:10:48 +00:00
Returns the string s that was converted from the encoding in from to the encoding in to.
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## base64Encode(s)
2020-03-20 10:10:48 +00:00
Encodes s string into base64
2018-11-02 19:06:05 +00:00
2021-02-16 11:13:01 +00:00
Alias: `TO_BASE64`.
2022-06-02 10:55:18 +00:00
## base64Decode(s)
2020-03-20 10:10:48 +00:00
Decode base64-encoded string s into original string. In case of failure raises an exception.
2018-11-02 19:06:05 +00:00
2021-02-16 11:13:01 +00:00
Alias: `FROM_BASE64`.
2022-06-02 10:55:18 +00:00
## tryBase64Decode(s)
Similar to base64Decode, but in case of error an empty string would be returned.
2022-06-02 10:55:18 +00:00
## endsWith(s, suffix)
Returns whether to end with the specified suffix. Returns 1 if the string ends with the specified suffix, otherwise it returns 0.
2022-06-02 10:55:18 +00:00
## startsWith(str, prefix)
2019-09-26 11:39:06 +00:00
Returns 1 whether string starts with the specified prefix, otherwise it returns 0.
2020-03-20 10:10:48 +00:00
``` sql
2019-09-30 07:24:02 +00:00
SELECT startsWith('Spider-Man', 'Spi');
2019-09-26 11:39:06 +00:00
```
**Returned values**
- 1, if the string starts with the specified prefix.
2021-05-27 19:44:11 +00:00
- 0, if the string does not start with the specified prefix.
2019-09-26 11:39:06 +00:00
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
2019-09-26 11:39:06 +00:00
SELECT startsWith('Hello, world!', 'He');
```
Result:
2020-03-20 10:10:48 +00:00
``` text
2019-09-26 11:39:06 +00:00
┌─startsWith('Hello, world!', 'He')─┐
│ 1 │
└───────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## trim
Removes all specified characters from the start or end of a string.
By default removes all consecutive occurrences of common whitespace (ASCII character 32) from both ends of a string.
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
trim([[LEADING|TRAILING|BOTH] trim_character FROM] input_string)
```
**Arguments**
- `trim_character` — Specified characters for trim. [String](../../sql-reference/data-types/string.md).
- `input_string` — String for trim. [String](../../sql-reference/data-types/string.md).
**Returned value**
A string without leading and (or) trailing specified characters.
Type: `String`.
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT trim(BOTH ' ()' FROM '( Hello, world! )');
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─trim(BOTH ' ()' FROM '( Hello, world! )')─┐
│ Hello, world! │
└───────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## trimLeft
2021-05-27 19:44:11 +00:00
Removes all consecutive occurrences of common whitespace (ASCII character 32) from the beginning of a string. It does not remove other kinds of whitespace characters (tab, no-break space, etc.).
2020-03-20 10:10:48 +00:00
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
trimLeft(input_string)
```
Alias: `ltrim(input_string)`.
**Arguments**
- `input_string` — string to trim. [String](../../sql-reference/data-types/string.md).
**Returned value**
A string without leading common whitespaces.
Type: `String`.
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT trimLeft(' Hello, world! ');
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─trimLeft(' Hello, world! ')─┐
│ Hello, world! │
└─────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## trimRight
2021-05-27 19:44:11 +00:00
Removes all consecutive occurrences of common whitespace (ASCII character 32) from the end of a string. It does not remove other kinds of whitespace characters (tab, no-break space, etc.).
2020-03-20 10:10:48 +00:00
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
trimRight(input_string)
```
Alias: `rtrim(input_string)`.
**Arguments**
- `input_string` — string to trim. [String](../../sql-reference/data-types/string.md).
**Returned value**
A string without trailing common whitespaces.
Type: `String`.
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT trimRight(' Hello, world! ');
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─trimRight(' Hello, world! ')─┐
│ Hello, world! │
└──────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## trimBoth
2021-05-27 19:44:11 +00:00
Removes all consecutive occurrences of common whitespace (ASCII character 32) from both ends of a string. It does not remove other kinds of whitespace characters (tab, no-break space, etc.).
2020-03-20 10:10:48 +00:00
**Syntax**
2020-03-20 10:10:48 +00:00
``` sql
trimBoth(input_string)
```
Alias: `trim(input_string)`.
**Arguments**
- `input_string` — string to trim. [String](../../sql-reference/data-types/string.md).
**Returned value**
A string without leading and trailing common whitespaces.
Type: `String`.
**Example**
Query:
2020-03-20 10:10:48 +00:00
``` sql
SELECT trimBoth(' Hello, world! ');
```
Result:
2020-03-20 10:10:48 +00:00
``` text
┌─trimBoth(' Hello, world! ')─┐
│ Hello, world! │
└─────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## CRC32(s)
2019-06-17 21:49:37 +00:00
Returns the CRC32 checksum of a string, using CRC-32-IEEE 802.3 polynomial and initial value `0xffffffff` (zlib implementation).
The result type is UInt32.
2022-06-02 10:55:18 +00:00
## CRC32IEEE(s)
Returns the CRC32 checksum of a string, using CRC-32-IEEE 802.3 polynomial.
2019-06-17 21:49:37 +00:00
The result type is UInt32.
2022-06-02 10:55:18 +00:00
## CRC64(s)
Returns the CRC64 checksum of a string, using CRC-64-ECMA polynomial.
The result type is UInt64.
2022-06-02 10:55:18 +00:00
## normalizeQuery
Replaces literals, sequences of literals and complex aliases with placeholders.
**Syntax**
``` sql
normalizeQuery(x)
```
2021-07-29 15:20:55 +00:00
**Arguments**
- `x` — Sequence of characters. [String](../../sql-reference/data-types/string.md).
2020-09-28 20:58:08 +00:00
**Returned value**
- Sequence of characters with placeholders.
Type: [String](../../sql-reference/data-types/string.md).
**Example**
Query:
``` sql
SELECT normalizeQuery('[1, 2, 3, x]') AS query;
```
Result:
``` text
┌─query────┐
│ [?.., x] │
└──────────┘
```
2022-06-02 10:55:18 +00:00
## normalizedQueryHash
Returns identical 64bit hash values without the values of literals for similar queries. It helps to analyze query log.
2021-07-29 15:20:55 +00:00
**Syntax**
``` sql
normalizedQueryHash(x)
```
2021-07-29 15:20:55 +00:00
**Arguments**
- `x` — Sequence of characters. [String](../../sql-reference/data-types/string.md).
**Returned value**
- Hash value.
Type: [UInt64](../../sql-reference/data-types/int-uint.md#uint-ranges).
**Example**
Query:
``` sql
SELECT normalizedQueryHash('SELECT 1 AS `xyz`') != normalizedQueryHash('SELECT 1 AS `abc`') AS res;
```
Result:
``` text
┌─res─┐
│ 1 │
└─────┘
```
2022-06-02 10:55:18 +00:00
## normalizeUTF8NFC
2021-10-19 22:31:39 +00:00
Converts a string to [NFC normalized form](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms), assuming the string contains a set of bytes that make up a UTF-8 encoded text.
**Syntax**
``` sql
normalizeUTF8NFC(words)
2021-10-19 22:31:39 +00:00
```
**Arguments**
- `words` — Input string that contains UTF-8 encoded text. [String](../../sql-reference/data-types/string.md).
**Returned value**
2021-10-22 20:10:02 +00:00
- String transformed to NFC normalization form.
2021-10-19 22:31:39 +00:00
Type: [String](../../sql-reference/data-types/string.md).
2021-10-22 20:10:02 +00:00
**Example**
Query:
``` sql
SELECT length('â'), normalizeUTF8NFC('â') AS nfc, length(nfc) AS nfc_len;
2021-10-22 20:10:02 +00:00
```
Result:
``` text
┌─length('â')─┬─nfc─┬─nfc_len─┐
│ 2 │ â │ 2 │
2021-10-22 20:10:02 +00:00
└─────────────┴─────┴─────────┘
```
2022-06-02 10:55:18 +00:00
## normalizeUTF8NFD
2021-10-22 20:10:02 +00:00
Converts a string to [NFD normalized form](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms), assuming the string contains a set of bytes that make up a UTF-8 encoded text.
**Syntax**
``` sql
normalizeUTF8NFD(words)
```
**Arguments**
- `words` — Input string that contains UTF-8 encoded text. [String](../../sql-reference/data-types/string.md).
**Returned value**
- String transformed to NFD normalization form.
2021-10-19 22:31:39 +00:00
2021-10-22 20:10:02 +00:00
Type: [String](../../sql-reference/data-types/string.md).
**Example**
Query:
``` sql
SELECT length('â'), normalizeUTF8NFD('â') AS nfd, length(nfd) AS nfd_len;
2021-10-22 20:10:02 +00:00
```
Result:
``` text
┌─length('â')─┬─nfd─┬─nfd_len─┐
│ 2 │ â │ 3 │
2021-10-22 20:10:02 +00:00
└─────────────┴─────┴─────────┘
```
2022-06-02 10:55:18 +00:00
## normalizeUTF8NFKC
2021-10-22 20:10:02 +00:00
Converts a string to [NFKC normalized form](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms), assuming the string contains a set of bytes that make up a UTF-8 encoded text.
**Syntax**
``` sql
normalizeUTF8NFKC(words)
```
**Arguments**
- `words` — Input string that contains UTF-8 encoded text. [String](../../sql-reference/data-types/string.md).
**Returned value**
- String transformed to NFKC normalization form.
Type: [String](../../sql-reference/data-types/string.md).
**Example**
Query:
``` sql
SELECT length('â'), normalizeUTF8NFKC('â') AS nfkc, length(nfkc) AS nfkc_len;
2021-10-22 20:10:02 +00:00
```
Result:
``` text
┌─length('â')─┬─nfkc─┬─nfkc_len─┐
│ 2 │ â │ 2 │
2021-10-22 20:10:02 +00:00
└─────────────┴──────┴──────────┘
```
2022-06-02 10:55:18 +00:00
## normalizeUTF8NFKD
2021-10-22 20:10:02 +00:00
Converts a string to [NFKD normalized form](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms), assuming the string contains a set of bytes that make up a UTF-8 encoded text.
**Syntax**
``` sql
normalizeUTF8NFKD(words)
```
**Arguments**
- `words` — Input string that contains UTF-8 encoded text. [String](../../sql-reference/data-types/string.md).
**Returned value**
- String transformed to NFKD normalization form.
Type: [String](../../sql-reference/data-types/string.md).
**Example**
Query:
``` sql
SELECT length('â'), normalizeUTF8NFKD('â') AS nfkd, length(nfkd) AS nfkd_len;
2021-10-22 20:10:02 +00:00
```
Result:
``` text
┌─length('â')─┬─nfkd─┬─nfkd_len─┐
│ 2 │ â │ 3 │
2021-10-22 20:10:02 +00:00
└─────────────┴──────┴──────────┘
```
2021-10-19 22:31:39 +00:00
2022-06-02 10:55:18 +00:00
## encodeXMLComponent
2020-12-27 11:21:58 +00:00
2020-12-29 13:27:47 +00:00
Escapes characters to place string into XML text node or attribute.
2020-12-27 11:21:58 +00:00
2020-12-29 13:27:47 +00:00
The following five XML predefined entities will be replaced: `<`, `&`, `>`, `"`, `'`.
2020-12-27 20:49:05 +00:00
2021-07-29 15:20:55 +00:00
**Syntax**
2020-12-27 11:21:58 +00:00
``` sql
encodeXMLComponent(x)
```
2021-07-29 15:20:55 +00:00
**Arguments**
2020-12-27 11:21:58 +00:00
2020-12-27 20:49:05 +00:00
- `x` — The sequence of characters. [String](../../sql-reference/data-types/string.md).
2020-12-27 11:21:58 +00:00
**Returned value**
2020-12-27 11:21:58 +00:00
2020-12-27 20:49:05 +00:00
- The sequence of characters with escape characters.
2020-12-27 11:21:58 +00:00
Type: [String](../../sql-reference/data-types/string.md).
**Example**
Query:
``` sql
2020-12-27 20:49:05 +00:00
SELECT encodeXMLComponent('Hello, "world"!');
SELECT encodeXMLComponent('<123>');
SELECT encodeXMLComponent('&clickhouse');
SELECT encodeXMLComponent('\'foo\'');
2020-12-27 11:21:58 +00:00
```
Result:
``` text
2020-12-27 20:49:05 +00:00
Hello, &quot;world&quot;!
&lt;123&gt;
&amp;clickhouse
&apos;foo&apos;
2020-12-27 11:21:58 +00:00
```
2022-06-02 10:55:18 +00:00
## decodeXMLComponent
2021-02-12 19:28:03 +00:00
2021-02-15 18:25:32 +00:00
Replaces XML predefined entities with characters. Predefined entities are `&quot;` `&amp;` `&apos;` `&gt;` `&lt;`
This function also replaces numeric character references with Unicode characters. Both decimal (like `&#10003;`) and hexadecimal (`&#x2713;`) forms are supported.
2021-02-12 19:28:03 +00:00
**Syntax**
``` sql
decodeXMLComponent(x)
```
**Arguments**
2021-02-12 19:28:03 +00:00
- `x` — A sequence of characters. [String](../../sql-reference/data-types/string.md).
**Returned value**
- The sequence of characters after replacement.
Type: [String](../../sql-reference/data-types/string.md).
**Example**
Query:
``` sql
SELECT decodeXMLComponent('&apos;foo&apos;');
SELECT decodeXMLComponent('&lt; &#x3A3; &gt;');
```
Result:
``` text
2021-07-29 15:20:55 +00:00
'foo'
2021-02-12 19:28:03 +00:00
< Σ >
```
**See Also**
- [List of XML and HTML character entity references](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)
2021-03-28 18:42:04 +00:00
2022-06-02 10:55:18 +00:00
## extractTextFromHTML
2021-03-28 18:42:04 +00:00
A function to extract text from HTML or XHTML.
2021-04-07 18:35:11 +00:00
It does not necessarily 100% conform to any of the HTML, XML or XHTML standards, but the implementation is reasonably accurate and it is fast. The rules are the following:
1. Comments are skipped. Example: `<!-- test -->`. Comment must end with `-->`. Nested comments are not possible.
2021-04-07 18:35:11 +00:00
Note: constructions like `<!-->` and `<!--->` are not valid comments in HTML but they are skipped by other rules.
2. CDATA is pasted verbatim. Note: CDATA is XML/XHTML specific. But it is processed for "best-effort" approach.
3. `script` and `style` elements are removed with all their content. Note: it is assumed that closing tag cannot appear inside content. For example, in JS string literal has to be escaped like `"<\/script>"`.
2021-04-07 19:23:53 +00:00
Note: comments and CDATA are possible inside `script` or `style` - then closing tags are not searched inside CDATA. Example: `<script><![CDATA[</script>]]></script>`. But they are still searched inside comments. Sometimes it becomes complicated: `<script>var x = "<!--"; </script> var y = "-->"; alert(x + y);</script>`
2021-04-06 19:10:22 +00:00
Note: `script` and `style` can be the names of XML namespaces - then they are not treated like usual `script` or `style` elements. Example: `<script:a>Hello</script:a>`.
Note: whitespaces are possible after closing tag name: `</script >` but not before: `< / script>`.
4. Other tags or tag-like elements are skipped without inner content. Example: `<a>.</a>`
2021-04-07 18:35:11 +00:00
Note: it is expected that this HTML is illegal: `<a test=">"></a>`
Note: it also skips something like tags: `<>`, `<!>`, etc.
Note: tag without end is skipped to the end of input: `<hello `
2021-04-06 19:10:22 +00:00
5. HTML and XML entities are not decoded. They must be processed by separate function.
6. Whitespaces in the text are collapsed or inserted by specific rules.
- Whitespaces at the beginning and at the end are removed.
- Consecutive whitespaces are collapsed.
- But if the text is separated by other elements and there is no whitespace, it is inserted.
2021-04-07 18:35:11 +00:00
- It may cause unnatural examples: `Hello<b>world</b>`, `Hello<!-- -->world` - there is no whitespace in HTML, but the function inserts it. Also consider: `Hello<p>world</p>`, `Hello<br>world`. This behavior is reasonable for data analysis, e.g. to convert HTML to a bag of words.
7. Also note that correct handling of whitespaces requires the support of `<pre></pre>` and CSS `display` and `white-space` properties.
2021-03-28 18:42:04 +00:00
**Syntax**
``` sql
extractTextFromHTML(x)
```
2021-04-03 16:16:56 +00:00
**Arguments**
2021-03-28 18:42:04 +00:00
2021-07-29 15:20:55 +00:00
- `x` — input text. [String](../../sql-reference/data-types/string.md).
2021-03-28 18:42:04 +00:00
**Returned value**
- Extracted text.
Type: [String](../../sql-reference/data-types/string.md).
**Example**
2021-04-03 19:31:24 +00:00
The first example contains several tags and a comment and also shows whitespace processing.
The second example shows `CDATA` and `script` tag processing.
2021-04-07 18:35:11 +00:00
In the third example text is extracted from the full HTML response received by the [url](../../sql-reference/table-functions/url.md) function.
2021-04-03 16:16:56 +00:00
2021-03-28 18:42:04 +00:00
Query:
``` sql
SELECT extractTextFromHTML(' <p> A text <i>with</i><b>tags</b>. <!-- comments --> </p> ');
2021-04-03 16:16:56 +00:00
SELECT extractTextFromHTML('<![CDATA[The content within <b>CDATA</b>]]> <script>alert("Script");</script>');
2021-04-05 19:37:01 +00:00
SELECT extractTextFromHTML(html) FROM url('http://www.donothingfor2minutes.com/', RawBLOB, 'html String');
2021-03-28 18:42:04 +00:00
```
Result:
``` text
A text with tags .
2021-04-03 16:16:56 +00:00
The content within <b>CDATA</b>
2021-04-05 19:37:01 +00:00
Do Nothing for 2 Minutes 2:00 &nbsp;
2021-04-03 16:16:56 +00:00
```