Merge remote-tracking branch 'ClickHouse/master' into change_date

2024-11-21 15:12:02 +00:00 · 2024-07-08 21:10:06 +00:00 · 2024-07-08 21:10:06 +00:00 · ef0fa20de3
commit ef0fa20de3
parent db1817a633 9e9862066f
3 changed files with 191 additions and 50 deletions
--- a/docs/en/operations/settings/settings.md
+++ b/docs/en/operations/settings/settings.md
@ -1170,6 +1170,10 @@ Data in the VALUES clause of INSERT queries is processed by a separate stream pa

 Default value: 262144 (= 256 KiB).

+:::note
+`max_query_size` cannot be set within an SQL query (e.g., `SELECT now() SETTINGS max_query_size=10000`) because ClickHouse needs to allocate a buffer to parse the query, and this buffer size is determined by the `max_query_size` setting, which must be configured before the query is executed.
+:::
+
 ## max_parser_depth {#max_parser_depth}

 Limits maximum recursion depth in the recursive descent parser. Allows controlling the stack size.
--- a/docs/en/sql-reference/functions/string-functions.md
+++ b/docs/en/sql-reference/functions/string-functions.md
@ -12,9 +12,7 @@ Functions for [searching](string-search-functions.md) in strings and for [replac

 ## empty

-Checks whether the input string is empty.
-
-A string is considered non-empty if it contains at least one byte, even if this byte is a space or the null byte.
+Checks whether the input string is empty. A string is considered non-empty if it contains at least one byte, even if this byte is a space or the null byte.

 The function is also available for [arrays](array-functions.md#function-empty) and [UUIDs](uuid-functions.md#empty).

@ -48,9 +46,7 @@ Result:

 ## notEmpty

-Checks whether the input string is non-empty.
-
-A string is considered non-empty if it contains at least one byte, even if this byte is a space or the null byte.
+Checks whether the input string is non-empty. A string is considered non-empty if it contains at least one byte, even if this byte is a space or the null byte.

 The function is also available for [arrays](array-functions.md#function-notempty) and [UUIDs](uuid-functions.md#notempty).

@ -96,7 +92,7 @@ length(s)

 **Parameters**

- `s`: An input string or array. [String](../data-types/string)/[Array](../data-types/array).
+- `s` — An input string or array. [String](../data-types/string)/[Array](../data-types/array).

 **Returned value**

@ -149,7 +145,7 @@ lengthUTF8(s)

 **Parameters**

- `s`: String containing valid UTF-8 encoded text. [String](../data-types/string).
+- `s` — String containing valid UTF-8 encoded text. [String](../data-types/string).

 **Returned value**

@ -183,8 +179,8 @@ left(s, offset)

 **Parameters**

- `s`: The string to calculate a substring from. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
- `offset`: The number of bytes of the offset. [UInt*](../data-types/int-uint).
+- `s` — The string to calculate a substring from. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
+- `offset` — The number of bytes of the offset. [UInt*](../data-types/int-uint).

 **Returned value**

@ -230,8 +226,8 @@ leftUTF8(s, offset)

 **Parameters**

- `s`: The UTF-8 encoded string to calculate a substring from. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
- `offset`: The number of bytes of the offset. [UInt*](../data-types/int-uint).
+- `s` — The UTF-8 encoded string to calculate a substring from. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
+- `offset` — The number of bytes of the offset. [UInt*](../data-types/int-uint).

 **Returned value**

@ -347,8 +343,8 @@ right(s, offset)

 **Parameters**

- `s`: The string to calculate a substring from. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
- `offset`: The number of bytes of the offset. [UInt*](../data-types/int-uint).
+- `s` — The string to calculate a substring from. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
+- `offset` — The number of bytes of the offset. [UInt*](../data-types/int-uint).

 **Returned value**

@ -394,8 +390,8 @@ rightUTF8(s, offset)

 **Parameters**

- `s`: The UTF-8 encoded string to calculate a substring from. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
- `offset`: The number of bytes of the offset. [UInt*](../data-types/int-uint).
+- `s` — The UTF-8 encoded string to calculate a substring from. [String](../data-types/string.md) or [FixedString](../data-types/fixedstring.md).
+- `offset` — The number of bytes of the offset. [UInt*](../data-types/int-uint).

 **Returned value**

@ -547,7 +543,7 @@ Alias: `ucase`

 **Parameters**

- `input`: A string type [String](../data-types/string.md).
+- `input` — A string type [String](../data-types/string.md).

 **Returned value**

@ -571,16 +567,47 @@ SELECT upper('clickhouse');

 Converts a string to lowercase, assuming that the string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown and the result is undefined.

-Does not detect the language, e.g. for Turkish the result might not be exactly correct (i/İ vs. i/I).
+:::note
+Does not detect the language, e.g. for Turkish the result might not be exactly correct (i/İ vs. i/I). If the length of the UTF-8 byte sequence is different for upper and lower case of a code point (such as `ẞ` and `ß`), the result may be incorrect for this code point.
+:::

-If the length of the UTF-8 byte sequence is different for upper and lower case of a code point, the result may be incorrect for this code point.
+**Syntax**
+
+```sql
+lowerUTF8(input)
+```
+
+**Parameters**
+
+- `input` — A string type [String](../data-types/string.md).
+
+**Returned value**
+
+- A [String](../data-types/string.md) data type value.
+
+**Example**
+
+Query:
+
+``` sql
+SELECT lowerUTF8('MÜNCHEN') as Lowerutf8;
+```
+
+Result:
+
+``` response
+┌─Lowerutf8─┐
+│ münchen   │
+└───────────┘
+```

 ## upperUTF8

 Converts a string to uppercase, assuming that the string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown and the result is undefined.

-
-If the length of the UTF-8 byte sequence is different for upper and lower case of a code point, the result may be incorrect for this code point.
+:::note
+Does not detect the language, e.g. for Turkish the result might not be exactly correct (i/İ vs. i/I). If the length of the UTF-8 byte sequence is different for upper and lower case of a code point (such as `ẞ` and `ß`), the result may be incorrect for this code point.
+:::

 **Syntax**

@ -590,7 +617,7 @@ upperUTF8(input)

 **Parameters**

- `input`: A string type [String](../data-types/string.md).
+- `input` — A string type [String](../data-types/string.md).

 **Returned value**

@ -604,6 +631,8 @@ Query:
 SELECT upperUTF8('München') as Upperutf8;
 ```

+Result:
+
 ``` response
 ┌─Upperutf8─┐
 │ MÜNCHEN   │
@ -614,6 +643,34 @@ SELECT upperUTF8('München') as Upperutf8;

 Returns 1, if the set of bytes constitutes valid UTF-8-encoded text, otherwise 0.

+**Syntax**
+
+``` sql
+isValidUTF8(input)
+```
+
+**Parameters**
+
+- `input` — A string type [String](../data-types/string.md).
+
+**Returned value**
+
+- Returns `1`, if the set of bytes constitutes valid UTF-8-encoded text, otherwise `0`.
+
+Query:
+
+``` sql
+SELECT isValidUTF8('\xc3\xb1') AS valid, isValidUTF8('\xc3\x28') AS invalid;
+```
+
+Result:
+
+``` response
+┌─valid─┬─invalid─┐
+│     1 │       0 │
+└───────┴─────────┘
+```
+
 ## toValidUTF8

 Replaces invalid UTF-8 characters by the `<60>` (U+FFFD) character. All running in a row invalid characters are collapsed into the one replacement character.
@ -883,7 +940,7 @@ Returns the substring of a string `s` which starts at the specified byte index `
 substring(s, offset[, length])
 ```

-Alias:
+Aliases:
 - `substr`
 - `mid`
 - `byteSlice`
@ -926,9 +983,9 @@ substringUTF8(s, offset[, length])

 **Arguments**

- `s`: The string to calculate a substring from. [String](../data-types/string.md), [FixedString](../data-types/fixedstring.md) or [Enum](../data-types/enum.md)
- `offset`: The starting position of the substring in `s` . [(U)Int*](../data-types/int-uint.md).
- `length`: The maximum length of the substring. [(U)Int*](../data-types/int-uint.md). Optional.
+- `s` — The string to calculate a substring from. [String](../data-types/string.md), [FixedString](../data-types/fixedstring.md) or [Enum](../data-types/enum.md)
+- `offset` — The starting position of the substring in `s` . [(U)Int*](../data-types/int-uint.md).
+- `length` — The maximum length of the substring. [(U)Int*](../data-types/int-uint.md). Optional.

 **Returned value**

@ -964,9 +1021,9 @@ Alias: `SUBSTRING_INDEX`

 **Arguments**

- s: The string to extract substring from. [String](../data-types/string.md).
- delim: The character to split. [String](../data-types/string.md).
- count: The number of occurrences of the delimiter to count before extracting the substring. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. [UInt or Int](../data-types/int-uint.md)
+- s — The string to extract substring from. [String](../data-types/string.md).
+- delim — The character to split. [String](../data-types/string.md).
+- count — The number of occurrences of the delimiter to count before extracting the substring. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. [UInt or Int](../data-types/int-uint.md)

 **Example**

@ -995,9 +1052,9 @@ substringIndexUTF8(s, delim, count)

 **Arguments**

- `s`: The string to extract substring from. [String](../data-types/string.md).
- `delim`: The character to split. [String](../data-types/string.md).
- `count`: The number of occurrences of the delimiter to count before extracting the substring. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. [UInt or Int](../data-types/int-uint.md)
+- `s` — The string to extract substring from. [String](../data-types/string.md).
+- `delim` — The character to split. [String](../data-types/string.md).
+- `count` — The number of occurrences of the delimiter to count before extracting the substring. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. [UInt or Int](../data-types/int-uint.md)

 **Returned value**

@ -1277,7 +1334,7 @@ tryBase64Decode(encoded)

 **Arguments**

- `encoded`: [String](../data-types/string.md) column or constant. If the string is not a valid Base64-encoded value, returns an empty string.
+- `encoded` — [String](../data-types/string.md) column or constant. If the string is not a valid Base64-encoded value, returns an empty string.

 **Returned value**

@ -1309,7 +1366,7 @@ tryBase64URLDecode(encodedUrl)

 **Parameters**

- `encodedURL`: [String](../data-types/string.md) column or constant. If the string is not a valid Base64-encoded value with URL-specific modifications, returns an empty string.
+- `encodedURL` — [String](../data-types/string.md) column or constant. If the string is not a valid Base64-encoded value with URL-specific modifications, returns an empty string.

 **Returned value**

@ -2011,7 +2068,7 @@ soundex(val)

 **Arguments**

- `val` - Input value. [String](../data-types/string.md)
+- `val` — Input value. [String](../data-types/string.md)

 **Returned value**

@ -2044,7 +2101,7 @@ punycodeEncode(val)

 **Arguments**

- `val` - Input value. [String](../data-types/string.md)
+- `val` — Input value. [String](../data-types/string.md)

 **Returned value**

@ -2077,7 +2134,7 @@ punycodeEncode(val)

 **Arguments**

- `val` - Punycode-encoded string. [String](../data-types/string.md)
+- `val` — Punycode-encoded string. [String](../data-types/string.md)

 **Returned value**

@ -2103,7 +2160,7 @@ Like `punycodeDecode` but returns an empty string if no valid Punycode-encoded s

 ## idnaEncode

-Returns the the ASCII representation (ToASCII algorithm) of a domain name according to the [Internationalized Domain Names in Applications](https://en.wikipedia.org/wiki/Internationalized_domain_name#Internationalizing_Domain_Names_in_Applications) (IDNA) mechanism.
+Returns the ASCII representation (ToASCII algorithm) of a domain name according to the [Internationalized Domain Names in Applications](https://en.wikipedia.org/wiki/Internationalized_domain_name#Internationalizing_Domain_Names_in_Applications) (IDNA) mechanism.
 The input string must be UTF-encoded and translatable to an ASCII string, otherwise an exception is thrown.
 Note: No percent decoding or trimming of tabs, spaces or control characters is performed.

@ -2115,7 +2172,7 @@ idnaEncode(val)

 **Arguments**

- `val` - Input value. [String](../data-types/string.md)
+- `val` — Input value. [String](../data-types/string.md)

 **Returned value**

@ -2141,7 +2198,7 @@ Like `idnaEncode` but returns an empty string in case of an error instead of thr

 ## idnaDecode

-Returns the the Unicode (UTF-8) representation (ToUnicode algorithm) of a domain name according to the [Internationalized Domain Names in Applications](https://en.wikipedia.org/wiki/Internationalized_domain_name#Internationalizing_Domain_Names_in_Applications) (IDNA) mechanism.
+Returns the Unicode (UTF-8) representation (ToUnicode algorithm) of a domain name according to the [Internationalized Domain Names in Applications](https://en.wikipedia.org/wiki/Internationalized_domain_name#Internationalizing_Domain_Names_in_Applications) (IDNA) mechanism.
 In case of an error (e.g. because the input is invalid), the input string is returned.
 Note that repeated application of `idnaEncode()` and `idnaDecode()` does not necessarily return the original string due to case normalization.

@ -2153,7 +2210,7 @@ idnaDecode(val)

 **Arguments**

- `val` - Input value. [String](../data-types/string.md)
+- `val` — Input value. [String](../data-types/string.md)

 **Returned value**

@ -2197,7 +2254,7 @@ Result:
 └───────────────────────────────────────────┘
 ```

-Alias: mismatches
+Alias: `mismatches`

 ## stringJaccardIndex

@ -2251,7 +2308,7 @@ Result:
 └─────────────────────────────────────┘
 ```

-Alias: levenshteinDistance
+Alias: `levenshteinDistance`

 ## editDistanceUTF8

@ -2277,7 +2334,7 @@ Result:
 └─────────────────────────────────────┘
 ```

-Alias: levenshteinDistanceUTF8
+Alias: `levenshteinDistanceUTF8`

 ## damerauLevenshteinDistance

@ -2355,13 +2412,93 @@ Result:

 Convert the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.

+:::note
+Because `initCap` converts only the first letter of each word to upper case you may observe unexpected behaviour for words containing apostrophes or capital letters. For example:
+
+```sql
+SELECT initCap('mother''s daughter'), initCap('joe McAdam');
+```
+
+will return
+
+```response
+┌─initCap('mother\'s daughter')─┬─initCap('joe McAdam')─┐
+│ Mother'S Daughter             │ Joe Mcadam            │
+└───────────────────────────────┴───────────────────────┘
+```
+
+This is a known behaviour, with no plans currently to fix it.
+:::
+
+**Syntax**
+
+```sql
+initcap(val)
+```
+
+**Arguments**
+
+- `val` — Input value. [String](../data-types/string.md).
+
+**Returned value**
+
+- `val` with the first letter of each word converted to upper case. [String](../data-types/string.md).
+
+**Example**
+
+Query:
+
+```sql
+SELECT initcap('building for fast');
+```
+
+Result:
+
+```text
+┌─initcap('building for fast')─┐
+│ Building For Fast            │
+└──────────────────────────────┘
+```
+
 ## initcapUTF8

-Like [initcap](#initcap), assuming that the string contains valid UTF-8 encoded text. If this assumption is violated, no exception is thrown and the result is undefined.
-
-Does not detect the language, e.g. for Turkish the result might not be exactly correct (i/İ vs. i/I).
+Like [initcap](#initcap), `initcapUTF8` converts the first letter of each word to upper case and the rest to lower case. Assumes that the string contains valid UTF-8 encoded text. 
+If this assumption is violated, no exception is thrown and the result is undefined.

+:::note
+This function does not detect the language, e.g. for Turkish the result might not be exactly correct (i/İ vs. i/I).
 If the length of the UTF-8 byte sequence is different for upper and lower case of a code point, the result may be incorrect for this code point.
+:::
+
+**Syntax**
+
+```sql
+initcapUTF8(val)
+```
+
+**Arguments**
+
+- `val` — Input value. [String](../data-types/string.md).
+
+**Returned value**
+
+- `val` with the first letter of each word converted to upper case. [String](../data-types/string.md).
+
+**Example**
+
+Query:
+
+```sql
+SELECT initcapUTF8('не тормозит');
+```
+
+Result:
+
+```text
+┌─initcapUTF8('не тормозит')─┐
+│ Не Тормозит                │
+└────────────────────────────┘
+```

 ## firstLine

@ -2375,7 +2512,7 @@ firstLine(val)

 **Arguments**

- `val` - Input value. [String](../data-types/string.md)
+- `val` — Input value. [String](../data-types/string.md)

 **Returned value**

--- a/utils/check-style/aspell-ignore/en/aspell-dict.txt
+++ b/utils/check-style/aspell-ignore/en/aspell-dict.txt
@ -1,4 +1,4 @@
-personal_ws-1.1 en 2758
+personal_ws-1.1 en 2942 
 AArch
 ACLs
 ALTERs
@ -1664,9 +1664,9 @@ fsync
 func
 fuzzBits
 fuzzJSON
+fuzzQuery
 fuzzer
 fuzzers
-fuzzQuery
 gRPC
 gccMurmurHash
 gcem