Merge pull request #62111 from Blargian/document_ngramXYZ

Document ngramXYZ functions
This commit is contained in:
robot-ch-test-poll4 2024-04-02 10:37:14 +02:00 committed by GitHub
commit a5f3d7d5cb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 303 additions and 8 deletions

View File

@ -481,9 +481,9 @@ Alias: `haystack NOT ILIKE pattern` (operator)
## ngramDistance ## ngramDistance
Calculates the 4-gram distance between a `haystack` string and a `needle` string. For that, it counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns a Float32 between 0 and 1. The smaller the result is, the more strings are similar to each other. Throws an exception if constant `needle` or `haystack` arguments are more than 32Kb in size. If any of non-constant `haystack` or `needle` arguments is more than 32Kb in size, the distance is always 1. Calculates the 4-gram distance between a `haystack` string and a `needle` string. For this, it counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns a [Float32](../../sql-reference/data-types/float.md/#float32-float64) between 0 and 1. The smaller the result is, the more similar the strings are to each other.
Functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function. Functions [`ngramDistanceCaseInsensitive`](#ngramdistancecaseinsensitive), [`ngramDistanceUTF8`](#ngramdistanceutf8), [`ngramDistanceCaseInsensitiveUTF8`](#ngramdistancecaseinsensitiveutf8) provide case-insensitive and/or UTF-8 variants of this function.
**Syntax** **Syntax**
@ -491,15 +491,170 @@ Functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseIns
ngramDistance(haystack, needle) ngramDistance(haystack, needle)
``` ```
**Parameters**
- `haystack`: First comparison string. [String literal](../syntax#string)
- `needle`: Second comparison string. [String literal](../syntax#string)
**Returned value**
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
**Implementation details**
This function will throw an exception if constant `needle` or `haystack` arguments are more than 32Kb in size. If any non-constant `haystack` or `needle` arguments are more than 32Kb in size, then the distance is always 1.
**Examples**
The more similar two strings are to each other, the closer the result will be to 0 (identical).
Query:
```sql
SELECT ngramDistance('ClickHouse','ClickHouse!');
```
Result:
```response
0.06666667
```
The less similar two strings are to each, the larger the result will be.
Query:
```sql
SELECT ngramDistance('ClickHouse','House');
```
Result:
```response
0.5555556
```
## ngramDistanceCaseInsensitive
Provides a case-insensitive variant of [ngramDistance](#ngramdistance).
**Syntax**
```sql
ngramDistanceCaseInsensitive(haystack, needle)
```
**Parameters**
- `haystack`: First comparison string. [String literal](../syntax#string)
- `needle`: Second comparison string. [String literal](../syntax#string)
**Returned value**
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
**Examples**
With [ngramDistance](#ngramdistance) differences in case will affect the similarity value:
Query:
```sql
SELECT ngramDistance('ClickHouse','clickhouse');
```
Result:
```response
0.71428573
```
With [ngramDistanceCaseInsensitive](#ngramdistancecaseinsensitive) case is ignored so two identical strings differing only in case will now return a low similarity value:
Query:
```sql
SELECT ngramDistanceCaseInsensitive('ClickHouse','clickhouse');
```
Result:
```response
0
```
## ngramDistanceUTF8
Provides a UTF-8 variant of [ngramDistance](#ngramdistance). Assumes that `needle` and `haystack` strings are UTF-8 encoded strings.
**Syntax**
```sql
ngramDistanceUTF8(haystack, needle)
```
**Parameters**
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
**Returned value**
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
**Example**
Query:
```sql
SELECT ngramDistanceUTF8('abcde','cde');
```
Result:
```response
0.5
```
## ngramDistanceCaseInsensitiveUTF8
Provides a case-insensitive variant of [ngramDistanceUTF8](#ngramdistanceutf8).
**Syntax**
```sql
ngramDistanceCaseInsensitiveUTF8(haystack, needle)
```
**Parameters**
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
**Returned value**
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
**Example**
Query:
```sql
SELECT ngramDistanceCaseInsensitiveUTF8('abcde','CDE');
```
Result:
```response
0.5
```
## ngramSearch ## ngramSearch
Like `ngramDistance` but calculates the non-symmetric difference between a `needle` string and a `haystack` string, i.e. the number of n-grams from `needle` minus the common number of n-grams normalized by the number of `needle` n-grams. Returns a Float32 between 0 and 1. The bigger the result is, the more likely `needle` is in the `haystack`. This function is useful for fuzzy string search. Also see function `soundex`. Like `ngramDistance` but calculates the non-symmetric difference between a `needle` string and a `haystack` string, i.e. the number of n-grams from the needle minus the common number of n-grams normalized by the number of `needle` n-grams. Returns a [Float32](../../sql-reference/data-types/float.md/#float32-float64) between 0 and 1. The bigger the result is, the more likely `needle` is in the `haystack`. This function is useful for fuzzy string search. Also see function [`soundex`](../../sql-reference/functions/string-functions#soundex).
Functions `ngramSearchCaseInsensitive, ngramSearchUTF8, ngramSearchCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function. Functions [`ngramSearchCaseInsensitive`](#ngramsearchcaseinsensitive), [`ngramSearchUTF8`](#ngramsearchutf8), [`ngramSearchCaseInsensitiveUTF8`](#ngramsearchcaseinsensitiveutf8) provide case-insensitive and/or UTF-8 variants of this function.
:::note
The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one this works for Latin and mostly for all Cyrillic letters.
:::
**Syntax** **Syntax**
@ -507,6 +662,140 @@ The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram
ngramSearch(haystack, needle) ngramSearch(haystack, needle)
``` ```
**Parameters**
- `haystack`: First comparison string. [String literal](../syntax#string)
- `needle`: Second comparison string. [String literal](../syntax#string)
**Returned value**
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
**Implementation details**
:::note
The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one this works for Latin and mostly for all Cyrillic letters.
:::
**Example**
Query:
```sql
SELECT ngramSearch('Hello World','World Hello');
```
Result:
```response
0.5
```
## ngramSearchCaseInsensitive
Provides a case-insensitive variant of [ngramSearch](#ngramSearch).
**Syntax**
```sql
ngramSearchCaseInsensitive(haystack, needle)
```
**Parameters**
- `haystack`: First comparison string. [String literal](../syntax#string)
- `needle`: Second comparison string. [String literal](../syntax#string)
**Returned value**
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
The bigger the result is, the more likely `needle` is in the `haystack`.
**Example**
Query:
```sql
SELECT ngramSearchCaseInsensitive('Hello World','hello');
```
Result:
```response
1
```
## ngramSearchUTF8
Provides a UTF-8 variant of [ngramSearch](#ngramsearch) in which `needle` and `haystack` are assumed to be UTF-8 encoded strings.
**Syntax**
```sql
ngramSearchUTF8(haystack, needle)
```
**Parameters**
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
**Returned value**
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
The bigger the result is, the more likely `needle` is in the `haystack`.
**Example**
Query:
```sql
SELECT ngramSearchUTF8('абвгдеёжз', 'гдеёзд');
```
Result:
```response
0.5
```
## ngramSearchCaseInsensitiveUTF8
Provides a case-insensitive variant of [ngramSearchUTF8](#ngramsearchutf8) in which `needle` and `haystack`.
**Syntax**
```sql
ngramSearchCaseInsensitiveUTF8(haystack, needle)
```
**Parameters**
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
**Returned value**
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
The bigger the result is, the more likely `needle` is in the `haystack`.
**Example**
Query:
```sql
SELECT ngramSearchCaseInsensitiveUTF8('абвГДЕёжз', 'АбвгдЕЁжз');
```
Result:
```response
0.57142854
```
## countSubstrings ## countSubstrings
Returns how often substring `needle` occurs in string `haystack`. Returns how often substring `needle` occurs in string `haystack`.

View File

@ -1984,6 +1984,9 @@ nestjs
netloc netloc
ngram ngram
ngramDistance ngramDistance
ngramDistanceCaseInsensitive
ngramDistanceCaseInsensitiveUTF
ngramDistanceUTF
ngramMinHash ngramMinHash
ngramMinHashArg ngramMinHashArg
ngramMinHashArgCaseInsensitive ngramMinHashArgCaseInsensitive
@ -1993,6 +1996,9 @@ ngramMinHashCaseInsensitive
ngramMinHashCaseInsensitiveUTF ngramMinHashCaseInsensitiveUTF
ngramMinHashUTF ngramMinHashUTF
ngramSearch ngramSearch
ngramSearchCaseInsensitive
ngramSearchCaseInsensitiveUTF
ngramSearchUTF
ngramSimHash ngramSimHash
ngramSimHashCaseInsensitive ngramSimHashCaseInsensitive
ngramSimHashCaseInsensitiveUTF ngramSimHashCaseInsensitiveUTF