mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-11-21 23:21:59 +00:00
Merge pull request #62111 from Blargian/document_ngramXYZ
Document ngramXYZ functions
This commit is contained in:
commit
a5f3d7d5cb
@ -481,9 +481,9 @@ Alias: `haystack NOT ILIKE pattern` (operator)
|
||||
|
||||
## ngramDistance
|
||||
|
||||
Calculates the 4-gram distance between a `haystack` string and a `needle` string. For that, it counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns a Float32 between 0 and 1. The smaller the result is, the more strings are similar to each other. Throws an exception if constant `needle` or `haystack` arguments are more than 32Kb in size. If any of non-constant `haystack` or `needle` arguments is more than 32Kb in size, the distance is always 1.
|
||||
Calculates the 4-gram distance between a `haystack` string and a `needle` string. For this, it counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns a [Float32](../../sql-reference/data-types/float.md/#float32-float64) between 0 and 1. The smaller the result is, the more similar the strings are to each other.
|
||||
|
||||
Functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function.
|
||||
Functions [`ngramDistanceCaseInsensitive`](#ngramdistancecaseinsensitive), [`ngramDistanceUTF8`](#ngramdistanceutf8), [`ngramDistanceCaseInsensitiveUTF8`](#ngramdistancecaseinsensitiveutf8) provide case-insensitive and/or UTF-8 variants of this function.
|
||||
|
||||
**Syntax**
|
||||
|
||||
@ -491,15 +491,170 @@ Functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseIns
|
||||
ngramDistance(haystack, needle)
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
- `haystack`: First comparison string. [String literal](../syntax#string)
|
||||
- `needle`: Second comparison string. [String literal](../syntax#string)
|
||||
|
||||
**Returned value**
|
||||
|
||||
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||
|
||||
**Implementation details**
|
||||
|
||||
This function will throw an exception if constant `needle` or `haystack` arguments are more than 32Kb in size. If any non-constant `haystack` or `needle` arguments are more than 32Kb in size, then the distance is always 1.
|
||||
|
||||
**Examples**
|
||||
|
||||
The more similar two strings are to each other, the closer the result will be to 0 (identical).
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramDistance('ClickHouse','ClickHouse!');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
0.06666667
|
||||
```
|
||||
|
||||
The less similar two strings are to each, the larger the result will be.
|
||||
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramDistance('ClickHouse','House');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
0.5555556
|
||||
```
|
||||
|
||||
## ngramDistanceCaseInsensitive
|
||||
|
||||
Provides a case-insensitive variant of [ngramDistance](#ngramdistance).
|
||||
|
||||
**Syntax**
|
||||
|
||||
```sql
|
||||
ngramDistanceCaseInsensitive(haystack, needle)
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
- `haystack`: First comparison string. [String literal](../syntax#string)
|
||||
- `needle`: Second comparison string. [String literal](../syntax#string)
|
||||
|
||||
**Returned value**
|
||||
|
||||
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||
|
||||
**Examples**
|
||||
|
||||
With [ngramDistance](#ngramdistance) differences in case will affect the similarity value:
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramDistance('ClickHouse','clickhouse');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
0.71428573
|
||||
```
|
||||
|
||||
With [ngramDistanceCaseInsensitive](#ngramdistancecaseinsensitive) case is ignored so two identical strings differing only in case will now return a low similarity value:
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramDistanceCaseInsensitive('ClickHouse','clickhouse');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
0
|
||||
```
|
||||
|
||||
## ngramDistanceUTF8
|
||||
|
||||
Provides a UTF-8 variant of [ngramDistance](#ngramdistance). Assumes that `needle` and `haystack` strings are UTF-8 encoded strings.
|
||||
|
||||
**Syntax**
|
||||
|
||||
```sql
|
||||
ngramDistanceUTF8(haystack, needle)
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||
|
||||
**Returned value**
|
||||
|
||||
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||
|
||||
**Example**
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramDistanceUTF8('abcde','cde');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
0.5
|
||||
```
|
||||
|
||||
## ngramDistanceCaseInsensitiveUTF8
|
||||
|
||||
Provides a case-insensitive variant of [ngramDistanceUTF8](#ngramdistanceutf8).
|
||||
|
||||
**Syntax**
|
||||
|
||||
```sql
|
||||
ngramDistanceCaseInsensitiveUTF8(haystack, needle)
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||
|
||||
**Returned value**
|
||||
|
||||
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||
|
||||
**Example**
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramDistanceCaseInsensitiveUTF8('abcde','CDE');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
0.5
|
||||
```
|
||||
|
||||
## ngramSearch
|
||||
|
||||
Like `ngramDistance` but calculates the non-symmetric difference between a `needle` string and a `haystack` string, i.e. the number of n-grams from `needle` minus the common number of n-grams normalized by the number of `needle` n-grams. Returns a Float32 between 0 and 1. The bigger the result is, the more likely `needle` is in the `haystack`. This function is useful for fuzzy string search. Also see function `soundex`.
|
||||
Like `ngramDistance` but calculates the non-symmetric difference between a `needle` string and a `haystack` string, i.e. the number of n-grams from the needle minus the common number of n-grams normalized by the number of `needle` n-grams. Returns a [Float32](../../sql-reference/data-types/float.md/#float32-float64) between 0 and 1. The bigger the result is, the more likely `needle` is in the `haystack`. This function is useful for fuzzy string search. Also see function [`soundex`](../../sql-reference/functions/string-functions#soundex).
|
||||
|
||||
Functions `ngramSearchCaseInsensitive, ngramSearchUTF8, ngramSearchCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function.
|
||||
|
||||
:::note
|
||||
The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables – collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function – we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one – this works for Latin and mostly for all Cyrillic letters.
|
||||
:::
|
||||
Functions [`ngramSearchCaseInsensitive`](#ngramsearchcaseinsensitive), [`ngramSearchUTF8`](#ngramsearchutf8), [`ngramSearchCaseInsensitiveUTF8`](#ngramsearchcaseinsensitiveutf8) provide case-insensitive and/or UTF-8 variants of this function.
|
||||
|
||||
**Syntax**
|
||||
|
||||
@ -507,6 +662,140 @@ The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram
|
||||
ngramSearch(haystack, needle)
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
- `haystack`: First comparison string. [String literal](../syntax#string)
|
||||
- `needle`: Second comparison string. [String literal](../syntax#string)
|
||||
|
||||
**Returned value**
|
||||
|
||||
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||
|
||||
**Implementation details**
|
||||
|
||||
:::note
|
||||
The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables – collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function – we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one – this works for Latin and mostly for all Cyrillic letters.
|
||||
:::
|
||||
|
||||
**Example**
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramSearch('Hello World','World Hello');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
0.5
|
||||
```
|
||||
|
||||
## ngramSearchCaseInsensitive
|
||||
|
||||
Provides a case-insensitive variant of [ngramSearch](#ngramSearch).
|
||||
|
||||
**Syntax**
|
||||
|
||||
```sql
|
||||
ngramSearchCaseInsensitive(haystack, needle)
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
- `haystack`: First comparison string. [String literal](../syntax#string)
|
||||
- `needle`: Second comparison string. [String literal](../syntax#string)
|
||||
|
||||
**Returned value**
|
||||
|
||||
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||
|
||||
The bigger the result is, the more likely `needle` is in the `haystack`.
|
||||
|
||||
**Example**
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramSearchCaseInsensitive('Hello World','hello');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
1
|
||||
```
|
||||
|
||||
## ngramSearchUTF8
|
||||
|
||||
Provides a UTF-8 variant of [ngramSearch](#ngramsearch) in which `needle` and `haystack` are assumed to be UTF-8 encoded strings.
|
||||
|
||||
**Syntax**
|
||||
|
||||
```sql
|
||||
ngramSearchUTF8(haystack, needle)
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||
|
||||
**Returned value**
|
||||
|
||||
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||
|
||||
The bigger the result is, the more likely `needle` is in the `haystack`.
|
||||
|
||||
**Example**
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramSearchUTF8('абвгдеёжз', 'гдеёзд');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
0.5
|
||||
```
|
||||
|
||||
## ngramSearchCaseInsensitiveUTF8
|
||||
|
||||
Provides a case-insensitive variant of [ngramSearchUTF8](#ngramsearchutf8) in which `needle` and `haystack`.
|
||||
|
||||
**Syntax**
|
||||
|
||||
```sql
|
||||
ngramSearchCaseInsensitiveUTF8(haystack, needle)
|
||||
```
|
||||
|
||||
**Parameters**
|
||||
|
||||
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||
|
||||
**Returned value**
|
||||
|
||||
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||
|
||||
The bigger the result is, the more likely `needle` is in the `haystack`.
|
||||
|
||||
**Example**
|
||||
|
||||
Query:
|
||||
|
||||
```sql
|
||||
SELECT ngramSearchCaseInsensitiveUTF8('абвГДЕёжз', 'АбвгдЕЁжз');
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```response
|
||||
0.57142854
|
||||
```
|
||||
|
||||
## countSubstrings
|
||||
|
||||
Returns how often substring `needle` occurs in string `haystack`.
|
||||
|
@ -1984,6 +1984,9 @@ nestjs
|
||||
netloc
|
||||
ngram
|
||||
ngramDistance
|
||||
ngramDistanceCaseInsensitive
|
||||
ngramDistanceCaseInsensitiveUTF
|
||||
ngramDistanceUTF
|
||||
ngramMinHash
|
||||
ngramMinHashArg
|
||||
ngramMinHashArgCaseInsensitive
|
||||
@ -1993,6 +1996,9 @@ ngramMinHashCaseInsensitive
|
||||
ngramMinHashCaseInsensitiveUTF
|
||||
ngramMinHashUTF
|
||||
ngramSearch
|
||||
ngramSearchCaseInsensitive
|
||||
ngramSearchCaseInsensitiveUTF
|
||||
ngramSearchUTF
|
||||
ngramSimHash
|
||||
ngramSimHashCaseInsensitive
|
||||
ngramSimHashCaseInsensitiveUTF
|
||||
|
Loading…
Reference in New Issue
Block a user