mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-11-25 17:12:03 +00:00
Merge pull request #62111 from Blargian/document_ngramXYZ
Document ngramXYZ functions
This commit is contained in:
commit
a5f3d7d5cb
@ -481,9 +481,9 @@ Alias: `haystack NOT ILIKE pattern` (operator)
|
|||||||
|
|
||||||
## ngramDistance
|
## ngramDistance
|
||||||
|
|
||||||
Calculates the 4-gram distance between a `haystack` string and a `needle` string. For that, it counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns a Float32 between 0 and 1. The smaller the result is, the more strings are similar to each other. Throws an exception if constant `needle` or `haystack` arguments are more than 32Kb in size. If any of non-constant `haystack` or `needle` arguments is more than 32Kb in size, the distance is always 1.
|
Calculates the 4-gram distance between a `haystack` string and a `needle` string. For this, it counts the symmetric difference between two multisets of 4-grams and normalizes it by the sum of their cardinalities. Returns a [Float32](../../sql-reference/data-types/float.md/#float32-float64) between 0 and 1. The smaller the result is, the more similar the strings are to each other.
|
||||||
|
|
||||||
Functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function.
|
Functions [`ngramDistanceCaseInsensitive`](#ngramdistancecaseinsensitive), [`ngramDistanceUTF8`](#ngramdistanceutf8), [`ngramDistanceCaseInsensitiveUTF8`](#ngramdistancecaseinsensitiveutf8) provide case-insensitive and/or UTF-8 variants of this function.
|
||||||
|
|
||||||
**Syntax**
|
**Syntax**
|
||||||
|
|
||||||
@ -491,15 +491,170 @@ Functions `ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseIns
|
|||||||
ngramDistance(haystack, needle)
|
ngramDistance(haystack, needle)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Parameters**
|
||||||
|
|
||||||
|
- `haystack`: First comparison string. [String literal](../syntax#string)
|
||||||
|
- `needle`: Second comparison string. [String literal](../syntax#string)
|
||||||
|
|
||||||
|
**Returned value**
|
||||||
|
|
||||||
|
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||||
|
|
||||||
|
**Implementation details**
|
||||||
|
|
||||||
|
This function will throw an exception if constant `needle` or `haystack` arguments are more than 32Kb in size. If any non-constant `haystack` or `needle` arguments are more than 32Kb in size, then the distance is always 1.
|
||||||
|
|
||||||
|
**Examples**
|
||||||
|
|
||||||
|
The more similar two strings are to each other, the closer the result will be to 0 (identical).
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramDistance('ClickHouse','ClickHouse!');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
0.06666667
|
||||||
|
```
|
||||||
|
|
||||||
|
The less similar two strings are to each, the larger the result will be.
|
||||||
|
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramDistance('ClickHouse','House');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
0.5555556
|
||||||
|
```
|
||||||
|
|
||||||
|
## ngramDistanceCaseInsensitive
|
||||||
|
|
||||||
|
Provides a case-insensitive variant of [ngramDistance](#ngramdistance).
|
||||||
|
|
||||||
|
**Syntax**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ngramDistanceCaseInsensitive(haystack, needle)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters**
|
||||||
|
|
||||||
|
- `haystack`: First comparison string. [String literal](../syntax#string)
|
||||||
|
- `needle`: Second comparison string. [String literal](../syntax#string)
|
||||||
|
|
||||||
|
**Returned value**
|
||||||
|
|
||||||
|
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||||
|
|
||||||
|
**Examples**
|
||||||
|
|
||||||
|
With [ngramDistance](#ngramdistance) differences in case will affect the similarity value:
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramDistance('ClickHouse','clickhouse');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
0.71428573
|
||||||
|
```
|
||||||
|
|
||||||
|
With [ngramDistanceCaseInsensitive](#ngramdistancecaseinsensitive) case is ignored so two identical strings differing only in case will now return a low similarity value:
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramDistanceCaseInsensitive('ClickHouse','clickhouse');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
0
|
||||||
|
```
|
||||||
|
|
||||||
|
## ngramDistanceUTF8
|
||||||
|
|
||||||
|
Provides a UTF-8 variant of [ngramDistance](#ngramdistance). Assumes that `needle` and `haystack` strings are UTF-8 encoded strings.
|
||||||
|
|
||||||
|
**Syntax**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ngramDistanceUTF8(haystack, needle)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters**
|
||||||
|
|
||||||
|
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||||
|
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||||
|
|
||||||
|
**Returned value**
|
||||||
|
|
||||||
|
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramDistanceUTF8('abcde','cde');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
0.5
|
||||||
|
```
|
||||||
|
|
||||||
|
## ngramDistanceCaseInsensitiveUTF8
|
||||||
|
|
||||||
|
Provides a case-insensitive variant of [ngramDistanceUTF8](#ngramdistanceutf8).
|
||||||
|
|
||||||
|
**Syntax**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ngramDistanceCaseInsensitiveUTF8(haystack, needle)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters**
|
||||||
|
|
||||||
|
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||||
|
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||||
|
|
||||||
|
**Returned value**
|
||||||
|
|
||||||
|
- Value between 0 and 1 representing the similarity between the two strings. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramDistanceCaseInsensitiveUTF8('abcde','CDE');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
0.5
|
||||||
|
```
|
||||||
|
|
||||||
## ngramSearch
|
## ngramSearch
|
||||||
|
|
||||||
Like `ngramDistance` but calculates the non-symmetric difference between a `needle` string and a `haystack` string, i.e. the number of n-grams from `needle` minus the common number of n-grams normalized by the number of `needle` n-grams. Returns a Float32 between 0 and 1. The bigger the result is, the more likely `needle` is in the `haystack`. This function is useful for fuzzy string search. Also see function `soundex`.
|
Like `ngramDistance` but calculates the non-symmetric difference between a `needle` string and a `haystack` string, i.e. the number of n-grams from the needle minus the common number of n-grams normalized by the number of `needle` n-grams. Returns a [Float32](../../sql-reference/data-types/float.md/#float32-float64) between 0 and 1. The bigger the result is, the more likely `needle` is in the `haystack`. This function is useful for fuzzy string search. Also see function [`soundex`](../../sql-reference/functions/string-functions#soundex).
|
||||||
|
|
||||||
Functions `ngramSearchCaseInsensitive, ngramSearchUTF8, ngramSearchCaseInsensitiveUTF8` provide case-insensitive and/or UTF-8 variants of this function.
|
Functions [`ngramSearchCaseInsensitive`](#ngramsearchcaseinsensitive), [`ngramSearchUTF8`](#ngramsearchutf8), [`ngramSearchCaseInsensitiveUTF8`](#ngramsearchcaseinsensitiveutf8) provide case-insensitive and/or UTF-8 variants of this function.
|
||||||
|
|
||||||
:::note
|
|
||||||
The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables – collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function – we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one – this works for Latin and mostly for all Cyrillic letters.
|
|
||||||
:::
|
|
||||||
|
|
||||||
**Syntax**
|
**Syntax**
|
||||||
|
|
||||||
@ -507,6 +662,140 @@ The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram
|
|||||||
ngramSearch(haystack, needle)
|
ngramSearch(haystack, needle)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Parameters**
|
||||||
|
|
||||||
|
- `haystack`: First comparison string. [String literal](../syntax#string)
|
||||||
|
- `needle`: Second comparison string. [String literal](../syntax#string)
|
||||||
|
|
||||||
|
**Returned value**
|
||||||
|
|
||||||
|
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||||
|
|
||||||
|
**Implementation details**
|
||||||
|
|
||||||
|
:::note
|
||||||
|
The UTF-8 variants use the 3-gram distance. These are not perfectly fair n-gram distances. We use 2-byte hashes to hash n-grams and then calculate the (non-)symmetric difference between these hash tables – collisions may occur. With UTF-8 case-insensitive format we do not use fair `tolower` function – we zero the 5-th bit (starting from zero) of each codepoint byte and first bit of zeroth byte if bytes more than one – this works for Latin and mostly for all Cyrillic letters.
|
||||||
|
:::
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramSearch('Hello World','World Hello');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
0.5
|
||||||
|
```
|
||||||
|
|
||||||
|
## ngramSearchCaseInsensitive
|
||||||
|
|
||||||
|
Provides a case-insensitive variant of [ngramSearch](#ngramSearch).
|
||||||
|
|
||||||
|
**Syntax**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ngramSearchCaseInsensitive(haystack, needle)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters**
|
||||||
|
|
||||||
|
- `haystack`: First comparison string. [String literal](../syntax#string)
|
||||||
|
- `needle`: Second comparison string. [String literal](../syntax#string)
|
||||||
|
|
||||||
|
**Returned value**
|
||||||
|
|
||||||
|
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||||
|
|
||||||
|
The bigger the result is, the more likely `needle` is in the `haystack`.
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramSearchCaseInsensitive('Hello World','hello');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
1
|
||||||
|
```
|
||||||
|
|
||||||
|
## ngramSearchUTF8
|
||||||
|
|
||||||
|
Provides a UTF-8 variant of [ngramSearch](#ngramsearch) in which `needle` and `haystack` are assumed to be UTF-8 encoded strings.
|
||||||
|
|
||||||
|
**Syntax**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ngramSearchUTF8(haystack, needle)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters**
|
||||||
|
|
||||||
|
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||||
|
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||||
|
|
||||||
|
**Returned value**
|
||||||
|
|
||||||
|
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||||
|
|
||||||
|
The bigger the result is, the more likely `needle` is in the `haystack`.
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramSearchUTF8('абвгдеёжз', 'гдеёзд');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
0.5
|
||||||
|
```
|
||||||
|
|
||||||
|
## ngramSearchCaseInsensitiveUTF8
|
||||||
|
|
||||||
|
Provides a case-insensitive variant of [ngramSearchUTF8](#ngramsearchutf8) in which `needle` and `haystack`.
|
||||||
|
|
||||||
|
**Syntax**
|
||||||
|
|
||||||
|
```sql
|
||||||
|
ngramSearchCaseInsensitiveUTF8(haystack, needle)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters**
|
||||||
|
|
||||||
|
- `haystack`: First UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||||
|
- `needle`: Second UTF-8 encoded comparison string. [String literal](../syntax#string)
|
||||||
|
|
||||||
|
**Returned value**
|
||||||
|
|
||||||
|
- Value between 0 and 1 representing the likelihood of the `needle` being in the `haystack`. [Float32](../../sql-reference/data-types/float.md/#float32-float64)
|
||||||
|
|
||||||
|
The bigger the result is, the more likely `needle` is in the `haystack`.
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
|
||||||
|
Query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT ngramSearchCaseInsensitiveUTF8('абвГДЕёжз', 'АбвгдЕЁжз');
|
||||||
|
```
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
```response
|
||||||
|
0.57142854
|
||||||
|
```
|
||||||
|
|
||||||
## countSubstrings
|
## countSubstrings
|
||||||
|
|
||||||
Returns how often substring `needle` occurs in string `haystack`.
|
Returns how often substring `needle` occurs in string `haystack`.
|
||||||
|
@ -1984,6 +1984,9 @@ nestjs
|
|||||||
netloc
|
netloc
|
||||||
ngram
|
ngram
|
||||||
ngramDistance
|
ngramDistance
|
||||||
|
ngramDistanceCaseInsensitive
|
||||||
|
ngramDistanceCaseInsensitiveUTF
|
||||||
|
ngramDistanceUTF
|
||||||
ngramMinHash
|
ngramMinHash
|
||||||
ngramMinHashArg
|
ngramMinHashArg
|
||||||
ngramMinHashArgCaseInsensitive
|
ngramMinHashArgCaseInsensitive
|
||||||
@ -1993,6 +1996,9 @@ ngramMinHashCaseInsensitive
|
|||||||
ngramMinHashCaseInsensitiveUTF
|
ngramMinHashCaseInsensitiveUTF
|
||||||
ngramMinHashUTF
|
ngramMinHashUTF
|
||||||
ngramSearch
|
ngramSearch
|
||||||
|
ngramSearchCaseInsensitive
|
||||||
|
ngramSearchCaseInsensitiveUTF
|
||||||
|
ngramSearchUTF
|
||||||
ngramSimHash
|
ngramSimHash
|
||||||
ngramSimHashCaseInsensitive
|
ngramSimHashCaseInsensitive
|
||||||
ngramSimHashCaseInsensitiveUTF
|
ngramSimHashCaseInsensitiveUTF
|
||||||
|
Loading…
Reference in New Issue
Block a user