ClickHouse/docs/en/sql-reference/functions/hash-functions.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

1850 lines
69 KiB
Markdown
Raw Normal View History

2020-04-03 13:23:32 +00:00
---
2022-08-28 14:53:34 +00:00
slug: /en/sql-reference/functions/hash-functions
2023-04-19 17:05:55 +00:00
sidebar_position: 85
sidebar_label: Hash
2020-04-03 13:23:32 +00:00
---
2022-06-02 10:55:18 +00:00
# Hash Functions
Hash functions can be used for the deterministic pseudo-random shuffling of elements.
2017-04-03 19:49:50 +00:00
Simhash is a hash function, which returns close hash values for close (similar) arguments.
2022-06-02 10:55:18 +00:00
## halfMD5
2017-04-03 19:49:50 +00:00
2023-01-08 00:19:49 +00:00
[Interprets](/docs/en/sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input parameters as strings and calculates the [MD5](https://en.wikipedia.org/wiki/MD5) hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the resulting string, and interprets them as `UInt64` in big-endian byte order.
2017-04-03 19:49:50 +00:00
2022-05-11 18:10:16 +00:00
```sql
halfMD5(par1, ...)
```
The function is relatively slow (5 million short strings per second per processor core).
Consider using the [sipHash64](#hash_functions-siphash64) function instead.
**Arguments**
2023-01-08 00:19:49 +00:00
The function takes a variable number of input parameters. Arguments can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md). For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data).
**Returned Value**
2023-01-08 00:19:49 +00:00
A [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
**Example**
2022-05-11 18:10:16 +00:00
```sql
SELECT halfMD5(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS halfMD5hash, toTypeName(halfMD5hash) AS type;
```
2020-03-20 10:10:48 +00:00
2022-05-11 18:10:16 +00:00
```response
┌────────halfMD5hash─┬─type───┐
│ 186182704141653334 │ UInt64 │
└────────────────────┴────────┘
```
2022-06-02 10:55:18 +00:00
## MD4
Calculates the MD4 from a string and returns the resulting set of bytes as FixedString(16).
2023-01-26 12:58:02 +00:00
## MD5 {#hash_functions-md5}
2017-04-26 19:16:38 +00:00
Calculates the MD5 from a string and returns the resulting set of bytes as FixedString(16).
2021-05-27 19:44:11 +00:00
If you do not need MD5 in particular, but you need a decent cryptographic 128-bit hash, use the sipHash128 function instead.
If you want to get the same result as output by the md5sum utility, use lower(hex(MD5(s))).
## sipHash64 (#hash_functions-siphash64)
2017-04-03 19:49:50 +00:00
Produces a 64-bit [SipHash](https://en.wikipedia.org/wiki/SipHash) hash value.
2022-05-11 18:10:16 +00:00
```sql
sipHash64(par1,...)
```
2023-01-26 12:58:02 +00:00
This is a cryptographic hash function. It works at least three times faster than the [MD5](#hash_functions-md5) hash function.
2023-01-26 12:58:02 +00:00
The function [interprets](/docs/en/sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input parameters as strings and calculates the hash value for each of them. It then combines the hashes by the following algorithm:
1. The first and the second hash value are concatenated to an array which is hashed.
2. The previously calculated hash value and the hash of the third input parameter are hashed in a similar way.
3. This calculation is repeated for all remaining hash values of the original input.
**Arguments**
2023-01-26 12:58:02 +00:00
The function takes a variable number of input parameters of any of the [supported data types](/docs/en/sql-reference/data-types/index.md).
**Returned Value**
2023-01-08 00:19:49 +00:00
A [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
2023-01-26 12:58:02 +00:00
Note that the calculated hash values may be equal for the same input values of different argument types. This affects for example integer types of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data.
**Example**
2022-05-11 18:10:16 +00:00
```sql
SELECT sipHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS SipHash, toTypeName(SipHash) AS type;
```
2020-03-20 10:10:48 +00:00
2022-05-11 18:10:16 +00:00
```response
┌──────────────SipHash─┬─type───┐
│ 11400366955626497465 │ UInt64 │
└──────────────────────┴────────┘
```
## sipHash64Keyed
2023-01-26 12:58:02 +00:00
Same as [sipHash64](#hash_functions-siphash64) but additionally takes an explicit key argument instead of using a fixed key.
**Syntax**
```sql
sipHash64Keyed((k0, k1), par1,...)
```
**Arguments**
Same as [sipHash64](#hash_functions-siphash64), but the first argument is a tuple of two UInt64 values representing the key.
**Returned value**
A [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
**Example**
Query:
```sql
SELECT sipHash64Keyed((506097522914230528, 1084818905618843912), array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS SipHash, toTypeName(SipHash) AS type;
```
```response
┌─────────────SipHash─┬─type───┐
│ 8017656310194184311 │ UInt64 │
└─────────────────────┴────────┘
```
2022-06-02 10:55:18 +00:00
## sipHash128
2017-04-03 19:49:50 +00:00
2023-01-26 12:58:02 +00:00
Like [sipHash64](#hash_functions-siphash64) but produces a 128-bit hash value, i.e. the final xor-folding state is done up to 128 bits.
2023-03-27 18:54:05 +00:00
:::note
This 128-bit variant differs from the reference implementation and it's weaker.
This version exists because, when it was written, there was no official 128-bit extension for SipHash.
New projects should probably use [sipHash128Reference](#hash_functions-siphash128reference).
:::
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
sipHash128(par1,...)
```
**Arguments**
2023-01-26 12:58:02 +00:00
Same as for [sipHash64](#hash_functions-siphash64).
**Returned value**
2023-01-26 12:58:02 +00:00
A 128-bit `SipHash` hash value of type [FixedString(16)](/docs/en/sql-reference/data-types/fixedstring.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT hex(sipHash128('foo', '\x01', 3));
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─hex(sipHash128('foo', '', 3))────┐
│ 9DE516A64A414D4B1B609415E4523F24 │
└──────────────────────────────────┘
```
## sipHash128Keyed
2023-01-26 12:58:02 +00:00
Same as [sipHash128](#hash_functions-siphash128) but additionally takes an explicit key argument instead of using a fixed key.
2023-03-27 18:54:05 +00:00
:::note
This 128-bit variant differs from the reference implementation and it's weaker.
This version exists because, when it was written, there was no official 128-bit extension for SipHash.
New projects should probably use [sipHash128ReferenceKeyed](#hash_functions-siphash128referencekeyed).
:::
**Syntax**
```sql
sipHash128Keyed((k0, k1), par1,...)
```
**Arguments**
Same as [sipHash128](#hash_functions-siphash128), but the first argument is a tuple of two UInt64 values representing the key.
**Returned value**
2023-02-03 14:14:15 +00:00
A 128-bit `SipHash` hash value of type [FixedString(16)](/docs/en/sql-reference/data-types/fixedstring.md).
**Example**
Query:
```sql
SELECT hex(sipHash128Keyed((506097522914230528, 1084818905618843912),'foo', '\x01', 3));
```
Result:
```response
┌─hex(sipHash128Keyed((506097522914230528, 1084818905618843912), 'foo', '', 3))─┐
│ B8467F65C8B4CFD9A5F8BD733917D9BF │
└───────────────────────────────────────────────────────────────────────────────┘
```
## sipHash128Reference
Like [sipHash128](#hash_functions-siphash128) but implements the 128-bit algorithm from the original authors of SipHash.
**Syntax**
```sql
sipHash128Reference(par1,...)
```
**Arguments**
Same as for [sipHash128](#hash_functions-siphash128).
**Returned value**
A 128-bit `SipHash` hash value of type [FixedString(16)](/docs/en/sql-reference/data-types/fixedstring.md).
**Example**
Query:
```sql
SELECT hex(sipHash128Reference('foo', '\x01', 3));
```
Result:
```response
┌─hex(sipHash128Reference('foo', '', 3))─┐
│ 4D1BE1A22D7F5933C0873E1698426260 │
└────────────────────────────────────────┘
```
## sipHash128ReferenceKeyed
Same as [sipHash128Reference](#hash_functions-siphash128reference) but additionally takes an explicit key argument instead of using a fixed key.
**Syntax**
```sql
sipHash128ReferenceKeyed((k0, k1), par1,...)
```
**Arguments**
Same as [sipHash128Reference](#hash_functions-siphash128reference), but the first argument is a tuple of two UInt64 values representing the key.
**Returned value**
A 128-bit `SipHash` hash value of type [FixedString(16)](/docs/en/sql-reference/data-types/fixedstring.md).
**Example**
Query:
```sql
SELECT hex(sipHash128ReferenceKeyed((506097522914230528, 1084818905618843912),'foo', '\x01', 3));
```
Result:
```response
┌─hex(sipHash128ReferenceKeyed((506097522914230528, 1084818905618843912), 'foo', '', 3))─┐
│ 630133C9722DC08646156B8130C4CDC8 │
└────────────────────────────────────────────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## cityHash64
2017-04-03 19:49:50 +00:00
Produces a 64-bit [CityHash](https://github.com/google/cityhash) hash value.
2022-05-11 18:10:16 +00:00
```sql
cityHash64(par1,...)
```
This is a fast non-cryptographic hash function. It uses the CityHash algorithm for string parameters and implementation-specific fast non-cryptographic hash function for parameters with other data types. The function uses the CityHash combinator to get the final results.
2023-05-12 14:51:47 +00:00
Note that Google changed the algorithm of CityHash after it has been added to ClickHouse. In other words, ClickHouse's cityHash64 and Google's upstream CityHash now produce different results. ClickHouse cityHash64 corresponds to CityHash v1.0.2.
2023-05-12 11:23:02 +00:00
**Arguments**
2023-01-08 00:19:49 +00:00
The function takes a variable number of input parameters. Arguments can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md). For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data).
**Returned Value**
2023-01-08 00:19:49 +00:00
A [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
**Examples**
Call example:
2022-05-11 18:10:16 +00:00
```sql
SELECT cityHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS CityHash, toTypeName(CityHash) AS type;
```
2020-03-20 10:10:48 +00:00
2022-05-11 18:10:16 +00:00
```response
┌─────────────CityHash─┬─type───┐
│ 12072650598913549138 │ UInt64 │
└──────────────────────┴────────┘
```
The following example shows how to compute the checksum of the entire table with accuracy up to the row order:
2022-05-11 18:10:16 +00:00
```sql
SELECT groupBitXor(cityHash64(*)) FROM table
```
2022-06-02 10:55:18 +00:00
## intHash32
2017-04-03 19:49:50 +00:00
2017-04-26 19:16:38 +00:00
Calculates a 32-bit hash code from any type of integer.
This is a relatively fast non-cryptographic hash function of average quality for numbers.
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## intHash64
2017-04-26 19:16:38 +00:00
Calculates a 64-bit hash code from any type of integer.
It works faster than intHash32. Average quality.
2017-04-03 19:49:50 +00:00
2022-06-02 10:55:18 +00:00
## SHA1, SHA224, SHA256, SHA512
2023-01-08 00:19:49 +00:00
Calculates SHA-1, SHA-224, SHA-256, SHA-512 hash from a string and returns the resulting set of bytes as [FixedString](/docs/en/sql-reference/data-types/fixedstring.md).
2021-09-29 13:59:56 +00:00
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
2021-09-29 13:59:56 +00:00
SHA1('s')
2021-10-03 21:19:37 +00:00
...
2021-09-29 13:59:56 +00:00
SHA512('s')
```
2017-04-26 19:16:38 +00:00
The function works fairly slowly (SHA-1 processes about 5 million short strings per second per processor core, while SHA-224 and SHA-256 process about 2.2 million).
2020-03-20 10:10:48 +00:00
We recommend using this function only in cases when you need a specific hash function and you cant select it.
2021-09-29 14:52:39 +00:00
Even in these cases, we recommend applying the function offline and pre-calculating values when inserting them into the table, instead of applying it in `SELECT` queries.
2017-04-03 19:49:50 +00:00
2021-09-29 13:59:56 +00:00
**Arguments**
- `s` — Input string for SHA hash calculation. [String](/docs/en/sql-reference/data-types/string.md).
2021-09-29 13:59:56 +00:00
**Returned value**
- SHA hash as a hex-unencoded FixedString. SHA-1 returns as FixedString(20), SHA-224 as FixedString(28), SHA-256 — FixedString(32), SHA-512 — FixedString(64).
2021-09-29 13:59:56 +00:00
2023-01-08 00:19:49 +00:00
Type: [FixedString](/docs/en/sql-reference/data-types/fixedstring.md).
2021-09-29 13:59:56 +00:00
**Example**
2023-01-08 00:19:49 +00:00
Use the [hex](/docs/en/sql-reference/functions/encoding-functions.md/#hex) function to represent the result as a hex-encoded string.
2021-09-29 13:59:56 +00:00
Query:
2022-05-11 18:10:16 +00:00
```sql
2021-10-03 21:19:37 +00:00
SELECT hex(SHA1('abc'));
2021-09-29 13:59:56 +00:00
```
Result:
2022-05-11 18:10:16 +00:00
```response
2021-10-03 21:19:37 +00:00
┌─hex(SHA1('abc'))─────────────────────────┐
│ A9993E364706816ABA3E25717850C26C9CD0D89D │
└──────────────────────────────────────────┘
2021-09-29 13:59:56 +00:00
```
2022-06-02 10:55:18 +00:00
## BLAKE3
2022-05-11 18:10:16 +00:00
2023-01-08 00:19:49 +00:00
Calculates BLAKE3 hash string and returns the resulting set of bytes as [FixedString](/docs/en/sql-reference/data-types/fixedstring.md).
2022-05-11 18:10:16 +00:00
**Syntax**
```sql
BLAKE3('s')
```
This cryptographic hash-function is integrated into ClickHouse with BLAKE3 Rust library. The function is rather fast and shows approximately two times faster performance compared to SHA-2, while generating hashes of the same length as SHA-256.
**Arguments**
2023-01-08 00:19:49 +00:00
- s - input string for BLAKE3 hash calculation. [String](/docs/en/sql-reference/data-types/string.md).
2022-05-11 18:10:16 +00:00
**Return value**
- BLAKE3 hash as a byte array with type FixedString(32).
2023-01-08 00:19:49 +00:00
Type: [FixedString](/docs/en/sql-reference/data-types/fixedstring.md).
2022-05-11 18:10:16 +00:00
**Example**
2023-01-08 00:19:49 +00:00
Use function [hex](/docs/en/sql-reference/functions/encoding-functions.md/#hex) to represent the result as a hex-encoded string.
2022-05-11 18:10:16 +00:00
Query:
```sql
SELECT hex(BLAKE3('ABC'))
```
Result:
```sql
┌─hex(BLAKE3('ABC'))───────────────────────────────────────────────┐
│ D1717274597CF0289694F75D96D444B992A096F1AFD8E7BBFA6EBB1D360FEDFC │
└──────────────────────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## URLHash(url\[, N\])
2017-04-26 19:16:38 +00:00
A fast, decent-quality non-cryptographic hash function for a string obtained from a URL using some type of normalization.
`URLHash(s)` Calculates a hash from a string without one of the trailing symbols `/`,`?` or `#` at the end, if present.
`URLHash(s, N)` Calculates a hash from a string up to the N level in the URL hierarchy, without one of the trailing symbols `/`,`?` or `#` at the end, if present.
2022-05-11 18:10:16 +00:00
Levels are the same as in URLHierarchy.
2022-06-02 10:55:18 +00:00
## farmFingerprint64
2020-10-31 13:04:18 +00:00
2022-06-02 10:55:18 +00:00
## farmHash64
2020-12-09 14:03:38 +00:00
Produces a 64-bit [FarmHash](https://github.com/google/farmhash) or Fingerprint value. `farmFingerprint64` is preferred for a stable and portable value.
2022-05-11 18:10:16 +00:00
```sql
2020-10-31 13:04:18 +00:00
farmFingerprint64(par1, ...)
farmHash64(par1, ...)
```
2020-12-09 14:03:38 +00:00
These functions use the `Fingerprint64` and `Hash64` methods respectively from all [available methods](https://github.com/google/farmhash/blob/master/src/farmhash.h).
**Arguments**
2023-01-08 00:19:49 +00:00
The function takes a variable number of input parameters. Arguments can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md). For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data).
**Returned Value**
2023-01-08 00:19:49 +00:00
A [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
**Example**
2022-05-11 18:10:16 +00:00
```sql
SELECT farmHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS FarmHash, toTypeName(FarmHash) AS type;
```
2020-03-20 10:10:48 +00:00
2022-05-11 18:10:16 +00:00
```response
┌─────────────FarmHash─┬─type───┐
│ 17790458267262532859 │ UInt64 │
└──────────────────────┴────────┘
```
2022-06-02 10:55:18 +00:00
## javaHash
2023-03-30 23:07:49 +00:00
Calculates JavaHash from a [string](http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/478a4add975b/src/share/classes/java/lang/String.java#l1452),
[Byte](https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/478a4add975b/src/share/classes/java/lang/Byte.java#l405),
[Short](https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/478a4add975b/src/share/classes/java/lang/Short.java#l410),
[Integer](https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/478a4add975b/src/share/classes/java/lang/Integer.java#l959),
[Long](https://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/478a4add975b/src/share/classes/java/lang/Long.java#l1060).
2022-09-13 09:59:11 +00:00
This hash function is neither fast nor having a good quality. The only reason to use it is when this algorithm is already used in another system and you have to calculate exactly the same result.
2022-09-15 02:07:39 +00:00
Note that Java only support calculating signed integers hash, so if you want to calculate unsigned integers hash you must cast it to proper signed ClickHouse types.
2020-03-20 10:10:48 +00:00
**Syntax**
2019-11-29 12:15:56 +00:00
2022-05-11 18:10:16 +00:00
```sql
SELECT javaHash('')
```
**Returned value**
A `Int32` data type hash value.
**Example**
Query:
2022-09-13 09:59:11 +00:00
```sql
SELECT javaHash(toInt32(123));
```
Result:
```response
┌─javaHash(toInt32(123))─┐
│ 123 │
└────────────────────────┘
```
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT javaHash('Hello, world!');
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─javaHash('Hello, world!')─┐
│ -1880044555 │
└───────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## javaHashUTF16LE
2019-11-11 04:13:55 +00:00
2019-12-03 00:56:38 +00:00
Calculates [JavaHash](http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/478a4add975b/src/share/classes/java/lang/String.java#l1452) from a string, assuming it contains bytes representing a string in UTF-16LE encoding.
2019-11-11 04:13:55 +00:00
2020-03-20 10:10:48 +00:00
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
2019-11-29 12:15:56 +00:00
javaHashUTF16LE(stringUtf16le)
```
**Arguments**
- `stringUtf16le` — a string in UTF-16LE encoding.
**Returned value**
2019-11-29 12:15:56 +00:00
A `Int32` data type hash value.
2019-11-11 04:13:55 +00:00
**Example**
Correct query with UTF-16LE encoded string.
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT javaHashUTF16LE(convertCharset('test', 'utf-8', 'utf-16le'));
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─javaHashUTF16LE(convertCharset('test', 'utf-8', 'utf-16le'))─┐
│ 3556498 │
└──────────────────────────────────────────────────────────────┘
2019-11-11 04:13:55 +00:00
```
2022-06-02 10:55:18 +00:00
## hiveHash
Calculates `HiveHash` from a string.
2022-05-11 18:10:16 +00:00
```sql
SELECT hiveHash('')
```
2019-09-23 23:27:24 +00:00
This is just [JavaHash](#hash_functions-javahash) with zeroed out sign bit. This function is used in [Apache Hive](https://en.wikipedia.org/wiki/Apache_Hive) for versions before 3.0. This hash function is neither fast nor having a good quality. The only reason to use it is when this algorithm is already used in another system and you have to calculate exactly the same result.
**Returned value**
A `Int32` data type hash value.
Type: `hiveHash`.
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT hiveHash('Hello, world!');
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─hiveHash('Hello, world!')─┐
│ 267439093 │
└───────────────────────────┘
```
## Entropy-learned hashing (experimental)
Entropy-learned hashing is disabled by default, to enable: `SET allow_experimental_hash_functions=1`.
2023-05-08 13:11:21 +00:00
Entropy-learned hashing is not a standalone hash function like `metroHash64`, `cityHash64`, `sipHash64` etc. Instead, it aims to preprocess
the data to be hashed in a way that a standalone hash function can be computed more efficiently while not compromising the hash quality,
i.e. the randomness of the hashes. For that, entropy-based hashing chooses a subset of the bytes in a training data set of Strings which has
the same randomness (entropy) as the original Strings. For example, if the Strings are in average 100 bytes long, and we pick a subset of 5
bytes, then a hash function will be 95% less expensive to evaluate. For details of the method, refer to [Entropy-Learned Hashing: Constant
Time Hashing with Controllable Uniformity](https://doi.org/10.1145/3514221.3517894).
Entropy-learned hashing has two phases:
1. A training phase on a representative but typically small set of Strings to be hashed. Training consists of two steps:
- Function `prepareTrainEntropyLearnedHash(data, id)` caches the training data in a global state under a given `id`. It returns dummy
value `0` on every row.
- Function `trainEntropyLearnedHash(id)` computes a minimal partial sub-key of the training data stored stored under `id` in the global
2023-05-26 15:37:33 +00:00
state. The cached training data in the global state is replaced by the partial key. Dummy value `0` is returned on every row.
2023-05-08 13:11:21 +00:00
2. An evaluation phase where hashes are computed using the previously calculated partial sub-keys. Function `entropyLearnedHash(data, id)`
hashes `data` using the partial subkey stored as `id`. CityHash64 is used as hash function.
The reason that the training phase comprises two steps is that ClickHouse processes data at chunk granularity but entropy-learned hashing
needs to process the entire training set at once.
2023-05-26 15:37:33 +00:00
Since functions `prepareTrainEntropyLearnedHash()` and `trainEntropyLearnedHash()` access global state, they should not be called in
parallel with the same `id`.
2023-05-08 13:11:21 +00:00
**Syntax**
``` sql
prepareTrainEntropyLearnedHash(data, id);
trainEntropyLearnedHash(id);
2023-05-08 13:11:21 +00:00
entropyLearnedHash(data, id);
```
**Example**
```sql
2023-05-26 15:37:33 +00:00
SET allow_experimental_hash_functions=1;
2023-05-08 13:11:21 +00:00
CREATE TABLE tab (col String) ENGINE=Memory;
INSERT INTO tab VALUES ('aa'), ('ba'), ('ca');
SELECT prepareTrainEntropyLearnedHash(col, 'id1') AS prepared FROM tab;
SELECT trainEntropyLearnedHash('id1') AS trained FROM tab;
2023-05-08 13:11:21 +00:00
SELECT entropyLearnedHash(col, 'id1') as hashes FROM tab;
```
Result:
``` response
┌─prepared─┐
│ 0 │
│ 0 │
│ 0 │
└──────────┘
┌─trained─┐
│ 0 │
│ 0 │
│ 0 │
└─────────┘
2023-05-08 13:11:21 +00:00
┌───────────────hashes─┐
│ 2603192927274642682 │
│ 4947675599669400333 │
│ 10783339242466472992 │
└──────────────────────┘
```
2022-06-02 10:55:18 +00:00
## metroHash64
Produces a 64-bit [MetroHash](http://www.jandrewrogers.com/2015/05/27/metrohash/) hash value.
2022-05-11 18:10:16 +00:00
```sql
metroHash64(par1, ...)
```
**Arguments**
2023-01-08 00:19:49 +00:00
The function takes a variable number of input parameters. Arguments can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md). For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data).
**Returned Value**
2023-01-08 00:19:49 +00:00
A [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
**Example**
2022-05-11 18:10:16 +00:00
```sql
SELECT metroHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS MetroHash, toTypeName(MetroHash) AS type;
```
2020-03-20 10:10:48 +00:00
2022-05-11 18:10:16 +00:00
```response
┌────────────MetroHash─┬─type───┐
│ 14235658766382344533 │ UInt64 │
└──────────────────────┴────────┘
```
2022-06-02 10:55:18 +00:00
## jumpConsistentHash
Calculates JumpConsistentHash form a UInt64.
Accepts two arguments: a UInt64-type key and the number of buckets. Returns Int32.
For more information, see the link: [JumpConsistentHash](https://arxiv.org/pdf/1406.2294.pdf)
2022-06-02 10:55:18 +00:00
## murmurHash2_32, murmurHash2_64
Produces a [MurmurHash2](https://github.com/aappleby/smhasher) hash value.
2022-05-11 18:10:16 +00:00
```sql
murmurHash2_32(par1, ...)
murmurHash2_64(par1, ...)
```
**Arguments**
2023-01-08 00:19:49 +00:00
Both functions take a variable number of input parameters. Arguments can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md). For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data).
**Returned Value**
- The `murmurHash2_32` function returns hash value having the [UInt32](/docs/en/sql-reference/data-types/int-uint.md) data type.
- The `murmurHash2_64` function returns hash value having the [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type.
**Example**
2022-05-11 18:10:16 +00:00
```sql
SELECT murmurHash2_64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS MurmurHash2, toTypeName(MurmurHash2) AS type;
```
2020-03-20 10:10:48 +00:00
2022-05-11 18:10:16 +00:00
```response
┌──────────MurmurHash2─┬─type───┐
│ 11832096901709403633 │ UInt64 │
└──────────────────────┴────────┘
```
2022-06-02 10:55:18 +00:00
## gccMurmurHash
Calculates a 64-bit [MurmurHash2](https://github.com/aappleby/smhasher) hash value using the same hash seed as [gcc](https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libstdc%2B%2B-v3/include/bits/functional_hash.h#L191). It is portable between CLang and GCC builds.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
gccMurmurHash(par1, ...)
```
**Arguments**
- `par1, ...` — A variable number of parameters that can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md/#data_types).
**Returned value**
- Calculated hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT
gccMurmurHash(1, 2, 3) AS res1,
gccMurmurHash(('a', [1, 2, 3], 4, (4, ['foo', 'bar'], 1, (1, 2)))) AS res2
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─────────────────res1─┬────────────────res2─┐
│ 12384823029245979431 │ 1188926775431157506 │
└──────────────────────┴─────────────────────┘
```
2023-03-29 18:05:25 +00:00
## kafkaMurmurHash
2023-03-30 23:07:49 +00:00
Calculates a 32-bit [MurmurHash2](https://github.com/aappleby/smhasher) hash value using the same hash seed as [Kafka](https://github.com/apache/kafka/blob/461c5cfe056db0951d9b74f5adc45973670404d7/clients/src/main/java/org/apache/kafka/common/utils/Utils.java#L482) and without the highest bit to be compatible with [Default Partitioner](https://github.com/apache/kafka/blob/139f7709bd3f5926901a21e55043388728ccca78/clients/src/main/java/org/apache/kafka/clients/producer/internals/BuiltInPartitioner.java#L328).
2023-03-29 18:05:25 +00:00
**Syntax**
```sql
MurmurHash(par1, ...)
```
**Arguments**
- `par1, ...` — A variable number of parameters that can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md/#data_types).
2023-03-29 18:05:25 +00:00
**Returned value**
- Calculated hash value.
2023-03-29 18:05:25 +00:00
2023-03-30 23:07:49 +00:00
Type: [UInt32](/docs/en/sql-reference/data-types/int-uint.md).
2023-03-29 18:05:25 +00:00
**Example**
Query:
```sql
SELECT
kafkaMurmurHash('foobar') AS res1,
2023-03-30 23:07:49 +00:00
kafkaMurmurHash(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS res2
2023-03-29 18:05:25 +00:00
```
Result:
```response
2023-03-30 23:07:49 +00:00
┌───────res1─┬─────res2─┐
│ 1357151166 │ 85479775 │
└────────────┴──────────┘
2023-03-29 18:05:25 +00:00
```
2022-06-02 10:55:18 +00:00
## murmurHash3_32, murmurHash3_64
Produces a [MurmurHash3](https://github.com/aappleby/smhasher) hash value.
2022-05-11 18:10:16 +00:00
```sql
murmurHash3_32(par1, ...)
murmurHash3_64(par1, ...)
```
**Arguments**
2023-01-08 00:19:49 +00:00
Both functions take a variable number of input parameters. Arguments can be any of the [supported data types](/docs/en/sql-reference/data-types/index.md). For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed `Tuple` with the same data, `Map` and the corresponding `Array(Tuple(key, value))` type with the same data).
**Returned Value**
- The `murmurHash3_32` function returns a [UInt32](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
- The `murmurHash3_64` function returns a [UInt64](/docs/en/sql-reference/data-types/int-uint.md) data type hash value.
**Example**
2022-05-11 18:10:16 +00:00
```sql
SELECT murmurHash3_32(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS MurmurHash3, toTypeName(MurmurHash3) AS type;
```
2020-03-20 10:10:48 +00:00
2022-05-11 18:10:16 +00:00
```response
┌─MurmurHash3─┬─type───┐
│ 2152717 │ UInt32 │
└─────────────┴────────┘
```
2022-06-02 10:55:18 +00:00
## murmurHash3_128
Produces a 128-bit [MurmurHash3](https://github.com/aappleby/smhasher) hash value.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
murmurHash3_128(expr)
```
**Arguments**
- `expr` — A list of [expressions](/docs/en/sql-reference/syntax.md/#syntax-expressions). [String](/docs/en/sql-reference/data-types/string.md).
**Returned value**
A 128-bit `MurmurHash3` hash value.
2023-01-08 00:19:49 +00:00
Type: [FixedString(16)](/docs/en/sql-reference/data-types/fixedstring.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT hex(murmurHash3_128('foo', 'foo', 'foo'));
```
2020-03-20 10:10:48 +00:00
Result:
2022-05-11 18:10:16 +00:00
```response
┌─hex(murmurHash3_128('foo', 'foo', 'foo'))─┐
│ F8F7AD9B6CD4CF117A71E277E2EC2931 │
└───────────────────────────────────────────┘
```
2023-01-06 17:05:27 +00:00
## xxh3
Produces a 64-bit [xxh3](https://github.com/Cyan4973/xxHash) hash value.
**Syntax**
```sql
xxh3(expr)
```
**Arguments**
- `expr` — A list of [expressions](/docs/en/sql-reference/syntax.md/#syntax-expressions) of any data type.
2023-01-06 17:05:27 +00:00
**Returned value**
A 64-bit `xxh3` hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
2023-01-06 17:05:27 +00:00
**Example**
Query:
```sql
2023-01-09 19:42:24 +00:00
SELECT xxh3('Hello', 'world')
```
2023-01-06 17:05:27 +00:00
Result:
```response
┌─xxh3('Hello', 'world')─┐
│ 5607458076371731292 │
└────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## xxHash32, xxHash64
Calculates `xxHash` from a string. It is proposed in two flavors, 32 and 64 bits.
2022-05-11 18:10:16 +00:00
```sql
SELECT xxHash32('')
OR
SELECT xxHash64('')
```
**Returned value**
2022-12-21 13:39:44 +00:00
A `UInt32` or `UInt64` data type hash value.
Type: `UInt32` for `xxHash32` and `UInt64` for `xxHash64`.
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT xxHash32('Hello, world!');
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─xxHash32('Hello, world!')─┐
│ 834093149 │
└───────────────────────────┘
```
**See Also**
- [xxHash](http://cyan4973.github.io/xxHash/).
2022-06-02 10:55:18 +00:00
## ngramSimHash
Splits a ASCII string into n-grams of `ngramsize` symbols and returns the n-gram `simhash`. Is case sensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [bitHammingDistance](/docs/en/sql-reference/functions/bit-functions.md/#bithammingdistance). The smaller is the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) of the calculated `simhashes` of two strings, the more likely these strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramSimHash(string[, ngramsize])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramSimHash('ClickHouse') AS Hash;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌───────Hash─┐
│ 1627567969 │
└────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramSimHashCaseInsensitive
Splits a ASCII string into n-grams of `ngramsize` symbols and returns the n-gram `simhash`. Is case insensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [bitHammingDistance](/docs/en/sql-reference/functions/bit-functions.md/#bithammingdistance). The smaller is the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) of the calculated `simhashes` of two strings, the more likely these strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramSimHashCaseInsensitive(string[, ngramsize])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramSimHashCaseInsensitive('ClickHouse') AS Hash;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌──────Hash─┐
│ 562180645 │
└───────────┘
```
2022-06-02 10:55:18 +00:00
## ngramSimHashUTF8
Splits a UTF-8 string into n-grams of `ngramsize` symbols and returns the n-gram `simhash`. Is case sensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [bitHammingDistance](/docs/en/sql-reference/functions/bit-functions.md/#bithammingdistance). The smaller is the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) of the calculated `simhashes` of two strings, the more likely these strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramSimHashUTF8(string[, ngramsize])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramSimHashUTF8('ClickHouse') AS Hash;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌───────Hash─┐
│ 1628157797 │
└────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramSimHashCaseInsensitiveUTF8
Splits a UTF-8 string into n-grams of `ngramsize` symbols and returns the n-gram `simhash`. Is case insensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [bitHammingDistance](/docs/en/sql-reference/functions/bit-functions.md/#bithammingdistance). The smaller is the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) of the calculated `simhashes` of two strings, the more likely these strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramSimHashCaseInsensitiveUTF8(string[, ngramsize])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramSimHashCaseInsensitiveUTF8('ClickHouse') AS Hash;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌───────Hash─┐
│ 1636742693 │
└────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleSimHash
Splits a ASCII string into parts (shingles) of `shinglesize` words and returns the word shingle `simhash`. Is case sensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [bitHammingDistance](/docs/en/sql-reference/functions/bit-functions.md/#bithammingdistance). The smaller is the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) of the calculated `simhashes` of two strings, the more likely these strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleSimHash(string[, shinglesize])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleSimHash('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Hash;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌───────Hash─┐
│ 2328277067 │
└────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleSimHashCaseInsensitive
Splits a ASCII string into parts (shingles) of `shinglesize` words and returns the word shingle `simhash`. Is case insensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [bitHammingDistance](/docs/en/sql-reference/functions/bit-functions.md/#bithammingdistance). The smaller is the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) of the calculated `simhashes` of two strings, the more likely these strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleSimHashCaseInsensitive(string[, shinglesize])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleSimHashCaseInsensitive('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Hash;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌───────Hash─┐
│ 2194812424 │
└────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleSimHashUTF8
Splits a UTF-8 string into parts (shingles) of `shinglesize` words and returns the word shingle `simhash`. Is case sensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [bitHammingDistance](/docs/en/sql-reference/functions/bit-functions.md/#bithammingdistance). The smaller is the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) of the calculated `simhashes` of two strings, the more likely these strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleSimHashUTF8(string[, shinglesize])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
2023-06-02 11:30:05 +00:00
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleSimHashUTF8('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Hash;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌───────Hash─┐
│ 2328277067 │
└────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleSimHashCaseInsensitiveUTF8
Splits a UTF-8 string into parts (shingles) of `shinglesize` words and returns the word shingle `simhash`. Is case insensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [bitHammingDistance](/docs/en/sql-reference/functions/bit-functions.md/#bithammingdistance). The smaller is the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) of the calculated `simhashes` of two strings, the more likely these strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleSimHashCaseInsensitiveUTF8(string[, shinglesize])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Hash value.
2023-01-08 00:19:49 +00:00
Type: [UInt64](/docs/en/sql-reference/data-types/int-uint.md).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleSimHashCaseInsensitiveUTF8('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Hash;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌───────Hash─┐
│ 2194812424 │
└────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramMinHash
Splits a ASCII string into n-grams of `ngramsize` symbols and calculates hash values for each n-gram. Uses `hashnum` minimum hashes to calculate the minimum hash and `hashnum` maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [tupleHammingDistance](/docs/en/sql-reference/functions/tuple-functions.md/#tuplehammingdistance). For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramMinHash(string[, ngramsize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two hashes — the minimum and the maximum.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([UInt64](/docs/en/sql-reference/data-types/int-uint.md), [UInt64](/docs/en/sql-reference/data-types/int-uint.md)).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramMinHash('ClickHouse') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple──────────────────────────────────────┐
│ (18333312859352735453,9054248444481805918) │
└────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramMinHashCaseInsensitive
Splits a ASCII string into n-grams of `ngramsize` symbols and calculates hash values for each n-gram. Uses `hashnum` minimum hashes to calculate the minimum hash and `hashnum` maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [tupleHammingDistance](/docs/en/sql-reference/functions/tuple-functions.md/#tuplehammingdistance). For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramMinHashCaseInsensitive(string[, ngramsize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two hashes — the minimum and the maximum.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([UInt64](/docs/en/sql-reference/data-types/int-uint.md), [UInt64](/docs/en/sql-reference/data-types/int-uint.md)).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramMinHashCaseInsensitive('ClickHouse') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple──────────────────────────────────────┐
│ (2106263556442004574,13203602793651726206) │
└────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramMinHashUTF8
Splits a UTF-8 string into n-grams of `ngramsize` symbols and calculates hash values for each n-gram. Uses `hashnum` minimum hashes to calculate the minimum hash and `hashnum` maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [tupleHammingDistance](/docs/en/sql-reference/functions/tuple-functions.md/#tuplehammingdistance). For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramMinHashUTF8(string[, ngramsize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two hashes — the minimum and the maximum.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([UInt64](/docs/en/sql-reference/data-types/int-uint.md), [UInt64](/docs/en/sql-reference/data-types/int-uint.md)).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramMinHashUTF8('ClickHouse') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple──────────────────────────────────────┐
│ (18333312859352735453,6742163577938632877) │
└────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramMinHashCaseInsensitiveUTF8
Splits a UTF-8 string into n-grams of `ngramsize` symbols and calculates hash values for each n-gram. Uses `hashnum` minimum hashes to calculate the minimum hash and `hashnum` maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [tupleHammingDistance](/docs/en/sql-reference/functions/tuple-functions.md/#tuplehammingdistance). For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramMinHashCaseInsensitiveUTF8(string [, ngramsize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two hashes — the minimum and the maximum.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([UInt64](/docs/en/sql-reference/data-types/int-uint.md), [UInt64](/docs/en/sql-reference/data-types/int-uint.md)).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramMinHashCaseInsensitiveUTF8('ClickHouse') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple───────────────────────────────────────┐
│ (12493625717655877135,13203602793651726206) │
└─────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramMinHashArg
Splits a ASCII string into n-grams of `ngramsize` symbols and returns the n-grams with minimum and maximum hashes, calculated by the [ngramMinHash](#ngramminhash) function with the same input. Is case sensitive.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramMinHashArg(string[, ngramsize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two tuples with `hashnum` n-grams each.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md)), [Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md))).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramMinHashArg('ClickHouse') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple─────────────────────────────────────────────────────────────────────────┐
│ (('ous','ick','lic','Hou','kHo','use'),('Hou','lic','ick','ous','ckH','Cli')) │
└───────────────────────────────────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramMinHashArgCaseInsensitive
Splits a ASCII string into n-grams of `ngramsize` symbols and returns the n-grams with minimum and maximum hashes, calculated by the [ngramMinHashCaseInsensitive](#ngramminhashcaseinsensitive) function with the same input. Is case insensitive.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramMinHashArgCaseInsensitive(string[, ngramsize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two tuples with `hashnum` n-grams each.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md)), [Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md))).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramMinHashArgCaseInsensitive('ClickHouse') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple─────────────────────────────────────────────────────────────────────────┐
│ (('ous','ick','lic','kHo','use','Cli'),('kHo','lic','ick','ous','ckH','Hou')) │
└───────────────────────────────────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramMinHashArgUTF8
Splits a UTF-8 string into n-grams of `ngramsize` symbols and returns the n-grams with minimum and maximum hashes, calculated by the [ngramMinHashUTF8](#ngramminhashutf8) function with the same input. Is case sensitive.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramMinHashArgUTF8(string[, ngramsize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two tuples with `hashnum` n-grams each.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md)), [Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md))).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramMinHashArgUTF8('ClickHouse') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple─────────────────────────────────────────────────────────────────────────┐
│ (('ous','ick','lic','Hou','kHo','use'),('kHo','Hou','lic','ick','ous','ckH')) │
└───────────────────────────────────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## ngramMinHashArgCaseInsensitiveUTF8
Splits a UTF-8 string into n-grams of `ngramsize` symbols and returns the n-grams with minimum and maximum hashes, calculated by the [ngramMinHashCaseInsensitiveUTF8](#ngramminhashcaseinsensitiveutf8) function with the same input. Is case insensitive.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
ngramMinHashArgCaseInsensitiveUTF8(string[, ngramsize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `ngramsize` — The size of an n-gram. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two tuples with `hashnum` n-grams each.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md)), [Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md))).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT ngramMinHashArgCaseInsensitiveUTF8('ClickHouse') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple─────────────────────────────────────────────────────────────────────────┐
│ (('ckH','ous','ick','lic','kHo','use'),('kHo','lic','ick','ous','ckH','Hou')) │
└───────────────────────────────────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleMinHash
Splits a ASCII string into parts (shingles) of `shinglesize` words and calculates hash values for each word shingle. Uses `hashnum` minimum hashes to calculate the minimum hash and `hashnum` maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [tupleHammingDistance](/docs/en/sql-reference/functions/tuple-functions.md/#tuplehammingdistance). For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleMinHash(string[, shinglesize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two hashes — the minimum and the maximum.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([UInt64](/docs/en/sql-reference/data-types/int-uint.md), [UInt64](/docs/en/sql-reference/data-types/int-uint.md)).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleMinHash('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple──────────────────────────────────────┐
│ (16452112859864147620,5844417301642981317) │
└────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleMinHashCaseInsensitive
Splits a ASCII string into parts (shingles) of `shinglesize` words and calculates hash values for each word shingle. Uses `hashnum` minimum hashes to calculate the minimum hash and `hashnum` maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [tupleHammingDistance](/docs/en/sql-reference/functions/tuple-functions.md/#tuplehammingdistance). For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleMinHashCaseInsensitive(string[, shinglesize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two hashes — the minimum and the maximum.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([UInt64](/docs/en/sql-reference/data-types/int-uint.md), [UInt64](/docs/en/sql-reference/data-types/int-uint.md)).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleMinHashCaseInsensitive('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple─────────────────────────────────────┐
│ (3065874883688416519,1634050779997673240) │
└───────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleMinHashUTF8
Splits a UTF-8 string into parts (shingles) of `shinglesize` words and calculates hash values for each word shingle. Uses `hashnum` minimum hashes to calculate the minimum hash and `hashnum` maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [tupleHammingDistance](/docs/en/sql-reference/functions/tuple-functions.md/#tuplehammingdistance). For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleMinHashUTF8(string[, shinglesize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two hashes — the minimum and the maximum.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([UInt64](/docs/en/sql-reference/data-types/int-uint.md), [UInt64](/docs/en/sql-reference/data-types/int-uint.md)).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleMinHashUTF8('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple──────────────────────────────────────┐
│ (16452112859864147620,5844417301642981317) │
└────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleMinHashCaseInsensitiveUTF8
Splits a UTF-8 string into parts (shingles) of `shinglesize` words and calculates hash values for each word shingle. Uses `hashnum` minimum hashes to calculate the minimum hash and `hashnum` maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
2023-01-08 00:19:49 +00:00
Can be used for detection of semi-duplicate strings with [tupleHammingDistance](/docs/en/sql-reference/functions/tuple-functions.md/#tuplehammingdistance). For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleMinHashCaseInsensitiveUTF8(string[, shinglesize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two hashes — the minimum and the maximum.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([UInt64](/docs/en/sql-reference/data-types/int-uint.md), [UInt64](/docs/en/sql-reference/data-types/int-uint.md)).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleMinHashCaseInsensitiveUTF8('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple─────────────────────────────────────┐
│ (3065874883688416519,1634050779997673240) │
└───────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleMinHashArg
Splits a ASCII string into parts (shingles) of `shinglesize` words each and returns the shingles with minimum and maximum word hashes, calculated by the [wordshingleMinHash](#wordshingleminhash) function with the same input. Is case sensitive.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleMinHashArg(string[, shinglesize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two tuples with `hashnum` word shingles each.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md)), [Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md))).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleMinHashArg('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).', 1, 3) AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple─────────────────────────────────────────────────────────────────┐
│ (('OLAP','database','analytical'),('online','oriented','processing')) │
└───────────────────────────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleMinHashArgCaseInsensitive
Splits a ASCII string into parts (shingles) of `shinglesize` words each and returns the shingles with minimum and maximum word hashes, calculated by the [wordShingleMinHashCaseInsensitive](#wordshingleminhashcaseinsensitive) function with the same input. Is case insensitive.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleMinHashArgCaseInsensitive(string[, shinglesize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two tuples with `hashnum` word shingles each.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md)), [Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md))).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleMinHashArgCaseInsensitive('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).', 1, 3) AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple──────────────────────────────────────────────────────────────────┐
│ (('queries','database','analytical'),('oriented','processing','DBMS')) │
└────────────────────────────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleMinHashArgUTF8
Splits a UTF-8 string into parts (shingles) of `shinglesize` words each and returns the shingles with minimum and maximum word hashes, calculated by the [wordShingleMinHashUTF8](#wordshingleminhashutf8) function with the same input. Is case sensitive.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleMinHashArgUTF8(string[, shinglesize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two tuples with `hashnum` word shingles each.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md)), [Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md))).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleMinHashArgUTF8('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).', 1, 3) AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple─────────────────────────────────────────────────────────────────┐
│ (('OLAP','database','analytical'),('online','oriented','processing')) │
└───────────────────────────────────────────────────────────────────────┘
```
2022-06-02 10:55:18 +00:00
## wordShingleMinHashArgCaseInsensitiveUTF8
Splits a UTF-8 string into parts (shingles) of `shinglesize` words each and returns the shingles with minimum and maximum word hashes, calculated by the [wordShingleMinHashCaseInsensitiveUTF8](#wordshingleminhashcaseinsensitiveutf8) function with the same input. Is case insensitive.
**Syntax**
2022-05-11 18:10:16 +00:00
```sql
wordShingleMinHashArgCaseInsensitiveUTF8(string[, shinglesize, hashnum])
```
**Arguments**
- `string` — String. [String](/docs/en/sql-reference/data-types/string.md).
- `shinglesize` — The size of a word shingle. Optional. Possible values: any number from `1` to `25`. Default value: `3`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
- `hashnum` — The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from `1` to `25`. Default value: `6`. [UInt8](/docs/en/sql-reference/data-types/int-uint.md).
**Returned value**
- Tuple with two tuples with `hashnum` word shingles each.
2023-01-08 00:19:49 +00:00
Type: [Tuple](/docs/en/sql-reference/data-types/tuple.md)([Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md)), [Tuple](/docs/en/sql-reference/data-types/tuple.md)([String](/docs/en/sql-reference/data-types/string.md))).
**Example**
Query:
2022-05-11 18:10:16 +00:00
```sql
SELECT wordShingleMinHashArgCaseInsensitiveUTF8('ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).', 1, 3) AS Tuple;
```
Result:
2022-05-11 18:10:16 +00:00
```response
┌─Tuple──────────────────────────────────────────────────────────────────┐
│ (('queries','database','analytical'),('oriented','processing','DBMS')) │
└────────────────────────────────────────────────────────────────────────┘
```