2021-05-10 10:42:32 +00:00
---
2022-08-28 14:53:34 +00:00
slug: /en/sql-reference/functions/nlp-functions
2023-04-19 17:05:55 +00:00
sidebar_position: 130
sidebar_label: NLP (experimental)
2021-05-10 10:42:32 +00:00
---
2023-06-23 12:49:41 +00:00
# Natural Language Processing (NLP) Functions
2024-06-25 14:48:32 +00:00
:::warning
2022-04-09 13:29:05 +00:00
This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set `allow_experimental_nlp_functions = 1` to enable it.
:::
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
## detectCharset
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
The `detectCharset` function detects the character set of the non-UTF8-encoded input string.
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
*Syntax*
2021-05-10 10:42:32 +00:00
``` sql
2024-06-25 14:48:32 +00:00
detectCharset('text_to_be_analyzed')
2021-05-10 10:42:32 +00:00
```
2024-06-25 14:48:32 +00:00
*Arguments*
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String ](../data-types/string.md#string ).
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
*Returned value*
- A `String` containing the code of the detected character set
*Examples*
2021-05-10 10:42:32 +00:00
Query:
2024-06-25 14:48:32 +00:00
```sql
SELECT detectCharset('Ich bleibe für ein paar Tage.');
2021-05-10 10:42:32 +00:00
```
Result:
2024-06-25 14:48:32 +00:00
```response
┌─detectCharset('Ich bleibe für ein paar Tage.')─┐
│ WINDOWS-1252 │
└────────────────────────────────────────────────┘
2021-05-10 10:42:32 +00:00
```
2023-05-22 16:14:23 +00:00
2024-06-25 14:48:32 +00:00
## detectLanguage
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
Detects the language of the UTF8-encoded input string. The function uses the [CLD2 library ](https://github.com/CLD2Owners/cld2 ) for detection, and it returns the 2-letter ISO language code.
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
The `detectLanguage` function works best when providing over 200 characters in the input string.
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
*Syntax*
2021-05-10 10:42:32 +00:00
``` sql
2024-06-25 14:48:32 +00:00
detectLanguage('text_to_be_analyzed')
2021-05-10 10:42:32 +00:00
```
2024-06-25 14:48:32 +00:00
*Arguments*
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String ](../data-types/string.md#string ).
*Returned value*
- The 2-letter ISO code of the detected language
Other possible results:
- `un` = unknown, can not detect any language.
- `other` = the detected language does not have 2 letter code.
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
*Examples*
2021-05-10 10:42:32 +00:00
Query:
2024-06-25 14:48:32 +00:00
```sql
SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there’ s a will, there’ s a way.');
2021-05-10 10:42:32 +00:00
```
Result:
2024-06-25 14:48:32 +00:00
```response
fr
2021-05-10 10:42:32 +00:00
```
2024-06-25 14:48:32 +00:00
## detectLanguageMixed
2023-05-22 17:02:39 +00:00
2024-06-25 14:48:32 +00:00
Similar to the `detectLanguage` function, but `detectLanguageMixed` returns a `Map` of 2-letter language codes that are mapped to the percentage of the certain language in the text.
2021-05-10 10:42:32 +00:00
2021-06-05 03:57:53 +00:00
2024-06-25 14:48:32 +00:00
*Syntax*
2021-08-02 12:32:45 +00:00
2024-06-25 14:48:32 +00:00
``` sql
detectLanguageMixed('text_to_be_analyzed')
```
2021-08-02 12:32:45 +00:00
2024-06-25 14:48:32 +00:00
*Arguments*
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String ](../data-types/string.md#string ).
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
*Returned value*
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
- `Map(String, Float32)` : The keys are 2-letter ISO codes and the values are a percentage of text found for that language
2021-05-10 10:42:32 +00:00
2024-06-25 14:48:32 +00:00
*Examples*
2021-05-10 10:42:32 +00:00
Query:
2024-06-25 14:48:32 +00:00
```sql
SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.');
2021-05-10 10:42:32 +00:00
```
Result:
2024-06-25 14:48:32 +00:00
```response
┌─detectLanguageMixed()─┐
│ {'ja':0.62,'fr':0.36 │
└───────────────────────┘
2021-08-02 15:54:24 +00:00
```
2023-01-10 19:26:51 +00:00
2024-06-25 15:01:14 +00:00
## detectProgrammingLanguage
Determines the programming language from the source code. Calculates all the unigrams and bigrams of commands in the source code.
Then using a marked-up dictionary with weights of unigrams and bigrams of commands for various programming languages finds the biggest weight of the programming language and returns it.
*Syntax*
``` sql
detectProgrammingLanguage('source_code')
```
*Arguments*
- `source_code` — String representation of the source code to analyze. [String ](../data-types/string.md#string ).
*Returned value*
- Programming language. [String ](../data-types/string.md ).
*Examples*
Query:
```sql
SELECT detectProgrammingLanguage('#include < iostream > ');
```
Result:
```response
┌─detectProgrammingLanguage('#include < iostream > ')─┐
│ C++ │
└──────────────────────────────────────────────────┘
```
2024-06-25 14:48:32 +00:00
## detectLanguageUnknown
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
Similar to the `detectLanguage` function, except the `detectLanguageUnknown` function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Syntax*
2023-01-10 19:26:51 +00:00
``` sql
2024-06-25 14:48:32 +00:00
detectLanguageUnknown('text_to_be_analyzed')
2023-01-10 19:26:51 +00:00
```
2024-06-25 14:48:32 +00:00
*Arguments*
2023-01-10 19:26:51 +00:00
2024-05-24 03:54:16 +00:00
- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String ](../data-types/string.md#string ).
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Returned value*
2023-01-10 19:26:51 +00:00
- The 2-letter ISO code of the detected language
Other possible results:
- `un` = unknown, can not detect any language.
- `other` = the detected language does not have 2 letter code.
2024-06-25 14:48:32 +00:00
*Examples*
2023-01-10 19:26:51 +00:00
Query:
```sql
2024-06-25 14:48:32 +00:00
SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.');
2023-01-10 19:26:51 +00:00
```
Result:
```response
2024-06-25 14:48:32 +00:00
┌─detectLanguageUnknown('Ich bleibe für ein paar Tage.')─┐
│ de │
└────────────────────────────────────────────────────────┘
2023-01-10 19:26:51 +00:00
```
2024-06-25 14:48:32 +00:00
## detectTonality
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
Determines the sentiment of text data. Uses a marked-up sentiment dictionary, in which each word has a tonality ranging from `-12` to `6` .
For each text, it calculates the average sentiment value of its words and returns it in the range `[-1,1]` .
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
:::note
This function is limited in its current form. Currently it makes use of the embedded emotional dictionary at `/contrib/nlp-data/tonality_ru.zst` and only works for the Russian language.
:::
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Syntax*
2023-01-10 19:26:51 +00:00
``` sql
2024-06-25 14:48:32 +00:00
detectTonality(text)
2023-01-10 19:26:51 +00:00
```
2024-06-25 14:48:32 +00:00
*Arguments*
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
- `text` — The text to be analyzed. [String ](../data-types/string.md#string ).
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Returned value*
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
- The average sentiment value of the words in `text` . [Float32 ](../data-types/float.md ).
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Examples*
2023-01-10 19:26:51 +00:00
Query:
```sql
2024-06-25 14:48:32 +00:00
SELECT detectTonality('Шарик - хороший пёс'), -- Sharik is a good dog
detectTonality('Шарик - пёс'), -- Sharik is a dog
detectTonality('Шарик - плохой пёс'); -- Sharkik is a bad dog
2023-01-10 19:26:51 +00:00
```
Result:
```response
2024-06-25 14:48:32 +00:00
┌─detectTonality('Шарик - хороший пёс')─┬─detectTonality('Шарик - пёс')─┬─detectTonality('Шарик - плохой пёс')─┐
│ 0.44445 │ 0 │ -0.3 │
└───────────────────────────────────────┴───────────────────────────────┴──────────────────────────────────────┘
2023-01-10 19:26:51 +00:00
```
2024-06-25 14:48:32 +00:00
## lemmatize
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained [here ](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models ).
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Syntax*
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
``` sql
lemmatize('language', word)
```
*Arguments*
- `language` — Language which rules will be applied. [String ](../data-types/string.md#string ).
- `word` — Word that needs to be lemmatized. Must be lowercase. [String ](../data-types/string.md#string ).
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Examples*
Query:
2023-01-10 19:26:51 +00:00
``` sql
2024-06-25 14:48:32 +00:00
SELECT lemmatize('en', 'wolves');
2023-01-10 19:26:51 +00:00
```
2024-06-25 14:48:32 +00:00
Result:
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
``` text
┌─lemmatize("wolves")─┐
│ "wolf" │
└─────────────────────┘
```
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Configuration*
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
This configuration specifies that the dictionary `en.bin` should be used for lemmatization of English (`en`) words. The `.bin` files can be downloaded from
[here ](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models ).
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
``` xml
< lemmatizers >
< lemmatizer >
<!-- highlight - start -->
< lang > en< / lang >
< path > en.bin< / path >
<!-- highlight - end -->
< / lemmatizer >
< / lemmatizers >
```
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
## stem
Performs stemming on a given word.
*Syntax*
``` sql
stem('language', word)
```
*Arguments*
- `language` — Language which rules will be applied. Use the two letter [ISO 639-1 code ](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes ).
- `word` — word that needs to be stemmed. Must be in lowercase. [String ](../data-types/string.md#string ).
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Examples*
2023-01-10 19:26:51 +00:00
Query:
2024-06-25 14:48:32 +00:00
``` sql
SELECT arrayMap(x -> stem('en', x), ['I', 'think', 'it', 'is', 'a', 'blessing', 'in', 'disguise']) as res;
2023-01-10 19:26:51 +00:00
```
Result:
2024-06-25 14:48:32 +00:00
``` text
┌─res────────────────────────────────────────────────┐
│ ['I','think','it','is','a','bless','in','disguis'] │
└────────────────────────────────────────────────────┘
2023-01-10 19:26:51 +00:00
```
2024-06-25 14:48:32 +00:00
*Supported languages for stem()*
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
:::note
The stem() function uses the [Snowball stemming ](https://snowballstem.org/ ) library, see the Snowball website for updated languages etc.
:::
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
- Arabic
- Armenian
- Basque
- Catalan
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hindi
- Hungarian
- Indonesian
- Irish
- Italian
- Lithuanian
- Nepali
- Norwegian
- Porter
- Portuguese
- Romanian
- Russian
- Serbian
- Spanish
- Swedish
- Tamil
- Turkish
- Yiddish
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
## synonyms
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
Finds synonyms to a given word. There are two types of synonym extensions: `plain` and `wordnet` .
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
With the `plain` extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
With the `wordnet` extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Syntax*
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
``` sql
synonyms('extension_name', word)
```
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
*Arguments*
2023-01-10 19:26:51 +00:00
2024-06-25 14:48:32 +00:00
- `extension_name` — Name of the extension in which search will be performed. [String ](../data-types/string.md#string ).
- `word` — Word that will be searched in extension. [String ](../data-types/string.md#string ).
*Examples*
2023-01-10 19:26:51 +00:00
Query:
2024-06-25 14:48:32 +00:00
``` sql
SELECT synonyms('list', 'important');
2023-01-10 19:26:51 +00:00
```
Result:
2024-06-25 14:48:32 +00:00
``` text
┌─synonyms('list', 'important')────────────┐
│ ['important','big','critical','crucial'] │
└──────────────────────────────────────────┘
2023-01-10 19:26:51 +00:00
```
2024-06-25 14:48:32 +00:00
*Configuration*
``` xml
< synonyms_extensions >
< extension >
< name > en< / name >
< type > plain< / type >
< path > en.txt< / path >
< / extension >
< extension >
< name > en< / name >
< type > wordnet< / type >
< path > en/< / path >
< / extension >
< / synonyms_extensions >
```