ClickHouse/docs/en/sql-reference/functions/nlp-functions.md

---
slug: /en/sql-reference/functions/nlp-functions
sidebar_position: 130
sidebar_label: NLP (experimental)
---

# Natural Language Processing (NLP) Functions

:::warning
This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set `allow_experimental_nlp_functions = 1` to enable it.
:::

## detectCharset

The `detectCharset` function detects the character set of the non-UTF8-encoded input string.

*Syntax*

``` sql
detectCharset('text_to_be_analyzed')
```

*Arguments*

- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).

*Returned value*

- A `String` containing the code of the detected character set

*Examples*

Query:

```sql
SELECT detectCharset('Ich bleibe für ein paar Tage.');
```

Result:

```response
┌─detectCharset('Ich bleibe für ein paar Tage.')─┐
│ WINDOWS-1252                                   │
└────────────────────────────────────────────────┘
```

## detectLanguage

Detects the language of the UTF8-encoded input string. The function uses the [CLD2 library](https://github.com/CLD2Owners/cld2) for detection, and it returns the 2-letter ISO language code.

The `detectLanguage` function works best when providing over 200 characters in the input string.

*Syntax*

``` sql
detectLanguage('text_to_be_analyzed')
```

*Arguments*

- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).

*Returned value*

- The 2-letter ISO code of the detected language

Other possible results:

- `un` = unknown, can not detect any language.
- `other` = the detected language does not have 2 letter code.

*Examples*

Query:

```sql
SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there’s a will, there’s a way.');
```

Result:

```response
fr
```

## detectLanguageMixed

Similar to the `detectLanguage` function, but `detectLanguageMixed` returns a `Map` of 2-letter language codes that are mapped to the percentage of the certain language in the text.


*Syntax*

``` sql
detectLanguageMixed('text_to_be_analyzed')
```

*Arguments*

- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).

*Returned value*

- `Map(String, Float32)`: The keys are 2-letter ISO codes and the values are a percentage of text found for that language


*Examples*

Query:

```sql
SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.');
```

Result:

```response
┌─detectLanguageMixed()─┐
│ {'ja':0.62,'fr':0.36  │
└───────────────────────┘
```

## detectProgrammingLanguage

Determines the programming language from the source code. Calculates all the unigrams and bigrams of commands in the source code. 
Then using a marked-up dictionary with weights of unigrams and bigrams of commands for various programming languages finds the biggest weight of the programming language and returns it.

*Syntax*

``` sql
detectProgrammingLanguage('source_code')
```

*Arguments*

- `source_code` — String representation of the source code to analyze. [String](../data-types/string.md#string).

*Returned value*

- Programming language. [String](../data-types/string.md).

*Examples*

Query:

```sql
SELECT detectProgrammingLanguage('#include <iostream>');
```

Result:

```response
┌─detectProgrammingLanguage('#include <iostream>')─┐
│ C++                                              │
└──────────────────────────────────────────────────┘
```

## detectLanguageUnknown

Similar to the `detectLanguage` function, except the `detectLanguageUnknown` function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.


*Syntax*

``` sql
detectLanguageUnknown('text_to_be_analyzed')
```

*Arguments*

- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).

*Returned value*

- The 2-letter ISO code of the detected language

Other possible results:

- `un` = unknown, can not detect any language.
- `other` = the detected language does not have 2 letter code.

*Examples*

Query:

```sql
SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.');
```

Result:

```response
┌─detectLanguageUnknown('Ich bleibe für ein paar Tage.')─┐
│ de                                                     │
└────────────────────────────────────────────────────────┘
```

## detectTonality

Determines the sentiment of text data. Uses a marked-up sentiment dictionary, in which each word has a tonality ranging from `-12` to `6`.
For each text, it calculates the average sentiment value of its words and returns it in the range `[-1,1]`.

:::note
This function is limited in its current form. Currently it makes use of the embedded emotional dictionary at `/contrib/nlp-data/tonality_ru.zst` and only works for the Russian language.
:::

*Syntax*

``` sql
detectTonality(text)
```

*Arguments*

- `text` — The text to be analyzed. [String](../data-types/string.md#string).

*Returned value*

- The average sentiment value of the words in `text`. [Float32](../data-types/float.md).

*Examples*

Query:

```sql
SELECT detectTonality('Шарик - хороший пёс'), -- Sharik is a good dog 
       detectTonality('Шарик - пёс'), -- Sharik is a dog
       detectTonality('Шарик - плохой пёс'); -- Sharkik is a bad dog
```

Result:

```response
┌─detectTonality('Шарик - хороший пёс')─┬─detectTonality('Шарик - пёс')─┬─detectTonality('Шарик - плохой пёс')─┐
│                               0.44445 │                             0 │                                 -0.3 │
└───────────────────────────────────────┴───────────────────────────────┴──────────────────────────────────────┘
```
## lemmatize

Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained [here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).

*Syntax*

``` sql
lemmatize('language', word)
```

*Arguments*

- `language` — Language which rules will be applied. [String](../data-types/string.md#string).
- `word` — Word that needs to be lemmatized. Must be lowercase. [String](../data-types/string.md#string).

*Examples*

Query:

``` sql
SELECT lemmatize('en', 'wolves');
```

Result:

``` text
┌─lemmatize("wolves")─┐
│              "wolf" │
└─────────────────────┘
```

*Configuration*

This configuration specifies that the dictionary `en.bin` should be used for lemmatization of English (`en`) words.  The `.bin` files can be downloaded from
[here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).

``` xml
<lemmatizers>
    <lemmatizer>
        <!-- highlight-start -->
        <lang>en</lang>
        <path>en.bin</path>
        <!-- highlight-end -->
    </lemmatizer>
</lemmatizers>
```

## stem

Performs stemming on a given word.

*Syntax*

``` sql
stem('language', word)
```

*Arguments*

- `language` — Language which rules will be applied. Use the two letter [ISO 639-1 code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).
- `word` — word that needs to be stemmed. Must be in lowercase. [String](../data-types/string.md#string).

*Examples*

Query:

``` sql
SELECT arrayMap(x -> stem('en', x), ['I', 'think', 'it', 'is', 'a', 'blessing', 'in', 'disguise']) as res;
```

Result:

``` text
┌─res────────────────────────────────────────────────┐
│ ['I','think','it','is','a','bless','in','disguis'] │
└────────────────────────────────────────────────────┘
```
*Supported languages for stem()*

:::note
The stem() function uses the [Snowball stemming](https://snowballstem.org/) library, see the Snowball website for updated languages etc.
:::

- Arabic
- Armenian
- Basque
- Catalan
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hindi
- Hungarian
- Indonesian
- Irish
- Italian
- Lithuanian
- Nepali
- Norwegian
- Porter
- Portuguese
- Romanian
- Russian
- Serbian
- Spanish
- Swedish
- Tamil
- Turkish
- Yiddish

## synonyms

Finds synonyms to a given word. There are two types of synonym extensions: `plain` and `wordnet`.

With the `plain` extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.

With the `wordnet` extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.

*Syntax*

``` sql
synonyms('extension_name', word)
```

*Arguments*

- `extension_name` — Name of the extension in which search will be performed. [String](../data-types/string.md#string).
- `word` — Word that will be searched in extension. [String](../data-types/string.md#string).

*Examples*

Query:

``` sql
SELECT synonyms('list', 'important');
```

Result:

``` text
┌─synonyms('list', 'important')────────────┐
│ ['important','big','critical','crucial'] │
└──────────────────────────────────────────┘
```

*Configuration*
``` xml
<synonyms_extensions>
    <extension>
        <name>en</name>
        <type>plain</type>
        <path>en.txt</path>
    </extension>
    <extension>
        <name>en</name>
        <type>wordnet</type>
        <path>en/</path>
    </extension>
</synonyms_extensions>
```
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
+								---
-												add slugs

											
										
										
											2022-08-28 14:53:34 +00:00
+								slug: /en/sql-reference/functions/nlp-functions
-												Docs: Sort functions in sidebar

											
										
										
											2023-04-19 17:05:55 +00:00
+								sidebar_position: 130
 								sidebar_label: NLP (experimental)
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
+								---
-												Fix heading of nlp-functions.md
											
										
										
											2023-06-23 12:49:41 +00:00
+								# Natural Language Processing (NLP) Functions
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								:::warning
-												Removed /ja folder, cleaned up /ru markdown

											
										
										
											2022-04-09 13:29:05 +00:00
+								This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set `allow_experimental_nlp_functions = 1` to enable it.
 								:::
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								## detectCharset
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								The `detectCharset` function detects the character set of the non-UTF8-encoded input string.
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Syntax*
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
 								``` sql
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								detectCharset('text_to_be_analyzed')
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Arguments*
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Returned value*
 								- A `String` containing the code of the detected character set
 								*Examples*
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
 								Query:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								```sql
 								SELECT detectCharset('Ich bleibe für ein paar Tage.');
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
+								```
 								Result:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								```response
 								┌─detectCharset('Ich bleibe für ein paar Tage.')─┐
 								│ WINDOWS-1252                                   │
 								└────────────────────────────────────────────────┘
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
+								```
-												Update nlp-functions.md
											
										
										
											2023-05-22 16:14:23 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								## detectLanguage
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								Detects the language of the UTF8-encoded input string. The function uses the [CLD2 library](https://github.com/CLD2Owners/cld2) for detection, and it returns the 2-letter ISO language code.
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								The `detectLanguage` function works best when providing over 200 characters in the input string.
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Syntax*
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
 								``` sql
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								detectLanguage('text_to_be_analyzed')
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Arguments*
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
 								*Returned value*
 								- The 2-letter ISO code of the detected language
 								Other possible results:
 								- `un` = unknown, can not detect any language.
 								- `other` = the detected language does not have 2 letter code.
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Examples*
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
 								Query:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								```sql
 								SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there’s a will, there’s a way.');
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
+								```
 								Result:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								```response
 								fr
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								## detectLanguageMixed
-												add more info to NLP docs

											
										
										
											2023-05-22 17:02:39 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								Similar to the `detectLanguage` function, but `detectLanguageMixed` returns a `Map` of 2-letter language codes that are mapped to the percentage of the certain language in the text.
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												added english and russian documentation draft

											
										
										
											2021-06-05 03:57:53 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Syntax*
-												Improve docs

											
										
										
											2021-08-02 12:32:45 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								``` sql
 								detectLanguageMixed('text_to_be_analyzed')
 								```
-												Improve docs

											
										
										
											2021-08-02 12:32:45 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Arguments*
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Returned value*
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								- `Map(String, Float32)`: The keys are 2-letter ISO codes and the values are a percentage of text found for that language
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Examples*
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
 								Query:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								```sql
 								SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.');
-												added english documentation for tokenize() & stem()

											
										
										
											2021-05-10 10:42:32 +00:00
+								```
 								Result:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								```response
 								┌─detectLanguageMixed()─┐
 								│ {'ja':0.62,'fr':0.36  │
 								└───────────────────────┘
-												Apply suggestions from code review

Co-authored-by: Alexey Boykov <33257111+mathalex@users.noreply.github.com>
											
										
										
											2021-08-02 15:54:24 +00:00
+								```
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectProgrammingLanguage

											
										
										
											2024-06-25 15:01:14 +00:00
+								## detectProgrammingLanguage
 								Determines the programming language from the source code. Calculates all the unigrams and bigrams of commands in the source code.
 								Then using a marked-up dictionary with weights of unigrams and bigrams of commands for various programming languages finds the biggest weight of the programming language and returns it.
 								*Syntax*
 								``` sql
 								detectProgrammingLanguage('source_code')
 								```
 								*Arguments*
 								- `source_code` — String representation of the source code to analyze. [String](../data-types/string.md#string).
 								*Returned value*
 								- Programming language. [String](../data-types/string.md).
 								*Examples*
 								Query:
 								```sql
 								SELECT detectProgrammingLanguage('#include <iostream>');
 								```
 								Result:
 								```response
 								┌─detectProgrammingLanguage('#include <iostream>')─┐
 								│ C++                                              │
 								└──────────────────────────────────────────────────┘
 								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								## detectLanguageUnknown
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								Similar to the `detectLanguage` function, except the `detectLanguageUnknown` function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Syntax*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
 								``` sql
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								detectLanguageUnknown('text_to_be_analyzed')
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Arguments*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												Standardize references to data type docs

											
										
										
											2024-05-24 03:54:16 +00:00
+								- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Returned value*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
 								- The 2-letter ISO code of the detected language
 								Other possible results:
 								- `un` = unknown, can not detect any language.
 								- `other` = the detected language does not have 2 letter code.
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Examples*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
 								Query:
 								```sql
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.');
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
 								Result:
 								```response
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								┌─detectLanguageUnknown('Ich bleibe für ein paar Tage.')─┐
 								│ de                                                     │
 								└────────────────────────────────────────────────────────┘
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								## detectTonality
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								Determines the sentiment of text data. Uses a marked-up sentiment dictionary, in which each word has a tonality ranging from `-12` to `6`.
 								For each text, it calculates the average sentiment value of its words and returns it in the range `[-1,1]`.
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								:::note
 								This function is limited in its current form. Currently it makes use of the embedded emotional dictionary at `/contrib/nlp-data/tonality_ru.zst` and only works for the Russian language.
 								:::
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Syntax*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
 								``` sql
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								detectTonality(text)
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Arguments*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								- `text` — The text to be analyzed. [String](../data-types/string.md#string).
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Returned value*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								- The average sentiment value of the words in `text`. [Float32](../data-types/float.md).
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Examples*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
 								Query:
 								```sql
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								SELECT detectTonality('Шарик - хороший пёс'), -- Sharik is a good dog
 								       detectTonality('Шарик - пёс'), -- Sharik is a dog
 								       detectTonality('Шарик - плохой пёс'); -- Sharkik is a bad dog
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
 								Result:
 								```response
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								┌─detectTonality('Шарик - хороший пёс')─┬─detectTonality('Шарик - пёс')─┬─detectTonality('Шарик - плохой пёс')─┐
 								│                               0.44445 │                             0 │                                 -0.3 │
 								└───────────────────────────────────────┴───────────────────────────────┴──────────────────────────────────────┘
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								## lemmatize
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained [here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Syntax*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								``` sql
 								lemmatize('language', word)
 								```
 								*Arguments*
 								- `language` — Language which rules will be applied. [String](../data-types/string.md#string).
 								- `word` — Word that needs to be lemmatized. Must be lowercase. [String](../data-types/string.md#string).
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Examples*
 								Query:
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
 								``` sql
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								SELECT lemmatize('en', 'wolves');
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								Result:
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								``` text
 								┌─lemmatize("wolves")─┐
 								│              "wolf" │
 								└─────────────────────┘
 								```
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Configuration*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								This configuration specifies that the dictionary `en.bin` should be used for lemmatization of English (`en`) words.  The `.bin` files can be downloaded from
 								[here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								``` xml
 								<lemmatizers>
 								    <lemmatizer>
 								        <!-- highlight-start -->
 								        <lang>en</lang>
 								        <path>en.bin</path>
 								        <!-- highlight-end -->
 								    </lemmatizer>
 								</lemmatizers>
 								```
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								## stem
 								Performs stemming on a given word.
 								*Syntax*
 								``` sql
 								stem('language', word)
 								```
 								*Arguments*
 								- `language` — Language which rules will be applied. Use the two letter [ISO 639-1 code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).
 								- `word` — word that needs to be stemmed. Must be in lowercase. [String](../data-types/string.md#string).
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Examples*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
 								Query:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								``` sql
 								SELECT arrayMap(x -> stem('en', x), ['I', 'think', 'it', 'is', 'a', 'blessing', 'in', 'disguise']) as res;
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
 								Result:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								``` text
 								┌─res────────────────────────────────────────────────┐
 								│ ['I','think','it','is','a','bless','in','disguis'] │
 								└────────────────────────────────────────────────────┘
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Supported languages for stem()*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								:::note
 								The stem() function uses the [Snowball stemming](https://snowballstem.org/) library, see the Snowball website for updated languages etc.
 								:::
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								- Arabic
 								- Armenian
 								- Basque
 								- Catalan
 								- Danish
 								- Dutch
 								- English
 								- Finnish
 								- French
 								- German
 								- Greek
 								- Hindi
 								- Hungarian
 								- Indonesian
 								- Irish
 								- Italian
 								- Lithuanian
 								- Nepali
 								- Norwegian
 								- Porter
 								- Portuguese
 								- Romanian
 								- Russian
 								- Serbian
 								- Spanish
 								- Swedish
 								- Tamil
 								- Turkish
 								- Yiddish
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								## synonyms
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								Finds synonyms to a given word. There are two types of synonym extensions: `plain` and `wordnet`.
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								With the `plain` extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								With the `wordnet` extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Syntax*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								``` sql
 								synonyms('extension_name', word)
 								```
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								*Arguments*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								- `extension_name` — Name of the extension in which search will be performed. [String](../data-types/string.md#string).
 								- `word` — Word that will be searched in extension. [String](../data-types/string.md#string).
 								*Examples*
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
 								Query:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								``` sql
 								SELECT synonyms('list', 'important');
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
 								Result:
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
+								``` text
 								┌─synonyms('list', 'important')────────────┐
 								│ ['important','big','critical','crucial'] │
 								└──────────────────────────────────────────┘
-												Update nlp-functions.md

Added the detectLanguage functions

											
										
										
											2023-01-10 19:26:51 +00:00
+								```
-												add detectTonality and alphabetize page

											
										
										
											2024-06-25 14:48:32 +00:00
 								*Configuration*
 								``` xml
 								<synonyms_extensions>
 								    <extension>
 								        <name>en</name>
 								        <type>plain</type>
 								        <path>en.txt</path>
 								    </extension>
 								    <extension>
 								        <name>en</name>
 								        <type>wordnet</type>
 								        <path>en/</path>
 								    </extension>
 								</synonyms_extensions>
 								```