ClickHouse/docs/en/sql-reference/functions/nlp-functions.md

---
sidebar_position: 67
sidebar_label: NLP
---

# [experimental] Natural Language Processing functions

:::warning    
This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set `allow_experimental_nlp_functions = 1` to enable it.
:::

## stem

Performs stemming on a given word.

**Syntax**

``` sql
stem('language', word)
```

**Arguments**

-   `language` — Language which rules will be applied. Must be in lowercase. [String](../../sql-reference/data-types/string.md#string).
-   `word` — word that needs to be stemmed. Must be in lowercase. [String](../../sql-reference/data-types/string.md#string).

**Examples**

Query:

``` sql
SELECT arrayMap(x -> stem('en', x), ['I', 'think', 'it', 'is', 'a', 'blessing', 'in', 'disguise']) as res;
```

Result:

``` text
┌─res────────────────────────────────────────────────┐
│ ['I','think','it','is','a','bless','in','disguis'] │
└────────────────────────────────────────────────────┘
```

## lemmatize

Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained [here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).

**Syntax**

``` sql
lemmatize('language', word)
```

**Arguments**

-   `language` — Language which rules will be applied. [String](../../sql-reference/data-types/string.md#string).
-   `word` — Word that needs to be lemmatized. Must be lowercase. [String](../../sql-reference/data-types/string.md#string).

**Examples**

Query:

``` sql
SELECT lemmatize('en', 'wolves');
```

Result:

``` text
┌─lemmatize("wolves")─┐
│              "wolf" │
└─────────────────────┘
```

Configuration:
``` xml
<lemmatizers>
    <lemmatizer>
        <lang>en</lang>
        <path>en.bin</path>
    </lemmatizer>
</lemmatizers>
```

## synonyms

Finds synonyms to a given word. There are two types of synonym extensions: `plain` and `wordnet`.

With the `plain` extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.

With the `wordnet` extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.

**Syntax**

``` sql
synonyms('extension_name', word)
```

**Arguments**

-   `extension_name` — Name of the extension in which search will be performed. [String](../../sql-reference/data-types/string.md#string).
-   `word` — Word that will be searched in extension. [String](../../sql-reference/data-types/string.md#string).

**Examples**

Query:

``` sql
SELECT synonyms('list', 'important');
```

Result:

``` text
┌─synonyms('list', 'important')────────────┐
│ ['important','big','critical','crucial'] │
└──────────────────────────────────────────┘
```

Configuration:
``` xml
<synonyms_extensions>
    <extension>
        <name>en</name>
        <type>plain</type>
        <path>en.txt</path>
    </extension>
    <extension>
        <name>en</name>
        <type>wordnet</type>
        <path>en/</path>
    </extension>
</synonyms_extensions>
```
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			`---`
Removed /ja folder, cleaned up /ru markdown 2022-04-09 13:29:05 +00:00			`sidebar_position: 67`
			`sidebar_label: NLP`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			`---`

Remove H1 anchor tags from docs 2022-06-02 10:55:18 +00:00			`# [experimental] Natural Language Processing functions`
Improve docs 2021-08-02 12:32:45 +00:00
Removed /ja folder, cleaned up /ru markdown 2022-04-09 13:29:05 +00:00			`:::warning`
			This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set `allow_experimental_nlp_functions = 1` to enable it.
			`:::`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
Remove H1 anchor tags from docs 2022-06-02 10:55:18 +00:00			`## stem`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
Improve docs 2021-08-02 12:32:45 +00:00			`Performs stemming on a given word.`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
			`Syntax`

			``` sql
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`stem('language', word)`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			```

			`Arguments`

added english and russian documentation draft 2021-06-05 03:57:53 +00:00			- `language` — Language which rules will be applied. Must be in lowercase. [String](../../sql-reference/data-types/string.md#string).
			- `word` — word that needs to be stemmed. Must be in lowercase. [String](../../sql-reference/data-types/string.md#string).
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
			`Examples`

			`Query:`

			``` sql
Update nlp-functions.md 2021-09-28 14:26:35 +00:00			`SELECT arrayMap(x -> stem('en', x), ['I', 'think', 'it', 'is', 'a', 'blessing', 'in', 'disguise']) as res;`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			```

			`Result:`

			``` text
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`┌─res────────────────────────────────────────────────┐`
			`│ ['I','think','it','is','a','bless','in','disguis'] │`
			`└────────────────────────────────────────────────────┘`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			```

Remove H1 anchor tags from docs 2022-06-02 10:55:18 +00:00			`## lemmatize`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
Apply suggestions from code review Co-authored-by: Alexey Boykov <33257111+mathalex@users.noreply.github.com> 2021-08-02 15:54:24 +00:00			`Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained [here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
			`Syntax`

			``` sql
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`lemmatize('language', word)`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			```

			`Arguments`

			- `language` — Language which rules will be applied. [String](../../sql-reference/data-types/string.md#string).
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			- `word` — Word that needs to be lemmatized. Must be lowercase. [String](../../sql-reference/data-types/string.md#string).
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
			`Examples`

			`Query:`

			``` sql
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`SELECT lemmatize('en', 'wolves');`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			```

			`Result:`

			``` text
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`┌─lemmatize("wolves")─┐`
			`│ "wolf" │`
			`└─────────────────────┘`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			```

added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`Configuration:`
			``` xml
			`<lemmatizers>`
			`<lemmatizer>`
			`<lang>en</lang>`
			`<path>en.bin</path>`
			`</lemmatizer>`
			`</lemmatizers>`
			```
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
Remove H1 anchor tags from docs 2022-06-02 10:55:18 +00:00			`## synonyms`
added english and russian documentation draft 2021-06-05 03:57:53 +00:00
Improve docs 2021-08-02 12:32:45 +00:00			Finds synonyms to a given word. There are two types of synonym extensions: `plain` and `wordnet`.

Apply suggestions from code review Co-authored-by: Alexey Boykov <33257111+mathalex@users.noreply.github.com> 2021-08-02 15:54:24 +00:00			With the `plain` extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.
Improve docs 2021-08-02 12:32:45 +00:00
Apply suggestions from code review Co-authored-by: Alexey Boykov <33257111+mathalex@users.noreply.github.com> 2021-08-02 15:54:24 +00:00			With the `wordnet` extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
			`Syntax`

			``` sql
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`synonyms('extension_name', word)`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			```

			`Arguments`

Improve docs 2021-08-02 12:32:45 +00:00			- `extension_name` — Name of the extension in which search will be performed. [String](../../sql-reference/data-types/string.md#string).
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			- `word` — Word that will be searched in extension. [String](../../sql-reference/data-types/string.md#string).
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00
			`Examples`

			`Query:`

			``` sql
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`SELECT synonyms('list', 'important');`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			```

			`Result:`

			``` text
added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`┌─synonyms('list', 'important')────────────┐`
			`│ ['important','big','critical','crucial'] │`
			`└──────────────────────────────────────────┘`
added english documentation for tokenize() & stem() 2021-05-10 10:42:32 +00:00			```

added english and russian documentation draft 2021-06-05 03:57:53 +00:00			`Configuration:`
			``` xml
			`<synonyms_extensions>`
			`<extension>`
			`<name>en</name>`
			`<type>plain</type>`
			`<path>en.txt</path>`
			`</extension>`
			`<extension>`
			`<name>en</name>`
			`<type>wordnet</type>`
			`<path>en/</path>`
			`</extension>`
			`</synonyms_extensions>`
Apply suggestions from code review Co-authored-by: Alexey Boykov <33257111+mathalex@users.noreply.github.com> 2021-08-02 15:54:24 +00:00			```