8.0 KiB
slug | sidebar_position | sidebar_label |
---|---|---|
/en/sql-reference/functions/nlp-functions | 130 | NLP (experimental) |
Natural Language Processing (NLP) Functions
:::note
This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1
to enable it.
:::
stem
Performs stemming on a given word.
Syntax
stem('language', word)
Arguments
language
— Language which rules will be applied. Use the two letter ISO 639-1 code.word
— word that needs to be stemmed. Must be in lowercase. String.
Examples
Query:
SELECT arrayMap(x -> stem('en', x), ['I', 'think', 'it', 'is', 'a', 'blessing', 'in', 'disguise']) as res;
Result:
┌─res────────────────────────────────────────────────┐
│ ['I','think','it','is','a','bless','in','disguis'] │
└────────────────────────────────────────────────────┘
Supported languages for stem()
:::note The stem() function uses the Snowball stemming library, see the Snowball website for updated languages etc. :::
- Arabic
- Armenian
- Basque
- Catalan
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hindi
- Hungarian
- Indonesian
- Irish
- Italian
- Lithuanian
- Nepali
- Norwegian
- Porter
- Portuguese
- Romanian
- Russian
- Serbian
- Spanish
- Swedish
- Tamil
- Turkish
- Yiddish
lemmatize
Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained here.
Syntax
lemmatize('language', word)
Arguments
language
— Language which rules will be applied. String.word
— Word that needs to be lemmatized. Must be lowercase. String.
Examples
Query:
SELECT lemmatize('en', 'wolves');
Result:
┌─lemmatize("wolves")─┐
│ "wolf" │
└─────────────────────┘
Configuration
This configuration specifies that the dictionary en.bin
should be used for lemmatization of English (en
) words. The .bin
files can be downloaded from
here.
<lemmatizers>
<lemmatizer>
<!-- highlight-start -->
<lang>en</lang>
<path>en.bin</path>
<!-- highlight-end -->
</lemmatizer>
</lemmatizers>
synonyms
Finds synonyms to a given word. There are two types of synonym extensions: plain
and wordnet
.
With the plain
extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.
With the wordnet
extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.
Syntax
synonyms('extension_name', word)
Arguments
extension_name
— Name of the extension in which search will be performed. String.word
— Word that will be searched in extension. String.
Examples
Query:
SELECT synonyms('list', 'important');
Result:
┌─synonyms('list', 'important')────────────┐
│ ['important','big','critical','crucial'] │
└──────────────────────────────────────────┘
Configuration
<synonyms_extensions>
<extension>
<name>en</name>
<type>plain</type>
<path>en.txt</path>
</extension>
<extension>
<name>en</name>
<type>wordnet</type>
<path>en/</path>
</extension>
</synonyms_extensions>
detectLanguage
Detects the language of the UTF8-encoded input string. The function uses the CLD2 library for detection, and it returns the 2-letter ISO language code.
The detectLanguage
function works best when providing over 200 characters in the input string.
Syntax
detectLanguage('text_to_be_analyzed')
Arguments
text_to_be_analyzed
— A collection (or sentences) of strings to analyze. String.
Returned value
- The 2-letter ISO code of the detected language
Other possible results:
un
= unknown, can not detect any language.other
= the detected language does not have 2 letter code.
Examples
Query:
SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there’s a will, there’s a way.');
Result:
fr
detectLanguageMixed
Similar to the detectLanguage
function, but detectLanguageMixed
returns a Map
of 2-letter language codes that are mapped to the percentage of the certain language in the text.
Syntax
detectLanguageMixed('text_to_be_analyzed')
Arguments
text_to_be_analyzed
— A collection (or sentences) of strings to analyze. String.
Returned value
Map(String, Float32)
: The keys are 2-letter ISO codes and the values are a percentage of text found for that language
Examples
Query:
SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.');
Result:
┌─detectLanguageMixed()─┐
│ {'ja':0.62,'fr':0.36 │
└───────────────────────┘
detectLanguageUnknown
Similar to the detectLanguage
function, except the detectLanguageUnknown
function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.
Syntax
detectLanguageUnknown('text_to_be_analyzed')
Arguments
text_to_be_analyzed
— A collection (or sentences) of strings to analyze. String.
Returned value
- The 2-letter ISO code of the detected language
Other possible results:
un
= unknown, can not detect any language.other
= the detected language does not have 2 letter code.
Examples
Query:
SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.');
Result:
┌─detectLanguageUnknown('Ich bleibe für ein paar Tage.')─┐
│ de │
└────────────────────────────────────────────────────────┘
detectCharset
The detectCharset
function detects the character set of the non-UTF8-encoded input string.
Syntax
detectCharset('text_to_be_analyzed')
Arguments
text_to_be_analyzed
— A collection (or sentences) of strings to analyze. String.
Returned value
- A
String
containing the code of the detected character set
Examples
Query:
SELECT detectCharset('Ich bleibe für ein paar Tage.');
Result:
┌─detectCharset('Ich bleibe für ein paar Tage.')─┐
│ WINDOWS-1252 │
└────────────────────────────────────────────────┘