Merge pull request #65673 from Blargian/document_detectTonality

[Docs] add `detectTonality`, `detectProgrammingLanguage` to docs
2024-11-22 15:42:02 +00:00 · 2024-06-25 17:23:15 +00:00 · 2024-06-25 17:23:15 +00:00 · bd4f8524bf
commit bd4f8524bf
parent 0158db296e a41d913b61
2 changed files with 285 additions and 207 deletions
--- a/docs/en/sql-reference/functions/nlp-functions.md
+++ b/docs/en/sql-reference/functions/nlp-functions.md
@ -6,26 +6,297 @@ sidebar_label: NLP (experimental)

 # Natural Language Processing (NLP) Functions

-:::note
+:::warning
 This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set `allow_experimental_nlp_functions = 1` to enable it.
 :::

+## detectCharset
+
+The `detectCharset` function detects the character set of the non-UTF8-encoded input string.
+
+*Syntax*
+
+``` sql
+detectCharset('text_to_be_analyzed')
+```
+
+*Arguments*
+
+- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
+
+*Returned value*
+
+- A `String` containing the code of the detected character set
+
+*Examples*
+
+Query:
+
+```sql
+SELECT detectCharset('Ich bleibe für ein paar Tage.');
+```
+
+Result:
+
+```response
+┌─detectCharset('Ich bleibe für ein paar Tage.')─┐
+│ WINDOWS-1252                                   │
+└────────────────────────────────────────────────┘
+```
+
+## detectLanguage
+
+Detects the language of the UTF8-encoded input string. The function uses the [CLD2 library](https://github.com/CLD2Owners/cld2) for detection, and it returns the 2-letter ISO language code.
+
+The `detectLanguage` function works best when providing over 200 characters in the input string.
+
+*Syntax*
+
+``` sql
+detectLanguage('text_to_be_analyzed')
+```
+
+*Arguments*
+
+- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
+
+*Returned value*
+
+- The 2-letter ISO code of the detected language
+
+Other possible results:
+
+- `un` = unknown, can not detect any language.
+- `other` = the detected language does not have 2 letter code.
+
+*Examples*
+
+Query:
+
+```sql
+SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there’s a will, there’s a way.');
+```
+
+Result:
+
+```response
+fr
+```
+
+## detectLanguageMixed
+
+Similar to the `detectLanguage` function, but `detectLanguageMixed` returns a `Map` of 2-letter language codes that are mapped to the percentage of the certain language in the text.
+
+
+*Syntax*
+
+``` sql
+detectLanguageMixed('text_to_be_analyzed')
+```
+
+*Arguments*
+
+- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
+
+*Returned value*
+
+- `Map(String, Float32)`: The keys are 2-letter ISO codes and the values are a percentage of text found for that language
+
+
+*Examples*
+
+Query:
+
+```sql
+SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.');
+```
+
+Result:
+
+```response
+┌─detectLanguageMixed()─┐
+│ {'ja':0.62,'fr':0.36  │
+└───────────────────────┘
+```
+
+## detectProgrammingLanguage
+
+Determines the programming language from the source code. Calculates all the unigrams and bigrams of commands in the source code. 
+Then using a marked-up dictionary with weights of unigrams and bigrams of commands for various programming languages finds the biggest weight of the programming language and returns it.
+
+*Syntax*
+
+``` sql
+detectProgrammingLanguage('source_code')
+```
+
+*Arguments*
+
+- `source_code` — String representation of the source code to analyze. [String](../data-types/string.md#string).
+
+*Returned value*
+
+- Programming language. [String](../data-types/string.md).
+
+*Examples*
+
+Query:
+
+```sql
+SELECT detectProgrammingLanguage('#include <iostream>');
+```
+
+Result:
+
+```response
+┌─detectProgrammingLanguage('#include <iostream>')─┐
+│ C++                                              │
+└──────────────────────────────────────────────────┘
+```
+
+## detectLanguageUnknown
+
+Similar to the `detectLanguage` function, except the `detectLanguageUnknown` function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.
+
+
+*Syntax*
+
+``` sql
+detectLanguageUnknown('text_to_be_analyzed')
+```
+
+*Arguments*
+
+- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
+
+*Returned value*
+
+- The 2-letter ISO code of the detected language
+
+Other possible results:
+
+- `un` = unknown, can not detect any language.
+- `other` = the detected language does not have 2 letter code.
+
+*Examples*
+
+Query:
+
+```sql
+SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.');
+```
+
+Result:
+
+```response
+┌─detectLanguageUnknown('Ich bleibe für ein paar Tage.')─┐
+│ de                                                     │
+└────────────────────────────────────────────────────────┘
+```
+
+## detectTonality
+
+Determines the sentiment of text data. Uses a marked-up sentiment dictionary, in which each word has a tonality ranging from `-12` to `6`.
+For each text, it calculates the average sentiment value of its words and returns it in the range `[-1,1]`.
+
+:::note
+This function is limited in its current form. Currently it makes use of the embedded emotional dictionary at `/contrib/nlp-data/tonality_ru.zst` and only works for the Russian language.
+:::
+
+*Syntax*
+
+``` sql
+detectTonality(text)
+```
+
+*Arguments*
+
+- `text` — The text to be analyzed. [String](../data-types/string.md#string).
+
+*Returned value*
+
+- The average sentiment value of the words in `text`. [Float32](../data-types/float.md).
+
+*Examples*
+
+Query:
+
+```sql
+SELECT detectTonality('Шарик - хороший пёс'), -- Sharik is a good dog 
+       detectTonality('Шарик - пёс'), -- Sharik is a dog
+       detectTonality('Шарик - плохой пёс'); -- Sharkik is a bad dog
+```
+
+Result:
+
+```response
+┌─detectTonality('Шарик - хороший пёс')─┬─detectTonality('Шарик - пёс')─┬─detectTonality('Шарик - плохой пёс')─┐
+│                               0.44445 │                             0 │                                 -0.3 │
+└───────────────────────────────────────┴───────────────────────────────┴──────────────────────────────────────┘
+```
+## lemmatize
+
+Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained [here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).
+
+*Syntax*
+
+``` sql
+lemmatize('language', word)
+```
+
+*Arguments*
+
+- `language` — Language which rules will be applied. [String](../data-types/string.md#string).
+- `word` — Word that needs to be lemmatized. Must be lowercase. [String](../data-types/string.md#string).
+
+*Examples*
+
+Query:
+
+``` sql
+SELECT lemmatize('en', 'wolves');
+```
+
+Result:
+
+``` text
+┌─lemmatize("wolves")─┐
+│              "wolf" │
+└─────────────────────┘
+```
+
+*Configuration*
+
+This configuration specifies that the dictionary `en.bin` should be used for lemmatization of English (`en`) words.  The `.bin` files can be downloaded from
+[here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).
+
+``` xml
+<lemmatizers>
+    <lemmatizer>
+        <!-- highlight-start -->
+        <lang>en</lang>
+        <path>en.bin</path>
+        <!-- highlight-end -->
+    </lemmatizer>
+</lemmatizers>
+```
+
 ## stem

 Performs stemming on a given word.

-### Syntax
+*Syntax*

 ``` sql
 stem('language', word)
 ```

-### Arguments
+*Arguments*

 - `language` — Language which rules will be applied. Use the two letter [ISO 639-1 code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).
 - `word` — word that needs to be stemmed. Must be in lowercase. [String](../data-types/string.md#string).

-### Examples
+*Examples*

 Query:

@ -40,7 +311,7 @@ Result:
 │ ['I','think','it','is','a','bless','in','disguis'] │
 └────────────────────────────────────────────────────┘
 ```
-### Supported languages for stem()
+*Supported languages for stem()*

 :::note
 The stem() function uses the [Snowball stemming](https://snowballstem.org/) library, see the Snowball website for updated languages etc.
@ -76,53 +347,6 @@ The stem() function uses the [Snowball stemming](https://snowballstem.org/) libr
 - Turkish
 - Yiddish

-## lemmatize
-
-Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained [here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).
-
-### Syntax
-
-``` sql
-lemmatize('language', word)
-```
-
-### Arguments
-
- `language` — Language which rules will be applied. [String](../data-types/string.md#string).
- `word` — Word that needs to be lemmatized. Must be lowercase. [String](../data-types/string.md#string).
-
-### Examples
-
-Query:
-
-``` sql
-SELECT lemmatize('en', 'wolves');
-```
-
-Result:
-
-``` text
-┌─lemmatize("wolves")─┐
-│              "wolf" │
-└─────────────────────┘
-```
-
-### Configuration
-
-This configuration specifies that the dictionary `en.bin` should be used for lemmatization of English (`en`) words.  The `.bin` files can be downloaded from 
-[here](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).
-
-``` xml
-<lemmatizers>
-    <lemmatizer>
-        <!-- highlight-start -->
-        <lang>en</lang>
-        <path>en.bin</path>
-        <!-- highlight-end -->
-    </lemmatizer>
-</lemmatizers>
-```
-
 ## synonyms

 Finds synonyms to a given word. There are two types of synonym extensions: `plain` and `wordnet`.
@ -131,18 +355,18 @@ With the `plain` extension type we need to provide a path to a simple text file,

 With the `wordnet` extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.

-### Syntax
+*Syntax*

 ``` sql
 synonyms('extension_name', word)
 ```

-### Arguments
+*Arguments*

 - `extension_name` — Name of the extension in which search will be performed. [String](../data-types/string.md#string).
 - `word` — Word that will be searched in extension. [String](../data-types/string.md#string).

-### Examples
+*Examples*

 Query:

@ -158,7 +382,7 @@ Result:
 └──────────────────────────────────────────┘
 ```

-### Configuration
+*Configuration*
 ``` xml
 <synonyms_extensions>
    <extension>
@ -173,153 +397,3 @@ Result:
    </extension>
 </synonyms_extensions>
 ```
-
-## detectLanguage
-
-Detects the language of the UTF8-encoded input string. The function uses the [CLD2 library](https://github.com/CLD2Owners/cld2) for detection, and it returns the 2-letter ISO language code.
-
-The `detectLanguage` function works best when providing over 200 characters in the input string.
-
-### Syntax
-
-``` sql
-detectLanguage('text_to_be_analyzed')
-```
-
-### Arguments
-
- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
-
-### Returned value
-
- The 2-letter ISO code of the detected language
-
-Other possible results:
-
- `un` = unknown, can not detect any language.
- `other` = the detected language does not have 2 letter code.
-
-### Examples
-
-Query:
-
-```sql
-SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there’s a will, there’s a way.');
-```
-
-Result:
-
-```response
-fr
-```
-
-## detectLanguageMixed
-
-Similar to the `detectLanguage` function, but `detectLanguageMixed` returns a `Map` of 2-letter language codes that are mapped to the percentage of the certain language in the text.
-
-
-### Syntax
-
-``` sql
-detectLanguageMixed('text_to_be_analyzed')
-```
-
-### Arguments
-
- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
-
-### Returned value
-
- `Map(String, Float32)`: The keys are 2-letter ISO codes and the values are a percentage of text found for that language
-
-
-### Examples
-
-Query:
-
-```sql
-SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.');
-```
-
-Result:
-
-```response
-┌─detectLanguageMixed()─┐
-│ {'ja':0.62,'fr':0.36  │
-└───────────────────────┘
-```
-
-## detectLanguageUnknown
-
-Similar to the `detectLanguage` function, except the `detectLanguageUnknown` function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.
-
-
-### Syntax
-
-``` sql
-detectLanguageUnknown('text_to_be_analyzed')
-```
-
-### Arguments
-
- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
-
-### Returned value
-
- The 2-letter ISO code of the detected language
-
-Other possible results:
-
- `un` = unknown, can not detect any language.
- `other` = the detected language does not have 2 letter code.
-
-### Examples
-
-Query:
-
-```sql
-SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.');
-```
-
-Result:
-
-```response
-┌─detectLanguageUnknown('Ich bleibe für ein paar Tage.')─┐
-│ de                                                     │
-└────────────────────────────────────────────────────────┘
-```
-
-## detectCharset
-
-The `detectCharset` function detects the character set of the non-UTF8-encoded input string.
-
-
-### Syntax
-
-``` sql
-detectCharset('text_to_be_analyzed')
-```
-
-### Arguments
-
- `text_to_be_analyzed` — A collection (or sentences) of strings to analyze. [String](../data-types/string.md#string).
-
-### Returned value
-
- A `String` containing the code of the detected character set
-
-### Examples
-
-Query:
-
-```sql
-SELECT detectCharset('Ich bleibe für ein paar Tage.');
-```
-
-Result:
-
-```response
-┌─detectCharset('Ich bleibe für ein paar Tage.')─┐
-│ WINDOWS-1252                                   │
-└────────────────────────────────────────────────┘
-```
--- a/utils/check-style/aspell-ignore/en/aspell-dict.txt
+++ b/utils/check-style/aspell-ignore/en/aspell-dict.txt
@ -48,6 +48,7 @@ AutoML
 Autocompletion
 AvroConfluent
 BIGINT
+bigrams
 BIGSERIAL
 BORO
 BSON
@ -1008,6 +1009,7 @@ UncompressedCacheBytes
 UncompressedCacheCells
 UnidirectionalEdgeIsValid
 UniqThetaSketch
+unigrams
 Updatable
 Uppercased
 Uptime
@ -1507,9 +1509,11 @@ deserializing
 destructor
 destructors
 detectCharset
+detectTonality
 detectLanguage
 detectLanguageMixed
 detectLanguageUnknown
+detectProgrammingLanguage
 determinator
 deterministically
 dictGet