From 29aa96037758df616ec6253289e218abd7212626 Mon Sep 17 00:00:00 2001 From: Han Fei Date: Tue, 16 May 2023 09:07:35 +0200 Subject: [PATCH 01/10] refine docs for regexp tree dictionary --- docs/en/sql-reference/dictionaries/index.md | 107 ++++++++++++++++---- 1 file changed, 89 insertions(+), 18 deletions(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index 5801b7866cb..ad5a65df994 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2197,13 +2197,13 @@ Result: └─────────────────────────────────┴───────┘ ``` -## RegExp Tree Dictionary {#regexp-tree-dictionary} +## RegExpTree Dictionary {#regexp-tree-dictionary} -Regexp Tree dictionary stores multiple trees of regular expressions with attributions. Users can retrieve strings in the dictionary. If a string matches the root of the regexp tree, we will collect the corresponding attributes of the matched root and continue to walk the children. If any of the children matches the string, we will collect attributes and rewrite the old ones if conflicts occur, then continue the traverse until we reach leaf nodes. +RegExpTree Dictionary is designed to store multiple regular expressions in a dictionary, and query if a string could match one or multiple regular expressions. In some senarioes, for example, User Agent Parser, this data structure is very useful. We can use it in both local and cloud environment. -Example of the ddl query for creating Regexp Tree dictionary: +### Use RegExpTree Dictionary in local environment - +In local environment, we create RegexpTree dictionary by a yaml file: ```sql create dictionary regexp_dict @@ -2218,11 +2218,9 @@ LAYOUT(regexp_tree) ... ``` -**Source** +The dictionary source `YAMLRegExpTree` represents the structure of a Regexp Tree. For example: -We introduce a type of source called `YAMLRegExpTree` representing the structure of Regexp Tree dictionary. An Example of a valid yaml config is like: - -```xml +```yaml - regexp: 'Linux/(\d+[\.\d]*).+tlinux' name: 'TencentOS' version: '\1' @@ -2240,17 +2238,15 @@ We introduce a type of source called `YAMLRegExpTree` representing the structure version: '10' ``` -The key `regexp` represents the regular expression of a tree node. The name of key is same as the dictionary key. The `name` and `version` is user-defined attributions in the dicitionary. The `versions` (which can be any name that not appear in attributions or the key) indicates the children nodes of this tree. +This config consists of a list of RegExpTree nodes. Each node has following structure: -**Back Reference** +- **regexp** means the regular expression of this node. +- **user defined attributions** is a list of dictionary attributions defined in the dictionary structure. In this case, we have two attributions: `name` and `version`. The first nodes have the both attributions. The second node only have `name` attribution, because the `version` is defined in the children nodes. + - The value of an attribution could contain a **back reference** which refers to a capture group of the matched regular expression. Reference number ranges from 1 to 9 and writes as `$1` or `\1`. During the query execution, the back reference in the value will be replaced by the matched capture group. +- **children nodes** is the secondary layer of the RegExpTree nodes, which also contains a list of RegExpTree nodes. If a string matches a regexp node in the first layer, the dictionary will check if the string matches the children nodes of it. If it matches, we assign the attributions of the matching nodes. If two or more nodes define the same attribution, chilren nodes have more priority. + - the name of **children nodes** in yaml files can be arbitrary. -The value of an attribution could contain a back reference which refers to a capture group of the matched regular expression. Reference number ranges from 1 to 9 and writes as `$1` or `\1`. - -During the query execution, the back reference in the value will be replaced by the matched capture group. - -**Query** - -Due to the specialty of Regexp Tree dictionary, we only allow functions `dictGet`, `dictGetOrDefault` and `dictGetOrNull` work with it. +Due to the specialty of Regexp Tree dictionary, we only allow functions `dictGet`, `dictGetOrDefault` and `dictGetOrNull`. Example: @@ -2260,12 +2256,87 @@ SELECT dictGet('regexp_dict', ('name', 'version'), '31/tclwebkit1024'); Result: -``` +```text ┌─dictGet('regexp_dict', ('name', 'version'), '31/tclwebkit1024')─┐ │ ('Andriod','12') │ └─────────────────────────────────────────────────────────────────┘ ``` +Explain: + +In this case, we match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in the first layer, so the dictionary will continue to look into the children nodes in the second layer and find it matches `3[12]/tclwebkit`. As a result, the value of `name` is `Andriod` defined in the first layer and the value of `version` is `12` defined in the second layer. + +### Use RegExpTree Dictionary on cloud + +We have shown how RegExpTree work in the local enviroument, but we cannot use `YAMLRegExpTree` on cloud. If we have a local yaml file, we can use this file to create RegExpTree Dictionary in the local enviroment, then dump this dictionary to a csv file by `dictionary` table function and [INTO OUTFILE](../../statements/select/into-outfile.md) clause. + +```sql +select * from dictionary(regexp_dict) into outfile('regexp_dict.csv') +``` + +The content of csv file is: + +```text +1,0,"Linux/(\d+[\.\d]*).+tlinux","['version','name']","['\\1','TencentOS']" +2,0,"(\d+)/tclwebkit(\d+[\.\d]*)","['comment','version','name']","['test $1 and $2','$1','Andriod']" +3,2,"33/tclwebkit","['version']","['13']" +4,2,"3[12]/tclwebkit","['version']","['12']" +5,2,"3[12]/tclwebkit","['version']","['11']" +6,2,"3[12]/tclwebkit","['version']","['10']" +``` + +The schema of dumped file is always + +- `id UInt64` represents the identify number of the RegexpTree node. +- `parent_id UInt64` represents the id of the parent of a node. +- `regexp String` represents the regular expression string. +- `keys Array(String)` represents the names of user defined attributions. +- `values Array(String)` represents the values of user defined attributions. + +On the cloud, we can create a table `regexp_dictionary_source_table` with the above table structure. + +```sql +CREATE TABLE regexp_dictionary_source_table +( + id UInt64, + parent_id UInt64, + regexp String, + keys Array(String), + values Array(String) +) ENGINE=Memory; +``` + +Then update the local CSV by + +```bash +clickhouse client \ + --host MY_HOST \ + --secure \ + --password MY_PASSWORD \ + --query " + insert into regexp_dictionary_source_table + select * from input ('id UInt64, parent_id UInt64, regexp String, keys Array(String), values Array(String)') + FORMAT CSV" < regexp_dict.csv +``` + +You can see how to [Insert Local Files](https://clickhouse.com/docs/en/integrations/data-ingestion/insert-local-files) for more details. After we initialize the source table, we can create a RegexpTree by table source: + +``` sql +create dictionary regexp_dict +( + regexp String, + name String, + version String +PRIMARY KEY(regexp) +SOURCE(CLICKHOUSE(TABLE 'regexp_dictionary_source_table')) +LIFETIME(0) +LAYOUT(regexp_tree); +``` + +### Use RegexpTree Dictionary as a UA Parser + +With a powerful yaml configure file, we can use RepexpTree dictionary as a UA parser. We support [uap-core](https://github.com/ua-parser/uap-core) and demostrate how to use it in the functional test [02504_regexp_dictionary_ua_parser](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/02504_regexp_dictionary_ua_parser.sh) + ## Embedded Dictionaries {#embedded-dictionaries} From e4e473ef30ff4a6ca3f198e3da6d21b1ff85ccbd Mon Sep 17 00:00:00 2001 From: Han Fei Date: Tue, 16 May 2023 11:22:14 +0200 Subject: [PATCH 02/10] Update docs/en/sql-reference/dictionaries/index.md Co-authored-by: Sergei Trifonov --- docs/en/sql-reference/dictionaries/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index ad5a65df994..1ab6370c977 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2199,7 +2199,7 @@ Result: ## RegExpTree Dictionary {#regexp-tree-dictionary} -RegExpTree Dictionary is designed to store multiple regular expressions in a dictionary, and query if a string could match one or multiple regular expressions. In some senarioes, for example, User Agent Parser, this data structure is very useful. We can use it in both local and cloud environment. +RegExpTree Dictionary is designed to store multiple regular expressions in a dictionary, and query if a string could match one or multiple regular expressions. In some scenarios, for example, User Agent Parser, this data structure is very useful. We can use it in both local and cloud environments. ### Use RegExpTree Dictionary in local environment From 31b8e3c4892ff03340314274ab715506aff0e95b Mon Sep 17 00:00:00 2001 From: Han Fei Date: Tue, 16 May 2023 11:22:24 +0200 Subject: [PATCH 03/10] Update docs/en/sql-reference/dictionaries/index.md Co-authored-by: Sergei Trifonov --- docs/en/sql-reference/dictionaries/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index 1ab6370c977..b50ad4bf365 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2241,9 +2241,9 @@ The dictionary source `YAMLRegExpTree` represents the structure of a Regexp Tree This config consists of a list of RegExpTree nodes. Each node has following structure: - **regexp** means the regular expression of this node. -- **user defined attributions** is a list of dictionary attributions defined in the dictionary structure. In this case, we have two attributions: `name` and `version`. The first nodes have the both attributions. The second node only have `name` attribution, because the `version` is defined in the children nodes. +- **user defined attributions** is a list of dictionary attributions defined in the dictionary structure. In this case, we have two attributions: `name` and `version`. The first nodes have both attributions. The second node only has `name` attribution, because the `version` is defined in the children nodes. - The value of an attribution could contain a **back reference** which refers to a capture group of the matched regular expression. Reference number ranges from 1 to 9 and writes as `$1` or `\1`. During the query execution, the back reference in the value will be replaced by the matched capture group. -- **children nodes** is the secondary layer of the RegExpTree nodes, which also contains a list of RegExpTree nodes. If a string matches a regexp node in the first layer, the dictionary will check if the string matches the children nodes of it. If it matches, we assign the attributions of the matching nodes. If two or more nodes define the same attribution, chilren nodes have more priority. +- **children nodes** is the secondary layer of the RegExpTree nodes, which also contains a list of RegExpTree nodes. If a string matches a regexp node in the first layer, the dictionary will check if the string matches the children nodes of it. If it matches, we assign the attributions of the matching nodes. If two or more nodes define the same attribution, children nodes have more priority. - the name of **children nodes** in yaml files can be arbitrary. Due to the specialty of Regexp Tree dictionary, we only allow functions `dictGet`, `dictGetOrDefault` and `dictGetOrNull`. From ed5906f15ded79823b9b4988cca6b335a6908100 Mon Sep 17 00:00:00 2001 From: Han Fei Date: Tue, 16 May 2023 11:22:31 +0200 Subject: [PATCH 04/10] Update docs/en/sql-reference/dictionaries/index.md Co-authored-by: Sergei Trifonov --- docs/en/sql-reference/dictionaries/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index b50ad4bf365..086c0b5c0ed 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2335,7 +2335,7 @@ LAYOUT(regexp_tree); ### Use RegexpTree Dictionary as a UA Parser -With a powerful yaml configure file, we can use RepexpTree dictionary as a UA parser. We support [uap-core](https://github.com/ua-parser/uap-core) and demostrate how to use it in the functional test [02504_regexp_dictionary_ua_parser](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/02504_regexp_dictionary_ua_parser.sh) +With a powerful yaml configure file, we can use RepexpTree dictionary as a UA parser. We support [uap-core](https://github.com/ua-parser/uap-core) and demonstrate how to use it in the functional test [02504_regexp_dictionary_ua_parser](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/02504_regexp_dictionary_ua_parser.sh) ## Embedded Dictionaries {#embedded-dictionaries} From a40d86b921ac3a519b7c9d65b41afbdf6667d2b8 Mon Sep 17 00:00:00 2001 From: Han Fei Date: Tue, 16 May 2023 11:22:42 +0200 Subject: [PATCH 05/10] Update docs/en/sql-reference/dictionaries/index.md Co-authored-by: Sergei Trifonov --- docs/en/sql-reference/dictionaries/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index 086c0b5c0ed..66f661cce60 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2268,7 +2268,7 @@ In this case, we match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in t ### Use RegExpTree Dictionary on cloud -We have shown how RegExpTree work in the local enviroument, but we cannot use `YAMLRegExpTree` on cloud. If we have a local yaml file, we can use this file to create RegExpTree Dictionary in the local enviroment, then dump this dictionary to a csv file by `dictionary` table function and [INTO OUTFILE](../../statements/select/into-outfile.md) clause. +We have shown how RegExpTree work in the local environment, but we cannot use `YAMLRegExpTree` in the cloud. If we have a local yaml file, we can use this file to create RegExpTree Dictionary in the local environment, then dump this dictionary to a csv file by the `dictionary` table function and [INTO OUTFILE](../../statements/select/into-outfile.md) clause. ```sql select * from dictionary(regexp_dict) into outfile('regexp_dict.csv') From 7df0e9d933267d7a2595e7d970c190650e09954c Mon Sep 17 00:00:00 2001 From: Han Fei Date: Tue, 16 May 2023 15:33:08 +0200 Subject: [PATCH 06/10] fix broken link --- docs/en/sql-reference/dictionaries/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index 66f661cce60..4abc41cdf42 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2268,7 +2268,7 @@ In this case, we match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in t ### Use RegExpTree Dictionary on cloud -We have shown how RegExpTree work in the local environment, but we cannot use `YAMLRegExpTree` in the cloud. If we have a local yaml file, we can use this file to create RegExpTree Dictionary in the local environment, then dump this dictionary to a csv file by the `dictionary` table function and [INTO OUTFILE](../../statements/select/into-outfile.md) clause. +We have shown how RegExpTree work in the local environment, but we cannot use `YAMLRegExpTree` in the cloud. If we have a local yaml file, we can use this file to create RegExpTree Dictionary in the local environment, then dump this dictionary to a csv file by the `dictionary` table function and [INTO OUTFILE](../statements/select/into-outfile.md) clause. ```sql select * from dictionary(regexp_dict) into outfile('regexp_dict.csv') From 549af4d35112f598f3bbaaf196d0d994ff9076dc Mon Sep 17 00:00:00 2001 From: Han Fei Date: Wed, 17 May 2023 21:23:32 +0200 Subject: [PATCH 07/10] address comments --- docs/en/sql-reference/dictionaries/index.md | 50 +++++++++---------- ...4_regexp_dictionary_table_source.reference | 40 +++++++-------- .../02504_regexp_dictionary_table_source.sql | 14 +++--- 3 files changed, 50 insertions(+), 54 deletions(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index 4abc41cdf42..eb45247e74a 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2197,13 +2197,13 @@ Result: └─────────────────────────────────┴───────┘ ``` -## RegExpTree Dictionary {#regexp-tree-dictionary} +## Regular Expression Tree Dictionary {#regexp-tree-dictionary} -RegExpTree Dictionary is designed to store multiple regular expressions in a dictionary, and query if a string could match one or multiple regular expressions. In some scenarios, for example, User Agent Parser, this data structure is very useful. We can use it in both local and cloud environments. +Regular expression tree dictionaries are a special type of dictionary which represent the mapping from key to attributes using a tree of regular expressions. There are some use cases, e.g. parsing of (user agent)[https://en.wikipedia.org/wiki/User_agent] strings, which can be expressed elegantly with regexp tree dictionaries. -### Use RegExpTree Dictionary in local environment +### Use Regular Expression Tree Dictionary in ClickHouse Open-Source Environment -In local environment, we create RegexpTree dictionary by a yaml file: +Regular expression tree dictionaries are defined in ClickHouse open-source using the YAMLRegExpTree source which is provided the path to a YAML file containing the regular expression tree. ```sql create dictionary regexp_dict @@ -2218,7 +2218,7 @@ LAYOUT(regexp_tree) ... ``` -The dictionary source `YAMLRegExpTree` represents the structure of a Regexp Tree. For example: +The dictionary source `YAMLRegExpTree` represents the structure of a regexp tree. For example: ```yaml - regexp: 'Linux/(\d+[\.\d]*).+tlinux' @@ -2226,7 +2226,7 @@ The dictionary source `YAMLRegExpTree` represents the structure of a Regexp Tree version: '\1' - regexp: '\d+/tclwebkit(?:\d+[\.\d]*)' - name: 'Andriod' + name: 'Android' versions: - regexp: '33/tclwebkit' version: '13' @@ -2238,15 +2238,15 @@ The dictionary source `YAMLRegExpTree` represents the structure of a Regexp Tree version: '10' ``` -This config consists of a list of RegExpTree nodes. Each node has following structure: +This config consists of a list of Regular Expression Tree nodes. Each node has following structure: - **regexp** means the regular expression of this node. -- **user defined attributions** is a list of dictionary attributions defined in the dictionary structure. In this case, we have two attributions: `name` and `version`. The first nodes have both attributions. The second node only has `name` attribution, because the `version` is defined in the children nodes. - - The value of an attribution could contain a **back reference** which refers to a capture group of the matched regular expression. Reference number ranges from 1 to 9 and writes as `$1` or `\1`. During the query execution, the back reference in the value will be replaced by the matched capture group. -- **children nodes** is the secondary layer of the RegExpTree nodes, which also contains a list of RegExpTree nodes. If a string matches a regexp node in the first layer, the dictionary will check if the string matches the children nodes of it. If it matches, we assign the attributions of the matching nodes. If two or more nodes define the same attribution, children nodes have more priority. - - the name of **children nodes** in yaml files can be arbitrary. +- **user defined attributes** is a list of dictionary attributes defined in the dictionary structure. In this case, we have two attributes: `name` and `version`. The first node has both attributes. The second node only has `name` attribute, because the `version` is defined in the children nodes. + - The value of an attribute could contain a **back reference** which refers to a capture group of the matched regular expression. Reference number ranges from 1 to 9 and writes as `$1` or `\1`. During the query execution, the back reference in the value will be replaced by the matched capture group. +- **children nodes** is the children of a regexp tree node, which has their own attributes and children nodes. String matching preceeds in a depth-first fasion. If a string matches any regexp node in the top layer, the dictionary checks if the string matches the children nodes of it. If it matches, we assign the attributes of the matching nodes. If two or more nodes define the same attribute, children nodes have more priority. + - the name of **children nodes** in YAML files can be arbitrary. -Due to the specialty of Regexp Tree dictionary, we only allow functions `dictGet`, `dictGetOrDefault` and `dictGetOrNull`. +Due to the specialty of regexp tree dictionary, we only allow functions `dictGet`, `dictGetOrDefault` and `dictGetOrNull`. Example: @@ -2258,17 +2258,17 @@ Result: ```text ┌─dictGet('regexp_dict', ('name', 'version'), '31/tclwebkit1024')─┐ -│ ('Andriod','12') │ +│ ('Android','12') │ └─────────────────────────────────────────────────────────────────┘ ``` -Explain: +In this case, we match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in the top layer's second node, so the dictionary continues to look into the children nodes and find it matches `3[12]/tclwebkit`. As a result, the value of `name` is `Android` defined in the first layer and the value of `version` is `12` defined the child node. -In this case, we match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in the first layer, so the dictionary will continue to look into the children nodes in the second layer and find it matches `3[12]/tclwebkit`. As a result, the value of `name` is `Andriod` defined in the first layer and the value of `version` is `12` defined in the second layer. +With a powerful YAML configure file, we can use RepexpTree dictionary as a UA parser. We support [uap-core](https://github.com/ua-parser/uap-core) and demonstrate how to use it in the functional test [02504_regexp_dictionary_ua_parser](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/02504_regexp_dictionary_ua_parser.sh) -### Use RegExpTree Dictionary on cloud +### Use Regular Expression Tree Dictionary in ClickHouse Cloud -We have shown how RegExpTree work in the local environment, but we cannot use `YAMLRegExpTree` in the cloud. If we have a local yaml file, we can use this file to create RegExpTree Dictionary in the local environment, then dump this dictionary to a csv file by the `dictionary` table function and [INTO OUTFILE](../statements/select/into-outfile.md) clause. +We have shown how Regular Expression Tree work in the local environment, but we cannot use `YAMLRegExpTree` in the cloud. If we have a local YAML file, we can use this file to create Regular Expression Tree Dictionary in the local environment, then dump this dictionary to a csv file by the `dictionary` table function and [INTO OUTFILE](../statements/select/into-outfile.md) clause. ```sql select * from dictionary(regexp_dict) into outfile('regexp_dict.csv') @@ -2278,7 +2278,7 @@ The content of csv file is: ```text 1,0,"Linux/(\d+[\.\d]*).+tlinux","['version','name']","['\\1','TencentOS']" -2,0,"(\d+)/tclwebkit(\d+[\.\d]*)","['comment','version','name']","['test $1 and $2','$1','Andriod']" +2,0,"(\d+)/tclwebkit(\d+[\.\d]*)","['comment','version','name']","['test $1 and $2','$1','Android']" 3,2,"33/tclwebkit","['version']","['13']" 4,2,"3[12]/tclwebkit","['version']","['12']" 5,2,"3[12]/tclwebkit","['version']","['11']" @@ -2287,11 +2287,11 @@ The content of csv file is: The schema of dumped file is always -- `id UInt64` represents the identify number of the RegexpTree node. +- `id UInt64` represents the id of the RegexpTree node. - `parent_id UInt64` represents the id of the parent of a node. - `regexp String` represents the regular expression string. -- `keys Array(String)` represents the names of user defined attributions. -- `values Array(String)` represents the values of user defined attributions. +- `keys Array(String)` represents the names of user defined attributes. +- `values Array(String)` represents the values of user defined attributes. On the cloud, we can create a table `regexp_dictionary_source_table` with the above table structure. @@ -2314,8 +2314,8 @@ clickhouse client \ --secure \ --password MY_PASSWORD \ --query " - insert into regexp_dictionary_source_table - select * from input ('id UInt64, parent_id UInt64, regexp String, keys Array(String), values Array(String)') + INSERT INTO regexp_dictionary_source_table + SELECT * FROM input ('id UInt64, parent_id UInt64, regexp String, keys Array(String), values Array(String)') FORMAT CSV" < regexp_dict.csv ``` @@ -2333,10 +2333,6 @@ LIFETIME(0) LAYOUT(regexp_tree); ``` -### Use RegexpTree Dictionary as a UA Parser - -With a powerful yaml configure file, we can use RepexpTree dictionary as a UA parser. We support [uap-core](https://github.com/ua-parser/uap-core) and demonstrate how to use it in the functional test [02504_regexp_dictionary_ua_parser](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/02504_regexp_dictionary_ua_parser.sh) - ## Embedded Dictionaries {#embedded-dictionaries} diff --git a/tests/queries/0_stateless/02504_regexp_dictionary_table_source.reference b/tests/queries/0_stateless/02504_regexp_dictionary_table_source.reference index 86a74291b07..4e72cf4ce37 100644 --- a/tests/queries/0_stateless/02504_regexp_dictionary_table_source.reference +++ b/tests/queries/0_stateless/02504_regexp_dictionary_table_source.reference @@ -1,11 +1,11 @@ 1 0 Linux/(\\d+[\\.\\d]*).+tlinux ['version','name'] ['\\1','TencentOS'] -2 0 (\\d+)/tclwebkit(\\d+[\\.\\d]*) ['comment','version','name'] ['test $1 and $2','$1','Andriod'] +2 0 (\\d+)/tclwebkit(\\d+[\\.\\d]*) ['comment','version','name'] ['test $1 and $2','$1','Android'] 3 2 33/tclwebkit ['version'] ['13'] 4 2 3[12]/tclwebkit ['version'] ['12'] 5 2 3[12]/tclwebkit ['version'] ['11'] 6 2 3[12]/tclwebkit ['version'] ['10'] ('TencentOS',101,'nothing') -('Andriod',13,'test 33 and 11.10') +('Android',13,'test 33 and 11.10') ('',NULL,'nothing') ('',0,'default') 30/tclwebkit0 @@ -23,22 +23,22 @@ 42/tclwebkit12 43/tclwebkit13 44/tclwebkit14 -('Andriod',30) -('Andriod',12) -('Andriod',12) -('Andriod',13) -('Andriod',34) -('Andriod',35) -('Andriod',36) -('Andriod',37) -('Andriod',38) -('Andriod',39) -('Andriod',40) -('Andriod',41) -('Andriod',42) -('Andriod',43) -('Andriod',44) -('Andriod1',33,'matched 3') -1 0 (\\d+)/tclwebkit ['version','name'] ['$1','Andriod'] +('Android',30) +('Android',12) +('Android',12) +('Android',13) +('Android',34) +('Android',35) +('Android',36) +('Android',37) +('Android',38) +('Android',39) +('Android',40) +('Android',41) +('Android',42) +('Android',43) +('Android',44) +('Android1',33,'matched 3') +1 0 (\\d+)/tclwebkit ['version','name'] ['$1','Android'] 2 0 33/tclwebkit ['comment','version'] ['matched 3','13'] -3 1 33/tclwebkit ['name'] ['Andriod1'] +3 1 33/tclwebkit ['name'] ['Android1'] diff --git a/tests/queries/0_stateless/02504_regexp_dictionary_table_source.sql b/tests/queries/0_stateless/02504_regexp_dictionary_table_source.sql index 15e8adce403..42d7acbf057 100644 --- a/tests/queries/0_stateless/02504_regexp_dictionary_table_source.sql +++ b/tests/queries/0_stateless/02504_regexp_dictionary_table_source.sql @@ -1,7 +1,7 @@ -- Tags: use-vectorscan -DROP TABLE IF EXISTS regexp_dictionary_source_table; DROP DICTIONARY IF EXISTS regexp_dict1; +DROP TABLE IF EXISTS regexp_dictionary_source_table; CREATE TABLE regexp_dictionary_source_table ( @@ -15,7 +15,7 @@ CREATE TABLE regexp_dictionary_source_table -- test back reference. INSERT INTO regexp_dictionary_source_table VALUES (1, 0, 'Linux/(\d+[\.\d]*).+tlinux', ['name', 'version'], ['TencentOS', '\1']) -INSERT INTO regexp_dictionary_source_table VALUES (2, 0, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Andriod', '$1', 'test $1 and $2']) +INSERT INTO regexp_dictionary_source_table VALUES (2, 0, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Android', '$1', 'test $1 and $2']) INSERT INTO regexp_dictionary_source_table VALUES (3, 2, '33/tclwebkit', ['version'], ['13']) INSERT INTO regexp_dictionary_source_table VALUES (4, 2, '3[12]/tclwebkit', ['version'], ['12']) INSERT INTO regexp_dictionary_source_table VALUES (5, 2, '3[12]/tclwebkit', ['version'], ['11']) @@ -65,14 +65,14 @@ SYSTEM RELOAD dictionary regexp_dict1; -- { serverError 489 } truncate table regexp_dictionary_source_table; INSERT INTO regexp_dictionary_source_table VALUES (1, 2, 'Linux/(\d+[\.\d]*).+tlinux', ['name', 'version'], ['TencentOS', '\1']) -INSERT INTO regexp_dictionary_source_table VALUES (2, 3, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Andriod', '$1', 'test $1 and $2']) -INSERT INTO regexp_dictionary_source_table VALUES (3, 1, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Andriod', '$1', 'test $1 and $2']) +INSERT INTO regexp_dictionary_source_table VALUES (2, 3, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Android', '$1', 'test $1 and $2']) +INSERT INTO regexp_dictionary_source_table VALUES (3, 1, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Android', '$1', 'test $1 and $2']) SYSTEM RELOAD dictionary regexp_dict1; -- { serverError 489 } -- test priority truncate table regexp_dictionary_source_table; -INSERT INTO regexp_dictionary_source_table VALUES (1, 0, '(\d+)/tclwebkit', ['name', 'version'], ['Andriod', '$1']); -INSERT INTO regexp_dictionary_source_table VALUES (3, 1, '33/tclwebkit', ['name'], ['Andriod1']); -- child has more priority than parents. +INSERT INTO regexp_dictionary_source_table VALUES (1, 0, '(\d+)/tclwebkit', ['name', 'version'], ['Android', '$1']); +INSERT INTO regexp_dictionary_source_table VALUES (3, 1, '33/tclwebkit', ['name'], ['Android1']); -- child has more priority than parents. INSERT INTO regexp_dictionary_source_table VALUES (2, 0, '33/tclwebkit', ['version', 'comment'], ['13', 'matched 3']); -- larger id has lower priority than small id. SYSTEM RELOAD dictionary regexp_dict1; select dictGet(regexp_dict1, ('name', 'version', 'comment'), '33/tclwebkit'); @@ -82,6 +82,6 @@ SYSTEM RELOAD dictionary regexp_dict1; -- { serverError 489 } select * from dictionary(regexp_dict1); +DROP DICTIONARY IF EXISTS regexp_dict1; DROP TABLE IF EXISTS regexp_dictionary_source_table; DROP TABLE IF EXISTS needle_table; -DROP DICTIONARY IF EXISTS regexp_dict1; From 312f751503a12fc9612f071f89a48266dfb42c37 Mon Sep 17 00:00:00 2001 From: Robert Schulze Date: Sun, 21 May 2023 13:08:55 +0000 Subject: [PATCH 08/10] Uppercase remaining SQL keywords --- docs/en/sql-reference/dictionaries/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index eb45247e74a..4e9bb2936db 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2206,7 +2206,7 @@ Regular expression tree dictionaries are a special type of dictionary which repr Regular expression tree dictionaries are defined in ClickHouse open-source using the YAMLRegExpTree source which is provided the path to a YAML file containing the regular expression tree. ```sql -create dictionary regexp_dict +CREATE DICTIONARY regexp_dict ( regexp String, name String, @@ -2322,7 +2322,7 @@ clickhouse client \ You can see how to [Insert Local Files](https://clickhouse.com/docs/en/integrations/data-ingestion/insert-local-files) for more details. After we initialize the source table, we can create a RegexpTree by table source: ``` sql -create dictionary regexp_dict +CREATE DICTIONARY regexp_dict ( regexp String, name String, From 9d9d4e3d62ea042952bf808143898529dd86822d Mon Sep 17 00:00:00 2001 From: Robert Schulze Date: Sun, 21 May 2023 13:33:03 +0000 Subject: [PATCH 09/10] Some fixups --- docs/en/sql-reference/dictionaries/index.md | 37 ++++++++++----------- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index 4e9bb2936db..522fe132a66 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2201,7 +2201,7 @@ Result: Regular expression tree dictionaries are a special type of dictionary which represent the mapping from key to attributes using a tree of regular expressions. There are some use cases, e.g. parsing of (user agent)[https://en.wikipedia.org/wiki/User_agent] strings, which can be expressed elegantly with regexp tree dictionaries. -### Use Regular Expression Tree Dictionary in ClickHouse Open-Source Environment +### Use Regular Expression Tree Dictionary in ClickHouse Open-Source Regular expression tree dictionaries are defined in ClickHouse open-source using the YAMLRegExpTree source which is provided the path to a YAML file containing the regular expression tree. @@ -2238,15 +2238,14 @@ The dictionary source `YAMLRegExpTree` represents the structure of a regexp tree version: '10' ``` -This config consists of a list of Regular Expression Tree nodes. Each node has following structure: +This config consists of a list of regular expression tree nodes. Each node has the following structure: -- **regexp** means the regular expression of this node. -- **user defined attributes** is a list of dictionary attributes defined in the dictionary structure. In this case, we have two attributes: `name` and `version`. The first node has both attributes. The second node only has `name` attribute, because the `version` is defined in the children nodes. - - The value of an attribute could contain a **back reference** which refers to a capture group of the matched regular expression. Reference number ranges from 1 to 9 and writes as `$1` or `\1`. During the query execution, the back reference in the value will be replaced by the matched capture group. -- **children nodes** is the children of a regexp tree node, which has their own attributes and children nodes. String matching preceeds in a depth-first fasion. If a string matches any regexp node in the top layer, the dictionary checks if the string matches the children nodes of it. If it matches, we assign the attributes of the matching nodes. If two or more nodes define the same attribute, children nodes have more priority. - - the name of **children nodes** in YAML files can be arbitrary. +- **regexp**: the regular expression of the node. +- **attributes**: a list of user-defined dictionary attributes. In this example, there are two attributes: `name` and `version`. The first node defines both attributes. The second node only defines attribute `name`. Attribute `version` is provided by the child nodes of the second node. + - The value of an attribute may contain **back references**, referring to capture groups of the matched regular expression. In the example, the value of attribute `version` in the first node consists of a back-reference `\1` to capture group `(\d+[\.\d]*)` in the regular expression. Back-reference numbers range from 1 to 9 and are written as `$1` or `\1` (for number 1). The back reference is replaced by the matched capture group during query execution. +- **child nodes**: a list of children of a regexp tree node, each of which has its own attributes and (potentially) children nodes. String matching proceeds in a depth-first fashion. If a string matches a regexp node, the dictionary checks if it also matches the nodes' child nodes. If that is the case, the attributes of the deepest matching node are assigned. Attributes of a child node overwrite equally named attributes of parent nodes. The name of child nodes in YAML files can be arbitrary, e.g. `versions` in above example. -Due to the specialty of regexp tree dictionary, we only allow functions `dictGet`, `dictGetOrDefault` and `dictGetOrNull`. +Regexp tree dictionaries only allow access using functions `dictGet`, `dictGetOrDefault` and `dictGetOrNull`. Example: @@ -2262,16 +2261,16 @@ Result: └─────────────────────────────────────────────────────────────────┘ ``` -In this case, we match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in the top layer's second node, so the dictionary continues to look into the children nodes and find it matches `3[12]/tclwebkit`. As a result, the value of `name` is `Android` defined in the first layer and the value of `version` is `12` defined the child node. +In this case, we first match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in the top layer's second node. The dictionary then continues to look into the child nodes and find that the string also matches `3[12]/tclwebkit`. As a result, the value of attribute `name` is `Android` (defined in the first layer) and the value of `version` is `12` (defined the child node). -With a powerful YAML configure file, we can use RepexpTree dictionary as a UA parser. We support [uap-core](https://github.com/ua-parser/uap-core) and demonstrate how to use it in the functional test [02504_regexp_dictionary_ua_parser](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/02504_regexp_dictionary_ua_parser.sh) +With a powerful YAML configure file, we can use a regexp tree dictionaries as a user agent string parser. We support [uap-core](https://github.com/ua-parser/uap-core) and demonstrate how to use it in the functional test [02504_regexp_dictionary_ua_parser](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/02504_regexp_dictionary_ua_parser.sh) ### Use Regular Expression Tree Dictionary in ClickHouse Cloud -We have shown how Regular Expression Tree work in the local environment, but we cannot use `YAMLRegExpTree` in the cloud. If we have a local YAML file, we can use this file to create Regular Expression Tree Dictionary in the local environment, then dump this dictionary to a csv file by the `dictionary` table function and [INTO OUTFILE](../statements/select/into-outfile.md) clause. +Above used `YAMLRegExpTree` source works in ClickHouse Open Source but not in ClickHouse Cloud. To use regexp tree dictionaries in ClickHouse could, first create a regexp tree dictionary from a YAML file locally in ClickHouse Open Source, then dump this dictionary into a CSV file using the `dictionary` table function and the [INTO OUTFILE](../statements/select/into-outfile.md) clause. ```sql -select * from dictionary(regexp_dict) into outfile('regexp_dict.csv') +SELECT * FROM dictionary(regexp_dict) INTO OUTFILE('regexp_dict.csv') ``` The content of csv file is: @@ -2285,15 +2284,15 @@ The content of csv file is: 6,2,"3[12]/tclwebkit","['version']","['10']" ``` -The schema of dumped file is always +The schema of dumped file is: -- `id UInt64` represents the id of the RegexpTree node. -- `parent_id UInt64` represents the id of the parent of a node. -- `regexp String` represents the regular expression string. -- `keys Array(String)` represents the names of user defined attributes. -- `values Array(String)` represents the values of user defined attributes. +- `id UInt64`: the id of the RegexpTree node. +- `parent_id UInt64`: the id of the parent of a node. +- `regexp String`: the regular expression string. +- `keys Array(String)`: the names of user-defined attributes. +- `values Array(String)`: the values of user-defined attributes. -On the cloud, we can create a table `regexp_dictionary_source_table` with the above table structure. +To create the dictionary in ClickHouse Cloud, first create a table `regexp_dictionary_source_table` with below table structure: ```sql CREATE TABLE regexp_dictionary_source_table From 491cf8b6e199757b35cef5273ea8c3cea76879b9 Mon Sep 17 00:00:00 2001 From: Robert Schulze Date: Sun, 21 May 2023 13:43:05 +0000 Subject: [PATCH 10/10] Fix minor mistakes --- docs/en/sql-reference/dictionaries/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/sql-reference/dictionaries/index.md b/docs/en/sql-reference/dictionaries/index.md index 522fe132a66..f593cbe9052 100644 --- a/docs/en/sql-reference/dictionaries/index.md +++ b/docs/en/sql-reference/dictionaries/index.md @@ -2261,7 +2261,7 @@ Result: └─────────────────────────────────────────────────────────────────┘ ``` -In this case, we first match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in the top layer's second node. The dictionary then continues to look into the child nodes and find that the string also matches `3[12]/tclwebkit`. As a result, the value of attribute `name` is `Android` (defined in the first layer) and the value of `version` is `12` (defined the child node). +In this case, we first match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in the top layer's second node. The dictionary then continues to look into the child nodes and finds that the string also matches `3[12]/tclwebkit`. As a result, the value of attribute `name` is `Android` (defined in the first layer) and the value of attribute `version` is `12` (defined the child node). With a powerful YAML configure file, we can use a regexp tree dictionaries as a user agent string parser. We support [uap-core](https://github.com/ua-parser/uap-core) and demonstrate how to use it in the functional test [02504_regexp_dictionary_ua_parser](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/02504_regexp_dictionary_ua_parser.sh)