Merge pull request #49902 from hanfei1991/hanfei/regexp-doc

refine docs for regexp tree dictionary
This commit is contained in:
Han Fei 2023-05-21 23:42:35 +02:00 committed by GitHub
commit da8590a26c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 114 additions and 48 deletions

View File

@ -2214,16 +2214,16 @@ Result:
└─────────────────────────────────┴───────┘
```
## RegExp Tree Dictionary {#regexp-tree-dictionary}
## Regular Expression Tree Dictionary {#regexp-tree-dictionary}
Regexp Tree dictionary stores multiple trees of regular expressions with attributions. Users can retrieve strings in the dictionary. If a string matches the root of the regexp tree, we will collect the corresponding attributes of the matched root and continue to walk the children. If any of the children matches the string, we will collect attributes and rewrite the old ones if conflicts occur, then continue the traverse until we reach leaf nodes.
Regular expression tree dictionaries are a special type of dictionary which represent the mapping from key to attributes using a tree of regular expressions. There are some use cases, e.g. parsing of (user agent)[https://en.wikipedia.org/wiki/User_agent] strings, which can be expressed elegantly with regexp tree dictionaries.
Example of the ddl query for creating Regexp Tree dictionary:
### Use Regular Expression Tree Dictionary in ClickHouse Open-Source
<CloudDetails />
Regular expression tree dictionaries are defined in ClickHouse open-source using the YAMLRegExpTree source which is provided the path to a YAML file containing the regular expression tree.
```sql
create dictionary regexp_dict
CREATE DICTIONARY regexp_dict
(
regexp String,
name String,
@ -2235,17 +2235,15 @@ LAYOUT(regexp_tree)
...
```
**Source**
The dictionary source `YAMLRegExpTree` represents the structure of a regexp tree. For example:
We introduce a type of source called `YAMLRegExpTree` representing the structure of Regexp Tree dictionary. An Example of a valid yaml config is like:
```xml
```yaml
- regexp: 'Linux/(\d+[\.\d]*).+tlinux'
name: 'TencentOS'
version: '\1'
- regexp: '\d+/tclwebkit(?:\d+[\.\d]*)'
name: 'Andriod'
name: 'Android'
versions:
- regexp: '33/tclwebkit'
version: '13'
@ -2257,17 +2255,14 @@ We introduce a type of source called `YAMLRegExpTree` representing the structure
version: '10'
```
The key `regexp` represents the regular expression of a tree node. The name of key is same as the dictionary key. The `name` and `version` is user-defined attributions in the dicitionary. The `versions` (which can be any name that not appear in attributions or the key) indicates the children nodes of this tree.
This config consists of a list of regular expression tree nodes. Each node has the following structure:
**Back Reference**
- **regexp**: the regular expression of the node.
- **attributes**: a list of user-defined dictionary attributes. In this example, there are two attributes: `name` and `version`. The first node defines both attributes. The second node only defines attribute `name`. Attribute `version` is provided by the child nodes of the second node.
- The value of an attribute may contain **back references**, referring to capture groups of the matched regular expression. In the example, the value of attribute `version` in the first node consists of a back-reference `\1` to capture group `(\d+[\.\d]*)` in the regular expression. Back-reference numbers range from 1 to 9 and are written as `$1` or `\1` (for number 1). The back reference is replaced by the matched capture group during query execution.
- **child nodes**: a list of children of a regexp tree node, each of which has its own attributes and (potentially) children nodes. String matching proceeds in a depth-first fashion. If a string matches a regexp node, the dictionary checks if it also matches the nodes' child nodes. If that is the case, the attributes of the deepest matching node are assigned. Attributes of a child node overwrite equally named attributes of parent nodes. The name of child nodes in YAML files can be arbitrary, e.g. `versions` in above example.
The value of an attribution could contain a back reference which refers to a capture group of the matched regular expression. Reference number ranges from 1 to 9 and writes as `$1` or `\1`.
During the query execution, the back reference in the value will be replaced by the matched capture group.
**Query**
Due to the specialty of Regexp Tree dictionary, we only allow functions `dictGet`, `dictGetOrDefault` and `dictGetOrNull` work with it.
Regexp tree dictionaries only allow access using functions `dictGet`, `dictGetOrDefault` and `dictGetOrNull`.
Example:
@ -2277,12 +2272,83 @@ SELECT dictGet('regexp_dict', ('name', 'version'), '31/tclwebkit1024');
Result:
```
```text
┌─dictGet('regexp_dict', ('name', 'version'), '31/tclwebkit1024')─┐
│ ('Andriod','12') │
│ ('Android','12') │
└─────────────────────────────────────────────────────────────────┘
```
In this case, we first match the regular expression `\d+/tclwebkit(?:\d+[\.\d]*)` in the top layer's second node. The dictionary then continues to look into the child nodes and finds that the string also matches `3[12]/tclwebkit`. As a result, the value of attribute `name` is `Android` (defined in the first layer) and the value of attribute `version` is `12` (defined the child node).
With a powerful YAML configure file, we can use a regexp tree dictionaries as a user agent string parser. We support [uap-core](https://github.com/ua-parser/uap-core) and demonstrate how to use it in the functional test [02504_regexp_dictionary_ua_parser](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/02504_regexp_dictionary_ua_parser.sh)
### Use Regular Expression Tree Dictionary in ClickHouse Cloud
Above used `YAMLRegExpTree` source works in ClickHouse Open Source but not in ClickHouse Cloud. To use regexp tree dictionaries in ClickHouse could, first create a regexp tree dictionary from a YAML file locally in ClickHouse Open Source, then dump this dictionary into a CSV file using the `dictionary` table function and the [INTO OUTFILE](../statements/select/into-outfile.md) clause.
```sql
SELECT * FROM dictionary(regexp_dict) INTO OUTFILE('regexp_dict.csv')
```
The content of csv file is:
```text
1,0,"Linux/(\d+[\.\d]*).+tlinux","['version','name']","['\\1','TencentOS']"
2,0,"(\d+)/tclwebkit(\d+[\.\d]*)","['comment','version','name']","['test $1 and $2','$1','Android']"
3,2,"33/tclwebkit","['version']","['13']"
4,2,"3[12]/tclwebkit","['version']","['12']"
5,2,"3[12]/tclwebkit","['version']","['11']"
6,2,"3[12]/tclwebkit","['version']","['10']"
```
The schema of dumped file is:
- `id UInt64`: the id of the RegexpTree node.
- `parent_id UInt64`: the id of the parent of a node.
- `regexp String`: the regular expression string.
- `keys Array(String)`: the names of user-defined attributes.
- `values Array(String)`: the values of user-defined attributes.
To create the dictionary in ClickHouse Cloud, first create a table `regexp_dictionary_source_table` with below table structure:
```sql
CREATE TABLE regexp_dictionary_source_table
(
id UInt64,
parent_id UInt64,
regexp String,
keys Array(String),
values Array(String)
) ENGINE=Memory;
```
Then update the local CSV by
```bash
clickhouse client \
--host MY_HOST \
--secure \
--password MY_PASSWORD \
--query "
INSERT INTO regexp_dictionary_source_table
SELECT * FROM input ('id UInt64, parent_id UInt64, regexp String, keys Array(String), values Array(String)')
FORMAT CSV" < regexp_dict.csv
```
You can see how to [Insert Local Files](https://clickhouse.com/docs/en/integrations/data-ingestion/insert-local-files) for more details. After we initialize the source table, we can create a RegexpTree by table source:
``` sql
CREATE DICTIONARY regexp_dict
(
regexp String,
name String,
version String
PRIMARY KEY(regexp)
SOURCE(CLICKHOUSE(TABLE 'regexp_dictionary_source_table'))
LIFETIME(0)
LAYOUT(regexp_tree);
```
## Embedded Dictionaries {#embedded-dictionaries}
<SelfManaged />

View File

@ -1,11 +1,11 @@
1 0 Linux/(\\d+[\\.\\d]*).+tlinux ['version','name'] ['\\1','TencentOS']
2 0 (\\d+)/tclwebkit(\\d+[\\.\\d]*) ['comment','version','name'] ['test $1 and $2','$1','Andriod']
2 0 (\\d+)/tclwebkit(\\d+[\\.\\d]*) ['comment','version','name'] ['test $1 and $2','$1','Android']
3 2 33/tclwebkit ['version'] ['13']
4 2 3[12]/tclwebkit ['version'] ['12']
5 2 3[12]/tclwebkit ['version'] ['11']
6 2 3[12]/tclwebkit ['version'] ['10']
('TencentOS',101,'nothing')
('Andriod',13,'test 33 and 11.10')
('Android',13,'test 33 and 11.10')
('',NULL,'nothing')
('',0,'default')
30/tclwebkit0
@ -23,22 +23,22 @@
42/tclwebkit12
43/tclwebkit13
44/tclwebkit14
('Andriod',30)
('Andriod',12)
('Andriod',12)
('Andriod',13)
('Andriod',34)
('Andriod',35)
('Andriod',36)
('Andriod',37)
('Andriod',38)
('Andriod',39)
('Andriod',40)
('Andriod',41)
('Andriod',42)
('Andriod',43)
('Andriod',44)
('Andriod1',33,'matched 3')
1 0 (\\d+)/tclwebkit ['version','name'] ['$1','Andriod']
('Android',30)
('Android',12)
('Android',12)
('Android',13)
('Android',34)
('Android',35)
('Android',36)
('Android',37)
('Android',38)
('Android',39)
('Android',40)
('Android',41)
('Android',42)
('Android',43)
('Android',44)
('Android1',33,'matched 3')
1 0 (\\d+)/tclwebkit ['version','name'] ['$1','Android']
2 0 33/tclwebkit ['comment','version'] ['matched 3','13']
3 1 33/tclwebkit ['name'] ['Andriod1']
3 1 33/tclwebkit ['name'] ['Android1']

View File

@ -1,7 +1,7 @@
-- Tags: use-vectorscan
DROP TABLE IF EXISTS regexp_dictionary_source_table;
DROP DICTIONARY IF EXISTS regexp_dict1;
DROP TABLE IF EXISTS regexp_dictionary_source_table;
CREATE TABLE regexp_dictionary_source_table
(
@ -15,7 +15,7 @@ CREATE TABLE regexp_dictionary_source_table
-- test back reference.
INSERT INTO regexp_dictionary_source_table VALUES (1, 0, 'Linux/(\d+[\.\d]*).+tlinux', ['name', 'version'], ['TencentOS', '\1'])
INSERT INTO regexp_dictionary_source_table VALUES (2, 0, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Andriod', '$1', 'test $1 and $2'])
INSERT INTO regexp_dictionary_source_table VALUES (2, 0, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Android', '$1', 'test $1 and $2'])
INSERT INTO regexp_dictionary_source_table VALUES (3, 2, '33/tclwebkit', ['version'], ['13'])
INSERT INTO regexp_dictionary_source_table VALUES (4, 2, '3[12]/tclwebkit', ['version'], ['12'])
INSERT INTO regexp_dictionary_source_table VALUES (5, 2, '3[12]/tclwebkit', ['version'], ['11'])
@ -65,14 +65,14 @@ SYSTEM RELOAD dictionary regexp_dict1; -- { serverError 489 }
truncate table regexp_dictionary_source_table;
INSERT INTO regexp_dictionary_source_table VALUES (1, 2, 'Linux/(\d+[\.\d]*).+tlinux', ['name', 'version'], ['TencentOS', '\1'])
INSERT INTO regexp_dictionary_source_table VALUES (2, 3, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Andriod', '$1', 'test $1 and $2'])
INSERT INTO regexp_dictionary_source_table VALUES (3, 1, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Andriod', '$1', 'test $1 and $2'])
INSERT INTO regexp_dictionary_source_table VALUES (2, 3, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Android', '$1', 'test $1 and $2'])
INSERT INTO regexp_dictionary_source_table VALUES (3, 1, '(\d+)/tclwebkit(\d+[\.\d]*)', ['name', 'version', 'comment'], ['Android', '$1', 'test $1 and $2'])
SYSTEM RELOAD dictionary regexp_dict1; -- { serverError 489 }
-- test priority
truncate table regexp_dictionary_source_table;
INSERT INTO regexp_dictionary_source_table VALUES (1, 0, '(\d+)/tclwebkit', ['name', 'version'], ['Andriod', '$1']);
INSERT INTO regexp_dictionary_source_table VALUES (3, 1, '33/tclwebkit', ['name'], ['Andriod1']); -- child has more priority than parents.
INSERT INTO regexp_dictionary_source_table VALUES (1, 0, '(\d+)/tclwebkit', ['name', 'version'], ['Android', '$1']);
INSERT INTO regexp_dictionary_source_table VALUES (3, 1, '33/tclwebkit', ['name'], ['Android1']); -- child has more priority than parents.
INSERT INTO regexp_dictionary_source_table VALUES (2, 0, '33/tclwebkit', ['version', 'comment'], ['13', 'matched 3']); -- larger id has lower priority than small id.
SYSTEM RELOAD dictionary regexp_dict1;
select dictGet(regexp_dict1, ('name', 'version', 'comment'), '33/tclwebkit');
@ -82,6 +82,6 @@ SYSTEM RELOAD dictionary regexp_dict1; -- { serverError 489 }
select * from dictionary(regexp_dict1);
DROP DICTIONARY IF EXISTS regexp_dict1;
DROP TABLE IF EXISTS regexp_dictionary_source_table;
DROP TABLE IF EXISTS needle_table;
DROP DICTIONARY IF EXISTS regexp_dict1;