ClickHouse

mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-11-24 00:22:29 +00:00

Author	SHA1	Message	Date
Robert Schulze	4ed5b903b4	Docs: remove anchor prefix	2023-09-18 18:35:59 +00:00
johanngan	bcb058f999	Add case insensitive and dot-all modes to RegExpTree dictionary The new per-dictionary settings control regex match semantics around case sensitivity and the '.' wildcard with newlines. They must be set at the dictionary level since they're applied to regex engines at pattern-compile-time. - regexp_dict_flag_case_insensitive: case insensitive matching - regexp_dict_flag_dotall: '.' matches all characters including newlines They correspond to HS_FLAG_CASELESS and HS_FLAG_DOTALL in Vectorscan and case_sensitive and dot_nl in RE2. These are the most useful options compatible with the internal behavior of RegExpTreeDictionary around splitting up simple and complex patterns between Vectorscan and RE2. The alternative is to use (?i) and/or (?s) for all patterns. However, (?s) isn't handled properly by OptimizedRegularExpression::analyze(). And while (?i) is, it still causes the dictionary to treat the pattern as "complex" for sequential scanning with RE2 rather than multi-matching with Vectorscan, even though Vectorscan supports case insensitive literal matching. Setting dictionary-wide flags is both more convenient, and circumvents these problems.	2023-09-06 11:28:53 -05:00
Robert Schulze	f20dd27ba6	Clean header mess up	2023-08-17 18:47:11 +00:00
Rich Raposa	13a03c046f	Remove duplicate section from Dictionary page	2023-08-11 23:45:58 -06:00
johanngan	c0f162c5b6	Add dictGetAll function for RegExpTreeDictionary This function outputs an array of attribute values from all regexp nodes that matched in a regexp tree dictionary. An optional final argument can be passed to limit the array size.	2023-06-04 23:46:04 -05:00
Robert Schulze	54872f9e7e	Typos: Follow-up to #50476	2023-06-02 13:28:09 +00:00
Robert Schulze	65cc92a78d	CI: Fix aspell on nested docs	2023-06-02 12:24:41 +00:00
johanngan	de3b08aa5b	Clean up regexp tree dictionary documentation dictGetOrNull() relies on IDictionary::hasKeys(), which RegExpTreeDictionary doesn't implement, so this probably never worked. If you try to use it, an exception is thrown. The docs shouldn't indicate that this is supported. Also fix a markdown hyperlink in the docs.	2023-05-25 14:35:24 -05:00
Denny Crane	8a00be69b3	Update index.md	2023-05-24 10:40:33 -03:00
Han Fei	2625696591	Merge branch 'master' into hanfei/regexp-doc	2023-05-21 23:42:01 +02:00
Robert Schulze	491cf8b6e1	Fix minor mistakes	2023-05-21 13:43:05 +00:00
Robert Schulze	9d9d4e3d62	Some fixups	2023-05-21 13:40:52 +00:00
Robert Schulze	312f751503	Uppercase remaining SQL keywords	2023-05-21 13:08:55 +00:00
Azat Khuzhin	2b240d3721	Improve documentation for HASHED/SPARSE_HASHED/COMPLEX_KEY_HASHED/COMPLEX_KEY_SPARSE_HASHED Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2023-05-19 06:07:21 +02:00
Azat Khuzhin	2996b38606	Add ability to configure maximum load factor for the HASHED/SPARSE_HASHED layout As it turns out, HashMap/PackedHashMap works great even with max load factor of 0.99. By "great" I mean it least it works faster then google sparsehash, and not to mention it's friendliness to the memory allocator (it has zero fragmentation since it works with a continuious memory region, in comparison to the sparsehash that doing lots of realloc, which jemalloc does not like, due to it's slabs). Here is a table of different setups: settings \| load (sec) \| read (sec) \| read (million rows/s) \| bytes_allocated \| RSS - \| - \| - \| - \| - \| - HASHED upstream \| - \| - \| - \| - \| 35GiB SPARSE_HASHED upstream \| - \| - \| - \| - \| 26GiB - \| - \| - \| - \| - \| - sparse_hash_map glibc hashbench \| - \| - \| - \| - \| 17.5GiB sparse_hash_map packed allocator \| 101.878 \| 231.48 \| 4.32 \| - \| 17.7GiB PackedHashMap 0.5 \| 15.514 \| 42.35 \| 23.61 \| 20GiB \| 22GiB hashed 0.95 \| 34.903 \| 115.615 \| 8.65 \| 16GiB \| 18.7GiB PackedHashMap 0.95 \| 93.6 \| 19.883 \| 10.68 \| 10GiB \| 12.8GiB PackedHashMap 0.99 \| 26.113 \| 83.6 \| 11.96 \| 10GiB \| 12.3GiB As it shows, PackedHashMap with 0.95 max_load_factor, eats 2.6x less memory then SPARSE_HASHED in upstream, and it also 2x faster for read! v2: fix grower Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2023-05-19 06:07:21 +02:00
Han Fei	549af4d351	address comments	2023-05-17 21:23:32 +02:00
Han Fei	7df0e9d933	fix broken link	2023-05-16 15:33:08 +02:00
Han Fei	a40d86b921	Update docs/en/sql-reference/dictionaries/index.md Co-authored-by: Sergei Trifonov <svtrifonov@gmail.com>	2023-05-16 11:22:42 +02:00
Han Fei	ed5906f15d	Update docs/en/sql-reference/dictionaries/index.md Co-authored-by: Sergei Trifonov <svtrifonov@gmail.com>	2023-05-16 11:22:31 +02:00
Han Fei	31b8e3c489	Update docs/en/sql-reference/dictionaries/index.md Co-authored-by: Sergei Trifonov <svtrifonov@gmail.com>	2023-05-16 11:22:24 +02:00
Han Fei	e4e473ef30	Update docs/en/sql-reference/dictionaries/index.md Co-authored-by: Sergei Trifonov <svtrifonov@gmail.com>	2023-05-16 11:22:14 +02:00
Han Fei	29aa960377	refine docs for regexp tree dictionary	2023-05-16 09:07:35 +02:00
Han Fei	ef74e64336	address comments	2023-05-11 22:18:08 +02:00
Ivan Takarlikov	8873856ce5	Fix some grammar mistakes in documentation, code and tests	2023-05-04 13:35:18 -03:00
MikhailBurdukov	b229a28e94	Merge branch 'master' into mongo_dict_tls	2023-04-26 23:39:27 +03:00
MikhailBurdukov	7764168bd5	Resove conflict	2023-04-26 19:50:58 +00:00
MikhailBurdukov	baaee66e85	Missing files	2023-04-26 19:29:29 +00:00
Robert Schulze	c406663442	Docs: Replace annoying three spaces in enumerations by a single space	2023-04-19 15:56:55 +00:00
DanRoscigno	6d8a2bbd48	standardize admonitions	2023-03-27 14:54:05 -04:00
rfraposa	ac5ed141d8	New nav - reverting the revert	2023-03-17 21:45:43 -05:00
Alexander Tokmakov	ec44c8293a	Revert "New navigation"	2023-03-17 21:21:11 +03:00
rfraposa	a580d7c021	Combined Dictionary pages	2023-03-08 16:52:01 -07:00
rfraposa	4b1b4a711e	Fix links	2023-03-08 00:05:58 -07:00
rfraposa	fa6f3dadba	Link fixes	2023-03-07 22:52:43 -07:00
rfraposa	4f67e3facf	Update Dictionary links	2023-03-03 20:11:51 -07:00
rfraposa	d1045b9f11	Fix Dictionary links; update install.md	2023-03-02 07:56:03 -07:00
rfraposa	17a2d7ed45	Fixing broken links	2023-03-01 16:53:17 -07:00
rfraposa	1b6916ddd2	Condensed dictionary docs into a single page	2023-02-28 12:01:52 -07:00
rfraposa	a4a5a8a7d3	Initial copy of doc-preview	2023-02-28 11:59:05 -07:00
rfraposa	e52edd4e85	Update external-dicts-dict-layout.md	2023-02-01 09:06:21 -07:00
Denny Crane	fda47bf4f8	Update external-dicts-dict-layout.md	2023-01-24 21:31:43 -04:00
Azat Khuzhin	4366f7fb3b	Remove PREALLOCATE for HASHED/SPARSE_HASHED dictionaries It does not give significant benefit, but now, you hashed/sparse_hashed dictionaries can be filled in parallel (#40003), using sharded dictionaries, and this should be used instead of PREALLOCATE. Note, that dictionaries, that had been created with PREALLOCATE will work, but simply ignore this attribute. Fixes: #41985 (cc @alexey-milovidov) Reverts: #23979 (cc @kitaisreal) Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2023-01-18 20:18:37 +01:00
Azat Khuzhin	99063b152f	Allow to configure queue backlog of the parallel hashed dictionary loader v2: Decrease default parallel_queue_backlog to 10000 (same speed) v3: Rename parallel_queue_backlog to per_shard_load_backlog v3: Rename per_shard_load_backlog to shard_load_queue_backlog v4: Fix documentation Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2023-01-13 13:39:26 +01:00
Azat Khuzhin	345c422e28	Add ability to load hashed dictionaries using multiple threads Right now dictionaries (here I will talk about only HASHED/SPARSE_HASHED/COMPLEX_KEY_HASHED/COMPLEX_KEY_SPARSE_HASHED) can load data only in one thread, since it uses one hash table that cannot be filled from multiple threads. And in case you have very big dictionary (i.e. 10e9 elements), it can take a awhile to load them, especially for SPARSE_HASHED variants (and if you have such amount of elements there, you are likely use SPARSE_HASHED, since it requires less memory), in my env it takes ~4 hours, which is enormous amount of time. So this patch add support of shards for dictionaries, number of shards determine how much hash tables will use this dictionary, also, and which is more important, how much threads it can use to load the data. And with 16 threads this works 2x faster, not perfect though, see the follow up patches in this series. v0: PARTITION BY v1: SHARDS 1 v2: SHARDS(1) v3: tried optimized mod - logical and, but it does not gain even 10% v4: tried squashing more (max_block_size * shards), but it does not gain even 10% either v5: move SHARDS into layout parameters (unknown simply ignored) v6: tune params for perf tests (to avoid too long queries) Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>	2023-01-13 13:39:25 +01:00
Han Fei	6ed4570f73	Merge branch 'master' into regexp-tree-dictionary	2023-01-10 15:36:30 +01:00
Han Fei	5f8296b719	Update docs/en/sql-reference/dictionaries/external-dictionaries/regexp-tree.md Co-authored-by: Vladimir C <vdimir@clickhouse.com>	2023-01-10 14:41:06 +01:00
Ivan Blinkov	61c2f23713	Remove leftover empty lines at the end of markdown files	2023-01-09 15:15:18 +01:00
Han Fei	f2a9eea995	write docs and optimize regex compile	2023-01-05 17:38:01 +01:00
Denny Crane	850f77f4d2	Update external-dicts-dict-sources.md	2022-12-26 16:21:36 -04:00
Dale Mcdiarmid	1f5e6799ec	revert contents change	2022-12-16 12:03:57 +00:00

1 2 3 4 5

207 Commits