diff --git a/docker/test/util/Dockerfile b/docker/test/util/Dockerfile index f1cf029e9a2..0ee426f4e4d 100644 --- a/docker/test/util/Dockerfile +++ b/docker/test/util/Dockerfile @@ -48,6 +48,7 @@ RUN apt-get update \ gdb \ git \ gperf \ + libclang-rt-${LLVM_VERSION}-dev \ lld-${LLVM_VERSION} \ llvm-${LLVM_VERSION} \ llvm-${LLVM_VERSION}-dev \ diff --git a/docs/en/engines/table-engines/mergetree-family/invertedindexes.md b/docs/en/engines/table-engines/mergetree-family/invertedindexes.md index 2899476b847..9f759303034 100644 --- a/docs/en/engines/table-engines/mergetree-family/invertedindexes.md +++ b/docs/en/engines/table-engines/mergetree-family/invertedindexes.md @@ -2,10 +2,10 @@ slug: /en/engines/table-engines/mergetree-family/invertedindexes sidebar_label: Inverted Indexes description: Quickly find search terms in text. -keywords: [full-text search, text search] +keywords: [full-text search, text search, inverted, index, indices] --- -# Inverted indexes [experimental] +# Full-text Search using Inverted Indexes [experimental] Inverted indexes are an experimental type of [secondary indexes](/docs/en/engines/table-engines/mergetree-family/mergetree.md/#available-types-of-indices) which provide fast text search capabilities for [String](/docs/en/sql-reference/data-types/string.md) or [FixedString](/docs/en/sql-reference/data-types/fixedstring.md) @@ -13,7 +13,7 @@ columns. The main idea of an inverted index is to store a mapping from "terms" t tokenized cells of the string column. For example, the string cell "I will be a little late" is by default tokenized into six terms "I", "will", "be", "a", "little" and "late". Another kind of tokenizer is n-grams. For example, the result of 3-gram tokenization will be 21 terms "I w", " wi", "wil", "ill", "ll ", "l b", " be" etc. The more fine-granular the input strings are tokenized, the bigger but also the more -useful the resulting inverted index will be. +useful the resulting inverted index will be. :::warning Inverted indexes are experimental and should not be used in production environments yet. They may change in the future in backward-incompatible @@ -50,7 +50,7 @@ Being a type of skipping index, inverted indexes can be dropped or added to a co ``` sql ALTER TABLE tab DROP INDEX inv_idx; -ALTER TABLE tab ADD INDEX inv_idx(s) TYPE inverted(2) GRANULARITY 1; +ALTER TABLE tab ADD INDEX inv_idx(s) TYPE inverted(2); ``` To use the index, no special functions or syntax are required. Typical string search predicates automatically leverage the index. As @@ -74,7 +74,106 @@ controls the amount of data read consumed from the underlying column before a ne intermediate memory consumption for index construction but also improves lookup performance since fewer segments need to be checked on average to evaluate a query. +## Full-text search of the Hacker News dataset + +Let's look at the performance improvements of inverted indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without an inverted index: + +```sql +CREATE TABLE hackernews ( + id UInt64, + deleted UInt8, + type String, + author String, + timestamp DateTime, + comment String, + dead UInt8, + parent UInt64, + poll UInt64, + children Array(UInt32), + url String, + score UInt32, + title String, + parts Array(UInt32), + descendants UInt32 +) +ENGINE = MergeTree +ORDER BY (type, author); +``` + +The 28.7M rows are in a Parquet file in S3 - let's insert them into the `hackernews` table: + +```sql +INSERT INTO hackernews + SELECT * FROM s3Cluster( + 'default', + 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet', + 'Parquet', + ' + id UInt64, + deleted UInt8, + type String, + by String, + time DateTime, + text String, + dead UInt8, + parent UInt64, + poll UInt64, + kids Array(UInt32), + url String, + score UInt32, + title String, + parts Array(UInt32), + descendants UInt32'); +``` + +Consider the following simple search for the term `ClickHouse` (and its varied upper and lower cases) in the `comment` column: + +```sql +SELECT count() +FROM hackernews +WHERE hasToken(lower(comment), 'clickhouse'); +``` + +Notice it takes 3 seconds to execute the query: + +```response +┌─count()─┐ +│ 1145 │ +└─────────┘ + +1 row in set. Elapsed: 3.001 sec. Processed 28.74 million rows, 9.75 GB (9.58 million rows/s., 3.25 GB/s.) +``` + +We will use `ALTER TABLE` and add an inverted index on the lowercase of the `comment` column, then materialize it (which can take a while - wait for it to materialize): + +```sql +ALTER TABLE hackernews + ADD INDEX comment_lowercase(lower(comment)) TYPE inverted; + +ALTER TABLE hackernews MATERIALIZE INDEX comment_lowercase; +``` + +We run the same query... + +```sql +SELECT count() +FROM hackernews +WHERE hasToken(lower(comment), 'clickhouse') +``` + +...and notice the query executes 4x faster: + +```response +┌─count()─┐ +│ 1145 │ +└─────────┘ + +1 row in set. Elapsed: 0.747 sec. Processed 4.49 million rows, 1.77 GB (6.01 million rows/s., 2.37 GB/s.) +``` + +:::note Unlike other secondary indices, inverted indexes (for now) map to row numbers (row ids) instead of granule ids. The reason for this design is performance. In practice, users often search for multiple terms at once. For example, filter predicate `WHERE s LIKE '%little%' OR s LIKE '%big%'` can be evaluated directly using an inverted index by forming the union of the row id lists for terms "little" and "big". This also means that the parameter `GRANULARITY` supplied to index creation has no meaning (it may be removed from the syntax in the future). +::: \ No newline at end of file diff --git a/docs/en/operations/caches.md b/docs/en/operations/caches.md index d912b8a5990..0f9156048c4 100644 --- a/docs/en/operations/caches.md +++ b/docs/en/operations/caches.md @@ -22,6 +22,6 @@ Additional cache types: - [Dictionaries](../sql-reference/dictionaries/index.md) data cache. - Schema inference cache. - [Filesystem cache](storing-data.md) over S3, Azure, Local and other disks. -- [(Experimental) Query result cache](query-result-cache.md). +- [(Experimental) Query cache](query-cache.md). To drop one of the caches, use [SYSTEM DROP ... CACHE](../sql-reference/statements/system.md#drop-mark-cache) statements. diff --git a/docs/en/operations/query-result-cache.md b/docs/en/operations/query-result-cache.md deleted file mode 100644 index 046b75ac5c5..00000000000 --- a/docs/en/operations/query-result-cache.md +++ /dev/null @@ -1,112 +0,0 @@ ---- -slug: /en/operations/query-result-cache -sidebar_position: 65 -sidebar_label: Query Result Cache [experimental] ---- - -# Query Result Cache [experimental] - -The query result cache allows to compute `SELECT` queries just once and to serve further executions of the same query directly from the -cache. Depending on the type of the queries, this can dramatically reduce latency and resource consumption of the ClickHouse server. - -## Background, Design and Limitations - -Query result caches can generally be viewed as transactionally consistent or inconsistent. - -- In transactionally consistent caches, the database invalidates (discards) cached query results if the result of the `SELECT` query changes - or potentially changes. In ClickHouse, operations which change the data include inserts/updates/deletes in/of/from tables or collapsing - merges. Transactionally consistent caching is especially suitable for OLTP databases, for example - [MySQL](https://dev.mysql.com/doc/refman/5.6/en/query-cache.html) (which removed query result cache after v8.0) and - [Oracle](https://docs.oracle.com/database/121/TGDBA/tune_result_cache.htm). -- In transactionally inconsistent caches, slight inaccuracies in query results are accepted under the assumption that all cache entries are - assigned a validity period after which they expire (e.g. 1 minute) and that the underlying data changes only little during this period. - This approach is overall more suitable for OLAP databases. As an example where transactionally inconsistent caching is sufficient, - consider an hourly sales report in a reporting tool which is simultaneously accessed by multiple users. Sales data changes typically - slowly enough that the database only needs to compute the report once (represented by the first `SELECT` query). Further queries can be - served directly from the query result cache. In this example, a reasonable validity period could be 30 min. - -Transactionally inconsistent caching is traditionally provided by client tools or proxy packages interacting with the database. As a result, -the same caching logic and configuration is often duplicated. With ClickHouse's query result cache, the caching logic moves to the server -side. This reduces maintenance effort and avoids redundancy. - -:::warning -The query result cache is an experimental feature that should not be used in production. There are known cases (e.g. in distributed query -processing) where wrong results are returned. -::: - -## Configuration Settings and Usage - -As long as the result cache is experimental it must be activated using the following configuration setting: - -```sql -SET allow_experimental_query_result_cache = true; -``` - -Afterwards, setting [use_query_result_cache](settings/settings.md#use-query-result-cache) can be used to control whether a specific query or -all queries of the current session should utilize the query result cache. For example, the first execution of query - -```sql -SELECT some_expensive_calculation(column_1, column_2) -FROM table -SETTINGS use_query_result_cache = true; -``` - -will store the query result in the query result cache. Subsequent executions of the same query (also with parameter `use_query_result_cache -= true`) will read the computed result from the cache and return it immediately. - -The way the cache is utilized can be configured in more detail using settings [enable_writes_to_query_result_cache](settings/settings.md#enable-writes-to-query-result-cache) -and [enable_reads_from_query_result_cache](settings/settings.md#enable-reads-from-query-result-cache) (both `true` by default). The first -settings controls whether query results are stored in the cache, whereas the second parameter determines if the database should try to -retrieve query results from the cache. For example, the following query will use the cache only passively, i.e. attempt to read from it but -not store its result in it: - -```sql -SELECT some_expensive_calculation(column_1, column_2) -FROM table -SETTINGS use_query_result_cache = true, enable_writes_to_query_result_cache = false; -``` - -For maximum control, it is generally recommended to provide settings "use_query_result_cache", "enable_writes_to_query_result_cache" and -"enable_reads_from_query_result_cache" only with specific queries. It is also possible to enable caching at user or profile level (e.g. via -`SET use_query_result_cache = true`) but one should keep in mind that all `SELECT` queries including monitoring or debugging queries to -system tables may return cached results then. - -The query result cache can be cleared using statement `SYSTEM DROP QUERY RESULT CACHE`. The content of the query result cache is displayed -in system table `SYSTEM.QUERY_RESULT_CACHE`. The number of query result cache hits and misses are shown as events "QueryCacheHits" and -"QueryCacheMisses" in system table `SYSTEM.EVENTS`. Both counters are only updated for `SELECT` queries which run with setting -"use_query_result_cache = true". Other queries do not affect the cache miss counter. - -The query result cache exists once per ClickHouse server process. However, cache results are by default not shared between users. This can -be changed (see below) but doing so is not recommended for security reasons. - -Query results are referenced in the query result cache by the [Abstract Syntax Tree (AST)](https://en.wikipedia.org/wiki/Abstract_syntax_tree) -of their query. This means that caching is agnostic to upper/lowercase, for example `SELECT 1` and `select 1` are treated as the same query. -To make the matching more natural, all query-level settings related to the query result cache are removed from the AST. - -If the query was aborted due to an exception or user cancellation, no entry is written into the query result cache. - -The size of the query result cache, the maximum number of cache entries and the maximum size of cache entries (in bytes and in records) can -be configured using different [server configuration options](server-configuration-parameters/settings.md#server_configuration_parameters_query-result-cache). - -To define how long a query must run at least such that its result can be cached, you can use setting -[query_result_cache_min_query_duration](settings/settings.md#query-result-cache-min-query-duration). For example, the result of query - -``` sql -SELECT some_expensive_calculation(column_1, column_2) -FROM table -SETTINGS use_query_result_cache = true, query_result_cache_min_query_duration = 5000; -``` - -is only cached if the query runs longer than 5 seconds. It is also possible to specify how often a query needs to run until its result is -cached - for that use setting [query_result_cache_min_query_runs](settings/settings.md#query-result-cache-min-query-runs). - -Entries in the query result cache become stale after a certain time period (time-to-live). By default, this period is 60 seconds but a -different value can be specified at session, profile or query level using setting [query_result_cache_ttl](settings/settings.md#query-result-cache-ttl). - -Also, results of queries with non-deterministic functions such as `rand()` and `now()` are not cached. This can be overruled using -setting [query_result_cache_store_results_of_queries_with_nondeterministic_functions](settings/settings.md#query-result-cache-store-results-of-queries-with-nondeterministic-functions). - -Finally, entries in the query cache are not shared between users due to security reasons. For example, user A must not be able to bypass a -row policy on a table by running the same query as another user B for whom no such policy exists. However, if necessary, cache entries can -be marked accessible by other users (i.e. shared) by supplying setting -[query_result_cache_share_between_users](settings/settings.md#query-result-cache-share-between-users). diff --git a/docs/en/operations/settings/settings.md b/docs/en/operations/settings/settings.md index 81c0531427f..32224056114 100644 --- a/docs/en/operations/settings/settings.md +++ b/docs/en/operations/settings/settings.md @@ -1303,7 +1303,7 @@ Default value: `3`. ## use_query_cache {#use-query-cache} -If turned on, `SELECT` queries may utilize the [query cache](../query-cache.md). Parameters [enable_reads_from_query_cache](#enable-readsfrom-query-cache) +If turned on, `SELECT` queries may utilize the [query cache](../query-cache.md). Parameters [enable_reads_from_query_cache](#enable-reads-from-query-cache) and [enable_writes_to_query_cache](#enable-writes-to-query-cache) control in more detail how the cache is used. Possible values: diff --git a/docs/en/sql-reference/statements/system.md b/docs/en/sql-reference/statements/system.md index 300205a7ef4..f9f55acfcec 100644 --- a/docs/en/sql-reference/statements/system.md +++ b/docs/en/sql-reference/statements/system.md @@ -283,7 +283,7 @@ SYSTEM START REPLICATION QUEUES [[db.]replicated_merge_tree_family_table_name] Wait until a `ReplicatedMergeTree` table will be synced with other replicas in a cluster. Will run until `receive_timeout` if fetches currently disabled for the table. ``` sql -SYSTEM SYNC REPLICA [db.]replicated_merge_tree_family_table_name +SYSTEM SYNC REPLICA [ON CLUSTER cluster_name] [db.]replicated_merge_tree_family_table_name ``` After running this statement the `[db.]replicated_merge_tree_family_table_name` fetches commands from the common replicated log into its own replication queue, and then the query waits till the replica processes all of the fetched commands. diff --git a/docs/en/sql-reference/table-functions/s3.md b/docs/en/sql-reference/table-functions/s3.md index 32d33f1dca5..d7199717798 100644 --- a/docs/en/sql-reference/table-functions/s3.md +++ b/docs/en/sql-reference/table-functions/s3.md @@ -2,11 +2,12 @@ slug: /en/sql-reference/table-functions/s3 sidebar_position: 45 sidebar_label: s3 +keywords: [s3, gcs, bucket] --- # s3 Table Function -Provides table-like interface to select/insert files in [Amazon S3](https://aws.amazon.com/s3/). This table function is similar to [hdfs](../../sql-reference/table-functions/hdfs.md), but provides S3-specific features. +Provides a table-like interface to select/insert files in [Amazon S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage/). This table function is similar to the [hdfs function](../../sql-reference/table-functions/hdfs.md), but provides S3-specific features. **Syntax** @@ -14,9 +15,24 @@ Provides table-like interface to select/insert files in [Amazon S3](https://aws. s3(path [,aws_access_key_id, aws_secret_access_key] [,format] [,structure] [,compression]) ``` +:::tip GCS +The S3 Table Function integrates with Google Cloud Storage by using the GCS XML API and HMAC keys. See the [Google interoperability docs]( https://cloud.google.com/storage/docs/interoperability) for more details about the endpoint and HMAC. + +For GCS, substitute your HMAC key and HMAC secret where you see `aws_access_key_id` and `aws_secret_access_key`. +::: + **Arguments** - `path` — Bucket url with path to file. Supports following wildcards in readonly mode: `*`, `?`, `{abc,def}` and `{N..M}` where `N`, `M` — numbers, `'abc'`, `'def'` — strings. For more information see [here](../../engines/table-engines/integrations/s3.md#wildcards-in-path). + + :::note GCS + The GCS path is in this format as the endpoint for the Google XML API is different than the JSON API: + ``` + https://storage.googleapis.com/// + ``` + and not ~~https://storage.cloud.google.com~~. + ::: + - `format` — The [format](../../interfaces/formats.md#formats) of the file. - `structure` — Structure of the table. Format `'column1_name column1_type, column2_name column2_type, ...'`. - `compression` — Parameter is optional. Supported values: `none`, `gzip/gz`, `brotli/br`, `xz/LZMA`, `zstd/zst`. By default, it will autodetect compression by file extension. diff --git a/programs/server/dashboard.html b/programs/server/dashboard.html index 859ce78068c..74bd9746a4d 100644 --- a/programs/server/dashboard.html +++ b/programs/server/dashboard.html @@ -76,7 +76,7 @@ #charts { height: 100%; - display: flex; + display: none; flex-flow: row wrap; gap: 1rem; } @@ -170,6 +170,14 @@ background: var(--button-background-color); } + #auth-error { + color: var(--error-color); + + display: flex; + flex-flow: row nowrap; + justify-content: center; + } + form { display: inline; } @@ -293,6 +301,7 @@ +