CLICKHOUSEDOCS-580: Fixed select.md and some of the broken links.

This commit is contained in:
Sergei Shtykov 2020-04-14 15:56:34 +03:00
parent a7bf6ad667
commit 37a971fbe5
2 changed files with 775 additions and 5 deletions

View File

@ -5,6 +5,6 @@ toc_title: Set
# Set {#set}
Used for the right half of an [IN](../../../sql_reference/statements/select.md#select-in-operators) expression.
Used for the right half of an [IN](../../statements/select.md#select-in-operators) expression.
[Original article](https://clickhouse.tech/docs/en/data_types/special_data_types/set/) <!--hide-->

View File

@ -248,7 +248,7 @@ Here, a sample of 10% is taken from the second half of the data.
### ARRAY JOIN Clause {#select-array-join-clause}
Allows executing `JOIN` with an array or nested data structure. The intent is similar to the [arrayJoin](../../sql_reference/functions/array_join.md#functions_arrayjoin) function, but its functionality is broader.
Allows executing `JOIN` with an array or nested data structure. The intent is similar to the [arrayJoin](../functions/array_join.md#functions_arrayjoin) function, but its functionality is broader.
``` sql
SELECT <expr_list>
@ -602,7 +602,777 @@ USING (equi_column1, ... equi_columnN, asof_column)
For example, consider the following tables:
\`\`\` text
table\_1 table\_2
table_1 table_2
event | ev_time | user_id event | ev_time | user_id
----------|---------|---------- ----------|---------|----------
... ...
event_1_1 | 12:00 | 42 event_2_1 | 11:59 | 42
... event_2_2 | 12:30 | 42
event_1_2 | 13:00 | 42 event_2_3 | 13:00 | 42
... ...
event \| ev\_time \| user\_id event \| ev\_time \| user\_id
`ASOF JOIN` can take the timestamp of a user event from `table_1` and find an event in `table_2` where the timestamp is closest to the timestamp of the event from `table_1` corresponding to the closest match condition. Equal timestamp values are the closest if available. Here, the `user_id` column can be used for joining on equality and the `ev_time` column can be used for joining on the closest match. In our example, `event_1_1` can be joined with `event_2_1` and `event_1_2` can be joined with `event_2_3`, but `event_2_2` cant be joined.
!!! note "Note"
`ASOF` join is **not** supported in the [Join](../../engines/table_engines/special/join.md) table engine.
To set the default strictness value, use the session configuration parameter [join\_default\_strictness](../../operations/settings/settings.md#settings-join_default_strictness).
#### GLOBAL JOIN {#global-join}
When using a normal `JOIN`, the query is sent to remote servers. Subqueries are run on each of them in order to make the right table, and the join is performed with this table. In other words, the right table is formed on each server separately.
When using `GLOBAL ... JOIN`, first the requestor server runs a subquery to calculate the right table. This temporary table is passed to each remote server, and queries are run on them using the temporary data that was transmitted.
Be careful when using `GLOBAL`. For more information, see the section [Distributed subqueries](#select-distributed-subqueries).
#### Usage Recommendations {#usage-recommendations}
When running a `JOIN`, there is no optimization of the order of execution in relation to other stages of the query. The join (a search in the right table) is run before filtering in `WHERE` and before aggregation. In order to explicitly set the processing order, we recommend running a `JOIN` subquery with a subquery.
Example:
``` sql
SELECT
CounterID,
hits,
visits
FROM
(
SELECT
CounterID,
count() AS hits
FROM test.hits
GROUP BY CounterID
) ANY LEFT JOIN
(
SELECT
CounterID,
sum(Sign) AS visits
FROM test.visits
GROUP BY CounterID
) USING CounterID
ORDER BY hits DESC
LIMIT 10
```
``` text
┌─CounterID─┬───hits─┬─visits─┐
│ 1143050 │ 523264 │ 13665 │
│ 731962 │ 475698 │ 102716 │
│ 722545 │ 337212 │ 108187 │
│ 722889 │ 252197 │ 10547 │
│ 2237260 │ 196036 │ 9522 │
│ 23057320 │ 147211 │ 7689 │
│ 722818 │ 90109 │ 17847 │
│ 48221 │ 85379 │ 4652 │
│ 19762435 │ 77807 │ 7026 │
│ 722884 │ 77492 │ 11056 │
└───────────┴────────┴────────┘
```
Subqueries dont allow you to set names or use them for referencing a column from a specific subquery.
The columns specified in `USING` must have the same names in both subqueries, and the other columns must be named differently. You can use aliases to change the names of columns in subqueries (the example uses the aliases `hits` and `visits`).
The `USING` clause specifies one or more columns to join, which establishes the equality of these columns. The list of columns is set without brackets. More complex join conditions are not supported.
The right table (the subquery result) resides in RAM. If there isnt enough memory, you cant run a `JOIN`.
Each time a query is run with the same `JOIN`, the subquery is run again because the result is not cached. To avoid this, use the special [Join](../../engines/table_engines/special/join.md) table engine, which is a prepared array for joining that is always in RAM.
In some cases, it is more efficient to use `IN` instead of `JOIN`.
Among the various types of `JOIN`, the most efficient is `ANY LEFT JOIN`, then `ANY INNER JOIN`. The least efficient are `ALL LEFT JOIN` and `ALL INNER JOIN`.
If you need a `JOIN` for joining with dimension tables (these are relatively small tables that contain dimension properties, such as names for advertising campaigns), a `JOIN` might not be very convenient due to the fact that the right table is re-accessed for every query. For such cases, there is an “external dictionaries” feature that you should use instead of `JOIN`. For more information, see the section [External dictionaries](../dictionaries/external_dictionaries/external_dicts.md).
**Memory Limitations**
ClickHouse uses the [hash join](https://en.wikipedia.org/wiki/Hash_join) algorithm. ClickHouse takes the `<right_subquery>` and creates a hash table for it in RAM. If you need to restrict join operation memory consumption use the following settings:
- [max\_rows\_in\_join](../../operations/settings/query_complexity.md#settings-max_rows_in_join) — Limits number of rows in the hash table.
- [max\_bytes\_in\_join](../../operations/settings/query_complexity.md#settings-max_bytes_in_join) — Limits size of the hash table.
When any of these limits is reached, ClickHouse acts as the [join\_overflow\_mode](../../operations/settings/query_complexity.md#settings-join_overflow_mode) setting instructs.
#### Processing of Empty or NULL Cells {#processing-of-empty-or-null-cells}
While joining tables, the empty cells may appear. The setting [join\_use\_nulls](../../operations/settings/settings.md#join_use_nulls) define how ClickHouse fills these cells.
If the `JOIN` keys are [Nullable](../data_types/nullable.md) fields, the rows where at least one of the keys has the value [NULL](../syntax.md#null-literal) are not joined.
#### Syntax Limitations {#syntax-limitations}
For multiple `JOIN` clauses in a single `SELECT` query:
- Taking all the columns via `*` is available only if tables are joined, not subqueries.
- The `PREWHERE` clause is not available.
For `ON`, `WHERE`, and `GROUP BY` clauses:
- Arbitrary expressions cannot be used in `ON`, `WHERE`, and `GROUP BY` clauses, but you can define an expression in a `SELECT` clause and then use it in these clauses via an alias.
### WHERE Clause {#select-where}
If there is a WHERE clause, it must contain an expression with the UInt8 type. This is usually an expression with comparison and logical operators.
This expression will be used for filtering data before all other transformations.
If indexes are supported by the database table engine, the expression is evaluated on the ability to use indexes.
### PREWHERE Clause {#prewhere-clause}
This clause has the same meaning as the WHERE clause. The difference is in which data is read from the table.
When using PREWHERE, first only the columns necessary for executing PREWHERE are read. Then the other columns are read that are needed for running the query, but only those blocks where the PREWHERE expression is true.
It makes sense to use PREWHERE if there are filtration conditions that are used by a minority of the columns in the query, but that provide strong data filtration. This reduces the volume of data to read.
For example, it is useful to write PREWHERE for queries that extract a large number of columns, but that only have filtration for a few columns.
PREWHERE is only supported by tables from the `*MergeTree` family.
A query may simultaneously specify PREWHERE and WHERE. In this case, PREWHERE precedes WHERE.
If the optimize\_move\_to\_prewhere setting is set to 1 and PREWHERE is omitted, the system uses heuristics to automatically move parts of expressions from WHERE to PREWHERE.
### GROUP BY Clause {#select-group-by-clause}
This is one of the most important parts of a column-oriented DBMS.
If there is a GROUP BY clause, it must contain a list of expressions. Each expression will be referred to here as a “key”.
All the expressions in the SELECT, HAVING, and ORDER BY clauses must be calculated from keys or from aggregate functions. In other words, each column selected from the table must be used either in keys or inside aggregate functions.
If a query contains only table columns inside aggregate functions, the GROUP BY clause can be omitted, and aggregation by an empty set of keys is assumed.
Example:
``` sql
SELECT
count(),
median(FetchTiming > 60 ? 60 : FetchTiming),
count() - sum(Refresh)
FROM hits
```
However, in contrast to standard SQL, if the table doesnt have any rows (either there arent any at all, or there arent any after using WHERE to filter), an empty result is returned, and not the result from one of the rows containing the initial values of aggregate functions.
As opposed to MySQL (and conforming to standard SQL), you cant get some value of some column that is not in a key or aggregate function (except constant expressions). To work around this, you can use the any aggregate function (get the first encountered value) or min/max.
Example:
``` sql
SELECT
domainWithoutWWW(URL) AS domain,
count(),
any(Title) AS title -- getting the first occurred page header for each domain.
FROM hits
GROUP BY domain
```
For every different key value encountered, GROUP BY calculates a set of aggregate function values.
GROUP BY is not supported for array columns.
A constant cant be specified as arguments for aggregate functions. Example: sum(1). Instead of this, you can get rid of the constant. Example: `count()`.
#### NULL processing {#null-processing}
For grouping, ClickHouse interprets [NULL](../syntax.md) as a value, and `NULL=NULL`.
Heres an example to show what this means.
Assume you have this table:
``` text
┌─x─┬────y─┐
│ 1 │ 2 │
│ 2 │ ᴺᵁᴸᴸ │
│ 3 │ 2 │
│ 3 │ 3 │
│ 3 │ ᴺᵁᴸᴸ │
└───┴──────┘
```
The query `SELECT sum(x), y FROM t_null_big GROUP BY y` results in:
``` text
┌─sum(x)─┬────y─┐
│ 4 │ 2 │
│ 3 │ 3 │
│ 5 │ ᴺᵁᴸᴸ │
└────────┴──────┘
```
You can see that `GROUP BY` for `y = NULL` summed up `x`, as if `NULL` is this value.
If you pass several keys to `GROUP BY`, the result will give you all the combinations of the selection, as if `NULL` were a specific value.
#### WITH TOTALS Modifier {#with-totals-modifier}
If the WITH TOTALS modifier is specified, another row will be calculated. This row will have key columns containing default values (zeros or empty lines), and columns of aggregate functions with the values calculated across all the rows (the “total” values).
This extra row is output in JSON\*, TabSeparated\*, and Pretty\* formats, separately from the other rows. In the other formats, this row is not output.
In JSON\* formats, this row is output as a separate totals field. In TabSeparated\* formats, the row comes after the main result, preceded by an empty row (after the other data). In Pretty\* formats, the row is output as a separate table after the main result.
`WITH TOTALS` can be run in different ways when HAVING is present. The behavior depends on the totals\_mode setting.
By default, `totals_mode = 'before_having'`. In this case, totals is calculated across all rows, including the ones that dont pass through HAVING and max\_rows\_to\_group\_by.
The other alternatives include only the rows that pass through HAVING in totals, and behave differently with the setting `max_rows_to_group_by` and `group_by_overflow_mode = 'any'`.
`after_having_exclusive` Dont include rows that didnt pass through `max_rows_to_group_by`. In other words, totals will have less than or the same number of rows as it would if `max_rows_to_group_by` were omitted.
`after_having_inclusive` Include all the rows that didnt pass through max\_rows\_to\_group\_by in totals. In other words, totals will have more than or the same number of rows as it would if `max_rows_to_group_by` were omitted.
`after_having_auto` Count the number of rows that passed through HAVING. If it is more than a certain amount (by default, 50%), include all the rows that didnt pass through max\_rows\_to\_group\_by in totals. Otherwise, do not include them.
`totals_auto_threshold` By default, 0.5. The coefficient for `after_having_auto`.
If `max_rows_to_group_by` and `group_by_overflow_mode = 'any'` are not used, all variations of `after_having` are the same, and you can use any of them (for example, `after_having_auto`).
You can use WITH TOTALS in subqueries, including subqueries in the JOIN clause (in this case, the respective total values are combined).
#### GROUP BY in External Memory {#select-group-by-in-external-memory}
You can enable dumping temporary data to the disk to restrict memory usage during `GROUP BY`.
The [max\_bytes\_before\_external\_group\_by](../../operations/settings/settings.md#settings-max_bytes_before_external_group_by) setting determines the threshold RAM consumption for dumping `GROUP BY` temporary data to the file system. If set to 0 (the default), it is disabled.
When using `max_bytes_before_external_group_by`, we recommend that you set `max_memory_usage` about twice as high. This is necessary because there are two stages to aggregation: reading the date and forming intermediate data (1) and merging the intermediate data (2). Dumping data to the file system can only occur during stage 1. If the temporary data wasnt dumped, then stage 2 might require up to the same amount of memory as in stage 1.
For example, if [max\_memory\_usage](../../operations/settings/settings.md#settings_max_memory_usage) was set to 10000000000 and you want to use external aggregation, it makes sense to set `max_bytes_before_external_group_by` to 10000000000, and max\_memory\_usage to 20000000000. When external aggregation is triggered (if there was at least one dump of temporary data), maximum consumption of RAM is only slightly more than `max_bytes_before_external_group_by`.
With distributed query processing, external aggregation is performed on remote servers. In order for the requester server to use only a small amount of RAM, set `distributed_aggregation_memory_efficient` to 1.
When merging data flushed to the disk, as well as when merging results from remote servers when the `distributed_aggregation_memory_efficient` setting is enabled, consumes up to `1/256 * the_number_of_threads` from the total amount of RAM.
When external aggregation is enabled, if there was less than `max_bytes_before_external_group_by` of data (i.e. data was not flushed), the query runs just as fast as without external aggregation. If any temporary data was flushed, the run time will be several times longer (approximately three times).
If you have an `ORDER BY` with a `LIMIT` after `GROUP BY`, then the amount of used RAM depends on the amount of data in `LIMIT`, not in the whole table. But if the `ORDER BY` doesnt have `LIMIT`, dont forget to enable external sorting (`max_bytes_before_external_sort`).
### LIMIT BY Clause {#limit-by-clause}
A query with the `LIMIT n BY expressions` clause selects the first `n` rows for each distinct value of `expressions`. The key for `LIMIT BY` can contain any number of [expressions](../syntax.md#syntax-expressions).
ClickHouse supports the following syntax:
- `LIMIT [offset_value, ]n BY expressions`
- `LIMIT n OFFSET offset_value BY expressions`
During query processing, ClickHouse selects data ordered by sorting key. The sorting key is set explicitly using an [ORDER BY](#select-order-by) clause or implicitly as a property of the table engine. Then ClickHouse applies `LIMIT n BY expressions` and returns the first `n` rows for each distinct combination of `expressions`. If `OFFSET` is specified, then for each data block that belongs to a distinct combination of `expressions`, ClickHouse skips `offset_value` number of rows from the beginning of the block and returns a maximum of `n` rows as a result. If `offset_value` is bigger than the number of rows in the data block, ClickHouse returns zero rows from the block.
`LIMIT BY` is not related to `LIMIT`. They can both be used in the same query.
**Examples**
Sample table:
``` sql
CREATE TABLE limit_by(id Int, val Int) ENGINE = Memory;
INSERT INTO limit_by values(1, 10), (1, 11), (1, 12), (2, 20), (2, 21);
```
Queries:
``` sql
SELECT * FROM limit_by ORDER BY id, val LIMIT 2 BY id
```
``` text
┌─id─┬─val─┐
│ 1 │ 10 │
│ 1 │ 11 │
│ 2 │ 20 │
│ 2 │ 21 │
└────┴─────┘
```
``` sql
SELECT * FROM limit_by ORDER BY id, val LIMIT 1, 2 BY id
```
``` text
┌─id─┬─val─┐
│ 1 │ 11 │
│ 1 │ 12 │
│ 2 │ 21 │
└────┴─────┘
```
The `SELECT * FROM limit_by ORDER BY id, val LIMIT 2 OFFSET 1 BY id` query returns the same result.
The following query returns the top 5 referrers for each `domain, device_type` pair with a maximum of 100 rows in total (`LIMIT n BY + LIMIT`).
``` sql
SELECT
domainWithoutWWW(URL) AS domain,
domainWithoutWWW(REFERRER_URL) AS referrer,
device_type,
count() cnt
FROM hits
GROUP BY domain, referrer, device_type
ORDER BY cnt DESC
LIMIT 5 BY domain, device_type
LIMIT 100
```
### HAVING Clause {#having-clause}
Allows filtering the result received after GROUP BY, similar to the WHERE clause.
WHERE and HAVING differ in that WHERE is performed before aggregation (GROUP BY), while HAVING is performed after it.
If aggregation is not performed, HAVING cant be used.
### ORDER BY Clause {#select-order-by}
The ORDER BY clause contains a list of expressions, which can each be assigned DESC or ASC (the sorting direction). If the direction is not specified, ASC is assumed. ASC is sorted in ascending order, and DESC in descending order. The sorting direction applies to a single expression, not to the entire list. Example: `ORDER BY Visits DESC, SearchPhrase`
For sorting by String values, you can specify collation (comparison). Example: `ORDER BY SearchPhrase COLLATE 'tr'` - for sorting by keyword in ascending order, using the Turkish alphabet, case insensitive, assuming that strings are UTF-8 encoded. COLLATE can be specified or not for each expression in ORDER BY independently. If ASC or DESC is specified, COLLATE is specified after it. When using COLLATE, sorting is always case-insensitive.
We only recommend using COLLATE for final sorting of a small number of rows, since sorting with COLLATE is less efficient than normal sorting by bytes.
Rows that have identical values for the list of sorting expressions are output in an arbitrary order, which can also be nondeterministic (different each time).
If the ORDER BY clause is omitted, the order of the rows is also undefined, and may be nondeterministic as well.
`NaN` and `NULL` sorting order:
- With the modifier `NULLS FIRST` — First `NULL`, then `NaN`, then other values.
- With the modifier `NULLS LAST` — First the values, then `NaN`, then `NULL`.
- Default — The same as with the `NULLS LAST` modifier.
Example:
For the table
``` text
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │ 2 │
│ 1 │ nan │
│ 2 │ 2 │
│ 3 │ 4 │
│ 5 │ 6 │
│ 6 │ nan │
│ 7 │ ᴺᵁᴸᴸ │
│ 6 │ 7 │
│ 8 │ 9 │
└───┴──────┘
```
Run the query `SELECT * FROM t_null_nan ORDER BY y NULLS FIRST` to get:
``` text
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 7 │ ᴺᵁᴸᴸ │
│ 1 │ nan │
│ 6 │ nan │
│ 2 │ 2 │
│ 2 │ 2 │
│ 3 │ 4 │
│ 5 │ 6 │
│ 6 │ 7 │
│ 8 │ 9 │
└───┴──────┘
```
When floating point numbers are sorted, NaNs are separate from the other values. Regardless of the sorting order, NaNs come at the end. In other words, for ascending sorting they are placed as if they are larger than all the other numbers, while for descending sorting they are placed as if they are smaller than the rest.
Less RAM is used if a small enough LIMIT is specified in addition to ORDER BY. Otherwise, the amount of memory spent is proportional to the volume of data for sorting. For distributed query processing, if GROUP BY is omitted, sorting is partially done on remote servers, and the results are merged on the requestor server. This means that for distributed sorting, the volume of data to sort can be greater than the amount of memory on a single server.
If there is not enough RAM, it is possible to perform sorting in external memory (creating temporary files on a disk). Use the setting `max_bytes_before_external_sort` for this purpose. If it is set to 0 (the default), external sorting is disabled. If it is enabled, when the volume of data to sort reaches the specified number of bytes, the collected data is sorted and dumped into a temporary file. After all data is read, all the sorted files are merged and the results are output. Files are written to the /var/lib/clickhouse/tmp/ directory in the config (by default, but you can use the tmp\_path parameter to change this setting).
Running a query may use more memory than max\_bytes\_before\_external\_sort. For this reason, this setting must have a value significantly smaller than max\_memory\_usage. As an example, if your server has 128 GB of RAM and you need to run a single query, set max\_memory\_usage to 100 GB, and max\_bytes\_before\_external\_sort to 80 GB.
External sorting works much less effectively than sorting in RAM.
### SELECT Clause {#select-select}
[Expressions](../syntax.md#syntax-expressions) specified in the `SELECT` clause are calculated after all the operations in the clauses described above are finished. These expressions work as if they apply to separate rows in the result. If expressions in the `SELECT` clause contain aggregate functions, then ClickHouse processes aggregate functions and expressions used as their arguments during the [GROUP BY](#select-group-by-clause) aggregation.
If you want to include all columns in the result, use the asterisk (`*`) symbol. For example, `SELECT * FROM ...`.
To match some columns in the result with a [re2](https://en.wikipedia.org/wiki/RE2_(software)) regular expression, you can use the `COLUMNS` expression.
``` sql
COLUMNS('regexp')
```
For example, consider the table:
``` sql
CREATE TABLE default.col_names (aa Int8, ab Int8, bc Int8) ENGINE = TinyLog
```
The following query selects data from all the columns containing the `a` symbol in their name.
``` sql
SELECT COLUMNS('a') FROM col_names
```
``` text
┌─aa─┬─ab─┐
│ 1 │ 1 │
└────┴────┘
```
The selected columns are returned not in the alphabetical order.
You can use multiple `COLUMNS` expressions in a query and apply functions to them.
For example:
``` sql
SELECT COLUMNS('a'), COLUMNS('c'), toTypeName(COLUMNS('c')) FROM col_names
```
``` text
┌─aa─┬─ab─┬─bc─┬─toTypeName(bc)─┐
│ 1 │ 1 │ 1 │ Int8 │
└────┴────┴────┴────────────────┘
```
Each column returned by the `COLUMNS` expression is passed to the function as a separate argument. Also you can pass other arguments to the function if it supports them. Be careful when using functions. If a function doesnt support the number of arguments you have passed to it, ClickHouse throws an exception.
For example:
``` sql
SELECT COLUMNS('a') + COLUMNS('c') FROM col_names
```
``` text
Received exception from server (version 19.14.1):
Code: 42. DB::Exception: Received from localhost:9000. DB::Exception: Number of arguments for function plus doesn't match: passed 3, should be 2.
```
In this example, `COLUMNS('a')` returns two columns: `aa` and `ab`. `COLUMNS('c')` returns the `bc` column. The `+` operator cant apply to 3 arguments, so ClickHouse throws an exception with the relevant message.
Columns that matched the `COLUMNS` expression can have different data types. If `COLUMNS` doesnt match any columns and is the only expression in `SELECT`, ClickHouse throws an exception.
### DISTINCT Clause {#select-distinct}
If DISTINCT is specified, only a single row will remain out of all the sets of fully matching rows in the result.
The result will be the same as if GROUP BY were specified across all the fields specified in SELECT without aggregate functions. But there are several differences from GROUP BY:
- DISTINCT can be applied together with GROUP BY.
- When ORDER BY is omitted and LIMIT is defined, the query stops running immediately after the required number of different rows has been read.
- Data blocks are output as they are processed, without waiting for the entire query to finish running.
DISTINCT is not supported if SELECT has at least one array column.
`DISTINCT` works with [NULL](../syntax.md) as if `NULL` were a specific value, and `NULL=NULL`. In other words, in the `DISTINCT` results, different combinations with `NULL` only occur once.
ClickHouse supports using the `DISTINCT` and `ORDER BY` clauses for different columns in one query. The `DISTINCT` clause is executed before the `ORDER BY` clause.
Example table:
``` text
┌─a─┬─b─┐
│ 2 │ 1 │
│ 1 │ 2 │
│ 3 │ 3 │
│ 2 │ 4 │
└───┴───┘
```
When selecting data with the `SELECT DISTINCT a FROM t1 ORDER BY b ASC` query, we get the following result:
``` text
┌─a─┐
│ 2 │
│ 1 │
│ 3 │
└───┘
```
If we change the sorting direction `SELECT DISTINCT a FROM t1 ORDER BY b DESC`, we get the following result:
``` text
┌─a─┐
│ 3 │
│ 1 │
│ 2 │
└───┘
```
Row `2, 4` was cut before sorting.
Take this implementation specificity into account when programming queries.
### LIMIT Clause {#limit-clause}
`LIMIT m` allows you to select the first `m` rows from the result.
`LIMIT n, m` allows you to select the first `m` rows from the result after skipping the first `n` rows. The `LIMIT m OFFSET n` syntax is also supported.
`n` and `m` must be non-negative integers.
If there isnt an `ORDER BY` clause that explicitly sorts results, the result may be arbitrary and nondeterministic.
### UNION ALL Clause {#union-all-clause}
You can use UNION ALL to combine any number of queries. Example:
``` sql
SELECT CounterID, 1 AS table, toInt64(count()) AS c
FROM test.hits
GROUP BY CounterID
UNION ALL
SELECT CounterID, 2 AS table, sum(Sign) AS c
FROM test.visits
GROUP BY CounterID
HAVING c > 0
```
Only UNION ALL is supported. The regular UNION (UNION DISTINCT) is not supported. If you need UNION DISTINCT, you can write SELECT DISTINCT from a subquery containing UNION ALL.
Queries that are parts of UNION ALL can be run simultaneously, and their results can be mixed together.
The structure of results (the number and type of columns) must match for the queries. But the column names can differ. In this case, the column names for the final result will be taken from the first query. Type casting is performed for unions. For example, if two queries being combined have the same field with non-`Nullable` and `Nullable` types from a compatible type, the resulting `UNION ALL` has a `Nullable` type field.
Queries that are parts of UNION ALL cant be enclosed in brackets. ORDER BY and LIMIT are applied to separate queries, not to the final result. If you need to apply a conversion to the final result, you can put all the queries with UNION ALL in a subquery in the FROM clause.
### INTO OUTFILE Clause {#into-outfile-clause}
Add the `INTO OUTFILE filename` clause (where filename is a string literal) to redirect query output to the specified file.
In contrast to MySQL, the file is created on the client side. The query will fail if a file with the same filename already exists.
This functionality is available in the command-line client and clickhouse-local (a query sent via HTTP interface will fail).
The default output format is TabSeparated (the same as in the command-line client batch mode).
### FORMAT Clause {#format-clause}
Specify FORMAT format to get data in any specified format.
You can use this for convenience, or for creating dumps.
For more information, see the section “Formats”.
If the FORMAT clause is omitted, the default format is used, which depends on both the settings and the interface used for accessing the DB. For the HTTP interface and the command-line client in batch mode, the default format is TabSeparated. For the command-line client in interactive mode, the default format is PrettyCompact (it has attractive and compact tables).
When using the command-line client, data is passed to the client in an internal efficient format. The client independently interprets the FORMAT clause of the query and formats the data itself (thus relieving the network and the server from the load).
### IN Operators {#select-in-operators}
The `IN`, `NOT IN`, `GLOBAL IN`, and `GLOBAL NOT IN` operators are covered separately, since their functionality is quite rich.
The left side of the operator is either a single column or a tuple.
Examples:
``` sql
SELECT UserID IN (123, 456) FROM ...
SELECT (CounterID, UserID) IN ((34, 123), (101500, 456)) FROM ...
```
If the left side is a single column that is in the index, and the right side is a set of constants, the system uses the index for processing the query.
Dont list too many values explicitly (i.e. millions). If a data set is large, put it in a temporary table (for example, see the section “External data for query processing”), then use a subquery.
The right side of the operator can be a set of constant expressions, a set of tuples with constant expressions (shown in the examples above), or the name of a database table or SELECT subquery in brackets.
If the right side of the operator is the name of a table (for example, `UserID IN users`), this is equivalent to the subquery `UserID IN (SELECT * FROM users)`. Use this when working with external data that is sent along with the query. For example, the query can be sent together with a set of user IDs loaded to the users temporary table, which should be filtered.
If the right side of the operator is a table name that has the Set engine (a prepared data set that is always in RAM), the data set will not be created over again for each query.
The subquery may specify more than one column for filtering tuples.
Example:
``` sql
SELECT (CounterID, UserID) IN (SELECT CounterID, UserID FROM ...) FROM ...
```
The columns to the left and right of the IN operator should have the same type.
The IN operator and subquery may occur in any part of the query, including in aggregate functions and lambda functions.
Example:
``` sql
SELECT
EventDate,
avg(UserID IN
(
SELECT UserID
FROM test.hits
WHERE EventDate = toDate('2014-03-17')
)) AS ratio
FROM test.hits
GROUP BY EventDate
ORDER BY EventDate ASC
```
``` text
┌──EventDate─┬────ratio─┐
│ 2014-03-17 │ 1 │
│ 2014-03-18 │ 0.807696 │
│ 2014-03-19 │ 0.755406 │
│ 2014-03-20 │ 0.723218 │
│ 2014-03-21 │ 0.697021 │
│ 2014-03-22 │ 0.647851 │
│ 2014-03-23 │ 0.648416 │
└────────────┴──────────┘
```
For each day after March 17th, count the percentage of pageviews made by users who visited the site on March 17th.
A subquery in the IN clause is always run just one time on a single server. There are no dependent subqueries.
#### NULL processing {#null-processing-1}
During request processing, the IN operator assumes that the result of an operation with [NULL](../syntax.md) is always equal to `0`, regardless of whether `NULL` is on the right or left side of the operator. `NULL` values are not included in any dataset, do not correspond to each other and cannot be compared.
Here is an example with the `t_null` table:
``` text
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │ 3 │
└───┴──────┘
```
Running the query `SELECT x FROM t_null WHERE y IN (NULL,3)` gives you the following result:
``` text
┌─x─┐
│ 2 │
└───┘
```
You can see that the row in which `y = NULL` is thrown out of the query results. This is because ClickHouse cant decide whether `NULL` is included in the `(NULL,3)` set, returns `0` as the result of the operation, and `SELECT` excludes this row from the final output.
``` sql
SELECT y IN (NULL, 3)
FROM t_null
```
``` text
┌─in(y, tuple(NULL, 3))─┐
│ 0 │
│ 1 │
└───────────────────────┘
```
#### Distributed Subqueries {#select-distributed-subqueries}
There are two options for IN-s with subqueries (similar to JOINs): normal `IN` / `JOIN` and `GLOBAL IN` / `GLOBAL JOIN`. They differ in how they are run for distributed query processing.
!!! attention "Attention"
Remember that the algorithms described below may work differently depending on the [settings](../../operations/settings/settings.md) `distributed_product_mode` setting.
When using the regular IN, the query is sent to remote servers, and each of them runs the subqueries in the `IN` or `JOIN` clause.
When using `GLOBAL IN` / `GLOBAL JOINs`, first all the subqueries are run for `GLOBAL IN` / `GLOBAL JOINs`, and the results are collected in temporary tables. Then the temporary tables are sent to each remote server, where the queries are run using this temporary data.
For a non-distributed query, use the regular `IN` / `JOIN`.
Be careful when using subqueries in the `IN` / `JOIN` clauses for distributed query processing.
Lets look at some examples. Assume that each server in the cluster has a normal **local\_table**. Each server also has a **distributed\_table** table with the **Distributed** type, which looks at all the servers in the cluster.
For a query to the **distributed\_table**, the query will be sent to all the remote servers and run on them using the **local\_table**.
For example, the query
``` sql
SELECT uniq(UserID) FROM distributed_table
```
will be sent to all remote servers as
``` sql
SELECT uniq(UserID) FROM local_table
```
and run on each of them in parallel, until it reaches the stage where intermediate results can be combined. Then the intermediate results will be returned to the requestor server and merged on it, and the final result will be sent to the client.
Now lets examine a query with IN:
``` sql
SELECT uniq(UserID) FROM distributed_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM local_table WHERE CounterID = 34)
```
- Calculation of the intersection of audiences of two sites.
This query will be sent to all remote servers as
``` sql
SELECT uniq(UserID) FROM local_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM local_table WHERE CounterID = 34)
```
In other words, the data set in the IN clause will be collected on each server independently, only across the data that is stored locally on each of the servers.
This will work correctly and optimally if you are prepared for this case and have spread data across the cluster servers such that the data for a single UserID resides entirely on a single server. In this case, all the necessary data will be available locally on each server. Otherwise, the result will be inaccurate. We refer to this variation of the query as “local IN”.
To correct how the query works when data is spread randomly across the cluster servers, you could specify **distributed\_table** inside a subquery. The query would look like this:
``` sql
SELECT uniq(UserID) FROM distributed_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM distributed_table WHERE CounterID = 34)
```
This query will be sent to all remote servers as
``` sql
SELECT uniq(UserID) FROM local_table WHERE CounterID = 101500 AND UserID IN (SELECT UserID FROM distributed_table WHERE CounterID = 34)
```
The subquery will begin running on each remote server. Since the subquery uses a distributed table, the subquery that is on each remote server will be resent to every remote server as
``` sql
SELECT UserID FROM local_table WHERE CounterID = 34
```
For example, if you have a cluster of 100 servers, executing the entire query will require 10,000 elementary requests, which is generally considered unacceptable.
In such cases, you should always use GLOBAL IN instead of IN. Lets look at how it works for the query
``` sql
SELECT uniq(UserID) FROM distributed_table WHERE CounterID = 101500 AND UserID GLOBAL IN (SELECT UserID FROM distributed_table WHERE CounterID = 34)
```
The requestor server will run the subquery
``` sql
SELECT UserID FROM distributed_table WHERE CounterID = 34
```
and the result will be put in a temporary table in RAM. Then the request will be sent to each remote server as
``` sql
SELECT uniq(UserID) FROM local_table WHERE CounterID = 101500 AND UserID GLOBAL IN _data1
```
and the temporary table `_data1` will be sent to every remote server with the query (the name of the temporary table is implementation-defined).
This is more optimal than using the normal IN. However, keep the following points in mind:
1. When creating a temporary table, data is not made unique. To reduce the volume of data transmitted over the network, specify DISTINCT in the subquery. (You dont need to do this for a normal IN.)
2. The temporary table will be sent to all the remote servers. Transmission does not account for network topology. For example, if 10 remote servers reside in a datacenter that is very remote in relation to the requestor server, the data will be sent 10 times over the channel to the remote datacenter. Try to avoid large data sets when using GLOBAL IN.
3. When transmitting data to remote servers, restrictions on network bandwidth are not configurable. You might overload the network.
4. Try to distribute data across servers so that you dont need to use GLOBAL IN on a regular basis.
5. If you need to use GLOBAL IN often, plan the location of the ClickHouse cluster so that a single group of replicas resides in no more than one data center with a fast network between them, so that a query can be processed entirely within a single data center.
It also makes sense to specify a local table in the `GLOBAL IN` clause, in case this local table is only available on the requestor server and you want to use data from it on remote servers.
### Extreme Values {#extreme-values}
In addition to results, you can also get minimum and maximum values for the results columns. To do this, set the **extremes** setting to 1. Minimums and maximums are calculated for numeric types, dates, and dates with times. For other columns, the default values are output.
An extra two rows are calculated the minimums and maximums, respectively. These extra two rows are output in `JSON*`, `TabSeparated*`, and `Pretty*` [formats](../../interfaces/formats.md), separate from the other rows. They are not output for other formats.
In `JSON*` formats, the extreme values are output in a separate extremes field. In `TabSeparated*` formats, the row comes after the main result, and after totals if present. It is preceded by an empty row (after the other data). In `Pretty*` formats, the row is output as a separate table after the main result, and after `totals` if present.
Extreme values are calculated for rows before `LIMIT`, but after `LIMIT BY`. However, when using `LIMIT offset, size`, the rows before `offset` are included in `extremes`. In stream requests, the result may also include a small number of rows that passed through `LIMIT`.
### Notes {#notes}
The `GROUP BY` and `ORDER BY` clauses do not support positional arguments. This contradicts MySQL, but conforms to standard SQL.
For example, `GROUP BY 1, 2` will be interpreted as grouping by constants (i.e. aggregation of all rows into one).
You can use synonyms (`AS` aliases) in any part of a query.
You can put an asterisk in any part of a query instead of an expression. When the query is analyzed, the asterisk is expanded to a list of all table columns (excluding the `MATERIALIZED` and `ALIAS` columns). There are only a few cases when using an asterisk is justified:
- When creating a table dump.
- For tables containing just a few columns, such as system tables.
- For getting information about what columns are in a table. In this case, set `LIMIT 1`. But it is better to use the `DESC TABLE` query.
- When there is strong filtration on a small number of columns using `PREWHERE`.
- In subqueries (since columns that arent needed for the external query are excluded from subqueries).
In all other cases, we dont recommend using the asterisk, since it only gives you the drawbacks of a columnar DBMS instead of the advantages. In other words using the asterisk is not recommended.
[Original article](https://clickhouse.tech/docs/en/query_language/select/) <!--hide-->