ClickHouse/docs/en/query_language/agg_functions/reference.md

999 lines
34 KiB
Markdown
Raw Normal View History

# Function Reference
## count {#agg_function-count}
Counts the number of rows or not-NULL values.
ClickHouse supports the following syntaxes for `count`:
- `count(expr)` or `COUNT(DISTINCT expr)`.
- `count()` or `COUNT(*)`. The `count()` syntax is a ClickHouse-specific implementation.
**Parameters**
The function can take:
- Zero parameters.
- One [expression](../syntax.md#syntax-expressions).
**Returned value**
- If the function is called without parameters it counts the number of rows.
- If the [expression](../syntax.md#syntax-expressions) is passed, then the function counts how many times this expression returned not null. If the expression returns a value of the [Nullable](../../data_types/nullable.md) data type, then the result of `count` stays not `Nullable`. The function returns 0 if the expression returned `NULL` for all the rows.
In both cases the type of the returned value is [UInt64](../../data_types/int_uint.md).
**Details**
ClickHouse supports the `COUNT(DISTINCT ...)` syntax. The behavior of this construction depends on the [count_distinct_implementation](../../operations/settings/settings.md#settings-count_distinct_implementation) setting. It defines which of the [uniq*](#agg_function-uniq) functions is used to perform the operation. By default the [uniqExact](#agg_function-uniqexact) function.
A `SELECT count() FROM table` query is not optimized, because the number of entries in the table is not stored separately. It chooses some small column from the table and count the number of values in it.
**Examples**
Example 1:
```sql
SELECT count() FROM t
```
```text
┌─count()─┐
│ 5 │
└─────────┘
```
Example 2:
```sql
SELECT name, value FROM system.settings WHERE name = 'count_distinct_implementation'
```
```text
┌─name──────────────────────────┬─value─────┐
│ count_distinct_implementation │ uniqExact │
└───────────────────────────────┴───────────┘
```
```sql
SELECT count(DISTINCT num) FROM t
```
```text
┌─uniqExact(num)─┐
│ 3 │
└────────────────┘
```
This example shows that `count(DISTINCT num)` is performed by the function `uniqExact` corresponding to the `count_distinct_implementation` setting value.
## any(x) {#agg_function-any}
Selects the first encountered value.
The query can be executed in any order and even in a different order each time, so the result of this function is indeterminate.
To get a determinate result, you can use the 'min' or 'max' function instead of 'any'.
In some cases, you can rely on the order of execution. This applies to cases when SELECT comes from a subquery that uses ORDER BY.
When a `SELECT` query has the `GROUP BY` clause or at least one aggregate function, ClickHouse (in contrast to MySQL) requires that all expressions in the `SELECT`, `HAVING`, and `ORDER BY` clauses be calculated from keys or from aggregate functions. In other words, each column selected from the table must be used either in keys or inside aggregate functions. To get behavior like in MySQL, you can put the other columns in the `any` aggregate function.
2018-04-23 06:20:21 +00:00
## anyHeavy(x)
Selects a frequently occurring value using the [heavy hitters](http://www.cs.umd.edu/~samir/498/karp.pdf) algorithm. If there is a value that occurs more than in half the cases in each of the query's execution threads, this value is returned. Normally, the result is nondeterministic.
```
anyHeavy(column)
```
**Arguments**
- `column` The column name.
**Example**
WIP on docs (#3813) * CLICKHOUSE-4063: less manual html @ index.md * CLICKHOUSE-4063: recommend markdown="1" in README.md * CLICKHOUSE-4003: manually purge custom.css for now * CLICKHOUSE-4064: expand <details> before any print (including to pdf) * CLICKHOUSE-3927: rearrange interfaces/formats.md a bit * CLICKHOUSE-3306: add few http headers * Remove copy-paste introduced in #3392 * Hopefully better chinese fonts #3392 * get rid of tabs @ custom.css * Apply comments and patch from #3384 * Add jdbc.md to ToC and some translation, though it still looks badly incomplete * minor punctuation * Add some backlinks to official website from mirrors that just blindly take markdown sources * Do not make fonts extra light * find . -name '*.md' -type f | xargs -I{} perl -pi -e 's//g' {} * find . -name '*.md' -type f | xargs -I{} perl -pi -e 's/ sql/g' {} * Remove outdated stuff from roadmap.md * Not so light font on front page too * Refactor Chinese formats.md to match recent changes in other languages * Update some links on front page * Remove some outdated comment * Add twitter link to front page * More front page links tuning * Add Amsterdam meetup link * Smaller font to avoid second line * Add Amsterdam link to README.md * Proper docs nav translation * Back to 300 font-weight except Chinese * fix docs build * Update Amsterdam link * remove symlinks * more zh punctuation * apply lost comment by @zhang2014 * Apply comments by @zhang2014 from #3417 * Remove Beijing link * rm incorrect symlink * restore content of docs/zh/operations/table_engines/index.md * CLICKHOUSE-3751: stem terms while searching docs * CLICKHOUSE-3751: use English stemmer in non-English docs too * CLICKHOUSE-4135 fix * Remove past meetup link * Add blog link to top nav * Add ContentSquare article link * Add form link to front page + refactor some texts * couple markup fixes * minor * Introduce basic ODBC driver page in docs * More verbose 3rd party libs disclaimer * Put third-party stuff into a separate folder * Separate third-party stuff in ToC too * Update links * Move stuff that is not really (only) a client library into a separate page * Add clickhouse-hdfs-loader link * Some introduction for "interfaces" section * Rewrite tcp.md * http_interface.md -> http.md * fix link * Remove unconvenient error for now * try to guess anchor instead of failing * remove symlink * Remove outdated info from introduction * remove ru roadmap.md * replace ru roadmap.md with symlink * Update roadmap.md * lost file * Title case in toc_en.yml * Sync "Functions" ToC section with en * Remove reference to pretty old ClickHouse release from docs * couple lost symlinks in fa * Close quote in proper place * Rewrite en/getting_started/index.md * Sync en<>ru getting_started/index.md * minor changes * Some gui.md refactoring * Translate DataGrip section to ru * Translate DataGrip section to zh * Translate DataGrip section to fa * Translate DBeaver section to fa * Translate DBeaver section to zh * Split third-party GUI to open-source and commercial * Mention some RDBMS integrations + ad-hoc translation fixes * Add rel="external nofollow" to outgoing links from docs * Lost blank lines * Fix class name * More rel="external nofollow" * Apply suggestions by @sundy-li * Mobile version of front page improvements * test * test 2 * test 3 * Update LICENSE * minor docs fix * Highlight current article as suggested by @sundy-li * fix link destination * Introduce backup.md (only "en" for now) * Mention INSERT+SELECT in backup.md * Some improvements for replication.md * Add backup.md to toc * Mention clickhouse-backup tool * Mention LightHouse in third-party GUI list * Introduce interfaces/third-party/proxy.md * Add clickhouse-bulk to proxy.md * Major extension of integrations.md contents * fix link target * remove unneeded file * better toc item name * fix markdown * better ru punctuation * Add yet another possible backup approach * Simplify copying permalinks to headers * Support non-eng link anchors in docs + update some deps * Generate anchors for single-page mode automatically * Remove anchors to top of pages * Remove anchors that nobody links to * build fixes * fix few links * restore css * fix some links * restore gifs * fix lost words * more docs fixes * docs fixes * NULL anchor * update urllib3 dependency * more fixes
2018-12-12 17:28:00 +00:00
Take the [OnTime](../../getting_started/example_datasets/ontime.md) data set and select any frequently occurring value in the `AirlineID` column.
``` sql
SELECT anyHeavy(AirlineID) AS res
FROM ontime
```
2018-04-23 06:20:21 +00:00
```
┌───res─┐
│ 19690 │
└───────┘
```
## anyLast(x)
Selects the last value encountered.
The result is just as indeterminate as for the `any` function.
##groupBitAnd
Applies bitwise `AND` for series of numbers.
```
groupBitAnd(expr)
```
**Parameters**
`expr` An expression that results in `UInt*` type.
**Return value**
Value of the `UInt*` type.
**Example**
Test data:
```
binary decimal
00101100 = 44
00011100 = 28
00001101 = 13
01010101 = 85
```
Query:
```
SELECT groupBitAnd(num) FROM t
```
Where `num` is the column with the test data.
Result:
```
binary decimal
00000100 = 4
```
##groupBitOr
Applies bitwise `OR` for series of numbers.
```
groupBitOr(expr)
```
**Parameters**
`expr` An expression that results in `UInt*` type.
**Return value**
Value of the `UInt*` type.
**Example**
Test data:
```
binary decimal
00101100 = 44
00011100 = 28
00001101 = 13
01010101 = 85
```
Query:
```
SELECT groupBitOr(num) FROM t
```
Where `num` is the column with the test data.
Result:
```
binary decimal
01111101 = 125
```
##groupBitXor
Applies bitwise `XOR` for series of numbers.
```
groupBitXor(expr)
```
**Parameters**
`expr` An expression that results in `UInt*` type.
**Return value**
Value of the `UInt*` type.
**Example**
Test data:
```
binary decimal
00101100 = 44
00011100 = 28
00001101 = 13
01010101 = 85
```
Query:
```
SELECT groupBitXor(num) FROM t
```
Where `num` is the column with the test data.
Result:
```
binary decimal
01101000 = 104
```
2019-06-14 10:29:16 +00:00
## groupBitmap
Bitmap or Aggregate calculations from a unsigned integer column, return cardinality of type UInt64, if add suffix -State, then return [bitmap object](../functions/bitmap_functions.md).
```
groupBitmap(expr)
```
**Parameters**
`expr` An expression that results in `UInt*` type.
**Return value**
Value of the `UInt64` type.
**Example**
Test data:
```
2019-07-03 20:58:10 +00:00
UserID
1
1
2
3
```
Query:
```
2019-07-03 20:58:10 +00:00
SELECT groupBitmap(UserID) as num FROM t
```
Result:
```
num
3
```
## min(x) {#agg_function-min}
Calculates the minimum.
## max(x) {#agg_function-max}
Calculates the maximum.
## argMin(arg, val) {#agg_function-argMin}
Calculates the 'arg' value for a minimal 'val' value. If there are several different values of 'arg' for minimal values of 'val', the first of these values encountered is output.
**Example:**
```
┌─user─────┬─salary─┐
│ director │ 5000 │
│ manager │ 3000 │
│ worker │ 1000 │
└──────────┴────────┘
SELECT argMin(user, salary) FROM salary
┌─argMin(user, salary)─┐
│ worker │
└──────────────────────┘
```
## argMax(arg, val) {#agg_function-argMax}
Calculates the 'arg' value for a maximum 'val' value. If there are several different values of 'arg' for maximum values of 'val', the first of these values encountered is output.
## sum(x) {#agg_function-sum}
Calculates the sum.
Only works for numbers.
## sumWithOverflow(x)
Computes the sum of the numbers, using the same data type for the result as for the input parameters. If the sum exceeds the maximum value for this data type, the function returns an error.
Only works for numbers.
2019-02-04 13:30:28 +00:00
## sumMap(key, value) {#agg_functions-summap}
Totals the 'value' array according to the keys specified in the 'key' array.
The number of elements in 'key' and 'value' must be the same for each row that is totaled.
Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.
Example:
``` sql
CREATE TABLE sum_map(
date Date,
timeslot DateTime,
statusMap Nested(
status UInt16,
requests UInt64
)
) ENGINE = Log;
INSERT INTO sum_map VALUES
('2000-01-01', '2000-01-01 00:00:00', [1, 2, 3], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:00:00', [3, 4, 5], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:01:00', [4, 5, 6], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:01:00', [6, 7, 8], [10, 10, 10]);
SELECT
timeslot,
sumMap(statusMap.status, statusMap.requests)
FROM sum_map
GROUP BY timeslot
```
```
┌────────────timeslot─┬─sumMap(statusMap.status, statusMap.requests)─┐
│ 2000-01-01 00:00:00 │ ([1,2,3,4,5],[10,10,20,10,10]) │
│ 2000-01-01 00:01:00 │ ([4,5,6,7,8],[10,10,20,10,10]) │
└─────────────────────┴──────────────────────────────────────────────┘
```
## skewPop
Computes the [skewness](https://en.wikipedia.org/wiki/Skewness) of a sequence.
```
skewPop(expr)
```
**Parameters**
`expr` — [Expression](../syntax.md#syntax-expressions) returning a number.
**Returned value**
The skewness of the given distribution. Type — [Float64](../../data_types/float.md)
**Example**
```sql
SELECT skewPop(value) FROM series_with_value_column
```
## skewSamp
Computes the [sample skewness](https://en.wikipedia.org/wiki/Skewness) of a sequence.
It represents an unbiased estimate of the skewness of a random variable if passed values form its sample.
```
skewSamp(expr)
```
**Parameters**
`expr` — [Expression](../syntax.md#syntax-expressions) returning a number.
**Returned value**
The skewness of the given distribution. Type — [Float64](../../data_types/float.md). If `n <= 1` (`n` is the size of the sample), then the function returns `nan`.
**Example**
```sql
SELECT skewSamp(value) FROM series_with_value_column
```
## kurtPop
Computes the [kurtosis](https://en.wikipedia.org/wiki/Kurtosis) of a sequence.
```
kurtPop(expr)
```
**Parameters**
`expr` — [Expression](../syntax.md#syntax-expressions) returning a number.
**Returned value**
The kurtosis of the given distribution. Type — [Float64](../../data_types/float.md)
**Example**
```sql
SELECT kurtPop(value) FROM series_with_value_column
```
## kurtSamp
Computes the [sample kurtosis](https://en.wikipedia.org/wiki/Kurtosis) of a sequence.
It represents an unbiased estimate of the kurtosis of a random variable if passed values form its sample.
```
kurtSamp(expr)
```
**Parameters**
`expr` — [Expression](../syntax.md#syntax-expressions) returning a number.
**Returned value**
The kurtosis of the given distribution. Type — [Float64](../../data_types/float.md). If `n <= 1` (`n` is a size of the sample), then the function returns `nan`.
**Example**
```sql
SELECT kurtSamp(value) FROM series_with_value_column
```
## timeSeriesGroupSum(uid, timestamp, value) {#agg_function-timeseriesgroupsum}
`timeSeriesGroupSum` can aggregate different time series that sample timestamp not alignment.
It will use linear interpolation between two sample timestamp and then sum time-series together.
- `uid` is the time series unique id, `UInt64`.
- `timestamp` is Int64 type in order to support millisecond or microsecond.
- `value` is the metric.
The function returns array of tuples with `(timestamp, aggregated_value)` pairs.
Before using this function make sure `timestamp` is in ascending order.
Example:
```
┌─uid─┬─timestamp─┬─value─┐
│ 1 │ 2 │ 0.2 │
│ 1 │ 7 │ 0.7 │
│ 1 │ 12 │ 1.2 │
│ 1 │ 17 │ 1.7 │
│ 1 │ 25 │ 2.5 │
│ 2 │ 3 │ 0.6 │
│ 2 │ 8 │ 1.6 │
│ 2 │ 12 │ 2.4 │
│ 2 │ 18 │ 3.6 │
│ 2 │ 24 │ 4.8 │
└─────┴───────────┴───────┘
```
```
CREATE TABLE time_series(
uid UInt64,
timestamp Int64,
value Float64
) ENGINE = Memory;
INSERT INTO time_series VALUES
(1,2,0.2),(1,7,0.7),(1,12,1.2),(1,17,1.7),(1,25,2.5),
(2,3,0.6),(2,8,1.6),(2,12,2.4),(2,18,3.6),(2,24,4.8);
SELECT timeSeriesGroupSum(uid, timestamp, value)
FROM (
SELECT * FROM time_series order by timestamp ASC
);
```
And the result will be:
```
[(2,0.2),(3,0.9),(7,2.1),(8,2.4),(12,3.6),(17,5.1),(18,5.4),(24,7.2),(25,2.5)]
```
## timeSeriesGroupRateSum(uid, ts, val) {#agg_function-timeseriesgroupratesum}
Similarly timeSeriesGroupRateSum, timeSeriesGroupRateSum will Calculate the rate of time-series and then sum rates together.
Also, timestamp should be in ascend order before use this function.
Use this function, the result above case will be:
```
[(2,0),(3,0.1),(7,0.3),(8,0.3),(12,0.3),(17,0.3),(18,0.3),(24,0.3),(25,0.1)]
```
## avg(x) {#agg_function-avg}
Calculates the average.
Only works for numbers.
The result is always Float64.
## uniq {#agg_function-uniq}
Calculates the approximate number of different values of the argument.
```
uniq(x[, ...])
```
**Parameters**
Function takes the variable number of parameters. Parameters can be of types: `Tuple`, `Array`, `Date`, `DateTime`, `String`, numeric types.
**Returned value**
- The number of the [UInt64](../../data_types/int_uint.md) type.
**Implementation details**
Function:
2018-11-22 22:46:46 +00:00
- Calculates a hash for all parameters in the aggregate, then uses it in calculations.
- Uses an adaptive sampling algorithm. For the calculation state, the function uses a sample of element hash values with a size up to 65536.
This algorithm is very accurate and very efficient on CPU. When query contains several of these functions, using `uniq` is almost as fast as using other aggregate functions.
- Provides the result deterministically (it doesn't depend on the order of query processing).
We recommend to use this function in almost all scenarios.
**See Also**
- [uniqCombined](#agg_function-uniqcombined)
- [uniqHLL12](#agg_function-uniqhll12)
- [uniqExact](#agg_function-uniqexact)
## uniqCombined {#agg_function-uniqcombined}
Calculates the approximate number of different values of the argument.
```
uniqCombined(HLL_precision)(x[, ...])
```
The `uniqCombined` function is a good choice for calculating the number of different values, but keep in mind that the estimation error for large sets (200 million elements and more) will become larger than theoretical value due to poor choice of hash function.
**Parameters**
Function takes the variable number of parameters. Parameters can be of types: `Tuple`, `Array`, `Date`, `DateTime`, `String`, numeric types.
`HLL_precision` is the base-2 logarithm of the number of cells in [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). Optional, you can use the function as `uniqCombined(x[, ...])`. The default value for `HLL_precision` is 17, that is effectively 96 KiB of space (2^17 cells of 6 bits each).
**Returned value**
- The number of the [UInt64](../../data_types/int_uint.md) type.
**Implementation details**
Function:
- Calculates a hash for all parameters in the aggregate, then uses it in calculations.
- Uses combination of three algorithms: array, hash table and HyperLogLog with an error correction table.
For small number of distinct elements, the array is used. When the set size becomes larger the hash table is used, while it is smaller than HyperLogLog data structure. For larger number of elements, the HyperLogLog is used, and it will occupy fixed amount of memory.
- Provides the result deterministically (it doesn't depend on the order of query processing).
In comparison with the [uniq](#agg_function-uniq) function the `uniqCombined`:
- Consumes several times less memory.
- Calculates with several times higher accuracy.
- Performs slightly lower usually. In some scenarios `uniqCombined` can perform better than `uniq`, for example, with distributed queries that transmit a large number of aggregation states over the network.
**See Also**
- [uniq](#agg_function-uniq)
- [uniqHLL12](#agg_function-uniqhll12)
- [uniqExact](#agg_function-uniqexact)
## uniqHLL12 {#agg_function-uniqhll12}
Calculates the approximate number of different values of the argument, using the [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) algortithm.
```
uniqHLL12(x[, ...])
```
**Parameters**
Function takes the variable number of parameters. Parameters can be of types: `Tuple`, `Array`, `Date`, `DateTime`, `String`, numeric types.
**Returned value**
- The number of the [UInt64](../../data_types/int_uint.md) type.
**Implementation details**
Function:
- Calculates a hash for all parameters in the aggregate, then uses it in calculations.
- Uses the HyperLogLog algorithm to approximate the number of different values of the argument.
212 5-bit cells are used. The size of the state is slightly more than 2.5 KB. The result is not very accurate (up to ~10% error) for small data sets (<10K elements). However, the result is fairly accurate for high-cardinality data sets (10K-100M), with a maximum error of ~1.6%. Starting from 100M, the estimation error increases, and the function will return very inaccurate results for data sets with extremely high cardinality (1B+ elements).
- Provides the determinate result (it doesn't depend on the order of query processing).
We don't recommend using this function. In most cases, use the [uniq](#agg_function-uniq) or [uniqCombined](#agg_function-uniqcombined) function.
**See Also**
- [uniq](#agg_function-uniq)
- [uniqCombined](#agg_function-uniqcombined)
- [uniqExact](#agg_function-uniqexact)
## uniqExact {#agg_function-uniqexact}
Calculates the exact number of different values of the argument.
```
uniqExact(x[, ...])
```
Use the `uniqExact` function if you definitely need an exact result. Otherwise use the [uniq](#agg_function-uniq) function.
The `uniqExact` function uses more memory than the `uniq`, because the size of the state has unbounded growth as the number of different values increases.
**Parameters**
Function takes the variable number of parameters. Parameters can be of types: `Tuple`, `Array`, `Date`, `DateTime`, `String`, numeric types.
**See Also**
- [uniq](#agg_function-uniq)
- [uniqCombined](#agg_function-uniqcombined)
- [uniqHLL12](#agg_function-uniqhll12)
## groupArray(x), groupArray(max_size)(x)
Creates an array of argument values.
Values can be added to the array in any (indeterminate) order.
The second version (with the `max_size` parameter) limits the size of the resulting array to `max_size` elements.
For example, `groupArray (1) (x)` is equivalent to `[any (x)]`.
In some cases, you can still rely on the order of execution. This applies to cases when `SELECT` comes from a subquery that uses `ORDER BY`.
2018-04-23 06:20:21 +00:00
## groupArrayInsertAt(x)
Inserts a value into the array in the specified position.
Accepts the value and position as input. If several values are inserted into the same position, any of them might end up in the resulting array (the first one will be used in the case of single-threaded execution). If no value is inserted into a position, the position is assigned the default value.
Optional parameters:
- The default value for substituting in empty positions.
- The length of the resulting array. This allows you to receive arrays of the same size for all the aggregate keys. When using this parameter, the default value must be specified.
## groupUniqArray(x), groupUniqArray(max_size)(x)
Creates an array from different argument values. Memory consumption is the same as for the `uniqExact` function.
The second version (with the `max_size` parameter) limits the size of the resulting array to `max_size` elements.
2019-06-14 10:29:16 +00:00
For example, `groupUniqArray(1)(x)` is equivalent to `[any(x)]`.
## quantile(level)(x)
Approximates the `level` quantile. `level` is a constant, a floating-point number from 0 to 1.
We recommend using a `level` value in the range of `[0.01, 0.99]`
Don't use a `level` value equal to 0 or 1 use the `min` and `max` functions for these cases.
In this function, as well as in all functions for calculating quantiles, the `level` parameter can be omitted. In this case, it is assumed to be equal to 0.5 (in other words, the function will calculate the median).
Works for numbers, dates, and dates with times.
Returns: for numbers `Float64`; for dates a date; for dates with times a date with time.
2018-02-07 07:33:52 +00:00
Uses [reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) with a reservoir size up to 8192.
If necessary, the result is output with linear approximation from the two neighboring values.
This algorithm provides very low accuracy. See also: `quantileTiming`, `quantileTDigest`, `quantileExact`.
The result depends on the order of running the query, and is nondeterministic.
When using multiple `quantile` (and similar) functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the `quantiles` (and similar) functions.
## quantileDeterministic(level)(x, determinator)
Works the same way as the `quantile` function, but the result is deterministic and does not depend on the order of query execution.
To achieve this, the function takes a second argument the "determinator". This is a number whose hash is used instead of a random number generator in the reservoir sampling algorithm. For the function to work correctly, the same determinator value should not occur too often. For the determinator, you can use an event ID, user ID, and so on.
Don't use this function for calculating timings. There is a more suitable function for this purpose: `quantileTiming`.
## quantileTiming(level)(x)
Computes the quantile of 'level' with a fixed precision.
Works for numbers. Intended for calculating quantiles of page loading time in milliseconds.
If the value is greater than 30,000 (a page loading time of more than 30 seconds), the result is equated to 30,000.
If the total value is not more than about 5670, then the calculation is accurate.
Otherwise:
- if the time is less than 1024 ms, then the calculation is accurate.
- otherwise the calculation is rounded to a multiple of 16 ms.
When passing negative values to the function, the behavior is undefined.
The returned value has the Float32 type. If no values were passed to the function (when using `quantileTimingIf`), 'nan' is returned. The purpose of this is to differentiate these instances from zeros. See the note on sorting NaNs in "ORDER BY clause".
The result is determinate (it doesn't depend on the order of query processing).
For its purpose (calculating quantiles of page loading times), using this function is more effective and the result is more accurate than for the `quantile` function.
## quantileTimingWeighted(level)(x, weight)
Differs from the `quantileTiming` function in that it has a second argument, "weights". Weight is a non-negative integer.
The result is calculated as if the `x` value were passed `weight` number of times to the `quantileTiming` function.
## quantileExact(level)(x)
Computes the quantile of 'level' exactly. To do this, all the passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n) memory, where 'n' is the number of values that were passed. However, for a small number of values, the function is very effective.
## quantileExactWeighted(level)(x, weight)
Computes the quantile of 'level' exactly. In addition, each value is counted with its weight, as if it is present 'weight' times. The arguments of the function can be considered as histograms, where the value 'x' corresponds to a histogram "column" of the height 'weight', and the function itself can be considered as a summation of histograms.
A hash table is used as the algorithm. Because of this, if the passed values are frequently repeated, the function consumes less RAM than `quantileExact`. You can use this function instead of `quantileExact` and specify the weight as 1.
## quantileTDigest(level)(x)
Approximates the quantile level using the [t-digest](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf) algorithm. The maximum error is 1%. Memory consumption by State is proportional to the logarithm of the number of passed values.
The performance of the function is lower than for `quantile` or `quantileTiming`. In terms of the ratio of State size to precision, this function is much better than `quantile`.
The result depends on the order of running the query, and is nondeterministic.
2018-04-23 06:20:21 +00:00
## median(x)
All the quantile functions have corresponding median functions: `median`, `medianDeterministic`, `medianTiming`, `medianTimingWeighted`, `medianExact`, `medianExactWeighted`, `medianTDigest`. They are synonyms and their behavior is identical.
## quantiles(level1, level2, ...)(x)
All the quantile functions also have corresponding quantiles functions: `quantiles`, `quantilesDeterministic`, `quantilesTiming`, `quantilesTimingWeighted`, `quantilesExact`, `quantilesExactWeighted`, `quantilesTDigest`. These functions calculate all the quantiles of the listed levels in one pass, and return an array of the resulting values.
## varSamp(x)
Calculates the amount `Σ((x - x̅)^2) / (n - 1)`, where `n` is the sample size and `x̅`is the average value of `x`.
It represents an unbiased estimate of the variance of a random variable if passed values form its sample.
Returns `Float64`. When `n <= 1`, returns `+∞`.
## varPop(x)
2019-02-14 16:44:46 +00:00
Calculates the amount `Σ((x - x̅)^2) / n`, where `n` is the sample size and `x̅`is the average value of `x`.
In other words, dispersion for a set of values. Returns `Float64`.
## stddevSamp(x)
The result is equal to the square root of `varSamp(x)`.
## stddevPop(x)
The result is equal to the square root of `varPop(x)`.
2018-04-23 06:20:21 +00:00
## topK(N)(column)
Returns an array of the most frequent values in the specified column. The resulting array is sorted in descending order of frequency of values (not by the values themselves).
Implements the [ Filtered Space-Saving](http://www.l2f.inesc-id.pt/~fmmb/wiki/uploads/Work/misnis.ref0a.pdf) algorithm for analyzing TopK, based on the reduce-and-combine algorithm from [Parallel Space Saving](https://arxiv.org/pdf/1401.0702.pdf).
2018-03-25 02:04:22 +00:00
```
topK(N)(column)
```
This function doesn't provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren't the most frequent values.
We recommend using the `N < 10 ` value; performance is reduced with large `N` values. Maximum value of ` N = 65536`.
**Arguments**
2018-03-25 02:04:22 +00:00
- 'N' is the number of values.
- ' x ' The column.
**Example**
WIP on docs (#3813) * CLICKHOUSE-4063: less manual html @ index.md * CLICKHOUSE-4063: recommend markdown="1" in README.md * CLICKHOUSE-4003: manually purge custom.css for now * CLICKHOUSE-4064: expand <details> before any print (including to pdf) * CLICKHOUSE-3927: rearrange interfaces/formats.md a bit * CLICKHOUSE-3306: add few http headers * Remove copy-paste introduced in #3392 * Hopefully better chinese fonts #3392 * get rid of tabs @ custom.css * Apply comments and patch from #3384 * Add jdbc.md to ToC and some translation, though it still looks badly incomplete * minor punctuation * Add some backlinks to official website from mirrors that just blindly take markdown sources * Do not make fonts extra light * find . -name '*.md' -type f | xargs -I{} perl -pi -e 's//g' {} * find . -name '*.md' -type f | xargs -I{} perl -pi -e 's/ sql/g' {} * Remove outdated stuff from roadmap.md * Not so light font on front page too * Refactor Chinese formats.md to match recent changes in other languages * Update some links on front page * Remove some outdated comment * Add twitter link to front page * More front page links tuning * Add Amsterdam meetup link * Smaller font to avoid second line * Add Amsterdam link to README.md * Proper docs nav translation * Back to 300 font-weight except Chinese * fix docs build * Update Amsterdam link * remove symlinks * more zh punctuation * apply lost comment by @zhang2014 * Apply comments by @zhang2014 from #3417 * Remove Beijing link * rm incorrect symlink * restore content of docs/zh/operations/table_engines/index.md * CLICKHOUSE-3751: stem terms while searching docs * CLICKHOUSE-3751: use English stemmer in non-English docs too * CLICKHOUSE-4135 fix * Remove past meetup link * Add blog link to top nav * Add ContentSquare article link * Add form link to front page + refactor some texts * couple markup fixes * minor * Introduce basic ODBC driver page in docs * More verbose 3rd party libs disclaimer * Put third-party stuff into a separate folder * Separate third-party stuff in ToC too * Update links * Move stuff that is not really (only) a client library into a separate page * Add clickhouse-hdfs-loader link * Some introduction for "interfaces" section * Rewrite tcp.md * http_interface.md -> http.md * fix link * Remove unconvenient error for now * try to guess anchor instead of failing * remove symlink * Remove outdated info from introduction * remove ru roadmap.md * replace ru roadmap.md with symlink * Update roadmap.md * lost file * Title case in toc_en.yml * Sync "Functions" ToC section with en * Remove reference to pretty old ClickHouse release from docs * couple lost symlinks in fa * Close quote in proper place * Rewrite en/getting_started/index.md * Sync en<>ru getting_started/index.md * minor changes * Some gui.md refactoring * Translate DataGrip section to ru * Translate DataGrip section to zh * Translate DataGrip section to fa * Translate DBeaver section to fa * Translate DBeaver section to zh * Split third-party GUI to open-source and commercial * Mention some RDBMS integrations + ad-hoc translation fixes * Add rel="external nofollow" to outgoing links from docs * Lost blank lines * Fix class name * More rel="external nofollow" * Apply suggestions by @sundy-li * Mobile version of front page improvements * test * test 2 * test 3 * Update LICENSE * minor docs fix * Highlight current article as suggested by @sundy-li * fix link destination * Introduce backup.md (only "en" for now) * Mention INSERT+SELECT in backup.md * Some improvements for replication.md * Add backup.md to toc * Mention clickhouse-backup tool * Mention LightHouse in third-party GUI list * Introduce interfaces/third-party/proxy.md * Add clickhouse-bulk to proxy.md * Major extension of integrations.md contents * fix link target * remove unneeded file * better toc item name * fix markdown * better ru punctuation * Add yet another possible backup approach * Simplify copying permalinks to headers * Support non-eng link anchors in docs + update some deps * Generate anchors for single-page mode automatically * Remove anchors to top of pages * Remove anchors that nobody links to * build fixes * fix few links * restore css * fix some links * restore gifs * fix lost words * more docs fixes * docs fixes * NULL anchor * update urllib3 dependency * more fixes
2018-12-12 17:28:00 +00:00
Take the [OnTime](../../getting_started/example_datasets/ontime.md) data set and select the three most frequently occurring values in the `AirlineID` column.
``` sql
SELECT topK(3)(AirlineID) AS res
FROM ontime
```
```
┌─res─────────────────┐
│ [19393,19790,19805] │
└─────────────────────┘
```
## covarSamp(x, y)
Calculates the value of `Σ((x - x̅)(y - y̅)) / (n - 1)`.
Returns Float64. When `n <= 1`, returns +∞.
## covarPop(x, y)
Calculates the value of `Σ((x - x̅)(y - y̅)) / n`.
## corr(x, y)
Calculates the Pearson correlation coefficient: `Σ((x - x̅)(y - y̅)) / sqrt(Σ((x - x̅)^2) * Σ((y - y̅)^2))`.
2019-05-25 11:18:08 +00:00
## simpleLinearRegression
2019-05-24 09:47:26 +00:00
2019-05-25 11:18:08 +00:00
Performs simple (unidimensional) linear regression.
2019-05-24 09:47:26 +00:00
```
2019-05-25 11:18:08 +00:00
simpleLinearRegression(x, y)
2019-05-24 09:47:26 +00:00
```
Parameters:
- `x` — Column with dependent variable values.
- `y` — Column with explanatory variable values.
2019-05-24 09:47:26 +00:00
2019-05-25 09:56:08 +00:00
Returned values:
2019-05-24 09:47:26 +00:00
Constants `(a, b)` of the resulting line `y = a*x + b`.
2019-05-24 09:47:26 +00:00
**Examples**
```sql
2019-05-25 11:18:08 +00:00
SELECT arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [0, 1, 2, 3])
2019-05-24 09:47:26 +00:00
```
```text
2019-05-25 11:18:08 +00:00
┌─arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [0, 1, 2, 3])─┐
│ (1,0) │
└───────────────────────────────────────────────────────────────────┘
2019-05-24 09:47:26 +00:00
```
```sql
2019-05-25 11:18:08 +00:00
SELECT arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [3, 4, 5, 6])
2019-05-24 09:47:26 +00:00
```
```text
2019-05-25 11:18:08 +00:00
┌─arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [3, 4, 5, 6])─┐
│ (1,3) │
└───────────────────────────────────────────────────────────────────┘
2019-05-24 09:47:26 +00:00
```
## stochasticLinearRegression {#agg_functions-stochasticlinearregression}
2019-05-25 15:51:56 +00:00
This function implements stochastic linear regression. It supports custom parameters for learning rate, L2 regularization coefficient, mini-batch size and has few methods for updating weights ([simple SGD](https://en.wikipedia.org/wiki/Stochastic_gradient_descent), [Momentum](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum), [Nesterov](https://mipt.ru/upload/medialibrary/d7e/41-91.pdf)).
2019-05-25 15:51:56 +00:00
### Parameters {#agg_functions-stochasticlinearregression-parameters}
2019-05-25 15:51:56 +00:00
There are 4 customizable parameters. They are passed to the function sequentially, but there is no need to pass all four - default values will be used, however good model required some parameter tuning.
```text
stochasticLinearRegression(1.0, 1.0, 10, 'SGD')
2019-05-25 15:51:56 +00:00
```
2019-05-30 22:48:44 +00:00
1. `learning rate` is the coefficient on step length, when gradient descent step is performed. Too big learning rate may cause infinite weights of the model. Default is `0.00001`.
2. `l2 regularization coefficient` which may help to prevent overfitting. Default is `0.1`.
3. `mini-batch size` sets the number of elements, which gradients will be computed and summed to perform one step of gradient descent. Pure stochastic descent uses one element, however having small batches(about 10 elements) make gradient steps more stable. Default is `15`.
2019-05-25 15:51:56 +00:00
4. `method for updating weights`, there are 3 of them: `SGD`, `Momentum`, `Nesterov`. `Momentum` and `Nesterov` require little bit more computations and memory, however they happen to be useful in terms of speed of convergance and stability of stochastic gradient methods. Default is `'SGD'`.
### Usage {#agg_functions-stochasticlinearregression-usage}
2019-05-25 15:51:56 +00:00
`stochasticLinearRegression` is used in two steps: fitting the model and predicting on new data. In order to fit the model and save its state for later usage we use `-State` combinator, which basically saves the state (model weights, etc).
To predict we use function [evalMLMethod](../functions/machine_learning_functions.md#machine_learning_methods-evalmlmethod), which takes a state as an argument as well as features to predict on.
2019-05-25 15:51:56 +00:00
<a name="stochasticlinearregression-usage-fitting"></a>
1. Fitting
2019-05-25 15:51:56 +00:00
Such query may be used.
2019-05-25 15:51:56 +00:00
```sql
CREATE TABLE IF NOT EXISTS train_data
(
param1 Float64,
param2 Float64,
target Float64
) ENGINE = Memory;
2019-05-25 15:51:56 +00:00
CREATE TABLE your_model ENGINE = Memory AS SELECT
stochasticLinearRegressionState(0.1, 0.0, 5, 'SGD')(target, param1, param2)
2019-05-25 15:51:56 +00:00
AS state FROM train_data;
2019-05-25 15:51:56 +00:00
```
2019-05-25 15:51:56 +00:00
Here we also need to insert data into `train_data` table. The number of parameters is not fixed, it depends only on number of arguments, passed into `linearRegressionState`. They all must be numeric values.
Note that the column with target value(which we would like to learn to predict) is inserted as the first argument.
2. Predicting
2019-05-25 15:51:56 +00:00
After saving a state into the table, we may use it multiple times for prediction, or even merge with other states and create new even better models.
2019-05-25 15:51:56 +00:00
```sql
WITH (SELECT state FROM your_model) AS model SELECT
evalMLMethod(model, param1, param2) FROM test_data
```
2019-05-25 15:51:56 +00:00
The query will return a column of predicted values. Note that first argument of `evalMLMethod` is `AggregateFunctionState` object, next are columns of features.
2019-05-25 15:51:56 +00:00
`test_data` is a table like `train_data` but may not contain target value.
### Notes {#agg_functions-stochasticlinearregression-notes}
2019-05-25 15:51:56 +00:00
1. To merge two models user may create such query:
```sql
SELECT state1 + state2 FROM your_models
```
where `your_models` table contains both models. This query will return new `AggregateFunctionState` object.
2019-05-25 15:51:56 +00:00
2. User may fetch weights of the created model for its own purposes without saving the model if no `-State` combinator is used.
```sql
SELECT stochasticLinearRegression(0.01)(target, param1, param2) FROM train_data
2019-05-25 15:51:56 +00:00
```
Such query will fit the model and return its weights - first are weights, which correspond to the parameters of the model, the last one is bias. So in the example above the query will return a column with 3 values.
2019-05-25 15:51:56 +00:00
**See Also**
- [stochasticLogisticRegression](#agg_functions-stochasticlogisticregression)
- [Difference between linear and logistic regressions](https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression)
2019-05-28 18:59:18 +00:00
## stochasticLogisticRegression {#agg_functions-stochasticlogisticregression}
2019-05-28 18:59:18 +00:00
This function implements stochastic logistic regression. It can be used for binary classification problem, supports the same custom parameters as stochasticLinearRegression and works the same way.
2019-05-28 18:59:18 +00:00
### Parameters {#agg_functions-stochasticlogisticregression-parameters}
2019-05-28 18:59:18 +00:00
Parameters are exactly the same as in stochasticLinearRegression:
`learning rate`, `l2 regularization coefficient`, `mini-batch size`, `method for updating weights`.
For more information see [parameters](#agg_functions-stochasticlinearregression-parameters).
2019-05-28 18:59:18 +00:00
```text
stochasticLogisticRegression(1.0, 1.0, 10, 'SGD')
2019-05-28 18:59:18 +00:00
```
1. Fitting
See the `Fitting` section in the [stochasticLinearRegression](#stochasticlinearregression-usage-fitting) description.
2019-05-28 18:59:18 +00:00
Predicted labels have to be in [-1, 1].
2. Predicting
Using saved state we can predict probability of object having label `1`.
2019-05-28 18:59:18 +00:00
```sql
WITH (SELECT state FROM your_model) AS model SELECT
evalMLMethod(model, param1, param2) FROM test_data
```
2019-05-28 18:59:18 +00:00
The query will return a column of probabilities. Note that first argument of `evalMLMethod` is `AggregateFunctionState` object, next are columns of features.
We can also set a bound of probability, which assigns elements to different labels.
2019-05-28 18:59:18 +00:00
```sql
SELECT ans < 1.1 AND ans > 0.5 FROM
(WITH (SELECT state FROM your_model) AS model SELECT
evalMLMethod(model, param1, param2) AS ans FROM test_data)
```
Then the result will be labels.
2019-05-28 18:59:18 +00:00
`test_data` is a table like `train_data` but may not contain target value.
**See Also**
- [stochasticLinearRegression](#agg_functions-stochasticlinearregression)
- [Difference between linear and logistic regressions.](https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression)
[Original article](https://clickhouse.yandex/docs/en/query_language/agg_functions/reference/) <!--hide-->