ClickHouse/docs/en/sql-reference/window-functions/index.md
Robert Schulze cdf28f9b71
Minor fixups
2023-04-19 16:16:51 +00:00

597 lines
28 KiB
Markdown

---
slug: /en/sql-reference/window-functions/
sidebar_position: 62
sidebar_label: Window Functions
title: Window Functions
---
ClickHouse supports the standard grammar for defining windows and window functions. The following features are currently supported:
| Feature | Support or workaround |
|------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ad hoc window specification (`count(*) over (partition by id order by time desc)`) | supported |
| expressions involving window functions, e.g. `(count(*) over ()) / 2)` | supported |
| `WINDOW` clause (`select ... from table window w as (partition by id)`) | supported |
| `ROWS` frame | supported |
| `RANGE` frame | supported, the default |
| `INTERVAL` syntax for `DateTime` `RANGE OFFSET` frame | not supported, specify the number of seconds instead (`RANGE` works with any numeric type). |
| `GROUPS` frame | not supported |
| Calculating aggregate functions over a frame (`sum(value) over (order by time)`) | all aggregate functions are supported |
| `rank()`, `dense_rank()`, `row_number()` | supported |
| `lag/lead(value, offset)` | Not supported. Workarounds: |
| | 1) replace with `any(value) over (.... rows between <offset> preceding and <offset> preceding)`, or `following` for `lead` |
| | 2) use `lagInFrame/leadInFrame`, which are analogous, but respect the window frame. To get behavior identical to `lag/lead`, use `rows between unbounded preceding and unbounded following` |
| ntile(buckets) | Supported. Specify window like, (partition by x order by y rows between unbounded preceding and unounded following). |
## ClickHouse-specific Window Functions
### nonNegativeDerivative(metric_column, timestamp_column[, INTERVAL X UNITS])
Finds non-negative derivative for given `metric_column` by `timestamp_column`.
`INTERVAL` can be omitted, default is `INTERVAL 1 SECOND`.
The computed value is the following for each row:
- `0` for 1st row,
- ${metric_i - metric_{i-1} \over timestamp_i - timestamp_{i-1}} * interval$ for $i_th$ row.
## References
### GitHub Issues
The roadmap for the initial support of window functions is [in this issue](https://github.com/ClickHouse/ClickHouse/issues/18097).
All GitHub issues related to window funtions have the [comp-window-functions](https://github.com/ClickHouse/ClickHouse/labels/comp-window-functions) tag.
### Tests
These tests contain the examples of the currently supported grammar:
https://github.com/ClickHouse/ClickHouse/blob/master/tests/performance/window_functions.xml
https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/01591_window_functions.sql
### Postgres Docs
https://www.postgresql.org/docs/current/sql-select.html#SQL-WINDOW
https://www.postgresql.org/docs/devel/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
https://www.postgresql.org/docs/devel/functions-window.html
https://www.postgresql.org/docs/devel/tutorial-window.html
### MySQL Docs
https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html
https://dev.mysql.com/doc/refman/8.0/en/window-functions-usage.html
https://dev.mysql.com/doc/refman/8.0/en/window-functions-frames.html
## Syntax
```text
aggregate_function (column_name)
OVER ([[PARTITION BY grouping_column] [ORDER BY sorting_column]
[ROWS or RANGE expression_to_bound_rows_withing_the_group]] | [window_name])
FROM table_name
WINDOW window_name as ([[PARTITION BY grouping_column] [ORDER BY sorting_column])
```
- `PARTITION BY` - defines how to break a resultset into groups.
- `ORDER BY` - defines how to order rows inside the group during calculation aggregate_function.
- `ROWS or RANGE` - defines bounds of a frame, aggregate_function is calculated within a frame.
- `WINDOW` - allows to reuse a window definition with multiple exressions.
### Functions
These functions can be used only as a window function.
`row_number()` - Number the current row within its partition starting from 1.
`first_value(x)` - Return the first non-NULL value evaluated within its ordered frame.
`last_value(x)` - Return the last non-NULL value evaluated within its ordered frame.
`nth_value(x, offset)` - Return the first non-NULL value evaluated against the nth row (offset) in its ordered frame.
`rank()` - Rank the current row within its partition with gaps.
`dense_rank()` - Rank the current row within its partition without gaps.
`lagInFrame(x)` - Return a value evaluated at the row that is at a specified physical offset row before the current row within the ordered frame.
`leadInFrame(x)` - Return a value evaluated at the row that is offset rows after the current row within the ordered frame.
```text
PARTITION
┌─────────────────┐ <-- UNBOUNDED PRECEDING (BEGINNING of the PARTITION)
│ │
│ │
│=================│ <-- N PRECEDING <─┐
│ N ROWS │ │ F
│ Before CURRENT │ │ R
│~~~~~~~~~~~~~~~~~│ <-- CURRENT ROW │ A
│ M ROWS │ │ M
│ After CURRENT │ │ E
│=================│ <-- M FOLLOWING <─┘
│ │
│ │
└─────────────────┘ <--- UNBOUNDED FOLLOWING (END of the PARTITION)
```
## Examples
```sql
CREATE TABLE wf_partition
(
`part_key` UInt64,
`value` UInt64,
`order` UInt64
)
ENGINE = Memory;
INSERT INTO wf_partition FORMAT Values
(1,1,1), (1,2,2), (1,3,3), (2,0,0), (3,0,0);
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key) AS frame_values
FROM wf_partition
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
1 1 1 [1,2,3] <
1 2 2 [1,2,3] 1-st group
1 3 3 [1,2,3] <
2 0 0 [0] <- 2-nd group
3 0 0 [0] <- 3-d group
└──────────┴───────┴───────┴──────────────┘
```
```sql
CREATE TABLE wf_frame
(
`part_key` UInt64,
`value` UInt64,
`order` UInt64
)
ENGINE = Memory;
INSERT INTO wf_frame FORMAT Values
(1,1,1), (1,2,2), (1,3,3), (1,4,4), (1,5,5);
-- frame is bounded by bounds of a partition (BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key ORDER BY order ASC
Rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
1 1 1 [1,2,3,4,5]
1 2 2 [1,2,3,4,5]
1 3 3 [1,2,3,4,5]
1 4 4 [1,2,3,4,5]
1 5 5 [1,2,3,4,5]
└──────────┴───────┴───────┴──────────────┘
-- short form - no bound expression, no order by
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
1 1 1 [1,2,3,4,5]
1 2 2 [1,2,3,4,5]
1 3 3 [1,2,3,4,5]
1 4 4 [1,2,3,4,5]
1 5 5 [1,2,3,4,5]
└──────────┴───────┴───────┴──────────────┘
-- frame is bounded by the beggining of a partition and the current row
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key ORDER BY order ASC
Rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
1 1 1 [1]
1 2 2 [1,2]
1 3 3 [1,2,3]
1 4 4 [1,2,3,4]
1 5 5 [1,2,3,4,5]
└──────────┴───────┴───────┴──────────────┘
-- short form (frame is bounded by the beggining of a partition and the current row)
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key ORDER BY order ASC) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
1 1 1 [1]
1 2 2 [1,2]
1 3 3 [1,2,3]
1 4 4 [1,2,3,4]
1 5 5 [1,2,3,4,5]
└──────────┴───────┴───────┴──────────────┘
-- frame is bounded by the beggining of a partition and the current row, but order is backward
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key ORDER BY order DESC) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
1 1 1 [5,4,3,2,1]
1 2 2 [5,4,3,2]
1 3 3 [5,4,3]
1 4 4 [5,4]
1 5 5 [5]
└──────────┴───────┴───────┴──────────────┘
-- sliding frame - 1 PRECEDING ROW AND CURRENT ROW
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key ORDER BY order ASC
Rows BETWEEN 1 PRECEDING AND CURRENT ROW) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
1 1 1 [1]
1 2 2 [1,2]
1 3 3 [2,3]
1 4 4 [3,4]
1 5 5 [4,5]
└──────────┴───────┴───────┴──────────────┘
-- sliding frame - Rows BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING
SELECT
part_key,
value,
order,
groupArray(value) OVER (PARTITION BY part_key ORDER BY order ASC
Rows BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) AS frame_values
FROM wf_frame
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┐
1 1 1 [1,2,3,4,5]
1 2 2 [1,2,3,4,5]
1 3 3 [2,3,4,5]
1 4 4 [3,4,5]
1 5 5 [4,5]
└──────────┴───────┴───────┴──────────────┘
-- row_number does not respect the frame, so rn_1 = rn_2 = rn_3 != rn_4
SELECT
part_key,
value,
order,
groupArray(value) OVER w1 AS frame_values,
row_number() OVER w1 AS rn_1,
sum(1) OVER w1 AS rn_2,
row_number() OVER w2 AS rn_3,
sum(1) OVER w2 AS rn_4
FROM wf_frame
WINDOW
w1 AS (PARTITION BY part_key ORDER BY order DESC),
w2 AS (PARTITION BY part_key ORDER BY order DESC
Rows BETWEEN 1 PRECEDING AND CURRENT ROW)
ORDER BY
part_key ASC,
value ASC;
┌─part_key─┬─value─┬─order─┬─frame_values─┬─rn_1─┬─rn_2─┬─rn_3─┬─rn_4─┐
1 1 1 [5,4,3,2,1] 5 5 5 2
1 2 2 [5,4,3,2] 4 4 4 2
1 3 3 [5,4,3] 3 3 3 2
1 4 4 [5,4] 2 2 2 2
1 5 5 [5] 1 1 1 1
└──────────┴───────┴───────┴──────────────┴──────┴──────┴──────┴──────┘
-- first_value and last_value respect the frame
SELECT
groupArray(value) OVER w1 AS frame_values_1,
first_value(value) OVER w1 AS first_value_1,
last_value(value) OVER w1 AS last_value_1,
groupArray(value) OVER w2 AS frame_values_2,
first_value(value) OVER w2 AS first_value_2,
last_value(value) OVER w2 AS last_value_2
FROM wf_frame
WINDOW
w1 AS (PARTITION BY part_key ORDER BY order ASC),
w2 AS (PARTITION BY part_key ORDER BY order ASC Rows BETWEEN 1 PRECEDING AND CURRENT ROW)
ORDER BY
part_key ASC,
value ASC;
┌─frame_values_1─┬─first_value_1─┬─last_value_1─┬─frame_values_2─┬─first_value_2─┬─last_value_2─┐
[1] 1 1 [1] 1 1
[1,2] 1 2 [1,2] 1 2
[1,2,3] 1 3 [2,3] 2 3
[1,2,3,4] 1 4 [3,4] 3 4
[1,2,3,4,5] 1 5 [4,5] 4 5
└────────────────┴───────────────┴──────────────┴────────────────┴───────────────┴──────────────┘
-- second value within the frame
SELECT
groupArray(value) OVER w1 AS frame_values_1,
nth_value(value, 2) OVER w1 AS second_value
FROM wf_frame
WINDOW w1 AS (PARTITION BY part_key ORDER BY order ASC Rows BETWEEN 3 PRECEDING AND CURRENT ROW)
ORDER BY
part_key ASC,
value ASC
┌─frame_values_1─┬─second_value─┐
[1] 0
[1,2] 2
[1,2,3] 2
[1,2,3,4] 2
[2,3,4,5] 3
└────────────────┴──────────────┘
-- second value within the frame + Null for missing values
SELECT
groupArray(value) OVER w1 AS frame_values_1,
nth_value(toNullable(value), 2) OVER w1 AS second_value
FROM wf_frame
WINDOW w1 AS (PARTITION BY part_key ORDER BY order ASC Rows BETWEEN 3 PRECEDING AND CURRENT ROW)
ORDER BY
part_key ASC,
value ASC
┌─frame_values_1─┬─second_value─┐
[1] ᴺᵁᴸᴸ
[1,2] 2
[1,2,3] 2
[1,2,3,4] 2
[2,3,4,5] 3
└────────────────┴──────────────┘
```
## Real world examples
### Maximum/total salary per department.
```sql
CREATE TABLE employees
(
`department` String,
`employee_name` String,
`salary` Float
)
ENGINE = Memory;
INSERT INTO employees FORMAT Values
('Finance', 'Jonh', 200),
('Finance', 'Joan', 210),
('Finance', 'Jean', 505),
('IT', 'Tim', 200),
('IT', 'Anna', 300),
('IT', 'Elen', 500);
SELECT
department,
employee_name AS emp,
salary,
max_salary_per_dep,
total_salary_per_dep,
round((salary / total_salary_per_dep) * 100, 2) AS `share_per_dep(%)`
FROM
(
SELECT
department,
employee_name,
salary,
max(salary) OVER wndw AS max_salary_per_dep,
sum(salary) OVER wndw AS total_salary_per_dep
FROM employees
WINDOW wndw AS (PARTITION BY department
rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY
department ASC,
employee_name ASC
);
┌─department─┬─emp──┬─salary─┬─max_salary_per_dep─┬─total_salary_per_dep─┬─share_per_dep(%)─┐
Finance Jean 505 505 915 55.19
Finance Joan 210 505 915 22.95
Finance Jonh 200 505 915 21.86
IT Anna 300 500 1000 30
IT Elen 500 500 1000 50
IT Tim 200 500 1000 20
└────────────┴──────┴────────┴────────────────────┴──────────────────────┴──────────────────┘
```
### Cumulative sum.
```sql
CREATE TABLE warehouse
(
`item` String,
`ts` DateTime,
`value` Float
)
ENGINE = Memory
INSERT INTO warehouse VALUES
('sku38', '2020-01-01', 9),
('sku38', '2020-02-01', 1),
('sku38', '2020-03-01', -4),
('sku1', '2020-01-01', 1),
('sku1', '2020-02-01', 1),
('sku1', '2020-03-01', 1);
SELECT
item,
ts,
value,
sum(value) OVER (PARTITION BY item ORDER BY ts ASC) AS stock_balance
FROM warehouse
ORDER BY
item ASC,
ts ASC;
┌─item──┬──────────────────ts─┬─value─┬─stock_balance─┐
sku1 2020-01-01 00:00:00 1 1
sku1 2020-02-01 00:00:00 1 2
sku1 2020-03-01 00:00:00 1 3
sku38 2020-01-01 00:00:00 9 9
sku38 2020-02-01 00:00:00 1 10
sku38 2020-03-01 00:00:00 -4 6
└───────┴─────────────────────┴───────┴───────────────┘
```
### Moving / Sliding Average (per 3 rows)
```sql
CREATE TABLE sensors
(
`metric` String,
`ts` DateTime,
`value` Float
)
ENGINE = Memory;
insert into sensors values('cpu_temp', '2020-01-01 00:00:00', 87),
('cpu_temp', '2020-01-01 00:00:01', 77),
('cpu_temp', '2020-01-01 00:00:02', 93),
('cpu_temp', '2020-01-01 00:00:03', 87),
('cpu_temp', '2020-01-01 00:00:04', 87),
('cpu_temp', '2020-01-01 00:00:05', 87),
('cpu_temp', '2020-01-01 00:00:06', 87),
('cpu_temp', '2020-01-01 00:00:07', 87);
SELECT
metric,
ts,
value,
avg(value) OVER
(PARTITION BY metric ORDER BY ts ASC Rows BETWEEN 2 PRECEDING AND CURRENT ROW)
AS moving_avg_temp
FROM sensors
ORDER BY
metric ASC,
ts ASC;
┌─metric───┬──────────────────ts─┬─value─┬───moving_avg_temp─┐
cpu_temp 2020-01-01 00:00:00 87 87
cpu_temp 2020-01-01 00:00:01 77 82
cpu_temp 2020-01-01 00:00:02 93 85.66666666666667
cpu_temp 2020-01-01 00:00:03 87 85.66666666666667
cpu_temp 2020-01-01 00:00:04 87 89
cpu_temp 2020-01-01 00:00:05 87 87
cpu_temp 2020-01-01 00:00:06 87 87
cpu_temp 2020-01-01 00:00:07 87 87
└──────────┴─────────────────────┴───────┴───────────────────┘
```
### Moving / Sliding Average (per 10 seconds)
```sql
SELECT
metric,
ts,
value,
avg(value) OVER (PARTITION BY metric ORDER BY ts
Range BETWEEN 10 PRECEDING AND CURRENT ROW) AS moving_avg_10_seconds_temp
FROM sensors
ORDER BY
metric ASC,
ts ASC;
┌─metric───┬──────────────────ts─┬─value─┬─moving_avg_10_seconds_temp─┐
cpu_temp 2020-01-01 00:00:00 87 87
cpu_temp 2020-01-01 00:01:10 77 77
cpu_temp 2020-01-01 00:02:20 93 93
cpu_temp 2020-01-01 00:03:30 87 87
cpu_temp 2020-01-01 00:04:40 87 87
cpu_temp 2020-01-01 00:05:50 87 87
cpu_temp 2020-01-01 00:06:00 87 87
cpu_temp 2020-01-01 00:07:10 87 87
└──────────┴─────────────────────┴───────┴────────────────────────────┘
```
### Moving / Sliding Average (per 10 days)
Temperature is stored with second precision, but using `Range` and `ORDER BY toDate(ts)` we form a frame with the size of 10 units, and because of `toDate(ts)` the unit is a day.
```sql
CREATE TABLE sensors
(
`metric` String,
`ts` DateTime,
`value` Float
)
ENGINE = Memory;
insert into sensors values('ambient_temp', '2020-01-01 00:00:00', 16),
('ambient_temp', '2020-01-01 12:00:00', 16),
('ambient_temp', '2020-01-02 11:00:00', 9),
('ambient_temp', '2020-01-02 12:00:00', 9),
('ambient_temp', '2020-02-01 10:00:00', 10),
('ambient_temp', '2020-02-01 12:00:00', 10),
('ambient_temp', '2020-02-10 12:00:00', 12),
('ambient_temp', '2020-02-10 13:00:00', 12),
('ambient_temp', '2020-02-20 12:00:01', 16),
('ambient_temp', '2020-03-01 12:00:00', 16),
('ambient_temp', '2020-03-01 12:00:00', 16),
('ambient_temp', '2020-03-01 12:00:00', 16);
SELECT
metric,
ts,
value,
round(avg(value) OVER (PARTITION BY metric ORDER BY toDate(ts)
Range BETWEEN 10 PRECEDING AND CURRENT ROW),2) AS moving_avg_10_days_temp
FROM sensors
ORDER BY
metric ASC,
ts ASC;
┌─metric───────┬──────────────────ts─┬─value─┬─moving_avg_10_days_temp─┐
ambient_temp 2020-01-01 00:00:00 16 16
ambient_temp 2020-01-01 12:00:00 16 16
ambient_temp 2020-01-02 11:00:00 9 12.5
ambient_temp 2020-01-02 12:00:00 9 12.5
ambient_temp 2020-02-01 10:00:00 10 10
ambient_temp 2020-02-01 12:00:00 10 10
ambient_temp 2020-02-10 12:00:00 12 11
ambient_temp 2020-02-10 13:00:00 12 11
ambient_temp 2020-02-20 12:00:01 16 13.33
ambient_temp 2020-03-01 12:00:00 16 16
ambient_temp 2020-03-01 12:00:00 16 16
ambient_temp 2020-03-01 12:00:00 16 16
└──────────────┴─────────────────────┴───────┴─────────────────────────┘
```
## Related Content
- Blog: [Working with time series data in ClickHouse](https://clickhouse.com/blog/working-with-time-series-data-and-functions-ClickHouse)
- Blog: [Window and array functions for Git commit sequences](https://clickhouse.com/blog/clickhouse-window-array-functions-git-commits)
- Blog: [Getting Data Into ClickHouse - Part 3 - Using S3](https://clickhouse.com/blog/getting-data-into-clickhouse-part-3-s3)