ClickHouse/docs/en/sql-reference/statements/optimize.md

193 lines
11 KiB
Markdown
Raw Normal View History

2020-07-11 11:05:49 +00:00
---
toc_priority: 47
2020-07-23 13:05:37 +00:00
toc_title: OPTIMIZE
2020-07-11 11:05:49 +00:00
---
# OPTIMIZE Statement {#misc_operations-optimize}
2021-03-28 11:11:56 +00:00
This query tries to initialize an unscheduled merge of data parts for tables.
!!! warning "Warning"
`OPTIMIZE` cant fix the `Too many parts` error.
**Syntax**
2020-07-11 11:05:49 +00:00
``` sql
OPTIMIZE TABLE [db.]name [ON CLUSTER cluster] [PARTITION partition | PARTITION ID 'partition_id'] [FINAL] [DEDUPLICATE [BY expression]]
2020-07-11 11:05:49 +00:00
```
2021-03-28 11:11:56 +00:00
The `OPTMIZE` query is supported for [MergeTree](../../engines/table-engines/mergetree-family/mergetree.md) family, the [MaterializedView](../../engines/table-engines/special/materializedview.md) and the [Buffer](../../engines/table-engines/special/buffer.md) engines. Other table engines arent supported.
2020-07-11 11:05:49 +00:00
When `OPTIMIZE` is used with the [ReplicatedMergeTree](../../engines/table-engines/mergetree-family/replication.md) family of table engines, ClickHouse creates a task for merging and waits for execution on all nodes (if the `replication_alter_partitions_sync` setting is enabled).
2021-05-27 19:44:11 +00:00
- If `OPTIMIZE` does not perform a merge for any reason, it does not notify the client. To enable notifications, use the [optimize_throw_if_noop](../../operations/settings/settings.md#setting-optimize_throw_if_noop) setting.
- If you specify a `PARTITION`, only the specified partition is optimized. [How to set partition expression](../../sql-reference/statements/alter/index.md#alter-how-to-specify-part-expr).
- If you specify `FINAL`, optimization is performed even when all the data is already in one part. Also merge is forced even if concurrent merges are performed.
- If you specify `DEDUPLICATE`, then completely identical rows (unless by-clause is specified) will be deduplicated (all columns are compared), it makes sense only for the MergeTree engine.
2021-04-15 12:52:15 +00:00
## BY expression {#by-expression}
2021-03-17 11:16:41 +00:00
If you want to perform deduplication on custom set of columns rather than on all, you can specify list of columns explicitly or use any combination of [`*`](../../sql-reference/statements/select/index.md#asterisk), [`COLUMNS`](../../sql-reference/statements/select/index.md#columns-expression) or [`EXCEPT`](../../sql-reference/statements/select/index.md#except-modifier) expressions. The explictly written or implicitly expanded list of columns must include all columns specified in row ordering expression (both primary and sorting keys) and partitioning expression (partitioning key).
2021-03-28 12:33:00 +00:00
!!! note "Note"
Notice that `*` behaves just like in `SELECT`: [MATERIALIZED](../../sql-reference/statements/create/table.md#materialized) and [ALIAS](../../sql-reference/statements/create/table.md#alias) columns are not used for expansion.
2021-04-29 22:36:50 +00:00
Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an `ALIAS` column.
**Syntax**
``` sql
2021-04-29 22:36:50 +00:00
OPTIMIZE TABLE table DEDUPLICATE; -- all columns
OPTIMIZE TABLE table DEDUPLICATE BY *; -- excludes MATERIALIZED and ALIAS columns
OPTIMIZE TABLE table DEDUPLICATE BY colX,colY,colZ;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY);
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex');
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY);
```
2021-04-10 16:14:36 +00:00
**Examples**
Consider the table:
2021-03-28 11:11:56 +00:00
``` sql
CREATE TABLE example (
primary_key Int32,
secondary_key Int32,
value UInt32,
partition_key UInt32,
materialized_value UInt32 MATERIALIZED 12345,
aliased_value UInt32 ALIAS 2,
PRIMARY KEY primary_key
) ENGINE=MergeTree
PARTITION BY partition_key
ORDER BY (primary_key, secondary_key);
```
``` sql
2021-05-17 20:50:02 +00:00
INSERT INTO example (primary_key, secondary_key, value, partition_key)
VALUES (0, 0, 0, 0), (0, 0, 0, 0), (1, 1, 2, 2), (1, 1, 2, 3), (1, 1, 3, 3);
```
``` sql
2021-05-17 20:50:02 +00:00
SELECT * FROM example;
```
2021-05-17 20:50:02 +00:00
Result:
```
2021-05-17 20:50:02 +00:00
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 0 │ 0 │ 0 │ 0 │
│ 0 │ 0 │ 0 │ 0 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 2 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 3 │
│ 1 │ 1 │ 3 │ 3 │
└─────────────┴───────────────┴───────┴───────────────┘
```
2021-05-17 20:50:02 +00:00
When columns for deduplication are not specified, all of them are taken into account. Row is removed only if all values in all columns are equal to corresponding values in previous row:
2021-03-28 11:11:56 +00:00
``` sql
OPTIMIZE TABLE example FINAL DEDUPLICATE;
```
``` sql
2021-05-17 20:50:02 +00:00
SELECT * FROM example;
```
2021-05-17 20:50:02 +00:00
Result:
2021-04-29 22:36:50 +00:00
```
2021-05-17 20:50:02 +00:00
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 2 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 0 │ 0 │ 0 │ 0 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 3 │
│ 1 │ 1 │ 3 │ 3 │
└─────────────┴───────────────┴───────┴───────────────┘
2021-04-29 22:36:50 +00:00
```
2021-05-17 20:50:02 +00:00
When columns are specified implicitly, the table is deduplicated by all columns that are not `ALIAS` or `MATERIALIZED`. Considering the table above, these are `primary_key`, `secondary_key`, `value`, and `partition_key` columns:
2021-04-29 22:36:50 +00:00
```sql
OPTIMIZE TABLE example FINAL DEDUPLICATE BY *;
```
2021-05-17 20:50:02 +00:00
``` sql
SELECT * FROM example;
```
Result:
```
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 2 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 0 │ 0 │ 0 │ 0 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 3 │
│ 1 │ 1 │ 3 │ 3 │
└─────────────┴───────────────┴───────┴───────────────┘
2021-04-29 22:36:50 +00:00
```
2021-05-17 20:50:02 +00:00
Deduplicate by all columns that are not `ALIAS` or `MATERIALIZED` and explicitly not `value`: `primary_key`, `secondary_key`, and `partition_key` columns.
2021-03-28 11:11:56 +00:00
``` sql
2021-05-17 20:50:02 +00:00
OPTIMIZE TABLE example FINAL DEDUPLICATE BY * EXCEPT value;
```
``` sql
2021-05-17 20:50:02 +00:00
SELECT * FROM example;
2021-04-29 22:36:50 +00:00
```
2021-05-17 20:50:02 +00:00
Result:
```
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 2 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 0 │ 0 │ 0 │ 0 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 3 │
└─────────────┴───────────────┴───────┴───────────────┘
```
2021-05-17 20:50:02 +00:00
Deduplicate explicitly by `primary_key`, `secondary_key`, and `partition_key` columns:
2021-04-29 22:36:50 +00:00
```sql
2021-05-17 20:50:02 +00:00
OPTIMIZE TABLE example FINAL DEDUPLICATE BY primary_key, secondary_key, partition_key;
```
``` sql
2021-05-17 20:50:02 +00:00
SELECT * FROM example;
```
Result:
2021-04-29 22:36:50 +00:00
```
2021-05-17 20:50:02 +00:00
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 2 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 0 │ 0 │ 0 │ 0 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 3 │
└─────────────┴───────────────┴───────┴───────────────┘
```
2021-04-29 22:36:50 +00:00
2021-05-17 20:50:02 +00:00
Deduplicate by any column matching a regex: `primary_key`, `secondary_key`, and `partition_key` columns:
2021-04-29 22:36:50 +00:00
```sql
OPTIMIZE TABLE example FINAL DEDUPLICATE BY COLUMNS('.*_key');
```
2021-05-17 20:50:02 +00:00
``` sql
SELECT * FROM example;
```
Result:
```
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 0 │ 0 │ 0 │ 0 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 2 │
└─────────────┴───────────────┴───────┴───────────────┘
┌─primary_key─┬─secondary_key─┬─value─┬─partition_key─┐
│ 1 │ 1 │ 2 │ 3 │
└─────────────┴───────────────┴───────┴───────────────┘
```