ClickHouse/docs/en/sql-reference/statements/optimize.md

4.6 KiB
Raw Blame History

toc_priority toc_title
47 OPTIMIZE

OPTIMIZE Statement

OPTIMIZE TABLE [db.]name [ON CLUSTER cluster] [PARTITION partition | PARTITION ID 'partition_id'] [FINAL] [DEDUPLICATE [BY expression]]

This query tries to initialize an unscheduled merge of data parts for tables with a table engine from the MergeTree family.

The OPTMIZE query is also supported for the MaterializedView and the Buffer engines. Other table engines arent supported.

When OPTIMIZE is used with the ReplicatedMergeTree family of table engines, ClickHouse creates a task for merging and waits for execution on all nodes (if the replication_alter_partitions_sync setting is enabled).

  • If OPTIMIZE doesnt perform a merge for any reason, it doesnt notify the client. To enable notifications, use the optimize_throw_if_noop setting.
  • If you specify a PARTITION, only the specified partition is optimized. How to set partition expression.
  • If you specify FINAL, optimization is performed even when all the data is already in one part. Also merge is forced even if concurrent merges are performed.
  • If you specify DEDUPLICATE, then completely identical rows (unless by-clause is specified) will be deduplicated (all columns are compared), it makes sense only for the MergeTree engine.

BY expression

If you want to perform deduplication on custom set of columns rather than on all, you can specify list of columns explicitly or use any combination of *, COLUMNS or EXCEPT expressions. The explictly written or implicitly expanded list of columns must include all columns specified in row ordering expression (both primary and sorting keys) and partitioning expression (partitioning key).

Note that * behaves just like in SELECT: MATERIALIZED, and ALIAS columns are not used for expansion. Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column.

OPTIMIZE TABLE table DEDUPLICATE; -- the old one
OPTIMIZE TABLE table DEDUPLICATE BY *; -- not the same as the old one, excludes MATERIALIZED columns (see the note above)
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY);
OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex');
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY);

Example:

A silly synthetic table.

CREATE TABLE example (
    primary_key Int32,
    secondary_key Int32,
    value UInt32,
    partition_key UInt32,
    materialized_value UInt32 MATERIALIZED 12345,
    aliased_value UInt32 ALIAS 2,
    PRIMARY KEY primary_key
) ENGINE=MergeTree
PARTITION BY partition_key
ORDER BY (primary_key, secondary_key);
-- The 'old' deduplicate, all columns are taken into account, i.e. row is removed only if all values in all columns are equal to corresponding values in previous row.
OPTIMIZE TABLE example FINAL DEDUPLICATE;
-- Deduplicate by all columns that are not `ALIAS` or `MATERIALIZED`: `primary_key`, `secondary_key`, `value`, `partition_key`, and `materialized_value` columns.
OPTIMIZE TABLE example FINAL DEDUPLICATE BY *;
-- Deduplicate by all columns that are not `ALIAS` or `MATERIALIZED` and explicitly not `materialized_value`: `primary_key`, `secondary_key`, `value`, and `partition_key` columns.
OPTIMIZE TABLE example FINAL DEDUPLICATE BY * EXCEPT materialized_value;
-- Deduplicate explicitly by `primary_key`, `secondary_key`, and `partition_key` columns.
OPTIMIZE TABLE example FINAL DEDUPLICATE BY primary_key, secondary_key, partition_key;
-- Deduplicate by any column matching a regex: `primary_key`, `secondary_key`, and `partition_key` columns.
OPTIMIZE TABLE example FINAL DEDUPLICATE BY COLUMNS('.*_key');

!!! warning "Warning" OPTIMIZE cant fix the “Too many parts” error.