mirror of https://github.com/ClickHouse/ClickHouse.git synced 2024-09-24 02:30:51 +00:00

Alina Terekhova 4b33108dac

Co-authored-by: olgarev <56617294+olgarev@users.noreply.github.com>

2021-05-17 22:13:41 +02:00

7.2 KiB

Raw Blame History

toc_priority	toc_title
47	OPTIMIZE

OPTIMIZE Statement

This query tries to initialize an unscheduled merge of data parts for tables.

!!! warning "Warning" OPTIMIZE can’t fix the Too many parts error.

Syntax

OPTIMIZE TABLE [db.]name [ON CLUSTER cluster] [PARTITION partition | PARTITION ID 'partition_id'] [FINAL] [DEDUPLICATE [BY expression]]

The OPTMIZE query is supported for MergeTree family, the MaterializedView and the Buffer engines. Other table engines aren’t supported.

When OPTIMIZE is used with the ReplicatedMergeTree family of table engines, ClickHouse creates a task for merging and waits for execution on all nodes (if the replication_alter_partitions_sync setting is enabled).

If OPTIMIZE doesn’t perform a merge for any reason, it doesn’t notify the client. To enable notifications, use the optimize_throw_if_noop setting.
If you specify a PARTITION, only the specified partition is optimized. How to set partition expression.
If you specify FINAL, optimization is performed even when all the data is already in one part. Also merge is forced even if concurrent merges are performed.
If you specify DEDUPLICATE, then completely identical rows (unless by-clause is specified) will be deduplicated (all columns are compared), it makes sense only for the MergeTree engine.

BY expression

If you want to perform deduplication on custom set of columns rather than on all, you can specify list of columns explicitly or use any combination of *, COLUMNS or EXCEPT expressions. The explictly written or implicitly expanded list of columns must include all columns specified in row ordering expression (both primary and sorting keys) and partitioning expression (partitioning key).

!!! note "Note" Notice that * behaves just like in SELECT: MATERIALIZED and ALIAS columns are not used for expansion. Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column.

Syntax

OPTIMIZE TABLE table DEDUPLICATE; -- all columns 
OPTIMIZE TABLE table DEDUPLICATE BY *; -- excludes MATERIALIZED and ALIAS columns
OPTIMIZE TABLE table DEDUPLICATE BY colX,colY,colZ;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY);
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex');
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY);

Examples

Consider the table:


CREATE TABLE dup_example (
    pk Int32, -- primary key
    sk Int32, -- secondary key
    value UInt32,
    mat UInt32 MATERIALIZED rand(),  -- materialized value
    alias UInt32 ALIAS 2, -- aliased value
    PRIMARY KEY pk
) ENGINE=MergeTree
ORDER BY (pk, sk);

Columns for deduplication are not specified, so all of them are taken into account. Row is removed only if all values in all columns are equal to corresponding values in previous row:

OPTIMIZE TABLE dup_example FINAL DEDUPLICATE;

Deduplicate by all columns that are not ALIAS or MATERIALIZED: in our case deduplicate by pk, sk and value columns.

OPTIMIZE TABLE dup_example FINAL DEDUPLICATE BY *;

Deduplicate explicitly by pk, sk, value and mat columns.

OPTIMIZE TABLE dup_example FINAL DEDUPLICATE BY pk, sk, value, mat;

Deduplicate by columns matching a regex '.*k': pk and sk columns.

OPTIMIZE TABLE dup_example FINAL DEDUPLICATE BY COLUMNS('.*k');

Error cases

Note that primary key column should not be missed in any BY expression. These queries will face errors:

OPTIMIZE TABLE dup_example DEDUPLICATE BY * EXCEPT(pk);
OPTIMIZE TABLE dup_example DEDUPLICATE BY sk, value;

Empty list cases:

OPTIMIZE TABLE dup_example DEDUPLICATE BY * EXCEPT(pk, sk, value, mat, alias); -- server error
OPTIMIZE TABLE dup_example DEDUPLICATE BY; -- syntax error

Replicated example on a ReplicatedMergeTree table engine:

DROP TABLE IF EXISTS replicated_deduplicate_by_columns_r1;
DROP TABLE IF EXISTS replicated_deduplicate_by_columns_r2;

SET replication_alter_partitions_sync = 2;

CREATE TABLE IF NOT EXISTS replicated_deduplicate_by_columns_r1 (
    id Int32, 
    value UInt32, 
    insert_time_ns DateTime64(9) MATERIALIZED now64(9), 
    insert_replica_id UInt8 MATERIALIZED randConstant()
) ENGINE=ReplicatedMergeTree('zookeeper_name_configured_in_auxiliary_zookeepers:path', 'r1') 
ORDER BY id;

CREATE TABLE IF NOT EXISTS replicated_deduplicate_by_columns_r2 (
    id Int32, 
    value UInt32, 
    insert_time_ns DateTime64(9) MATERIALIZED now64(9), 
    insert_replica_id UInt8 MATERIALIZED randConstant()
) ENGINE=ReplicatedMergeTree('zookeeper_name_configured_in_auxiliary_zookeepers:path', 'r2') 
ORDER BY id;

For the ReplicatedMergeTree engine we give the path to the table and name of the replica in Zookeeper.

Insert some data into both replicas and wait for them to sync:

SYSTEM SYNC REPLICA replicated_deduplicate_by_columns_r2;
SYSTEM SYNC REPLICA replicated_deduplicate_by_columns_r1;

Check that we have data on replicas:

SELECT 'r1', id, value, count(), uniqExact(insert_time_ns), uniqExact(insert_replica_id) 
FROM replicated_deduplicate_by_columns_r1 
GROUP BY id, value 
ORDER BY id, value;

SELECT 'r2', id, value, count(), uniqExact(insert_time_ns), uniqExact(insert_replica_id) 
FROM replicated_deduplicate_by_columns_r2 
GROUP BY id, value 
ORDER BY id, value;

Remove full duplicates from replica r1 based on all columns:

OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE;

Remove duplicates from replica r1 based on all columns that are not ALIAS or MATERIALIZED:

OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY *; -- except insert_time_ns, insert_replica_id

Deduplicate replica r1 explicitly by id and value columns:

OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY id, value;

Deduplicate by columns matching a regex:

OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY COLUMNS('[id, value]');

OPTIMIZE TABLE replicated_deduplicate_by_columns_r1 FINAL DEDUPLICATE BY COLUMNS('[i]') EXCEPT(insert_time_ns, insert_replica_id);

Don't forget to DROP tables and replicas SYSTEM DROP REPLICA afterwards.

7.2 KiB Raw Blame History Unescape Escape

OPTIMIZE Statement

BY expression

7.2 KiB

Raw Blame History