ClickHouse/tests/performance/codecs_int_insert.xml
Robert Schulze e6167d6b36
Deprecate Gorilla compression of non-float columns
Reasons:

1. The original Gorilla paper proposed a compression schema for pairs of
   time stamps and double-precision FP values. ClickHouse's Gorilla
   codec only implements compression of the latter and it does not
   impose any data type restrictions.
   - Data types != Float* or (U)Int* (e.g. Decimal, Point etc.) are
     definitely not supposed to be used with Gorilla.
   - (U)Int* types are debatable. The paper only considers
     integers-stored-as-FP-values, a practical use case for which
     Gorilla works well. Standalone integers are not considered which
     makes them at least suspicious.

2. Achieve consistency with FPC, another specialized floating-point
   timeseries codec, which rejects non-float data.

3. On practical datasets, ZSTD is often "good enough" (**) so it should
   be okay to disincentive non-ZSTD codecs a little bit. If needed,
   Delta and DoubleDelta codecs are viable alternative for slowly
   changing (time-series-like) integer sequences.

Since on-prem and hosted users may still have Gorilla-compressed
non-float data, this combination is only deprecated for now. No warning
or error will be emitted. Users are encouraged to migrate
Gorilla-compressed non-float data to an alternative codec. It is planned
to treat Gorilla-compressed non-float columns as "suspicious" six months
after this commit (i.e. in v23.6). Even then, it will still be possible
to set "allow_suspicious_codecs = true" and read and write
Gorilla-compressed non-float data.

(*) Sec. 4.1.2, "Gorilla restricts the value element in its tuple to a
    double floating point type.", https://doi.org/10.14778/2824032.2824078

(**) https://clickhouse.com/blog/optimize-clickhouse-codecs-compression-schema
2023-01-20 17:31:16 +00:00

54 lines
2.0 KiB
XML

<test>
<settings>
<allow_suspicious_codecs>1</allow_suspicious_codecs>
</settings>
<substitutions>
<substitution>
<name>codec</name>
<values>
<value>NONE</value> <!-- as a baseline -->
<value>LZ4</value>
<value>ZSTD</value>
<value>Delta</value>
<value>T64</value>
<value>DoubleDelta</value>
</values>
</substitution>
<substitution>
<name>type</name>
<values>
<value>UInt64</value>
</values>
</substitution>
<substitution>
<name>seq_type</name>
<values>
<value>seq</value>
<value>mon</value>
<value>rnd</value>
</values>
</substitution>
<substitution>
<name>num_rows</name>
<values>
<value>20000000</value>
</values>
</substitution>
</substitutions>
<create_query>CREATE TABLE IF NOT EXISTS codec_{seq_type}_{type}_{codec} (n {type} CODEC({codec}))
ENGINE = MergeTree PARTITION BY tuple() ORDER BY tuple()
SETTINGS parts_to_delay_insert = 5000, parts_to_throw_insert = 5000;</create_query>
<create_query>system stop merges</create_query>
<!-- Using limit to make query finite, allowing it to be run multiple times in a loop, reducing mean error -->
<query>INSERT INTO codec_seq_{type}_{codec} (n) SELECT number FROM system.numbers LIMIT {num_rows} SETTINGS max_threads=1</query>
<query>INSERT INTO codec_mon_{type}_{codec} (n) SELECT number*512+(intHash64(number)%512) FROM system.numbers LIMIT {num_rows} SETTINGS max_threads=1</query>
<query>INSERT INTO codec_rnd_{type}_{codec} (n) SELECT intHash64(number) FROM system.numbers LIMIT {num_rows} SETTINGS max_threads=1</query>
<drop_query>system start merges</drop_query>
<drop_query>DROP TABLE IF EXISTS codec_{seq_type}_{type}_{codec}</drop_query>
</test>