From 398427806f8eb378a1f6a46d0d93843ad63280ab Mon Sep 17 00:00:00 2001 From: Alexei Averchenko Date: Fri, 11 Nov 2016 15:36:12 +0300 Subject: [PATCH 1/3] Grammar in architecture.md I'm proposing a slight update to the architecture.md's lede. I tried to preserve the meaning precisely (because I know nothing about ClickHouse) while making it more pleasant to read. --- doc/developers/architecture.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/developers/architecture.md b/doc/developers/architecture.md index 6b683916597..f8dd3ffa8ba 100644 --- a/doc/developers/architecture.md +++ b/doc/developers/architecture.md @@ -1,13 +1,13 @@ # ClickHouse quick architecture overview -> Gray text is for side notes you don't have to read. +> Optional side notes are in grey. -ClickHouse is a true column oriented DBMS. Data is stored by columns. Even more, during query execution, data is processed by arrays (vectors, chunks of columns). In all places, where it is possible, operations on data are dispatched not for individual values but for arrays. It is called "vectorized query execution". This allows to lower dispatch cost relatively to cost of actual data processing. +ClickHouse is a true column oriented DBMS. Data is stored by columns, and furthermore, during query execution data is processed by arrays (vectors, chunks of columns). Whenever possible, operations are dispatched not on individual values but on arrays. It is called "vectorized query execution", and it helps lower dispatch cost relative to the cost of actual data processing. ->This idea is not any new. It is dated back to `APL` programming language and its descendants: `A+`, `J`, `K`, `Q`. Array programming is widely used in scientific data processing. Also, this idea is not new for relational databases: for example, it is used in `Vectorwise` system. +>This idea is nothing new. It dates back to the `APL` programming language and its descendants: `A+`, `J`, `K`, `Q`. Array programming is widely used in scientific data processing. Neither is this idea something new in relational databases: for example, it is used in the `Vectorwise` system. ->To speed up query processing, there are two different approaches: vectorized query execution and runtime code generation. In second approach, the code is generated for every kind of query on the fly, removing all indirection and dynamic dispatch. No one of these approaches is strictly better than the other. Runtime code generation could be better if it will fuse many operations together and could fully utilize CPU execution units and pipeline. Vectorized query execution could be worse because it must deal with temporary vectors, that must be written to cache and read back. If temporary data does not fit in L2 cache, it becomes an issue. But vectorized query execution more easily utilize SIMD capabilities of CPU. There is [research paper](http://15721.courses.cs.cmu.edu/spring2016/papers/p5-sompolski.pdf) from our friends that shows, that better to combine both approaches. ClickHouse mostly use vectorized query execution and has limited initial support for runtime code generation (only inner loop for first stage of GROUP BY could be compiled). +>There are two different approaches for speeding up query processing: vectorized query execution and runtime code generation. In the latter, the code is generated for every kind of query on the fly, removing all indirection and dynamic dispatch. None of these approaches is strictly better than the other. Runtime code generation can be better when fuses many operations together, thus fully utilizing CPU execution units and pipeline. Vectorized query execution can be worse, because it must deal with temporary vectors that must be written to cache and read back. If the temporary data does not fit in L2 cache, this becomes an issue. But vectorized query execution more easily utilizes SIMD capabilities of CPU. A [research paper](http://15721.courses.cs.cmu.edu/spring2016/papers/p5-sompolski.pdf) written by our friends shows that it is better to combine both approaches. ClickHouse mostly uses vectorized query execution and has limited initial support for runtime code generation (only the inner loop of first stage of GROUP BY can be compiled). ## Columns From 5b72f0956a27a4f724f7507e969938a0b2e03541 Mon Sep 17 00:00:00 2001 From: Alexey Milovidov Date: Fri, 11 Nov 2016 20:33:43 +0300 Subject: [PATCH 2/3] Setting level to zero after ATTACH [#METR-23438]. --- dbms/src/Storages/StorageReplicatedMergeTree.cpp | 2 ++ 1 file changed, 2 insertions(+) diff --git a/dbms/src/Storages/StorageReplicatedMergeTree.cpp b/dbms/src/Storages/StorageReplicatedMergeTree.cpp index d067cfa7709..13c971e0bbf 100644 --- a/dbms/src/Storages/StorageReplicatedMergeTree.cpp +++ b/dbms/src/Storages/StorageReplicatedMergeTree.cpp @@ -2526,6 +2526,7 @@ static String getFakePartNameForDrop(const String & month_name, UInt64 left, UIn DayNum_t right_date = DayNum_t(static_cast(left_date) + lut.daysInMonth(start_time) - 1); /// Уровень - right-left+1: кусок не мог образоваться в результате такого или большего количества слияний. + /// TODO This is not true for parts after ATTACH. return ActiveDataPartSet::getPartName(left_date, right_date, left, right, right - left + 1); } @@ -2745,6 +2746,7 @@ void StorageReplicatedMergeTree::attachPartition(ASTPtr query, const Field & fie ActiveDataPartSet::Part part; ActiveDataPartSet::parsePartName(part_name, part); part.left = part.right = --min_used_number; + part.level = 0; /// previous level has no sense after attach. String new_part_name = ActiveDataPartSet::getPartName(part.left_date, part.right_date, part.left, part.right, part.level); LOG_INFO(log, "Will attach " << part_name << " as " << new_part_name); From 155bd8005c89d3fcb5cf952102442e6f422ccbdb Mon Sep 17 00:00:00 2001 From: f1yegor Date: Sat, 12 Nov 2016 10:21:38 +0100 Subject: [PATCH 3/3] mention ...OrZero functions: toInt32OrZero and etc --- doc/reference_en.html | 1 + doc/reference_ru.html | 1 + 2 files changed, 2 insertions(+) diff --git a/doc/reference_en.html b/doc/reference_en.html index 6e526d42136..df9e97422dc 100644 --- a/doc/reference_en.html +++ b/doc/reference_en.html @@ -4515,6 +4515,7 @@ Zero as an argument is considered "false," while any non-zero value is ===toUInt8, toUInt16, toUInt32, toUInt64=== ===toInt8, toInt16, toInt32, toInt64=== ===toFloat32, toFloat64=== +===toUInt8OrZero, toUInt16OrZero, toUInt32OrZero, toUInt64OrZero, toInt8OrZero, toInt16OrZero, toInt32OrZero, toInt64OrZero, toFloat32OrZero, toFloat64OrZero=== ===toDate, toDateTime=== ===toString=== diff --git a/doc/reference_ru.html b/doc/reference_ru.html index 36e40493f90..26ad9098327 100644 --- a/doc/reference_ru.html +++ b/doc/reference_ru.html @@ -4579,6 +4579,7 @@ LIMIT 10 ===toUInt8, toUInt16, toUInt32, toUInt64=== ===toInt8, toInt16, toInt32, toInt64=== ===toFloat32, toFloat64=== +===toUInt8OrZero, toUInt16OrZero, toUInt32OrZero, toUInt64OrZero, toInt8OrZero, toInt16OrZero, toInt32OrZero, toInt64OrZero, toFloat32OrZero, toFloat64OrZero=== ===toDate, toDateTime=== ===toString===