CLICKHOUSE-2720: progress on website and reference (#886)

* update presentations

* CLICKHOUSE-2936: redirect from clickhouse.yandex.ru and clickhouse.yandex.com

* update submodule

* lost files

* CLICKHOUSE-2981: prefer sphinx docs over original reference

* CLICKHOUSE-2981: docs styles more similar to main website + add flags to switch language links

* update presentations

* Less confusing directory structure (docs -> doc/reference/)

* Minify sphinx docs too

* Website release script: fail fast + pass docker hash on deploy

* Do not underline links in docs

* shorter

* cleanup docker images

* tune nginx config

* CLICKHOUSE-3043: get rid of habrastorage links

* Lost translation

* CLICKHOUSE-2936: temporary client-side redirect

* behaves weird in test

* put redirect back

* CLICKHOUSE-3047: copy docs txts to public too

* move to proper file

* remove old pages to avoid confusion

* Remove reference redirect warning for now

* Refresh README.md

* Yellow buttons in docs

* Use svg flags instead of unicode ones in docs

* fix test website instance

* Put flags to separate files

* wrong flag

* Copy Yandex.Metrica introduction from main page to docs

* Yet another home page structure change, couple new blocks (CLICKHOUSE-3045)

* Update Contacts section

* CLICKHOUSE-2849: more detailed legal information

* CLICKHOUSE-2978 preparation - split by files

* More changes in Contacts block

* Tune texts on index page

* update presentations

* One more benchmark

* Add usage sections to index page, adapted from slides

* Get the roadmap started, based on slides from last ClickHouse Meetup

* CLICKHOUSE-2977: some rendering tuning

* Get rid of excessive section in the end of getting started

* Make headers linkable

* CLICKHOUSE-2981: links to editing reference - https://github.com/yandex/ClickHouse/issues/849

* CLICKHOUSE-2981: fix mobile styles in docs

* Ban crawling of duplicating docs

* Open some external links in new tab

* Ban old docs too

* Lots of trivial fixes in english docs

* Lots of trivial fixes in russian docs

* Remove getting started copies in markdown

* Add Yandex.Webmaster

* Fix some sphinx warnings

* More warnings fixed in english docs

* More sphinx warnings fixed

* Add code-block:: text

* More code-block:: text

* These headers look not that well

* Better switch between documentation languages

* merge use_case.rst into ya_metrika_task.rst

* Edit the agg_functions.rst texts

* Add lost empty lines

* Lost blank lines

* Add new logo sizes

* update presentations

* Next step in migrating to new documentation

* Fix all warnings in en reference

* Fix all warnings in ru reference

* Re-arrange existing reference

* Move operation tips to main reference

* Fix typos noticed by milovidov@

* Get rid of zookeeper.md

* Looks like duplicate of tutorial.html

* Fix some mess with html tags in tutorial

* No idea why nobody noticed this before, but it was completely not clear whet to get the data

* Match code block styling between main and tutorial pages (in favor of the latter)

* Get rid of some copypaste in tutorial

* Normalize header styles

* Move example_datasets to sphinx

* Move presentations submodule to website

* Move and update README.md

* No point in duplicating articles from habrahabr here

* Move development-related docs as is for now

* doc/reference/ -> docs/ (to match the URL on website)

* Adapt links to match the previous commit

* Adapt development docs to rst (still lacks translation and strikethrough support)

* clean on release

* blacklist presentations in gulp

* strikethrough support in sphinx

* just copy development folder for now

* fix weird introduction in style article

* Style guide translation (WIP)

* Finish style guide translation to English

* gulp clean separately

* Update year in LICENSE

* Initial CONTRIBUTING.md

* Fix remaining links to old docs in tutorial

* Some tutorial fixes

* Typo

* Another typo

* Update list of authors from yandex-team accoding to git log
This commit is contained in:
Ivan Blinkov 2017-06-20 17:19:03 +03:00 committed by alexey-milovidov
parent 9c7d30e0df
commit 67c2e50331
393 changed files with 5279 additions and 3658 deletions

2
.gitmodules vendored
View File

@ -1,3 +1,3 @@
[submodule "doc/presentations"]
path = doc/presentations
path = website/presentations
url = https://github.com/yandex/clickhouse-presentations.git

50
AUTHORS
View File

@ -1,23 +1,43 @@
The following authors have created the source code of "ClickHouse"
published and distributed by YANDEX LLC as the owner:
Alexey Milovidov <milovidov@yandex-team.ru>
Vitaliy Lyudvichenko <vludv@yandex-team.ru>
Alexander Makarov <asealback@yandex-team.ru>
Alexander Prudaev <aprudaev@yandex-team.ru>
Alexey Arno <af-arno@yandex-team.ru>
Andrey Mironov <hertz@yandex-team.ru>
Michael Kolupaev <mkolupaev@yandex-team.ru>
Pavel Kartavyy <kartavyy@yandex-team.ru>
Evgeniy Gatov <egatov@yandex-team.ru>
Sergey Fedorov <fets@yandex-team.ru>
Vyacheslav Alipov <alipov@yandex-team.ru>
Vladimir Chebotarev <chebotarev@yandex-team.ru>
Dmitry Galuza <galuza@yandex-team.ru>
Sergey Magidovich <mgsergio@yandex-team.ru>
Anton Tikhonov <rokerjoker@yandex-team.ru>
Alexey Milovidov <milovidov@yandex-team.ru>
Alexey Tronov <vkusny@yandex-team.ru>
Alexey Vasiliev <loudhorr@yandex-team.ru>
Alexey Zatelepin <ztlpn@yandex-team.ru>
Amy Krishnevsky <krishnevsky@yandex-team.ru>
Andrey M <hertz@yandex-team.ru>
Andrey Mironov <hertz@yandex-team.ru>
Andrey Urusov <drobus@yandex-team.ru>
Vsevolod Orlov <vorloff@yandex-team.ru>
Roman Peshkurov <peshkurov@yandex-team.ru>
Michael Razuvaev <razuvaev@yandex-team.ru>
Anton Tikhonov <rokerjoker@yandex-team.ru>
Dmitry Bilunov <kmeaw@yandex-team.ru>
Dmitry Galuza <galuza@yandex-team.ru>
Eugene Konkov <konkov@yandex-team.ru>
Evgeniy Gatov <egatov@yandex-team.ru>
Ilya Khomutov <robert@yandex-team.ru>
Ilya Korolev <breeze@yandex-team.ru>
Ivan Blinkov <blinkov@yandex-team.ru>
Maxim Nikulin <mnikulin@yandex-team.ru>
Michael Kolupaev <mkolupaev@yandex-team.ru>
Michael Razuvaev <razuvaev@yandex-team.ru>
Nikolai Kochetov <nik-kochetov@yandex-team.ru>
Nikolay Vasiliev <lonlylocly@yandex-team.ru>
Nikolay Volosatov <bamx23@yandex-team.ru>
Pavel Artemkin <stanly@yandex-team.ru>
Pavel Kartaviy <kartavyy@yandex-team.ru>
Roman Nozdrin <drrtuy@yandex-team.ru>
Roman Peshkurov <peshkurov@yandex-team.ru>
Sergey Fedorov <fets@yandex-team.ru>
Sergey Lazarev <hamilkar@yandex-team.ru>
Sergey Magidovich <mgsergio@yandex-team.ru>
Sergey Serebryanik <serebrserg@yandex-team.ru>
Sergey Veletskiy <velom@yandex-team.ru>
Vasily Okunev <okunev@yandex-team.ru>
Vitaliy Lyudvichenko <vludv@yandex-team.ru>
Vladimir Chebotarev <chebotarev@yandex-team.ru>
Vsevolod Orlov <vorloff@yandex-team.ru>
Vyacheslav Alipov <alipov@yandex-team.ru>
Yuriy Galitskiy <orantius@yandex-team.ru>

32
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,32 @@
# Contributing to ClickHouse
## Technical info
Developer guide for writing code for ClickHouse is published on official website alongside the usage and operations documentation:
https://clickhouse.yandex/docs/en/development/index.html
## Legal info
In order for us (YANDEX LLC) to accept patches and other contributions from you, you will have to adopt our Yandex Contributor License Agreement (the "**CLA**"). The current version of the CLA you may find here:
1) https://yandex.ru/legal/cla/?lang=en (in English) and
2) https://yandex.ru/legal/cla/?lang=ru (in Russian).
By adopting the CLA, you state the following:
* You obviously wish and are willingly licensing your contributions to us for our open source projects under the terms of the CLA,
* You has read the terms and conditions of the CLA and agree with them in full,
* You are legally able to provide and license your contributions as stated,
* We may use your contributions for our open source projects and for any other our project too,
* We rely on your assurances concerning the rights of third parties in relation to your contributes.
If you agree with these principles, please read and adopt our CLA. By providing us your contributions, you hereby declare that you has already read and adopt our CLA, and we may freely merge your contributions with our corresponding open source project and use it in further in accordance with terms and conditions of the CLA.
If you have already adopted terms and conditions of the CLA, you are able to provide your contributes. When you submit your pull request, please add the following information into it:
```
I hereby agree to the terms of the CLA available at: [link].
```
Replace the bracketed text as follows:
* [link] is the link at the current version of the CLA (you may add here a link https://yandex.ru/legal/cla/?lang=en (in English) or a link https://yandex.ru/legal/cla/?lang=ru (in Russian).
It is enough to provide us such notification at once.

View File

@ -1,4 +1,4 @@
Copyright 2016 YANDEX LLC
Copyright 2016-2017 YANDEX LLC
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

View File

@ -1,5 +0,0 @@
Read the documentation and install ClickHouse: [https://clickhouse.yandex/](https://clickhouse.yandex/)
How to build ClickHouse from source: [build.md](./build.md)
Example datasets to play with are located at [example_datasets](./example_datasets) directory.

View File

@ -1,117 +0,0 @@
ClickHouse administration tips.
CPU
SSE 4.2 instruction set support is required. Most recent (since 2008) CPUs have this instruction set.
When choosing between CPU with more cores and slightly less frequency and CPU with less cores and more frequency, choose first.
For example, 16 cores with 2600 MHz is better than 8 cores with 3600 MHz.
Hyper-Threading
Don't disable hyper-threading. Some queries will benefit from hyper-threading and some will not.
Turbo-Boost
Don't disable turbo-boost. It will do significant performance gain on typical load.
Use 'turbostat' tool to show real CPU frequency under load.
CPU scaling governor
Always use 'performance' scaling governor. 'ondemand' scaling governor performs much worse even on constantly high demand.
sudo echo 'performance' | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
CPU throttling
Your CPU could be overheated. Use 'dmesg' to show if it was thermal throttled.
Your CPU could be power capped in datacenter. Use 'turbostat' tool under load to monitor that.
RAM
For small amount of data (up to ~200 GB compressed) prefer to use as much RAM as data volume.
For larger amount of data, if you run interactive (online) queries, use reasonable amount of RAM (128 GB or more) to hot data fit in page cache.
Even for data volumes of ~50 TB per server, using 128 GB of RAM is much better for query performance than 64 GB.
Swap
Disable swap. The only possible reason to not disable swap is when you are running ClickHouse on your personal laptop/desktop.
Huge pages
Disable transparent huge pages. It interfers badly with memory allocators, leading to major performance degradation.
echo 'never' | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
Use 'perf top' to monitor time spent in kernel on doing memory management.
Don't allocate permanent huge pages.
Storage subsystem
If you could afford SSD, use SSD.
Otherwise use HDD. SATA HDDs 7200 RPM are Ok.
Prefer more servers with inplace storage to less servers with huge disk shelves.
Of course you could use huge disk shelves for archive storage with rare queries.
RAID
When using HDDs, you could use RAID-10, RAID-5, RAID-6 or RAID-50.
Use Linux software RAID (mdadm). Better to not use LVM.
When creating RAID-10, choose 'far' layout.
Prefer RAID-10 if you could afford it.
Don't use RAID-5 on more than 4 HDDs - use RAID-6 or RAID-50. RAID-6 is better.
When using RAID-5, RAID-6, or RAID-50, always increase stripe_cache_size, because default setting is awful.
echo 4096 | sudo tee /sys/block/md2/md/stripe_cache_size
Exact number is calculated from number of devices and chunk size: 2 * num_devices * chunk_size_in_bytes / 4096.
Chunk size 1024K is Ok for all RAID configurations.
Never use too small or too large chunk sizes.
On SSDs, you could use RAID-0.
Regardless to RAID, always use replication for data safety.
Enable NCQ. Use high queue depth. Use CFQ scheduler for HDDs and noop for SSDs. Don't lower readahead setting.
Enable write cache on HDDs.
Filesystem
Ext4 is Ok. Mount with noatime,nobarrier.
XFS is Ok too, but less tested with ClickHouse.
Most other filesystems should work fine. Filesystems with delayed allocation are better.
Linux kernel
Don't use too old Linux kernel. For example, on 2015, 3.18.19 is Ok.
You could use Yandex kernel: https://github.com/yandex/smart
- it gives at least 5% performance increase.
Network
When using IPv6, you must increase route cache.
Linux kernels before 3.2 has awful bugs in IPv6 implementation.
Prefer at least 10 Gbit network. 1 Gbit will also work, but much worse for repairing replicas with tens of terabytes of data and for processing huge distributed queries with much intermediate data.
ZooKeeper
Probably you already have ZooKeeper for another purposes.
It's Ok to use existing ZooKeeper installation if it is not overloaded.
Use recent version of ZooKeeper. At least 3.5 is Ok. Version in your linux package repository might be outdated.
With default settings, ZooKeeper have time bomb:
"A ZooKeeper server will not remove old snapshots and log files when using the default configuration (see autopurge below), this is the responsibility of the operator."
You need to defuse the bomb.

View File

@ -1,138 +0,0 @@
As of 2017-05-20, we have the following configuration in production:
ZooKeeper version is 3.5.1.
zoo.cfg:
```
# http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=30000
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=10
maxClientCnxns=2000
maxSessionTimeout=60000000
# the directory where the snapshot is stored.
dataDir=/opt/zookeeper/{{ cluster['name'] }}/data
# Place the dataLogDir to a separate physical disc for better performance
dataLogDir=/opt/zookeeper/{{ cluster['name'] }}/logs
autopurge.snapRetainCount=10
autopurge.purgeInterval=1
# To avoid seeks ZooKeeper allocates space in the transaction log file in
# blocks of preAllocSize kilobytes. The default block size is 64M. One reason
# for changing the size of the blocks is to reduce the block size if snapshots
# are taken more often. (Also, see snapCount).
preAllocSize=131072
# Clients can submit requests faster than ZooKeeper can process them,
# especially if there are a lot of clients. To prevent ZooKeeper from running
# out of memory due to queued requests, ZooKeeper will throttle clients so that
# there is no more than globalOutstandingLimit outstanding requests in the
# system. The default limit is 1,000.ZooKeeper logs transactions to a
# transaction log. After snapCount transactions are written to a log file a
# snapshot is started and a new transaction log file is started. The default
# snapCount is 10,000.
snapCount=3000000
# If this option is defined, requests will be will logged to a trace file named
# traceFile.year.month.day.
#traceFile=
# Leader accepts client connections. Default value is "yes". The leader machine
# coordinates updates. For higher update throughput at thes slight expense of
# read throughput the leader can be configured to not accept clients and focus
# on coordination.
leaderServes=yes
standaloneEnabled=false
dynamicConfigFile=/etc/zookeeper-{{ cluster['name'] }}/conf/zoo.cfg.dynamic
```
Java version:
```
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
```
Java parameters:
```
NAME=zookeeper-{{ cluster['name'] }}
ZOOCFGDIR=/etc/$NAME/conf
# TODO this is really ugly
# How to find out, which jars are needed?
# seems, that log4j requires the log4j.properties file to be in the classpath
CLASSPATH="$ZOOCFGDIR:/usr/build/classes:/usr/build/lib/*.jar:/usr/share/zookeeper/zookeeper-3.5.1-metrika.jar:/usr/share/zookeeper/slf4j-log4j12-1.7.5.jar:/usr/share/zookeeper/slf4j-api-1.7.5.jar:/usr/share/zookeeper/servlet-api-2.5-20081211.jar:/usr/share/zookeeper/netty-3.7.0.Final.jar:/usr/share/zookeeper/log4j-1.2.16.jar:/usr/share/zookeeper/jline-2.11.jar:/usr/share/zookeeper/jetty-util-6.1.26.jar:/usr/share/zookeeper/jetty-6.1.26.jar:/usr/share/zookeeper/javacc.jar:/usr/share/zookeeper/jackson-mapper-asl-1.9.11.jar:/usr/share/zookeeper/jackson-core-asl-1.9.11.jar:/usr/share/zookeeper/commons-cli-1.2.jar:/usr/src/java/lib/*.jar:/usr/etc/zookeeper"
ZOOCFG="$ZOOCFGDIR/zoo.cfg"
ZOO_LOG_DIR=/var/log/$NAME
USER=zookeeper
GROUP=zookeeper
PIDDIR=/var/run/$NAME
PIDFILE=$PIDDIR/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME
JAVA=/usr/bin/java
ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain"
ZOO_LOG4J_PROP="INFO,ROLLINGFILE"
JMXLOCALONLY=false
JAVA_OPTS="-Xms{{ cluster.get('xms','128M') }} \
-Xmx{{ cluster.get('xmx','1G') }} \
-Xloggc:/var/log/$NAME/zookeeper-gc.log \
-XX:+UseGCLogFileRotation \
-XX:NumberOfGCLogFiles=16 \
-XX:GCLogFileSize=16M \
-verbose:gc \
-XX:+PrintGCTimeStamps \
-XX:+PrintGCDateStamps \
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution \
-XX:+PrintGCApplicationStoppedTime \
-XX:+PrintGCApplicationConcurrentTime \
-XX:+PrintSafepointStatistics \
-XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled"
```
Salt init:
```
description "zookeeper-{{ cluster['name'] }} centralized coordination service"
start on runlevel [2345]
stop on runlevel [!2345]
respawn
limit nofile 8192 8192
pre-start script
[ -r "/etc/zookeeper-{{ cluster['name'] }}/conf/environment" ] || exit 0
. /etc/zookeeper-{{ cluster['name'] }}/conf/environment
[ -d $ZOO_LOG_DIR ] || mkdir -p $ZOO_LOG_DIR
chown $USER:$GROUP $ZOO_LOG_DIR
end script
script
. /etc/zookeeper-{{ cluster['name'] }}/conf/environment
[ -r /etc/default/zookeeper ] && . /etc/default/zookeeper
if [ -z "$JMXDISABLE" ]; then
JAVA_OPTS="$JAVA_OPTS -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=$JMXLOCALONLY"
fi
exec start-stop-daemon --start -c $USER --exec $JAVA --name zookeeper-{{ cluster['name'] }} \
-- -cp $CLASSPATH $JAVA_OPTS -Dzookeeper.log.dir=${ZOO_LOG_DIR} \
-Dzookeeper.root.logger=${ZOO_LOG4J_PROP} $ZOOMAIN $ZOOCFG
end script
```

View File

@ -1,166 +0,0 @@
# How to build ClickHouse
Build should work on Linux Ubuntu 12.04, 14.04 or newer.
With appropriate changes, build should work on any other Linux distribution.
Build is not intended to work on Mac OS X.
Only x86_64 with SSE 4.2 is supported. Support for AArch64 is experimental.
To test for SSE 4.2, do
```
grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
```
## Install Git and CMake
```
sudo apt-get install git cmake
```
## Detect number of threads
```
export THREADS=$(grep -c ^processor /proc/cpuinfo)
```
## Install GCC 6
There are several ways to do it.
### 1. Install from PPA package.
```
sudo apt-get install software-properties-common
sudo apt-add-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-6 g++-6
```
### 2. Install GCC 6 from sources.
Example:
```
# Download gcc from https://gcc.gnu.org/mirrors.html
wget ftp://ftp.fu-berlin.de/unix/languages/gcc/releases/gcc-6.2.0/gcc-6.2.0.tar.bz2
tar xf gcc-6.2.0.tar.bz2
cd gcc-6.2.0
./contrib/download_prerequisites
cd ..
mkdir gcc-build
cd gcc-build
../gcc-6.2.0/configure --enable-languages=c,c++
make -j $THREADS
sudo make install
hash gcc g++
gcc --version
sudo ln -s /usr/local/bin/gcc /usr/local/bin/gcc-6
sudo ln -s /usr/local/bin/g++ /usr/local/bin/g++-6
sudo ln -s /usr/local/bin/gcc /usr/local/bin/cc
sudo ln -s /usr/local/bin/g++ /usr/local/bin/c++
# /usr/local/bin/ should be in $PATH
```
## Use GCC 6 for builds
```
export CC=gcc-6
export CXX=g++-6
```
## Install required libraries from packages
```
sudo apt-get install libicu-dev libreadline-dev libmysqlclient-dev libssl-dev unixodbc-dev
```
# Checkout ClickHouse sources
To get latest stable version:
```
git clone -b stable git@github.com:yandex/ClickHouse.git
# or: git clone -b stable https://github.com/yandex/ClickHouse.git
cd ClickHouse
```
For development, switch to the `master` branch.
For latest release candidate, switch to the `testing` branch.
# Build ClickHouse
There are two variants of build.
## 1. Build release package.
### Install prerequisites to build debian packages.
```
sudo apt-get install devscripts dupload fakeroot debhelper
```
### Install recent version of clang.
Clang is embedded into ClickHouse package and used at runtime. Minimum version is 3.8.0. It is optional.
There are two variants:
#### 1. Build clang from sources.
```
cd ..
sudo apt-get install subversion
mkdir llvm
cd llvm
svn co http://llvm.org/svn/llvm-project/llvm/tags/RELEASE_400/final llvm
cd llvm/tools
svn co http://llvm.org/svn/llvm-project/cfe/tags/RELEASE_400/final clang
cd ..
cd projects/
svn co http://llvm.org/svn/llvm-project/compiler-rt/tags/RELEASE_400/final compiler-rt
cd ../..
mkdir build
cd build/
cmake -D CMAKE_BUILD_TYPE:STRING=Release ../llvm
make -j $THREADS
sudo make install
hash clang
```
#### 2. Install from packages.
On Ubuntu 16.04 or newer:
```
sudo apt-get install clang
```
You may also build ClickHouse with clang for development purposes.
For production releases, GCC is used.
### Run release script.
```
rm -f ../clickhouse*.deb
./release
```
You will find built packages in parent directory.
```
ls -l ../clickhouse*.deb
```
Note that usage of debian packages is not required.
ClickHouse has no runtime dependencies except libc,
so it could work on almost any Linux.
### Installing just built packages on development server.
```
sudo dpkg -i ../clickhouse*.deb
sudo service clickhouse-server start
```
## 2. Build to work with code.
```
mkdir build
cd build
cmake ..
make -j $THREADS
cd ..
```
To create an executable, run
```
make clickhouse
```
This will create the dbms/src/Server/clickhouse executable, which can be used with --client or --server.

View File

@ -1,44 +0,0 @@
# How to build ClickHouse
Build should work on Mac OS X 10.12. If you're using earlier version, you can try to build ClickHouse using Gentoo Prefix and clang sl in this instruction.
With appropriate changes, build should work on any other OS X distribution.
## Install Homebrew
```
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```
## Install required compilers, tools, libraries
```
brew install cmake gcc icu4c mysql openssl unixodbc libtool gettext homebrew/dupes/zlib readline boost --cc=gcc-6
```
# Checkout ClickHouse sources
To get the latest stable version:
```
git clone -b stable --recursive --depth=10 git@github.com:yandex/ClickHouse.git
# or: git clone -b stable --recursive --depth=10 https://github.com/yandex/ClickHouse.git
cd ClickHouse
```
For development, switch to the `master` branch.
For the latest release candidate, switch to the `testing` branch.
# Build ClickHouse
```
mkdir build
cd build
cmake .. -DCMAKE_CXX_COMPILER=`which g++-6` -DCMAKE_C_COMPILER=`which gcc-6`
make -j `sysctl -n hw.ncpu`
cd ..
```
# Caveats
If you intend to run clickhouse-server, make sure to increase system's maxfiles variable. See [MacOS.md](https://github.com/yandex/ClickHouse/blob/master/MacOS.md) for more details.

View File

@ -1,214 +0,0 @@
# Overview of ClickHouse architecture
> Optional side notes are in grey.
ClickHouse is a true column-oriented DBMS. Data is stored by columns, and during query execution data is processed by arrays (vectors or chunks of columns). Whenever possible, operations are dispatched on arrays, rather than on individual values. This is called "vectorized query execution," and it helps lower dispatch cost relative to the cost of actual data processing.
>This idea is nothing new. It dates back to the `APL` programming language and its descendants: `A+`, `J`, `K`, and `Q`. Array programming is widely used in scientific data processing. Neither is this idea something new in relational databases: for example, it is used in the `Vectorwise` system.
>There are two different approaches for speeding up query processing: vectorized query execution and runtime code generation. In the latter, the code is generated for every kind of query on the fly, removing all indirection and dynamic dispatch. Neither of these approaches is strictly better than the other. Runtime code generation can be better when it fuses many operations together, thus fully utilizing CPU execution units and the pipeline. Vectorized query execution can be less practical, because it involves temporary vectors that must be written to cache and read back. If the temporary data does not fit in the L2 cache, this becomes an issue. But vectorized query execution more easily utilizes the SIMD capabilities of the CPU. A [research paper](http://15721.courses.cs.cmu.edu/spring2016/papers/p5-sompolski.pdf) written by our friends shows that it is better to combine both approaches. ClickHouse mainly uses vectorized query execution and has limited initial support for runtime code generation (only the inner loop of the first stage of GROUP BY can be compiled).
## Columns
To represent columns in memory (actually, chunks of columns), the `IColumn` interface is used. This interface provides helper methods for implementation of various relational operators. Almost all operations are immutable: they do not modify the original column, but create a new modified one. For example, the `IColumn::filter` method accepts a filter byte mask and creates a new filtered column. It is used for the `WHERE` and `HAVING` relational operators. Additional examples: the `IColumn::permute` method to support `ORDER BY`, the `IColumn::cut` method to support `LIMIT`, and so on.
Various `IColumn` implementations (`ColumnUInt8`, `ColumnString` and so on) are responsible for the memory layout of columns. Memory layout is usually a contiguous array. For the integer type of columns it is just one contiguous array, like `std::vector`. For `String` and `Array` columns, it is two vectors: one for all array elements, placed contiguously, and a second one for offsets to the beginning of each array. There is also `ColumnConst` that stores just one value in memory, but looks like a column.
## Field
Nevertheless, it is possible to work with individual values as well. To represent an individual value, `Field` is used. `Field` is just a discriminated union of `UInt64`, `Int64`, `Float64`, `String` and `Array`. `IColumn` has the `operator[]` method to get the n-th value as a `Field`, and the `insert` method to append a `Field` to the end of a column. These methods are not very efficient, because they require dealing with temporary `Field` objects representing an individual value. There are more efficient methods, such as `insertFrom`, `insertRangeFrom`, and so on.
`Field` doesn't have enough information about a specific data type for a table. For example, `UInt8`, `UInt16`, `UInt32`, and `UInt64` are all represented as `UInt64` in a `Field`.
### Leaky abstractions
`IColumn` has methods for common relational transformations of data, but they don't meet all needs. For example, `ColumnUInt64` doesn't have a method to calculate the sum of two columns, and `ColumnString` doesn't have a method to run a substring search. These countless routines are implemented outside of `IColumn`.
Various functions on columns can be implemented in a generic, non-efficient way using `IColumn` methods to extract `Field` values, or in a specialized way using knowledge of inner memory layout of data in a specific `IColumn` implementation. To do this, functions are cast to a specific `IColumn` type and deal with internal representation directly. For example, `ColumnUInt64` has the `getData` method that returns a reference to an internal array, then a separate routine reads or fills that array directly. In fact, we have "leaky abstractions" to allow efficient specializations of various routines.
## Data types
`IDataType` is responsible for serialization and deserialization: for reading and writing chunks of columns or individual values in binary or text form.
`IDataType` directly corresponds to data types in tables. For example, there are `DataTypeUInt32`, `DataTypeDateTime`, `DataTypeString` and so on.
`IDataType` and `IColumn` are only loosely related to each other. Different data types can be represented in memory by the same `IColumn` implementations. For example, `DataTypeUInt32` and `DataTypeDateTime` are both represented by `ColumnUInt32` or `ColumnConstUInt32`. In addition, the same data type can be represented by different `IColumn` implementations. For example, `DataTypeUInt8` can be represented by `ColumnUInt8` or `ColumnConstUInt8`.
`IDataType` only stores metadata. For instance, `DataTypeUInt8` doesn't store anything at all (except vptr) and `DataTypeFixedString` stores just `N` (the size of fixed-size strings).
`IDataType` has helper methods for various data formats. Examples are methods to serialize a value with possible quoting, to serialize a value for JSON, and to serialize a value as part of XML format. There is no direct correspondence to data formats. For example, the different data formats `Pretty` and `TabSeparated` can use the same `serializeTextEscaped` helper method from the `IDataType` interface.
## Block
A `Block` is a container that represents a subset (chunk) of a table in memory. It is just a set of triples: `(IColumn, IDataType, column name)`. During query execution, data is processed by `Block`s. If we have a `Block`, we have data (in the `IColumn` object), we have information about its type (in `IDataType`) that tells us how to deal with that column, and we have the column name (either the original column name from the table, or some artificial name assigned for getting temporary results of calculations).
When we calculate some function over columns in a block, we add another column with its result to the block, and we don't touch columns for arguments of the function because operations are immutable. Later, unneeded columns can be removed from the block, but not modified. This is convenient for elimination of common subexpressions.
>Blocks are created for every processed chunk of data. Note that for the same type of calculation, the column names and types remain the same for different blocks, and only column data changes. It is better to split block data from the block header, because small block sizes will have a high overhead of temporary strings for copying shared_ptrs and column names.
## Block Streams
Block streams are for processing data. We use streams of blocks to read data from somewhere, perform data transformations, or write data to somewhere. `IBlockInputStream` has the `read` method to fetch the next block while available. `IBlockOutputStream` has the `write` method to push the block somewhere.
Streams are responsible for:
1. Reading or writing to a table. The table just returns a stream for reading or writing blocks.
2. Implementing data formats. For example, if you want to output data to a terminal in `Pretty` format, you create a block output stream where you push blocks, and it formats them.
3. Performing data transformations. Let's say you have `IBlockInputStream` and want to create a filtered stream. You create `FilterBlockInputStream` and initialize it with your stream. Then when you pull a block from `FilterBlockInputStream`, it pulls a block from your stream, filters it, and returns the filtered block to you. Query execution pipelines are represented this way.
There are more sophisticated transformations. For example, when you pull from `AggregatingBlockInputStream`, it reads all data from its source, aggregates it, and then returns a stream of aggregated data for you. Another example: `UnionBlockInputStream` accepts many input sources in the constructor and also a number of threads. It launches multiple threads and reads from multiple sources in parallel.
> Block streams use the "pull" approach to control flow: when you pull a block from the first stream, it consequently pulls the required blocks from nested streams, and the entire execution pipeline will work. Neither "pull" nor "push" is the best solution, because control flow is implicit, and that limits implementation of various features like simultaneous execution of multiple queries (merging many pipelines together). This limitation could be overcome with coroutines or just running extra threads that wait for each other. We may have more possibilities if we make control flow explicit: if we locate the logic for passing data from one calculation unit to another outside of those calculation units. Read this [nice article](http://journal.stuffwithstuff.com/2013/01/13/iteration-inside-and-out/) for more thoughts.
> We should note that the query execution pipeline creates temporary data at each step. We try to keep block size small enough so that temporary data fits in the CPU cache. With that assumption, writing and reading temporary data is almost free in comparison with other calculations. We could consider an alternative, which is to fuse many operations in the pipeline together, to make the pipeline as short as possible and remove much of the temporary data. This could be an advantage, but it also has drawbacks. For example, a split pipeline makes it easy to implement caching intermediate data, stealing intermediate data from similar queries running at the same time, and merging pipelines for similar queries.
## Formats
Data formats are implemented with block streams. There are "presentational" formats only suitable for output of data to the client, such as `Pretty` format, which provides only `IBlockOutputStream`. And there are input/output formats, such as `TabSeparated` or `JSONEachRow`.
There are also row streams: `IRowInputStream` and `IRowOutputStream`. They allow you to pull/push data by individual rows, not by blocks. And they are only needed to simplify implementation of row-oriented formats. The wrappers `BlockInputStreamFromRowInputStream` and `BlockOutputStreamFromRowOutputStream` allow you to convert row-oriented streams to regular block-oriented streams.
## IO
For byte-oriented input/output, there are `ReadBuffer` and `WriteBuffer` abstract classes. They are used instead of C++ `iostream`s. Don't worry: every mature C++ project is using something other than `iostream`s for good reasons.
`ReadBuffer` and `WriteBuffer` are just a contiguous buffer and a cursor pointing to the position in that buffer. Implementations may own or not own the memory for the buffer. There is a virtual method to fill the buffer with the following data (for `ReadBuffer`) or to flush the buffer somewhere (for `WriteBuffer`). The virtual methods are rarely called.
Implementations of `ReadBuffer`/`WriteBuffer` are used for working with files and file descriptors and network sockets, for implementing compression (`CompressedWriteBuffer` is initialized with another WriteBuffer and performs compression before writing data to it), and for other purposes the names `ConcatReadBuffer`, `LimitReadBuffer`, and `HashingWriteBuffer` speak for themselves.
Read/WriteBuffers only deal with bytes. To help with formatted input/output (for instance, to write a number in decimal format), there are functions from `ReadHelpers` and `WriteHelpers` header files.
Let's look at what happens when you want to write a result set in `JSON` format to stdout. You have a result set ready to be fetched from `IBlockInputStream`. You create `WriteBufferFromFileDescriptor(STDOUT_FILENO)` to write bytes to stdout. You create `JSONRowOutputStream`, initialized with that `WriteBuffer`, to write rows in `JSON` to stdout. You create `BlockOutputStreamFromRowOutputStream` on top of it, to represent it as `IBlockOutputStream`. Then you call `copyData` to transfer data from `IBlockInputStream` to `IBlockOutputStream`, and everything works. Internally, `JSONRowOutputStream` will write various JSON delimiters and call the `IDataType::serializeTextJSON` method with a reference to `IColumn` and the row number as arguments. Consequently, `IDataType::serializeTextJSON` will call a method from `WriteHelpers.h`: for example, `writeText` for numeric types and `writeJSONString` for `DataTypeString`.
## Tables
Tables are represented by the `IStorage` interface. Different implementations of that interface are different table engines. Examples are `StorageMergeTree`, `StorageMemory`, and so on. Instances of these classes are just tables.
The most important `IStorage` methods are `read` and `write`. There are also `alter`, `rename`, `drop`, and so on. The `read` method accepts the following arguments: the set of columns to read from a table, the `AST` query to consider, and the desired number of streams to return. It returns one or multiple `IBlockInputStream` objects and information about the stage of data processing that was completed inside a table engine during query execution.
In most cases, the read method is only responsible for reading the specified columns from a table, not for any further data processing. All further data processing is done by the query interpreter and is outside the responsibility of `IStorage`.
But there are notable exceptions:
- The AST query is passed to the `read` method and the table engine can use it to derive index usage and to read less data from a table.
- Sometimes the table engine can process data itself to a specific stage. For example, `StorageDistributed` can send a query to remote servers, ask them to process data to a stage where data from different remote servers can be merged, and return that preprocessed data.
The query interpreter then finishes processing the data.
The table's `read` method can return multiple `IBlockInputStream` objects to allow parallel data processing. These multiple block input streams can read from a table in parallel. Then you can wrap these streams with various transformations (such as expression evaluation or filtering) that can be calculated independently and create a `UnionBlockInputStream` on top of them, to read from multiple streams in parallel.
There are also `TableFunction`s. These are functions that return a temporary `IStorage` object to use in the `FROM` clause of a query.
To get a quick idea of how to implement your own table engine, look at something simple, like `StorageMemory` or `StorageTinyLog`.
> As the result of the `read` method, `IStorage` returns `QueryProcessingStage` information about what parts of the query were already calculated inside storage. Currently we have only very coarse granularity for that information. There is no way for the storage to say "I have already processed this part of the expression in WHERE, for this range of data". We need to work on that.
## Parsers
A query is parsed by a hand-written recursive descent parser. For example, `ParserSelectQuery` just recursively calls the underlying parsers for various parts of the query. Parsers create an `AST`. The `AST` is represented by nodes, which are instances of `IAST`.
> Parser generators are not used for historical reasons.
## Interpreters
Interpreters are responsible for creating the query execution pipeline from an `AST`. There are simple interpreters, such as `InterpreterExistsQuery`and `InterpreterDropQuery`, or the more sophisticated `InterpreterSelectQuery`. The query execution pipeline is a combination of block input or output streams. For example, the result of interpreting the `SELECT` query is the `IBlockInputStream` to read the result set from; the result of the INSERT query is the `IBlockOutputStream` to write data for insertion to; and the result of interpreting the `INSERT SELECT` query is the `IBlockInputStream` that returns an empty result set on the first read, but that copies data from `SELECT` to `INSERT` at the same time.
`InterpreterSelectQuery` uses `ExpressionAnalyzer` and `ExpressionActions` machinery for query analysis and transformations. This is where most rule-based query optimizations are done. `ExpressionAnalyzer` is quite messy and should be rewritten: various query transformations and optimizations should be extracted to separate classes to allow modular transformations or query.
## Functions
There are ordinary functions and aggregate functions. For aggregate functions, see the next section.
Ordinary functions don't change the number of rows they work as if they are processing each row independently. In fact, functions are not called for individual rows, but for `Block`s of data to implement vectorized query execution.
There are some miscellaneous functions, like `blockSize`, `rowNumberInBlock`, and `runningAccumulate`, that exploit block processing and violate the independence of rows.
ClickHouse has strong typing, so implicit type conversion doesn't occur. If a function doesn't support a specific combination of types, an exception will be thrown. But functions can work (be overloaded) for many different combinations of types. For example, the `plus` function (to implement the `+` operator) works for any combination of numeric types: `UInt8` + `Float32`, `UInt16` + `Int8`, and so on. Also, some variadic functions can accept any number of arguments, such as the `concat` function.
Implementing a function may be slightly inconvenient because a function explicitly dispatches supported data types and supported `IColumns`. For example, the `plus` function has code generated by instantiation of a C++ template for each combination of numeric types, and for constant or non-constant left and right arguments.
> This is a nice place to implement runtime code generation to avoid template code bloat. Also, it will make it possible to add fused functions like fused multiply-add, or to make multiple comparisons in one loop iteration.
> Due to vectorized query execution, functions are not short-circuit. For example, if you write `WHERE f(x) AND g(y)`, both sides will be calculated, even for rows, when `f(x)` is zero (except when `f(x)` is a zero constant expression). But if selectivity of the `f(x)` condition is high, and calculation of `f(x)` is much cheaper than `g(y)`, it's better to implement multi-pass calculation: first calculate `f(x)`, then filter columns by the result, and then calculate `g(y)` only for smaller, filtered chunks of data.
## Aggregate Functions
Aggregate functions are stateful functions. They accumulate passed values into some state, and allow you to get results from that state. They are managed with the `IAggregateFunction` interface. States can be rather simple (the state for `AggregateFunctionCount` is just a single `UInt64` value) or quite complex (the state of `AggregateFunctionUniqCombined` is a combination of a linear array, a hash table and a `HyperLogLog` probabilistic data structure).
To deal with multiple states while executing a high-cardinality `GROUP BY` query, states are allocated in `Arena` (a memory pool), or they could be allocated in any suitable piece of memory. States can have a non-trivial constructor and destructor: for example, complex aggregation states can allocate additional memory themselves. This requires some attention to creating and destroying states and properly passing their ownership, to keep track of who and when will destroy states.
Aggregation states can be serialized and deserialized to pass over the network during distributed query execution or to write them on disk where there is not enough RAM. They can even be stored in a table with the `DataTypeAggregateFunction` to allow incremental aggregation of data.
> The serialized data format for aggregate function states is not versioned right now. This is ok if aggregate states are only stored temporarily. But we have the `AggregatingMergeTree` table engine for incremental aggregation, and people are already using it in production. This is why we should add support for backward compatibility when changing the serialized format for any aggregate function in the future.
## Server
The server implements several different interfaces:
- An HTTP interface for any foreign clients.
- A TCP interface for the native ClickHouse client and for cross-server communication during distributed query execution.
- An interface for transferring data for replication.
Internally, it is just a basic multithreaded server without coroutines, fibers, etc. Since the server is not designed to process a high rate of simple queries but is intended to process a relatively low rate of complex queries, each of them can process a vast amount of data for analytics.
The server initializes the `Context` class with the necessary environment for query execution: the list of available databases, users and access rights, settings, clusters, the process list, the query log, and so on. This environment is used by interpreters.
We maintain full backward and forward compatibility for the server TCP protocol: old clients can talk to new servers and new clients can talk to old servers. But we don't want to maintain it eternally, and we are removing support for old versions after about one year.
> For all external applications, we recommend using the HTTP interface because it is simple and easy to use. The TCP protocol is more tightly linked to internal data structures: it uses an internal format for passing blocks of data and it uses custom framing for compressed data. We haven't released a C library for that protocol because it requires linking most of the ClickHouse codebase, which is not practical.
## Distributed query execution
Servers in a cluster setup are mostly independent. You can create a `Distributed` table on one or all servers in a cluster. The `Distributed` table does not store data itself it only provides a "view" to all local tables on multiple nodes of a cluster. When you SELECT from a `Distributed` table, it rewrites that query, chooses remote nodes according to load balancing settings, and sends the query to them. The `Distributed` table requests remote servers to process a query just up to a stage where intermediate results from different servers can be merged. Then it receives the intermediate results and merges them. The distributed table tries to distribute as much work as possible to remote servers, and does not send much intermediate data over the network.
> Things become more complicated when you have subqueries in IN or JOIN clauses and each of them uses a `Distributed` table. We have different strategies for execution of these queries.
> There is no global query plan for distributed query execution. Each node has its own local query plan for its part of the job. We only have simple one-pass distributed query execution: we send queries for remote nodes and then merge the results. But this is not feasible for difficult queries with high cardinality GROUP BYs or with a large amount of temporary data for JOIN: in such cases, we need to "reshuffle" data between servers, which requires additional coordination. ClickHouse does not support that kind of query execution, and we need to work on it.
## Merge Tree
`MergeTree` is a family of storage engines that supports indexing by primary key. The primary key can be an arbitary tuple of columns or expressions. Data in a `MergeTree` table is stored in "parts". Each part stores data in the primary key order (data is ordered lexicographically by the primary key tuple). All the table columns are stored in separate `column.bin` files in these parts. The files consist of compressed blocks. Each block is usually from 64 KB to 1 MB of uncompressed data, depending on the average value size. The blocks consist of column values placed contiguously one after the other. Column values are in the same order for each column (the order is defined by the primary key), so when you iterate by many columns, you get values for the corresponding rows.
The primary key itself is "sparse". It doesn't address each single row, but only some ranges of data. A separate `primary.idx` file has the value of the primary key for each N-th row, where N is called `index_granularity` (usually, N = 8192). Also, for each column, we have `column.mrk` files with "marks," which are offsets to each N-th row in the data file. Each mark is a pair: the offset in the file to the beginning of the compressed block, and the offset in the decompressed block to the beginning of data. Usually compressed blocks are aligned by marks, and the offset in the decompressed block is zero. Data for `primary.idx` always resides in memory and data for `column.mrk` files is cached.
When we are going to read something from a part in `MergeTree`, we look at `primary.idx` data and locate ranges that could possibly contain requested data, then look at `column.mrk` data and calculate offsets for where to start reading those ranges. Because of sparseness, excess data may be read. ClickHouse is not suitable for a high load of simple point queries, because the entire range with `index_granularity` rows must be read for each key, and the entire compressed block must be decompressed for each column. We made the index sparse because we must be able to maintain trillions of rows per single server without noticeable memory consumption for the index. Also, because the primary key is sparse, it is not unique: it cannot check the existence of the key in the table at INSERT time. You could have many rows with the same key in a table.
When you `INSERT` a bunch of data into `MergeTree`, that bunch is sorted by primary key order and forms a new part. To keep the number of parts relatively low, there are background threads that periodically select some parts and merge them to a single sorted part. That's why it is called `MergeTree`. Of course, merging leads to "write amplification". All parts are immutable: they are only created and deleted, but not modified. When SELECT is run, it holds a snapshot of the table (a set of parts). After merging, we also keep old parts for some time to make recovery after failure easier, so if we see that some merged part is probably broken, we can replace it with its source parts.
`MergeTree` is not an LSM tree because it doesn't contain "memtable" and "log": inserted data is written directly to the filesystem. This makes it suitable only to INSERT data in batches, not by individual row and not very frequently about once per second is ok, but a thousand times a second is not. We did it this way for simplicity's sake, and because we are already inserting data in batches in our applications.
> MergeTree tables can only have one (primary) index: there aren't any secondary indices. It would be nice to allow multiple physical representations under one logical table, for example, to store data in more than one physical order or even to allow representations with pre-aggregated data along with original data.
> There are MergeTree engines that are doing additional work during background merges. Examples are `CollapsingMergeTree` and `AggregatingMergeTree`. This could be treated as special support for updates. Keep in mind that these are not real updates because users usually have no control over the time when background merges will be executed, and data in a `MergeTree` table is almost always stored in more than one part, not in completely merged form.
## Replication
Replication in ClickHouse is implemented on a per-table basis. You could have some replicated and some non-replicated tables on the same server. You could also have tables replicated in different ways, such as one table with two-factor replication and another with three-factor.
Replication is implemented in the `ReplicatedMergeTree` storage engine. The path in `ZooKeeper` is specified as a parameter for the storage engine. All tables with the same path in `ZooKeeper` become replicas of each other: they synchronise their data and maintain consistency. Replicas can be added and removed dynamically simply by creating or dropping a table.
Replication uses an asynchronous multi-master scheme. You can insert data into any replica that has a session with `ZooKeeper`, and data is replicated to all other replicas asynchronously. Because ClickHouse doesn't support UPDATEs, replication is conflict-free. As there is no quorum acknowledgment of inserts, just-inserted data might be lost if one node fails.
Metadata for replication is stored in ZooKeeper. There is a replication log that lists what actions to do. Actions are: get part; merge parts; drop partition, etc. Each replica copies the replication log to its queue and then executes the actions from the queue. For example, on insertion, the "get part" action is created in the log, and every replica downloads that part. Merges are coordinated between replicas to get byte-identical results. All parts are merged in the same way on all replicas. To achieve this, one replica is elected as the leader, and that replica initiates merges and writes "merge parts" actions to the log.
Replication is physical: only compressed parts are transferred between nodes, not queries. To lower the network cost (to avoid network amplification), merges are processed on each replica independently in most cases. Large merged parts are sent over the network only in cases of significant replication lag.
In addition, each replica stores its state in ZooKeeper as the set of parts and its checksums. When the state on the local filesystem diverges from the reference state in ZooKeeper, the replica restores its consistency by downloading missing and broken parts from other replicas. When there is some unexpected or broken data in the local filesystem, ClickHouse does not remove it, but moves it to a separate directory and forgets it.
> The ClickHouse cluster consists of independent shards, and each shard consists of replicas. The cluster is not elastic, so after adding a new shard, data is not rebalanced between shards automatically. Instead, the cluster load will be uneven. This implementation gives you more control, and it is fine for relatively small clusters such as tens of nodes. But for clusters with hundreds of nodes that we are using in production, this approach becomes a significant drawback. We should implement a table engine that will span its data across the cluster with dynamically replicated regions that could be split and balanced between clusters automatically.

View File

@ -1,32 +0,0 @@
# How to run tests
The **clickhouse-test** utility that is used for functional testing is written using Python 2.x.
It also requires you to have some third-party packages:
```
$ pip install lxml termcolor
```
## In a nutshell:
- Put the `clickhouse` program to `/usr/bin` (or `PATH`)
- Create a `clickhouse-client` symlink in `/usr/bin` pointing to `clickhouse`
- Start the `clickhouse` server
- `cd dbms/tests/`
- Run `./clickhouse-test`
## Example usage
Run `./clickhouse-test --help` to see available options.
To run tests without having to create a symlink or mess with `PATH`:
```
./clickhouse-test -c "../../build/dbms/src/Server/clickhouse --client"
```
To run a single test, i.e. `00395_nullable`:
```
./clickhouse-test 00395
```

View File

@ -1,268 +0,0 @@
This benchmark was created by Vadim Tkachenko, see:
https://www.percona.com/blog/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/
https://www.percona.com/blog/2009/10/26/air-traffic-queries-in-luciddb/
https://www.percona.com/blog/2009/11/02/air-traffic-queries-in-infinidb-early-alpha/
https://www.percona.com/blog/2014/04/21/using-apache-hadoop-and-impala-together-with-mysql-for-data-analysis/
https://www.percona.com/blog/2016/01/07/apache-spark-with-air-ontime-performance-data/
http://nickmakos.blogspot.ru/2012/08/analyzing-air-traffic-performance-with.html
1. https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh
for s in `seq 1987 2017`
do
for m in `seq 1 12`
do
wget http://transtats.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
2.
CREATE TABLE `ontime` (
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`UniqueCarrier` FixedString(7),
`AirlineID` Int32,
`Carrier` FixedString(2),
`TailNum` String,
`FlightNum` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)
3.
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --host=example-perftest01j --query="INSERT INTO ontime FORMAT CSVWithNames"; done
4.
SELECT count() FROM ontime
Q0. select avg(c1) from (select Year, Month, count(*) as c1 from ontime group by Year, Month);
Q1. Count flights per day from 2000 to 2008 years
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q2. Count of flights delayed more than 10min per day of week for 2000-2008 years
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC
Q3. Count of delays per airport for years 2000-2008
SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10
Q4. Count of delays per Carrier for 2007 year
SELECT Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC
Q5. Percentage of delays for each carrier for 2007 year.
SELECT Carrier, c, c2, c*1000/c2 as c3
FROM
(
SELECT
Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year=2007
GROUP BY Carrier
)
ANY INNER JOIN
(
SELECT
Carrier,
count(*) AS c2
FROM ontime
WHERE Year=2007
GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;
More optimal version of same query:
SELECT Carrier, avg(DepDelay > 10) * 1000 AS c3 FROM ontime WHERE Year = 2007 GROUP BY Carrier ORDER BY Carrier
Q6. Lets try the same query for wide range of years 2000-2008.
SELECT Carrier, c, c2, c*1000/c2 as c3
FROM
(
SELECT
Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year >= 2000 AND Year <= 2008
GROUP BY Carrier
)
ANY INNER JOIN
(
SELECT
Carrier,
count(*) AS c2
FROM ontime
WHERE Year >= 2000 AND Year <= 2008
GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;
More optimal version of same query:
SELECT Carrier, avg(DepDelay > 10) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier ORDER BY Carrier
Q7. Percent of delayed (more 10mins) flights per year.
SELECT Year, c1/c2
FROM
(
select
Year,
count(*)*1000 as c1
from ontime
WHERE DepDelay>10
GROUP BY Year
)
ANY INNER JOIN
(
select
Year,
count(*) as c2
from ontime
GROUP BY Year
) USING (Year)
ORDER BY Year
More optimal version of same query:
SELECT Year, avg(DepDelay > 10) FROM ontime GROUP BY Year ORDER BY Year
Q8. Most popular destination in sense count of direct connected cities for different range of years.
SELECT DestCityName, uniqExact(OriginCityName) AS u FROM ontime WHERE Year >= 2000 and Year <= 2010 GROUP BY DestCityName ORDER BY u DESC LIMIT 10;
Q9. select Year, count(*) as c1 from ontime group by Year;
Q10.
select
min(Year), max(Year), Carrier, count(*) as cnt,
sum(ArrDelayMinutes>30) as flights_delayed,
round(sum(ArrDelayMinutes>30)/count(*),2) as rate
FROM ontime
WHERE
DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI')
and DestState not in ('AK', 'HI', 'PR', 'VI')
and FlightDate < '2010-01-01'
GROUP by Carrier
HAVING cnt > 100000 and max(Year) > 1990
ORDER by rate DESC
LIMIT 1000;
Bonus:
SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month)
select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month)
SELECT DestCityName, uniqExact(OriginCityName) AS u FROM ontime GROUP BY DestCityName ORDER BY u DESC LIMIT 10;
SELECT OriginCityName, DestCityName, count() AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
SELECT OriginCityName, count() AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;

View File

@ -1,120 +0,0 @@
AMPLab Big Data Benchmark: https://amplab.cs.berkeley.edu/benchmark/
1. Register free account on aws.amazon.com. You will need credit card, e-mail and phone number.
2. Create new access key on https://console.aws.amazon.com/iam/home?nc2=h_m_sc#security_credential
3. sudo apt-get install s3cmd
4.
mkdir tiny; cd tiny;
s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/tiny/ .
cd ..
mkdir 1node; cd 1node;
s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/1node/ .
cd ..
mkdir 5nodes; cd 5nodes;
s3cmd sync s3://big-data-benchmark/pavlo/text-deflate/5nodes/ .
cd ..
5.
CREATE TABLE rankings_tiny
(
pageURL String,
pageRank UInt32,
avgDuration UInt32
) ENGINE = Log;
CREATE TABLE uservisits_tiny
(
sourceIP String,
destinationURL String,
visitDate Date,
adRevenue Float32,
UserAgent String,
cCode FixedString(3),
lCode FixedString(6),
searchWord String,
duration UInt32
) ENGINE = MergeTree(visitDate, visitDate, 8192);
CREATE TABLE rankings_1node
(
pageURL String,
pageRank UInt32,
avgDuration UInt32
) ENGINE = Log;
CREATE TABLE uservisits_1node
(
sourceIP String,
destinationURL String,
visitDate Date,
adRevenue Float32,
UserAgent String,
cCode FixedString(3),
lCode FixedString(6),
searchWord String,
duration UInt32
) ENGINE = MergeTree(visitDate, visitDate, 8192);
CREATE TABLE rankings_5nodes_on_single
(
pageURL String,
pageRank UInt32,
avgDuration UInt32
) ENGINE = Log;
CREATE TABLE uservisits_5nodes_on_single
(
sourceIP String,
destinationURL String,
visitDate Date,
adRevenue Float32,
UserAgent String,
cCode FixedString(3),
lCode FixedString(6),
searchWord String,
duration UInt32
) ENGINE = MergeTree(visitDate, visitDate, 8192);
6.
for i in tiny/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_tiny FORMAT CSV"; done
for i in tiny/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_tiny FORMAT CSV"; done
for i in 1node/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_1node FORMAT CSV"; done
for i in 1node/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_1node FORMAT CSV"; done
for i in 5nodes/rankings/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO rankings_5nodes_on_single FORMAT CSV"; done
for i in 5nodes/uservisits/*.deflate; do echo $i; zlib-flate -uncompress < $i | clickhouse-client --host=example-perftest01j --query="INSERT INTO uservisits_5nodes_on_single FORMAT CSV"; done
7.
SELECT pageURL, pageRank FROM rankings_1node WHERE pageRank > 1000
SELECT substring(sourceIP, 1, 8), sum(adRevenue) FROM uservisits_1node GROUP BY substring(sourceIP, 1, 8)
SELECT
sourceIP,
sum(adRevenue) AS totalRevenue,
avg(pageRank) AS pageRank
FROM rankings_1node ALL INNER JOIN
(
SELECT
sourceIP,
destinationURL AS pageURL,
adRevenue
FROM uservisits_1node
WHERE (visitDate > '1980-01-01') AND (visitDate < '1980-04-01')
) USING pageURL
GROUP BY sourceIP
ORDER BY totalRevenue DESC
LIMIT 1

View File

@ -1,61 +0,0 @@
1. Download all data from http://labs.criteo.com/downloads/download-terabyte-click-logs/
2.
CREATE TABLE criteo_log (date Date, clicked UInt8, int1 Int32, int2 Int32, int3 Int32, int4 Int32, int5 Int32, int6 Int32, int7 Int32, int8 Int32, int9 Int32, int10 Int32, int11 Int32, int12 Int32, int13 Int32, cat1 String, cat2 String, cat3 String, cat4 String, cat5 String, cat6 String, cat7 String, cat8 String, cat9 String, cat10 String, cat11 String, cat12 String, cat13 String, cat14 String, cat15 String, cat16 String, cat17 String, cat18 String, cat19 String, cat20 String, cat21 String, cat22 String, cat23 String, cat24 String, cat25 String, cat26 String) ENGINE = Log
3.
for i in {00..23}; do echo $i; zcat datasets/criteo/day_${i#0}.gz | sed -r 's/^/2000-01-'${i/00/24}'\t/' | clickhouse-client --host=example-perftest01j --query="INSERT INTO criteo_log FORMAT TabSeparated"; done
4.
CREATE TABLE criteo
(
date Date,
clicked UInt8,
int1 Int32,
int2 Int32,
int3 Int32,
int4 Int32,
int5 Int32,
int6 Int32,
int7 Int32,
int8 Int32,
int9 Int32,
int10 Int32,
int11 Int32,
int12 Int32,
int13 Int32,
icat1 UInt32,
icat2 UInt32,
icat3 UInt32,
icat4 UInt32,
icat5 UInt32,
icat6 UInt32,
icat7 UInt32,
icat8 UInt32,
icat9 UInt32,
icat10 UInt32,
icat11 UInt32,
icat12 UInt32,
icat13 UInt32,
icat14 UInt32,
icat15 UInt32,
icat16 UInt32,
icat17 UInt32,
icat18 UInt32,
icat19 UInt32,
icat20 UInt32,
icat21 UInt32,
icat22 UInt32,
icat23 UInt32,
icat24 UInt32,
icat25 UInt32,
icat26 UInt32
) ENGINE = MergeTree(date, intHash32(icat1), (date, intHash32(icat1)), 8192)
5.
INSERT INTO criteo SELECT date, clicked, int1, int2, int3, int4, int5, int6, int7, int8, int9, int10, int11, int12, int13, reinterpretAsUInt32(unhex(cat1)) AS icat1, reinterpretAsUInt32(unhex(cat2)) AS icat2, reinterpretAsUInt32(unhex(cat3)) AS icat3, reinterpretAsUInt32(unhex(cat4)) AS icat4, reinterpretAsUInt32(unhex(cat5)) AS icat5, reinterpretAsUInt32(unhex(cat6)) AS icat6, reinterpretAsUInt32(unhex(cat7)) AS icat7, reinterpretAsUInt32(unhex(cat8)) AS icat8, reinterpretAsUInt32(unhex(cat9)) AS icat9, reinterpretAsUInt32(unhex(cat10)) AS icat10, reinterpretAsUInt32(unhex(cat11)) AS icat11, reinterpretAsUInt32(unhex(cat12)) AS icat12, reinterpretAsUInt32(unhex(cat13)) AS icat13, reinterpretAsUInt32(unhex(cat14)) AS icat14, reinterpretAsUInt32(unhex(cat15)) AS icat15, reinterpretAsUInt32(unhex(cat16)) AS icat16, reinterpretAsUInt32(unhex(cat17)) AS icat17, reinterpretAsUInt32(unhex(cat18)) AS icat18, reinterpretAsUInt32(unhex(cat19)) AS icat19, reinterpretAsUInt32(unhex(cat20)) AS icat20, reinterpretAsUInt32(unhex(cat21)) AS icat21, reinterpretAsUInt32(unhex(cat22)) AS icat22, reinterpretAsUInt32(unhex(cat23)) AS icat23, reinterpretAsUInt32(unhex(cat24)) AS icat24, reinterpretAsUInt32(unhex(cat25)) AS icat25, reinterpretAsUInt32(unhex(cat26)) AS icat26 FROM criteo_log;
DROP TABLE criteo_log;

File diff suppressed because one or more lines are too long

View File

@ -1,29 +0,0 @@
Compile dbgen: https://github.com/vadimtk/ssb-dbgen
```
git clone git@github.com:vadimtk/ssb-dbgen.git
cd ssb-dbgen
make
```
You will see some warnings. It's Ok.
Place `dbgen` and `dists.dss` to some place with at least 800 GB free space available.
Generate data:
```
./dbgen -s 1000 -T c
./dbgen -s 1000 -T l
```
Create tables in ClickHouse: https://github.com/alexey-milovidov/ssb-clickhouse/blob/cc8fd4d9b99859d12a6aaf46b5f1195c7a1034f9/create.sql
For single-node setup, create just MergeTree tables.
For Distributed setup, you must configure cluster `perftest_3shards_1replicas` in configuration file.
Then create MergeTree tables on each node and then create Distributed tables.
Load data (change customer to customerd in case of distributed setup):
```
cat customer.tbl | sed 's/$/2000-01-01/' | clickhouse-client --query "INSERT INTO customer FORMAT CSV"
cat lineorder.tbl | clickhouse-client --query "INSERT INTO lineorder FORMAT CSV"
```

View File

@ -1,18 +0,0 @@
http://dumps.wikimedia.org/other/pagecounts-raw/
for i in {2007..2016}; do for j in {01..12}; do echo $i-$j >&2; curl -sS "http://dumps.wikimedia.org/other/pagecounts-raw/$i/$i-$j/" | grep -oE 'pagecounts-[0-9]+-[0-9]+\.gz'; done; done | sort | uniq | tee links.txt
cat links.txt | while read link; do wget http://dumps.wikimedia.org/other/pagecounts-raw/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1/')/$(echo $link | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})[0-9]{2}-[0-9]+\.gz/\1-\2/')/$link; done
CREATE TABLE wikistat
(
date Date,
time DateTime,
project String,
subproject String,
path String,
hits UInt64,
size UInt64
) ENGINE = MergeTree(date, (path, time), 8192);
ls -1 /opt/wikistat/ | grep gz | while read i; do echo $i; gzip -cd /opt/wikistat/$i | ./wikistat-loader --time="$(echo -n $i | sed -r 's/pagecounts-([0-9]{4})([0-9]{2})([0-9]{2})-([0-9]{2})([0-9]{2})([0-9]{2})\.gz/\1-\2-\3 \4-00-00/')" | clickhouse-client --query="INSERT INTO wikistat FORMAT TabSeparated"; done

View File

@ -1,266 +0,0 @@
## Introduction
For a start, we need a computer. For example, let's create virtual instance in Openstack with following characteristics:
```
RAM 61GB
VCPUs 16 VCPU
Disk 40GB
Ephemeral Disk 100GB
```
OS:
```
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial
```
We gonna have a try with open data from On Time database, which is arranged by United States Department of
Transportation). One can find information about it, structure of the table and examples of queries here:
```
https://github.com/yandex/ClickHouse/blob/master/doc/example_datasets/1_ontime.txt
```
## Building
To build ClickHouse we will use a manual located here:
```
https://github.com/yandex/ClickHouse/blob/master/doc/build.md
```
Install required packages. After that let's run the following command from directory with source code of ClickHouse:
```
~/ClickHouse$ ./release
```
The build successfully completed:
![](images/build_completed.png)
Installing packages and running ClickHouse:
```
sudo apt-get install ../clickhouse-server-base_1.1.53960_amd64.deb ../clickhouse-server-common_1.1.53960_amd64.deb
sudo apt-get install ../clickhouse-client_1.1.53960_amd64.deb
sudo service clickhouse-server start
```
## Creation of table
Before loading the date into ClickHouse, let's start console client of ClickHouse in order to create a table with necessary fields:
```
$ clickhouse-client
```
The table is being created with following query:
```
:) create table `ontime` (
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`UniqueCarrier` FixedString(7),
`AirlineID` Int32,
`Carrier` FixedString(2),
`TailNum` String,
`FlightNum` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)
```
![](images/table_created.png)
One can get information about table using following queries:
```
:) desc ontime
```
```
:) show create ontime
```
## Filling in the data
Let's download the data:
```
for s in `seq 1987 2015`; do
for m in `seq 1 12`; do
wget http://tsdata.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
```
After that, add the data to ClickHouse:
```
for i in *.zip; do
echo $i
unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --query="insert into ontime format CSVWithNames"
done
```
## Working with data
Let's check if the table contains something:
```
:) select FlightDate, FlightNum, OriginCityName, DestCityName from ontime limit 10;
```
![](images/something.png)
Now we'll craft more complicated query. E.g., output fraction of flights delayed for more than 10 minutes, for each year:
```
select Year, c1/c2
from
(
select
Year,
count(*)*100 as c1
from ontime
where DepDelay > 10
group by Year
)
any inner join
(
select
Year,
count(*) as c2
from ontime
group by Year
) using (Year)
order by Year;
```
![](images/complicated.png)
## Additionally
### Copying a table
Suppose we need to copy 1% of records (luckiest ones) from ```ontime``` table to new table named ```ontime_ltd```.
To achieve that make the queries:
```
:) create table ontime_ltd as ontime;
:) insert into ontime_ltd select * from ontime where rand() % 100 = 42;
```
### Working with multiple tables
If one need to collect data from multiple tables, use function ```merge(database, regexp)```:
```
:) select (select count(*) from merge(default, 'ontime.*'))/(select count() from ontime);
```
![](images/1_percent.png)
### List of running queries
For diagnostics one usually need to know what exactly ClickHouse is doing at the moment. Let's execute a query that gonna last long:
```
:) select sleep(1000);
```
If we run ```clickhouse-client``` after that in another terminal, one can output list of queries and some additional useful information about them:
```
:) show processlist;
```
![](images/long_query.png)

View File

@ -1,267 +0,0 @@
## Введение
Для начала, возьмём какую-нибудь машину, например, создадим виртуальный инстанс в Openstack со следующими характеристиками:
```
RAM 61GB
VCPUs 16 VCPU
Disk 40GB
Ephemeral Disk 100GB
```
ОС:
```
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial
```
Будем работать с открытыми данными базы данных On Time, предоставленной Министерством транспорта США (United States Department of
Transportation). Информация о ней, структура таблицы, а также примеры запросов приведены здесь:
```
https://github.com/yandex/ClickHouse/blob/master/doc/example_datasets/1_ontime.txt
```
## Сборка
При сборке ClickHouse будем использовать инструкцию, расположенную по адресу:
```
https://github.com/yandex/ClickHouse/blob/master/doc/build.md
```
Установим необходимые пакеты. После этого выполним следующую команду из директории с исходными кодами ClickHouse:
```
~/ClickHouse$ ./release
```
Сборка успешно завершена:
![](images/build_completed.png)
Установим пакеты и запустим ClickHouse:
```
sudo apt-get install ../clickhouse-server-base_1.1.53960_amd64.deb ../clickhouse-server-common_1.1.53960_amd64.deb
sudo apt-get install ../clickhouse-client_1.1.53960_amd64.deb
sudo service clickhouse-server start
```
## Создание таблицы
Перед тем, как загружать данные базы данных On Time в ClickHouse, запустим консольный клиент ClickHouse, для того, чтобы создать таблицу с
необходимыми полями:
```
$ clickhouse-client
```
Таблица создаётся следующим запросом:
```
:) create table `ontime` (
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`UniqueCarrier` FixedString(7),
`AirlineID` Int32,
`Carrier` FixedString(2),
`TailNum` String,
`FlightNum` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)
```
![](images/table_created.png)
Информацию о таблице можно посмотреть следующими запросами:
```
:) desc ontime
```
```
:) show create ontime
```
## Наливка данных
Загрузим данные:
```
for s in `seq 1987 2015`; do
for m in `seq 1 12`; do
wget http://tsdata.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
```
Теперь необходимо загрузить данные в ClickHouse:
```
for i in *.zip; do
echo $i
unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --query="insert into ontime format CSVWithNames"
done
```
## Работа с данными
Проверим, что в таблице что-то есть:
```
:) select FlightDate, FlightNum, OriginCityName, DestCityName from ontime limit 10;
```
![](images/something.png)
Теперь придумаем более сложный запрос. Например, выведем процент задержанных больше чем на 10 минут полётов за каждый год:
```
select Year, c1/c2
from
(
select
Year,
count(*)*100 as c1
from ontime
where DepDelay > 10
group by Year
)
any inner join
(
select
Year,
count(*) as c2
from ontime
group by Year
) using (Year)
order by Year;
```
![](images/complicated.png)
## Дополнительно
### Копирование таблицы
Предположим, нам нужно скопировать 1% записей из таблицы (самых удачливых) ```ontime``` в новую таблицу ```ontime_ltd```. Для этого выполним запросы:
```
:) create table ontime_ltd as ontime;
:) insert into ontime_ltd select * from ontime where rand() % 100 = 42;
```
### Работа с несколькими таблицами
Если необходимо выполнять запросы над многими таблицами сразу, воспользуемся функцией ```merge(database, regexp)```:
```
:) select (select count(*) from merge(default, 'ontime.*'))/(select count() from ontime);
```
![](images/1_percent.png)
### Список выполняющихся запросов
В целях диагностики часто бывает нужно узнать, что именно в данный момент делает ClickHouse. Запустим запрос, который выполняется очень долго:
```
:) select sleep(1000);
```
Если теперь запустить ```clickhouse-client``` в другом терминале, можно будет вывести список запросов, а также некоторую
полезную информацию о них:
```
:) show processlist;
```
![](images/long_query.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 213 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 675 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 286 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 471 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 373 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 229 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 76 KiB

View File

@ -1 +0,0 @@
базы данных, структуры данных, веб-аналитика, яндекс, яндекс.метрика, big data, columnar database, olap

View File

@ -1,262 +0,0 @@
Яндекс.Метрика — система интернет аналитики. Есть две Яндекс.Метрики — <a href='https://metrika.yandex.ru/'>Метрика для сайтов</a> и <a href='https://appmetrika.yandex.ru/'>Метрика для приложений</a>. На входе имеем поток данных — событий, происходящих на сайтах или в приложениях. Наша задача — обработать эти данные и представить их в подходящем для анализа виде.
Обработка данных — это не проблема. Проблема в том, как и в каком виде сохранять результаты обработки, чтобы с ними можно было удобно работать. Результатами работы Яндекс.Метрики, как правило, являются отчёты, которые отображаются в интерфейсе Метрики. То есть, вопрос — как хранить данные для отчётов.
<img src="https://habrastorage.org/files/e4e/c0f/8bc/e4ec0f8bc416435ba78565e68ca85b5f.png"/>
Яндекс.Метрика работает с 2008 года — более семи лет. За это время, нам приходилось несколько раз менять подход к хранению данных. Каждый раз это было обусловлено тем, что то или иное решение работало слишком плохо — с недостаточным запасом по производительности, недостаточно надёжно и с большим количеством проблем при эксплуатации, использовало слишком много вычислительных ресурсов, или же просто не позволяло нам реализовать то, что мы хотим.
<cut>
Рассмотрим, для получения какого конечного результата нам нужно сохранять данные.
В старой Метрике для сайтов, имеется около 40 «фиксированных» отчётов (например, отчёт по географии посетителей), несколько инструментов для in-page аналитики (например, карта кликов), Вебвизор (позволяет максимально подробно изучить действия отдельных посетителей) и, отдельно, конструктор отчётов.
В новой Метрике, а также в Метрике мобильных приложений, вместо «фиксированных» отчётов, каждый отчёт можно произвольным образом изменять — добавлять новые измерения (например, в отчёт по поисковым фразам добавить ещё разбиение по страницам входа на сайт), сегментировать и сравнивать (например, сравнить источники трафика на сайт для всех посетителей и посетителей из Москвы), менять набор метрик и т. п.
Конечно, это требует совершенно разных подходов к хранению данных. И сейчас мы их рассмотрим.
<h1>MyISAM</h1>
В самом начале Метрика создавалась, как часть Директа. В Директе для решения задачи хранения статистики использовались MyISAM таблицы, и мы тоже с этого начали. Мы использовали MyISAM для хранения «фиксированных» отчётов с 2008 по 2011 год. Рассмотрим, какой должна быть структура таблицы для отчёта. Например, для отчёта по географии.
Отчёт показывается для конкретного сайта (точнее, номера счётчика Метрики). Значит в первичный ключ должен входить номер счётчика — CounterID. Пользователь может выбрать для отчёта произвольный отчётный период (интервал дат — дату начала и дату конца). Сохранять данные для каждой пары дат было бы неразумно. Вместо этого, данные сохраняются для каждой даты и затем суммируются для заданного интервала дат, при запросе. То есть, в первичный ключ входит дата — Date.
В отчёте данные отображаются для регионов, в виде дерева из стран, областей, городов и т. п., либо в виде списка — например, списка городов. Разумно поместить в первичный ключ таблицы идентификатор региона (RegionID), а собирать данные в дерево уже на стороне прикладного кода а не базы данных.
Среди метрик считается, например, средняя продолжительность визита. Значит в столбцах таблицы должно быть количество визитов и суммарная продолжительность визитов.
В итоге, структура таблицы такая: CounterID, Date, RegionID -> Visits, SumVisitTime, …
Рассмотрим, что происходит, когда мы хотим получить отчёт. Делается запрос SELECT с условием <code>WHERE CounterID = c AND Date BETWEEN min_date AND max_date</code>. То есть, происходит чтение по диапазону первичного ключа.
<strong>Как реально хранятся данные на диске?</strong>
MyISAM таблица представляет собой файл с данными и файл с индексами. Если из таблицы ничего не удалялось, и строчки не изменяли своей длины при обновлении, то файл с данными представляет собой сериализованные строчки, уложенные подряд, в порядке их добавления. Индекс (в том числе, первичный ключ) представляет собой B-дерево, в листьях которого находятся смещения в файле с данными.
Когда мы читаем данные по диапазону индекса, из индекса достаётся множество смещений в файле с данными. Затем по этому множеству смещений делаются чтения из файла с данными.
Предположим естественную ситуацию, когда индекс находится в оперативке (key cache в MySQL или системный page cache), а данные не закэшированы в оперативке. Предположим, что мы используем жёсткие диски. Время для чтения данных зависит от того, какой объём данных нужно прочитать и сколько нужно сделать seek-ов. Количество seek-ов определяется локальностью расположения данных на диске.
События в Метрику поступают в порядке, почти соответствующим времени событий. В этом входящем потоке данных, данные разных счётчиков разбросаны совершенно произвольным образом. То есть, входящие данные локальны по времени, но не локальны по номеру счётчика. А значит, при записи в MyISAM таблицу, данные разных счётчиков будут так же расположены совершенно случайным образом.
А это значит, что для чтения данных отчёта, необходимо будет выполнить примерно столько случайных чтений, сколько есть нужных нам строк в таблице. Обычный жёсткий диск 7200 RPM умеет выполнять от 100 до 200 случайных чтений в секунду, RAID-массив, при грамотном использовании — пропорционально больше. Один SSD пятилетней давности умеет выполнять 30&nbsp;000 случайных чтений в секунду, но мы не можем позволить себе хранить наши данные на SSD.
Таким образом, если для нашего отчёта нужно прочитать 10&nbsp;000 строк, то это вряд ли займёт меньше 10 секунд, что полностью неприемлимо.
Для чтений по диапазону первичного ключа лучше подходит InnoDB, так как в InnoDB используется <a href='http://en.wikipedia.org/wiki/Database_index#Clustered'>кластерный первичный ключ</a> (то есть, данные хранятся упорядоченно по первичному ключу). Но InnoDB было невозможно использовать из-за низкой скорости записи. Если читая этот текст, вы вспомнили про <a href='http://www.tokutek.com/tokudb-for-mysql/'>TokuDB</a>, то продолжайте читать этот текст.
Для того, чтобы MyISAM работала быстрее при выборе по диапазону первичного ключа, применялись некоторые трюки.
Сортировка таблицы. Так как данные необходимо обновлять инкрементально, то один раз отсортировать таблицу недостаточно, а сортировать её каждый раз невозможно. Тем не менее, это можно делать периодически для старых данных.
Партиционирование. Таблица разбивается на некоторое количество более маленьких по диапазонам первичного ключа. При этом есть надежда, что данные одной партиции будут храниться более-менее локально, и запросы по диапазону первичного ключа будут работать быстрее. Этот метод можно называть «кластерный первичный ключ вручную». При этом, вставка данных несколько замедляется. Выбирая количество партиций, как правило, удаётся достичь компромисса между скоростью вставок и чтений.
Разделение данных на поколения. При одной схеме партиционирования, могут слишком тормозить чтения, при другой — вставки, а при промежуточной — и то, и другое. В этом случае возможно разделение данных на два — несколько поколений. Например, первое поколение назовём оперативными данными — там партиционирование производится в порядке вставки (по времени) или не производится вообще. Второе поколение назовём архивными данными — там партиционирование производится в порядке чтения (по номеру счётчика). Данные переносятся из поколения в поколение скриптом, не слишком часто (например, раз в день). Данные считываются сразу из всех поколений. Это, как правило, помогает, но добавляет довольно много сложностей.
Все эти трюки, и некоторые другие, использовались в Яндекс.Метрике когда-то давно для того, чтобы всё хоть как-то работало.
Резюмируем, какие имеются недостатки:
<ul>
<li>очень сложно поддерживать локальность данных на диске;</li>
<li>блокировка таблицы при записи данных;</li>
<li>медленно работает репликация, реплики зачастую отстают;</li>
<li>не обеспечивается консистентность данных после сбоя;</li>
<li>очень сложно рассчитывать и хранить сложные агрегаты, такие как количество уникальных посетителей;</li>
<li>затруднительно использовать сжатие данных; сжатие данных работает неэффективно;</li>
<li>индексы занимают много места и зачастую не помещаются в оперативку;</li>
<li>необходимость шардировать данные вручную;</li>
<li>много вычислений приходится делать на стороне прикладного кода, после SELECT-а;</li>
<li>сложность эксплуатации.</li>
</ul>
<img src="https://habrastorage.org/files/7a9/757/7e3/7a97577e332d44ed8fab1b2613dd647e.png"/>
<em>Локальность данных на диске, образное представление</em>
В целом, использование MyISAM было крайне неудобным. В дневное время серверы работали со 100% нагрузкой на дисковые массивы (постоянное перемещение головок). При такой нагрузке, диски выходят из строя чаще, чем обычно. На серверах мы использовали дисковые полки (16 дисков) — то есть, довольно часто приходилось восстанваливать RAID-массивы. При этом, репликация отстаёт ещё больше и иногда реплику приходится наливать заново. Переключение мастера крайне неудобно. Для выбора реплики, на которую отправляются запросы, мы использовали MySQL Proxy, и это использование было весьма неудачным (потом заменили на HAProxy).
Несмотря на эти недостатки, по состоянию на 2011 год, мы хранили в MyISAM таблицах более 580 миллиардов строк. Потом всё переконвертировали в Metrage, удалили и в итоге освободили много серверов.
<h1>Metrage</h1>
Название Metrage (или по-русски — «Метраж») происходит от слов «Метрика» и «Агрегированные данные». Мы используем Metrage для хранения фиксированных отчётов с 2010 года по настоящее время.
Предположим, имеется следующий сценарий работы:
<ul>
<li>данные постоянно записываются в базу, небольшими batch-ами;</li>
<li>поток на запись сравнительно большой — несколько сотен тысяч строк в секунду;</li>
<li>запросов на чтение сравнительно мало — десятки-сотни запросов в секунду;</li>
<li>все чтения — по диапазону первичного ключа, до миллионов строк на один запрос;</li>
<li>строчки достаточно короткие — около 100 байт в несжатом виде.</li>
</ul>
Для такого сценария работы хорошо подходит структура данных <a href='http://en.wikipedia.org/wiki/Log-structured_merge-tree'>LSM-Tree</a>.
LSM-Tree представляет собой сравнительно небольшой набор «кусков» данных на диске. Каждый кусок содержит данные, отсортированные по первичному ключу. Новые данные сначала располагаются в какой либо структуре данных в оперативке (MemTable), затем записываются на диск в новый сортированный кусок. Периодически, в фоне, несколько сортированных кусков объединяются в один более крупный сортированный кусок (compaction). Таким образом, постоянно поддерживается сравнительно небольшой набор кусков.
LSM-Tree — достаточно распространённая структура данных.
Среди встраиваемых структур данных, LSM-Tree реализуют <a href='https://github.com/google/leveldb'>LevelDB</a>, <a href='http://rocksdb.org/'>RocksDB</a>.
LSM-Tree используется в <a href='http://hbase.apache.org/'>HBase</a> и <a href='http://cassandra.apache.org/'>Cassandra</a>.
<img src="https://habrastorage.org/files/826/43d/3f9/82643d3f92324e2498f43a82d2200485.png"/>
<em>Схема работы LSM-Tree</em>
Метраж также представляет собой LSM-Tree. В качестве «строчек» в Метраже могут использоваться произвольные структуры данных (фиксированы на этапе компиляции). Каждая строчка — это пара ключ, значение. Ключ представляет собой структуру с операциями сравнения на равенство и неравенство. Значение — произвольная структура с операциями update (добавить что-нибудь) и merge (агрегировать — объединить с другим значением). Короче говоря, значения представляют собой <a href='http://en.wikipedia.org/wiki/Conflict-free_replicated_data_type'>CRDT</a>.
В качестве значений могут выступать как простые структуры (кортеж чисел), так и более сложные (хэш-таблица для рассчёта количества уникальных посетителей, структура для карты кликов).
С помощью операций update и merge, постоянно выполняется инкрементальная агрегация данных:
<ul>
<li>во время вставки данных, при формировании новой пачки в оперативке;</li>
<li>во время фоновых слияний;</li>
<li>при запросах на чтение;</li>
</ul>
Также Метраж содержит нужную нам domain-specific логику, которая выполняется при запросах. Например, для отчёта по регионам, ключ в таблице будет содержать идентификатор самого нижнего региона (город, посёлок), и если нам нужно получить отчёт по странам, то доагрегация данных в данные по странам будет произведена на стороне сервера БД.
Рассмотрим достоинства этой структуры данных:
Данные расположены достаточно локально на жёстком диске, чтения по диапазону работают быстро.
Данные сжимаются по блокам. За счёт хранения в упорядоченном виде, сжатие достаточно сильное при использовании быстрых алгоритмов сжатия (в 2010 году использовали <a href='http://www.quicklz.com/'>QuickLZ</a>, с 2011 используем <a href='http://fastcompression.blogspot.ru/p/lz4.html'>LZ4</a>).
Хранение данных в упорядоченном виде позволяет использовать разреженный индекс. Разреженный индекс — это массив значений первичного ключа для каждой N-ой строки (N порядка тысяч). Такой индекс получается максимально компактным и всегда помещается в оперативку.
Так как чтения выполняются не очень часто, но при этом читают достаточно много строк, то увеличение latency из-за наличия многих кусков, из-за необходимости разжатия блока данных, и чтение лишних строк из-за разреженности индекса, не имеют значения.
Записанные куски данных не модифицируются. Это позволяет производить чтение и запись без блокировок — для чтения берётся снепшот данных.
Используется простой и единообразный код, но при этом мы можем легко реализовать всю нужную нам domain-specific логику.
Нам пришлось написать Метраж вместо доработки какого-либо существующего решения, потому что какого-либо существующего решения не было. Например, LevelDB не существовала в 2010 году. TokuDB в то время была доступна только за деньги. Все существовавшие системы, реализующие LSM-Tree подходят для хранения неструктурированных данных — отображения типа BLOB -> BLOB, с небольшими вариациями. Для того, чтобы адаптировать такую систему для работы с произвольными CRDT, потребовалось бы гораздо больше времени, чем на разработку Метража.
Конвертация данных из MySQL в Метраж была достаточно трудоёмкой: чистое время на работу программы конвертации — всего лишь около недели, но выполнить основную часть работы удалось только за два месяца. После перевода отчётов на Метраж, мы сразу же получили преимущество в скорости работы интерфейса Метрики. Так, 90% перцентиль времени загрузки отчёта по заголовкам страниц уменьшился с 26 секунд до 0.8 секунд (общее время, включая работу всех запросов к базам данных и последующих преобразований данных). Время обработки запросов самим Метражом (для всех отчётов) составляет: медиана — 6&nbsp;мс, 90% — 31&nbsp;мс, 99% — 334&nbsp;мс.
По опыту эксплуатации в течение пяти лет, Метраж показал себя как надёжное, беспроблемное решение. За всё время было всего лишь несколько незначительных сбоев. Преимущества в эффективности и в простоте использования, по сравнению с хранением данных в MyISAM, являются кардинальными.
Сейчас мы храним в Метраже 3.37 триллиона строк. Для этого используется 39 * 2 серверов. Мы постепенно отказываемся от хранения данных в Метраже и уже удалили несколько наиболее крупных таблиц. Недостаток Метража — возможность эффективной работы только с фиксированными отчётами. Метраж выполняет агрегацию данных и хранит агрегированные данные. А для того, чтобы это делать, нужно заранее перечислить все способы, которыми мы хотим агрегировать данные. Если мы будем агрегировать данные 40 разными способами — значит в Метрике будет 40 отчётов, но не больше.
<h1>OLAPServer</h1>
В Яндекс.Метрике, объём данных и величина нагрузки являются достаточно большими, чтобы основной проблемой было сделать решение, которое хотя бы работает — решает задачу и при этом справляется с нагрузкой в рамках адекватного количества вычислительных ресурсов. Поэтому, зачастую, основные усилия тратятся на то, чтобы сделать минимальный работающий прототип.
Одним из таких прототипов был OLAPServer. Мы использовали OLAPServer с 2009 по 2013 год в качестве структуры данных для конструктора отчётов.
Задача — получать произвольные отчёты, структура которых становится известна лишь в момент, когда пользователь хочет получить отчёт. Для этой задачи невозможно использовать предагрегированные данные, потому что невозможно предусмотреть заранее все комбинации измерений, метрик, условий. А значит, нужно хранить неагрегированные данные. Например, для отчётов, вычисляемых на основе визитов, необходимо иметь таблицу, где каждому визиту будет соответствовать строчка, а каждому параметру, по которому можно рассчитать отчёт, будет соответствовать столбец.
Имеем такой сценарий работы:
<ul>
<li>есть широкая «<a href='http://en.wikipedia.org/wiki/Fact_table'>таблица фактов</a>», содержащая большое количество столбцов (сотни);</li>
<li>при чтении, вынимается достаточно большое количество строк из БД, но только небольшое подмножество столбцов;</li>
<li>запросы на чтение идут сравнительно редко (обычно не более сотни в секунду на сервер);</li>
<li>при выполнении простых запросов, допустимы задержки в районе 50&nbsp;мс;</li>
<li>значения в столбцах достаточно мелкие — числа и небольшие строки (пример — 60 байт на URL);</li>
<li>требуется высокая пропускная способность при обработке одного запроса (до миллиардов строк в секунду на один сервер);</li>
<li>результат выполнения запроса существенно меньше исходных данных — то есть, данные фильтруются или агрегируются;</li>
<li>сравнительно простой сценарий обновления данных, обычно append-only batch-ами; нет сложных транзакций.</li>
</ul>
Для такого сценария работы (назовём его <a href='http://en.wikipedia.org/wiki/Online_analytical_processing'>OLAP</a> сценарий работы), наилучшим образом подходят столбцовые СУБД <a href='(http://en.wikipedia.org/wiki/Column-oriented_DBMS'>column-oriented DBMS</a>). Так называются СУБД, в которых данные для каждого столбца хранятся отдельно, а данные одного столбца — вместе.
Столбцовые СУБД эффективно работают для OLAP сценария работы по следующим причинам:
1. По I/O.
1.1. Для выполнения аналитического запроса, требуется прочитать небольшое количество столбцов таблицы. В столбцовой БД для этого можно читать только нужные данные. Например, если вам требуется только 5 столбцов из 100, то следует рассчитывать на 20-кратное уменьшение ввода-вывода.
1.2. Так как данные читаются пачками, то их проще сжимать. Данные, лежащие по столбцам также лучше сжимаются. За счёт этого, дополнительно уменьшается объём ввода-вывода.
1.3. За счёт уменьшения ввода-вывода, больше данных влезает в системный кэш.
Для примера, для запроса «посчитать количество записей для каждой рекламной системы», требуется прочитать один столбец «идентификатор рекламной системы», который занимает 1 байт в несжатом виде. Если большинство переходов было не с рекламных систем, то можно рассчитывать хотя бы на десятикратное сжатие этого столбца. При использовании быстрого алгоритма сжатия, возможно разжатие данных со скоростью более нескольких гигабайт несжатых данных в секунду. То есть, такой запрос может выполняться со скоростью около нескольких миллиардов строк в секунду на одном сервере.
2. По CPU.
Так как для выполнения запроса надо обработать достаточно большое количество строк, становится актуальным диспетчеризовывать все операции не для отдельных строк, а для целых векторов (пример — векторный движок в СУБД <a href='http://sites.computer.org/debull/A12mar/vectorwise.pdf'>VectorWise</a>), или реализовать движок выполнения запроса так, чтобы издержки на диспетчеризацию были примерно нулевыми (пример — кодогенерация с помощью <a href='http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-runtime-code-generation/'>LLVM в Cloudera Impala</a>). Если этого не делать, то при любой не слишком плохой дисковой подсистеме, интерпретатор запроса неизбежно упрётся в CPU. Имеет смысл не только хранить данные по столбцам, но и обрабатывать их, по возможности, тоже по столбцам.
<img src="https://habrastorage.org/files/2eb/c3e/79e/2ebc3e79e506499098bafb4ff59603c3.gif"/>
<em>Получение отчёта из столбцовой базы данных</em>
Существует достаточно много столбцовых СУБД. Это, например, <a href='http://www.vertica.com/'>Vertica</a>, Paraccel <a href='(http://www.actian.com/products/analytics-platform/matrix-mpp-analytics-database/'>Actian Matrix</a>) <a href='(http://aws.amazon.com/redshift/'>Amazon Redshift</a>), Sybase IQ (SAP IQ), <a href='http://www.exasol.com/en/'>Exasol</a>, <a href='https://www.infobright.com/'>Infobright</a>, <a href='https://github.com/infinidb/infinidb'>InfiniDB</a>, <a href='https://www.monetdb.org/Home'>MonetDB</a> (VectorWise) <a href='(http://www.actian.com/products/analytics-platform/vector-smp-analytics-database/'>Actian Vector</a>), <a href='http://luciddb.sourceforge.net/'>LucidDB</a>, <a href='http://hana.sap.com/abouthana.html'>SAP HANA</a>, <a href='http://research.google.com/pubs/pub36632.html'>Google Dremel</a>, <a href='http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf'>Google PowerDrill</a>, <a href='http://druid.io/'>Metamarkets Druid</a>, <a href='http://kx.com/software.php'>kdb+</a> и т. п.
В традиционно строковых СУБД последнее время тоже стали появляться решения для хранения данных по столбцам. Примеры — column store index в <a href='https://msdn.microsoft.com/en-us/gg492088.aspx'>MS SQL Server</a>, <a href='http://docs.memsql.com/latest/concepts/columnar/'>MemSQL</a>; <a href='https://github.com/citusdata/cstore_fdw'>cstore_fdw</a> для Postgres, форматы ORC-File и <a href='http://parquet.apache.org/'>Parquet</a> для Hadoop.
OLAPServer представляет собой простейшую и крайне ограниченную реализацию столбцовой базы данных. Так, OLAPServer поддерживает всего лишь одну таблицу, заданную в compile time — таблицу визитов. Обновление данных делается не в реальном времени, как везде в Метрике, а несколько раз в сутки. В качестве типов данных поддерживаются только числа фиксированной длины 1-8 байт. А в качестве запроса поддерживается лишь вариант <code>SELECT keys..., aggregates... FROM table WHERE condition1 AND condition2 AND... GROUP BY keys ORDER BY column_nums...</code>
Несмотря на такую ограниченную функциональность, OLAPServer успешно справлялся с задачей конструктора отчётов. Но не справлялся с задачей — реализовать возможность кастомизации каждого отчёта Яндекс.Метрики. Например, если отчёт содержал URL-ы, то его нельзя было получить через конструктор отчётов, потому что OLAPServer не хранил URL-ы; не удавалось реализовать часто необходимую нашим пользователям функциональность — просмотр страниц входа для поисковых фраз.
По состоянию на 2013 год, мы хранили в OLAPServer 728 миллиардов строк. Потом все данные переложили в ClickHouse и удалили.
<h1>ClickHouse</h1>
Используя OLAPServer, мы успели понять, насколько хорошо столбцовые СУБД справляются с задачей ad-hoc аналитики по неагрегированным данным. Если любой отчёт можно получить по неагрегированным данным, то возникает вопрос, нужно ли, вообще, предагрегировать данные заранее, как мы это делаем, используя Metrage?
С одной стороны, предагрегация данных позволяет уменьшить объём данных, используемых непосредственно в момент загрузки страницы с отчётом.
Но агрегированные данные являются очень ограниченным решением, по следующим причинам:
<ul>
<li>вы должны заранее знать перечень отчётов, необходимых пользователю;</li>
<li>то есть, пользователь не может построить произвольный отчёт;</li>
<li>при агрегации по большому количествую ключей, объём данных не уменьшается и агрегация бесполезна;</li>
<li>при большом количестве отчётов, получается слишком много вариантов агрегации (комбинаторный взрыв);</li>
<li>при агрегации по ключам высокой кардинальности (например, URL) объём данных уменьшается не сильно (менее чем в 2 раза);</li>
<li>из-за этого, объём данных при агрегации может не уменьшиться, а вырасти;</li>
<li>пользователи будут смотреть не все отчёты, которые мы для них посчитаем — то есть, большая часть вычислений бесполезна;</li>
<li>сложно поддерживать логическую целостность при хранении большого количества разных агрегаций;</li>
</ul>
Как видно, если ничего не агрегировать, и работать с неагрегированными данными, то это даже может уменьшить объём вычислений.
Но если работать только с неагрегированными данными, то это накладывает очень высокие требования к эффективности работы той системы, которая будет выполнять запросы.
Так, если мы агрегируем данные заранее, то мы это делаем, хоть и постоянно (в реальном времени), но зато асинхронно по отношению к пользовательским запросам. Мы должны всего лишь успевать агрегировать данные в реальном времени, а на момент получения отчёта, используются уже, по большей части, подготовленные данные.
А если не агрегировать данные заранее, то всю работу нужно делать в момент запроса пользователя — пока пользователь ждёт загрузки страницы с отчётом. Это значит, что нам может потребоваться, во время запроса, обработать многие миллиарды строк, и чем быстрее — тем лучше.
Для этого нужна хорошая столбцовая СУБД. На рынке не существует ни одной столбцовой СУБД, которая могла бы достаточно хорошо работать на задачах интернет-аналитики масштаба рунета, и при этом имела бы не запретительно высокую стоимость лицензий. Если бы мы использовали некоторые из решений, перечисленных в предыдущем разделе, то стоимость лицензий многократно превысила бы стоимость всех наших серверов и сотрудников.
В качестве альтернативы коммерческим столбцовым СУБД, последнее время стали появляться решения для эффективной ad-hoc аналитики по данным, находящимся в системах распределённых вычислений: <a href='http://impala.io/'>Cloudera Impala</a>, <a href='https://spark.apache.org/sql/'>Spark SQL</a>, <a href='https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920'>Presto</a>, <a href='http://drill.apache.org/'>Apache Drill</a>. Хотя такие системы могут эффективно работать на запросах для внутренних аналитических задач, достаточно трудно представить их в качестве бэкенда для веб-интерфейса аналитической системы, доступной внешним пользователям.
В Яндексе разработана своя столбцовая СУБД — ClickHouse. Рассмотрим основные преимущества ClickHouse.
<strong>ClickHouse подходит для больших данных</strong>
В новой Яндекс.Метрике ClickHouse используется для хранения всех данных для отчётов. Объём базы данных (декабрь 2015) составляет 11.4 триллионов строк (только для большой Метрики). Строчки — неагрегированные данные, которые используются для получения отчётов в реальном времени. Каждая строчка в наиболее крупных таблицах содержит более 200 столбцов.
<strong>ClickHouse линейно масштабируется</strong>
ClickHouse позволяет увеличивать размер кластера путём добавления новых серверов по мере необходимости. Для примера, основной кластер Яндекс.Метрики был увеличен с 60 до 394 серверов в течение двух лет. Для отказоустойчивости, серверы располагаются в разных датацентрах. ClickHouse может использовать все возможности железа для обработки одного запроса. Так достигается скорость более 1 терабайта в секунду (данных после разжатия, только используемые столбцы).
<strong>ClickHouse быстро работает</strong>
Высокая производительность ClickHouse является нашим отдельным предметом гордости. По результатам тестов, ClickHouse обрабатывает запросы быстрее, чем любая другая система, которую мы могли достать. Для примера, ClickHouse в среднем в 2.8 — 3.4 раза быстрее, чем Vertica. В ClickHouse нет одной серебрянной пули, за счёт которой система работает так быстро. Производительность — это результат систематической работы, которую мы постоянно делаем.
<strong>ClickHouse обладает богатой функциональностью</strong>
ClickHouse поддерживает диалект языка SQL. Поддерживаются подзапросы и JOIN-ы (локальные и распределённые).
Присутствуют многочисленные расширения SQL: функции для веб-аналитики, массивы и вложенные структуры данных, функции высшего порядка, агрегатные функции для приближённых вычислений с помощью sketching и т. п. При работе с ClickHouse вы получаете удобство реляционной СУБД.
<strong>ClickHouse прост в использовании</strong>
Не смотря на то, что ClickHouse — внутренняя разработка, мы сделали так, что ClickHouse похож на коробочный продукт, который легко устанавливается по инструкции и работает сразу. Хотя ClickHouse способен работать на кластерах большого размера, система может быть установлена на один сервер или даже на виртуальную машину.
ClickHouse разработан в команде Яндекс.Метрики. При этом, систему удалось сделать достаточно гибкой и расширяемой для того, чтобы она могла успешно использоваться для разных задач. Сейчас имеется более десятка применений ClickHouse внутри компании.
ClickHouse хорошо подходит для создания всевозможных аналитических инструментов. Действительно, если ClickHouse успешно справляется с задачами большой Яндекс.Метрики, то можно быть уверенным, что с другими задачами ClickHouse справится с многократным запасом по производительности.
В этом смысле особенно повезло Метрике мобильных приложений — когда она находилась в разработке в конце 2013 года, ClickHouse уже был готов. Для обработки данных Метрики приложений, мы просто сделали одну программу, которая берёт входящие данные, и после небольшой обработки, записывает их в ClickHouse. Любая функциональность, доступная в интерфейсе Метрики приложений представляет собой просто запрос SELECT.
ClickHouse используется для хранения и анализа логов различных сервисов в Яндексе. Типичным решением было бы использовать Logstash и ElasticSearch, но оно не работает на более-менее приличном потоке данных.
ClickHouse подходит в качестве базы данных для временных рядов — так, в Яндексе ClickHouse используется в качестве бэкенда для <a href='http://graphite.wikidot.com/'>Graphite</a> вместо Ceres/Whisper — это позволяет работать более чем с триллионом метрик на одном сервере.
ClickHouse используют аналитики для внутренних задач. По приблизительной оценке, эффективность работы ClickHouse по сравнению с традиционными методами обработки данных (скрипты на MR) выше примерно на три порядка. Это нельзя рассматривать как просто количественное отличие. Дело в том, что имея такую высокую скорость расчёта, можно позволить себе принципиально другие методы решения задач.
Если аналитик получил задачу — сделать отчёт, и если это хороший аналитик, то он не будет делать один отчёт. Вместо этого, он сначала получит десяток других отчётов, чтобы лучше изучить природу данных, и проверять возникающие при этом гипотезы. Зачастую имеет смысл смотреть на данные под разными углами, даже не имея при этом какой либо чёткой цели - для того, чтобы находить новые гипотезы и проверять их.
Это возможно лишь если скорость анализа данных позволяет проводить исследования в интерактивном режиме. Чем быстрее выполняются запросы, тем больше гипотез можно проверить. При работе с ClickHouse возникает такое ощущение, как будто у вас увеличилась скорость мышления.
В традиционных системах, данные, образно выражаясь, лежат мёртвым грузом на дне болота. С ними можно сделать что угодно, но это займёт много времени и будет очень неудобно. А если ваши данные лежат в ClickHouse, то это «живые» данные: вы можете изучать их в любых срезах и «сверлить» до каждой отдельной строчки.
<h1>Выводы</h1>
Так уж получилось, что Яндекс.Метрика <a href='http://w3techs.com/technologies/overview/traffic_analysis/all'>является</a> второй по величине системой веб-аналитики в мире. Объём поступающих в Метрику данных вырос с 200 млн. событий в сутки в начале 2009 года до чуть более 20 млрд. в 2015 году. Чтобы дать пользователям достаточно богатые возможности, но при этом не перестать работать под возрастающей нагрузкой, нам приходилось постоянно менять подход к организации хранения данных.
Для нас очень важна эффективность использования железа. По нашему опыту, при большом объёме данных, стоит беспокоиться не о том, насколько система хорошо масштабируется, а о том, насколько эффективно используется каждая единица ресурсов: каждое процессорное ядро, диск и SSD, оперативка, сеть. Ведь если ваша система уже использует сотни серверов, а вам нужно работать в десять раз эффективнее, то вряд ли вы сможете легко установить тысячи серверов, как бы хорошо система не масштабировалась.
Для достижения маскимальной эффективности важна специализация под конкретный класс задач. Не существует структуры данных, которая хорошо справляется с совершенно разными сценариями работы. Например очевидно, что key-value база не подойдёт для аналитических запросов. Чем больше нагрузка на систему, тем большая специализация будет требоваться, и не стоит бояться использовать для разных задач принципиально разные структуры данных.
Нам удалось сделать так, что Яндекс.Метрика является относительно дешёвой по железу. Это позволяет предоставлять сервис даже для самых крупных сайтов и мобильных приложений. На этом поле, у Яндекс.Метрики нет конкурентов. Для примера, если у вас есть популярное мобильное приложение, то вы можете бесплатно использовать <a href='https://appmetrika.yandex.ru/'>Яндекс.Метрику для приложений</a>, даже если ваше приложение популярнее, чем Яндекс.Карты.

View File

@ -1 +0,0 @@
Эволюция структур данных в Яндекс.Метрике

Binary file not shown.

Before

Width:  |  Height:  |  Size: 84 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 9.4 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 124 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 84 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 89 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 90 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 71 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 133 KiB

View File

@ -1 +0,0 @@
яндекс, open-source, big data, clickhouse, columnar database, olap, базы данных, структуры данных, веб-аналитика, c++

View File

@ -1,480 +0,0 @@
Сегодня внутренняя разработка компании Яндекс — <a href="https://clickhouse.yandex/">аналитическая СУБД ClickHouse</a>, стала доступна каждому. Исходники опубликованы на <a href="https://github.com/yandex/ClickHouse">GitHub</a> под лицензией Apache 2.0.
<img src="https://habrastorage.org/files/d9b/066/e61/d9b066e61e1f480a977d889dc03ded99.png"/>
ClickHouse позволяет выполнять аналитические запросы в интерактивном режиме по данным, обновляемым в реальном времени. Система способна масштабироваться до десятков триллионов записей и петабайт хранимых данных. Использование ClickHouse открывает возможности, которые раньше было даже трудно представить: вы можете сохранять весь поток данных без предварительной агрегации и быстро получать отчёты в любых разрезах. ClickHouse разработан в Яндексе для задач <a href="https://metrika.yandex.ru/">Яндекс.Метрики</a> — второй по величине системы веб-аналитики в мире.
В этой статье мы расскажем, как и для чего ClickHouse появился в Яндексе и что он умеет; сравним его с другими системами и покажем, как его поднять у себя с минимальными усилиями.
<cut />
<h1>Где находится ниша ClickHouse</h1>
Зачем кому-то может понадобиться использовать ClickHouse, когда есть много других технологий для работы с большими данными?
Если вам нужно просто хранить логи, у вас есть много вариантов. Вы можете загружать логи в Hadoop, анализировать их с помощью Hive, Spark или Impala. В этом случае вовсе не обязательно использовать ClickHouse. Всё становится сложнее, если вам нужно выполнять запросы в интерактивном режиме по неагрегированным данным, поступающим в систему в реальном времени. Для решения этой задачи, открытых технологий подходящего качества до сих пор не существовало.
Есть отдельные области, в которых могут быть использованы другие системы. Их можно классифицировать следующим образом:
<ol>
<li>Коммерческие OLAP СУБД для использования в собственной инфраструктуре.
Примеры: <a href="http://www8.hp.com/ru/ru/software-solutions/advanced-sql-big-data-analytics/">HP Vertica</a>, <a href="http://www.actian.com/products/big-data-analytics-platforms-with-hadoop/vector-smp-analytics-database/">Actian Vector</a>, <a href="http://www.actian.com/products/big-data-analytics-platforms-with-hadoop/matrix-mpp-analytics-databases/">Actian Matrix</a>, <a href="http://www.exasol.com/en/">EXASol</a>, <a href="https://go.sap.com/cis/cmp/ppc/crm-ru15-3di-ppc-it-a2/index.html">Sybase IQ</a> и другие.
Наши отличия: мы сделали технологию открытой и бесплатной.
</li>
<li>Облачные решения.
Примеры: <a href="http://aws.amazon.com/redshift/">Amazon Redshift</a> и <a href="https://cloud.google.com/bigquery/">Google BigQuery</a>.
Наши отличия: клиент может использовать ClickHouse в своей инфраструктуре и не платить за облака.
</li>
<li>Надстройки над Hadoop.
Примеры: <a href="http://impala.io/">Cloudera Impala</a>, <a href="http://spark.apache.org/sql/">Spark SQL</a>, <a href="https://prestodb.io/">Facebook Presto</a>, <a href="https://drill.apache.org/">Apache Drill</a>.
Наши отличия:
<ul>
<li>в отличие от Hadoop, ClickHouse позволяет обслуживать аналитические запросы даже в рамках массового сервиса, доступного публично, такого как Яндекс.Метрика;</li>
<li>для функционирования ClickHouse не требуется разворачивать Hadoop инфраструктуру, он прост в использовании, и подходит даже для небольших проектов;</li>
<li>ClickHouse позволяет загружать данные в реальном времени и самостоятельно занимается их хранением и индексацией;</li>
<li>в отличие от Hadoop, ClickHouse работает в географически распределённых датацентрах.</li>
</ul>
</li>
<li>Open-source OLAP СУБД.
Примеры: <a href="https://github.com/infinidb/infinidb">InfiniDB</a>, <a href="https://www.monetdb.org/">MonetDB</a>, <a href="https://github.com/LucidDB/luciddb">LucidDB</a>.
Разработка всех этих проектов заброшена, они никогда не были достаточно зрелыми и, по сути, так и не вышли из альфа-версии. Эти системы не были распределёнными, что является критически необходимым для обработки больших данных. Активная разработка ClickHouse, зрелость технологии и ориентация на практические потребности, возникающие при обработке больших данных, обеспечивается задачами Яндекса. Без использования «в бою» на реальных задачах, выходящих за рамки возможностей существующих систем, создать качественный продукт было бы невозможно.
</li>
<li>Open-source системы для аналитики, не являющиеся Relational OLAP СУБД.
Примеры: <a href="http://druid.io/">Metamarkets Druid</a>, <a href="http://kylin.apache.org/">Apache Kylin</a>.
Наши отличия: ClickHouse не требует предагрегации данных. ClickHouse поддерживает диалект языка SQL и предоставляет удобство реляционных СУБД.
</li>
</ol>
В рамках той достаточно узкой ниши, в которой находится ClickHouse, у него до сих пор нет альтернатив. В рамках более широкой области применения, ClickHouse может оказаться выгоднее других систем с точки зрения <a href="https://clickhouse.yandex/benchmark.html">скорости обработки запросов</a>, эффективности использования ресурсов и простоты эксплуатации.
<img src="https://habrastorage.org/files/37e/1b6/556/37e1b65562844a7a9c2477f9b5f7cda1.png"/>
<i>Карта кликов в Яндекс.Метрике и соответствующий запрос в ClickHouse</i>
Изначально мы разрабатывали ClickHouse исключительно для задач <a href="https://metrika.yandex.ru/">Яндекс.Метрики</a> — чтобы строить отчёты в интерактивном режиме по неагрегированным логам пользовательских действий. В связи с тем, что система является полноценной СУБД и обладает весьма широкой функциональностью, уже в начале использования в 2012 году, была написана <a href="https://clickhouse.yandex/reference_ru.html">подробная документация</a>. Это отличает ClickHouse от многих типичных внутренних разработок — специализированных и встраиваемых структур данных для решения конкретных задач, таких как, например, Metrage и OLAPServer, о которых я рассказывал в <a href="http://habrahabr.ru/company/yandex/blog/273305/">предыдущей статье</a>.
Развитая функциональность и наличие детальной документации привели к тому, что ClickHouse постепенно распространился по многим отделам Яндекса. Неожиданно оказалось, что система может быть установлена по инструкции и работает "из коробки", то есть не требует привлечения разработчиков. ClickHouse стал использоваться в Директе, Маркете, Почте, AdFox, Вебмастере, в мониторингах и в бизнес-аналитике. ClickHouse позволял либо решать задачи, для которых раньше не было подходящих инструментов, либо решать задачи на порядки эффективнее, чем другие системы.
Постепенно возник спрос на использование ClickHouse не только во внутренних продуктах Яндекса. Например, в 2013 году, ClickHouse применялся для анализа метаданных о событиях <a href="https://www.yandex.com/company/press_center/press_releases/2012/2012-04-10/">эксперимента LHCb в CERN</a>. Система могла бы использоваться более широко, но в то время этому мешал закрытый статус. Другой пример: open-source технология <a href="https://tech.yandex.ru/tank/">Яндекс.Танк</a> внутри Яндекса использует ClickHouse для хранения данных телеметрии, тогда как для внешних пользователей в качестве базы данных был доступен только MySQL, который плохо подходит для данной задачи.
По мере расширения пользовательской базы, возникла необходимость тратить на разработку чуть больше усилий, хоть и не очень много по сравнению с трудозатратами на решение задач Метрики. Зато в награду мы получаем повышение качества продукта, особенно в плане юзабилити.
Расширение пользовательской базы позволяет рассматривать примеры использования, которые без этого едва ли пришли бы в голову. Также это позволяет быстрее находить баги и неудобства, которые имеют значение в том числе и для основного применения ClickHouse — в Метрике. Без сомнения, всё это повышает качество продукта. Поэтому нам выгодно сделать ClickHouse открытым сегодня.
<h1>Как перестать бояться и начать использовать ClickHouse</h1>
Давайте попробуем работать с ClickHouse на примере «игрушечных» открытых данных — информации об авиаперелётах в США с 1987 по 2015 год. Это нельзя назвать большими данными (всего 166 млн. строк, 63 GB несжатых данных), зато вы можете быстро скачать их и начать экспериментировать. Скачать данные можно <a href="https://yadi.sk/d/pOZxpa42sDdgm">отсюда</a>.
Данные можно также скачать из первоисточника. Как это сделать, написано <a href="https://github.com/yandex/ClickHouse/raw/master/doc/example_datasets/1_ontime.txt">здесь</a>.
Для начала, установим ClickHouse на один сервер. Ниже также будет показано, как установить ClickHouse на кластер с шардированием и репликацией.
На Ubuntu и Debian Linux вы можете установить ClickHouse из <a href="https://clickhouse.yandex/#download">готовых пакетов</a>. На других Linux-системах, можно <a href="https://github.com/yandex/ClickHouse/blob/master/doc/build.md">собрать ClickHouse из исходников</a> и установить его самостоятельно.
Пакет clickhouse-client содержит программу <a href="https://clickhouse.yandex/reference_ru.html#%D0%9A%D0%BB%D0%B8%D0%B5%D0%BD%D1%82%20%D0%BA%D0%BE%D0%BC%D0%B0%D0%BD%D0%B4%D0%BD%D0%BE%D0%B9%20%D1%81%D1%82%D1%80%D0%BE%D0%BA%D0%B8">clickhouse-client</a> — клиент ClickHouse для работы в интерактивном режиме. Пакет clickhouse-server-base содержит бинарник clickhouse-server, а clickhouse-server-common — конфигурационные файлы к серверу.
Конфигурационные файлы сервера находятся в /etc/clickhouse-server/. Главное, на что следует обратить внимание перед началом работы — элемент path — место хранения данных. Необязательно модифицировать непосредственно файл config.xml — это не очень удобно при обновлении пакетов. Вместо этого можно переопределить нужные элементы <a href="https://clickhouse.yandex/reference_ru.html#%D0%9A%D0%BE%D0%BD%D1%84%D0%B8%D0%B3%D1%83%D1%80%D0%B0%D1%86%D0%B8%D0%BE%D0%BD%D0%BD%D1%8B%D0%B5%20%D1%84%D0%B0%D0%B9%D0%BB%D1%8B">в файлах в config.d директории</a>.
Также имеет смысл обратить внимание на <a href="https://clickhouse.yandex/reference_ru.html#%D0%9F%D1%80%D0%B0%D0%B2%D0%B0%20%D0%B4%D0%BE%D1%81%D1%82%D1%83%D0%BF%D0%B0">настройки прав доступа</a>.
Сервер не запускается самостоятельно при установке пакета и не перезапускается сам при обновлении.
Для запуска сервера, выполните:
<source lang="Bash">sudo service clickhouse-server start</source>
Логи сервера расположены по-умолчанию в директории /var/log/clickhouse-server/ .
После появления сообщения Ready for connections в логе, сервер будет принимать соединения.
Для подключения к серверу, используйте программу clickhouse-client.
<spoiler title="Короткая справка">
Работа в интерактивном режиме:
<source lang="Bash">
clickhouse-client
clickhouse-client --host=... --port=... --user=... --password=...
</source>
Включить многострочные запросы:
<source lang="Bash">
clickhouse-client -m
clickhouse-client --multiline
</source>
Выполнение запросов в batch режиме:
<source lang="Bash">
clickhouse-client --query='SELECT 1'
echo 'SELECT 1' | clickhouse-client
</source>
Вставка данных в заданном формате:
<source lang="Bash">
clickhouse-client --query='INSERT INTO table VALUES' &lt; data.txt
clickhouse-client --query='INSERT INTO table FORMAT TabSeparated' &lt; data.tsv
</source>
</spoiler>
<h3>Создаём таблицу для тестовых данных</h3>
<spoiler title="Создание таблицы">
<source lang="Bash">
$ clickhouse-client --multiline
ClickHouse client version 0.0.53720.
Connecting to localhost:9000.
Connected to ClickHouse server version 0.0.53720.
:) CREATE TABLE ontime
(
Year UInt16,
Quarter UInt8,
Month UInt8,
DayofMonth UInt8,
DayOfWeek UInt8,
FlightDate Date,
UniqueCarrier FixedString(7),
AirlineID Int32,
Carrier FixedString(2),
TailNum String,
FlightNum String,
OriginAirportID Int32,
OriginAirportSeqID Int32,
OriginCityMarketID Int32,
Origin FixedString(5),
OriginCityName String,
OriginState FixedString(2),
OriginStateFips String,
OriginStateName String,
OriginWac Int32,
DestAirportID Int32,
DestAirportSeqID Int32,
DestCityMarketID Int32,
Dest FixedString(5),
DestCityName String,
DestState FixedString(2),
DestStateFips String,
DestStateName String,
DestWac Int32,
CRSDepTime Int32,
DepTime Int32,
DepDelay Int32,
DepDelayMinutes Int32,
DepDel15 Int32,
DepartureDelayGroups String,
DepTimeBlk String,
TaxiOut Int32,
WheelsOff Int32,
WheelsOn Int32,
TaxiIn Int32,
CRSArrTime Int32,
ArrTime Int32,
ArrDelay Int32,
ArrDelayMinutes Int32,
ArrDel15 Int32,
ArrivalDelayGroups Int32,
ArrTimeBlk String,
Cancelled UInt8,
CancellationCode FixedString(1),
Diverted UInt8,
CRSElapsedTime Int32,
ActualElapsedTime Int32,
AirTime Int32,
Flights Int32,
Distance Int32,
DistanceGroup UInt8,
CarrierDelay Int32,
WeatherDelay Int32,
NASDelay Int32,
SecurityDelay Int32,
LateAircraftDelay Int32,
FirstDepTime String,
TotalAddGTime String,
LongestAddGTime String,
DivAirportLandings String,
DivReachedDest String,
DivActualElapsedTime String,
DivArrDelay String,
DivDistance String,
Div1Airport String,
Div1AirportID Int32,
Div1AirportSeqID Int32,
Div1WheelsOn String,
Div1TotalGTime String,
Div1LongestGTime String,
Div1WheelsOff String,
Div1TailNum String,
Div2Airport String,
Div2AirportID Int32,
Div2AirportSeqID Int32,
Div2WheelsOn String,
Div2TotalGTime String,
Div2LongestGTime String,
Div2WheelsOff String,
Div2TailNum String,
Div3Airport String,
Div3AirportID Int32,
Div3AirportSeqID Int32,
Div3WheelsOn String,
Div3TotalGTime String,
Div3LongestGTime String,
Div3WheelsOff String,
Div3TailNum String,
Div4Airport String,
Div4AirportID Int32,
Div4AirportSeqID Int32,
Div4WheelsOn String,
Div4TotalGTime String,
Div4LongestGTime String,
Div4WheelsOff String,
Div4TailNum String,
Div5Airport String,
Div5AirportID Int32,
Div5AirportSeqID Int32,
Div5WheelsOn String,
Div5TotalGTime String,
Div5LongestGTime String,
Div5WheelsOff String,
Div5TailNum String
)
ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);
</source>
</spoiler>
Мы создали таблицу типа <a href="https://clickhouse.yandex/reference_ru.html#MergeTree">MergeTree</a>. Таблицы семейства MergeTree рекомендуется использовать для любых серьёзных применений. Такие таблицы содержат первичный ключ, по которому данные инкрементально сортируются, что позволяет быстро выполнять запросы по диапазону первичного ключа.
Например, если у нас есть логи рекламной сети и нам нужно показывать отчёты для конкретных клиентов-рекламодателей, то первичный ключ в таблице должен начинаться на идентификатор клиента, чтобы для получения данных для одного клиента, достаточно было прочитать лишь небольшой диапазон данных.
<h3>Загружаем данные в таблицу</h3>
<source lang="Bash">xz -v -c -d &lt; ontime.csv.xz | clickhouse-client --query="INSERT INTO ontime FORMAT CSV"</source>
Запрос INSERT в ClickHouse позволяет загружать данные в любом <a href="https://clickhouse.yandex/reference_ru.html#%D0%A4%D0%BE%D1%80%D0%BC%D0%B0%D1%82%D1%8B">поддерживаемом формате</a>. При этом на загрузку данных расходуется O(1) памяти. На вход запроса INSERT можно передать любой объём данных. Вставлять данные всегда следует <a href="https://clickhouse.yandex/reference_ru.html#%D0%9F%D1%80%D0%BE%D0%B8%D0%B7%D0%B2%D0%BE%D0%B4%D0%B8%D1%82%D0%B5%D0%BB%D1%8C%D0%BD%D0%BE%D1%81%D1%82%D1%8C%20%D0%BF%D1%80%D0%B8%20%D0%B2%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B5%20%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85.">пачками не слишком маленького размера</a>. При этом вставка блоков данных размера до max_insert_block_size (= 1&nbsp;048&nbsp;576 строк по умолчанию), является атомарной: блок данных либо целиком вставится, либо целиком не вставится. В случае разрыва соединения в процессе вставки, вы можете не знать, вставился ли блок данных. Для достижения exactly-once семантики, для <a href="https://clickhouse.yandex/reference_ru.html#%D0%A0%D0%B5%D0%BF%D0%BB%D0%B8%D0%BA%D0%B0%D1%86%D0%B8%D1%8F%20%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85">реплицированных таблиц</a>, поддерживается идемпотентность: вы можете вставить один и тот же блок данных повторно, возможно на другую реплику, и он будет вставлен только один раз. В данном примере мы вставляем данные из localhost, поэтому мы не беспокоимся о формировании пачек и exactly-once семантике.
Запрос INSERT в таблицы типа MergeTree является неблокирующим, равно как и SELECT. После загрузки данных или даже во время процесса загрузки мы уже можем выполнять SELECT-ы.
В данном примере некоторая неоптимальность состоит в том, что в таблице используется тип данных String тогда, когда подошёл бы <a href="https://clickhouse.yandex/reference_ru.html#Enum">Enum</a> или числовой тип. Если множество разных значений строк заведомо небольшое (пример: название операционной системы, производитель мобильного телефона), то для максимальной производительности, мы рекомендуем использовать Enum-ы или числа. Если множество строк потенциально неограничено (пример: поисковый запрос, URL), то используйте тип данных String.
Во-вторых, отметим, что в рассматриваемом примере структура таблицы содержит избыточные столбцы Year, Quarter, Month, DayOfMonth, DayOfWeek, тогда как достаточно одного FlightDate. Скорее всего, это сделано для эффективной работы других СУБД, в которых функции для манипуляций с датой и временем, могут работать недостаточно быстро. В случае ClickHouse в этом нет необходимости, так как <a href="https://clickhouse.yandex/reference_ru.html#%D0%A4%D1%83%D0%BD%D0%BA%D1%86%D0%B8%D0%B8%20%D0%B4%D0%BB%D1%8F%20%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D1%8B%20%D1%81%20%D0%B4%D0%B0%D1%82%D0%B0%D0%BC%D0%B8%20%D0%B8%20%D0%B2%D1%80%D0%B5%D0%BC%D0%B5%D0%BD%D0%B5%D0%BC">соответствующие функции</a> хорошо оптимизированы. Впрочем, лишние столбцы не проблема: так как ClickHouse — это <a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS">столбцовая СУБД</a>, вы можете позволить себе иметь в таблице достаточно много столбцов. Сотни столбцов — это нормально для ClickHouse.
<h3>Примеры работы с загруженными данными</h3>
<ul>
<li><spoiler title="какие направления были самыми популярными в 2015 году;">
<source lang="SQL">
SELECT
OriginCityName,
DestCityName,
count(*) AS flights,
bar(flights, 0, 20000, 40)
FROM ontime WHERE Year = 2015 GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20
</source><img src="https://habrastorage.org/files/a85/18a/200/a8518a200d6d405a95ee80ea1c8e1c90.png"/>
<source lang="SQL">
SELECT
OriginCityName &lt; DestCityName ? OriginCityName : DestCityName AS a,
OriginCityName &lt; DestCityName ? DestCityName : OriginCityName AS b,
count(*) AS flights,
bar(flights, 0, 40000, 40)
FROM ontime WHERE Year = 2015 GROUP BY a, b ORDER BY flights DESC LIMIT 20
</source><img src="https://habrastorage.org/files/d35/78d/b55/d3578db55e304bd7b5eba818abdb53f5.png"/>
</spoiler>
</li>
<li><spoiler title="из каких городов отправляется больше рейсов;">
<source lang="SQL">
SELECT OriginCityName, count(*) AS flights FROM ontime GROUP BY OriginCityName ORDER BY flights DESC LIMIT 20
</source><img src="https://habrastorage.org/files/ef4/141/f34/ef4141f348234773a5349c4bd3e8f804.png"/></spoiler>
</li>
<li><spoiler title="из каких городов можно улететь по максимальному количеству направлений;">
<source lang="SQL">
SELECT OriginCityName, uniq(Dest) AS u FROM ontime GROUP BY OriginCityName ORDER BY u DESC LIMIT 20
</source><img src="https://habrastorage.org/files/240/9f4/9d1/2409f49d11fb4aa1b8b5ff34cf9ca75d.png"/></spoiler>
</li>
<li><spoiler title="как зависит задержка вылета рейсов от дня недели;">
<source lang="SQL">
SELECT DayOfWeek, count() AS c, avg(DepDelay &gt; 60) AS delays FROM ontime GROUP BY DayOfWeek ORDER BY DayOfWeek
</source><img src="https://habrastorage.org/files/885/e50/793/885e507930e34b7c8f788d25e7ca2bcf.png"/></spoiler>
</li>
<li><spoiler title="из каких городов, самолёты чаще задерживаются с вылетом более чем на час;">
<source lang="SQL">
SELECT OriginCityName, count() AS c, avg(DepDelay &gt; 60) AS delays
FROM ontime
GROUP BY OriginCityName
HAVING c &gt; 100000
ORDER BY delays DESC
LIMIT 20
</source><img src="https://habrastorage.org/files/ac2/926/56d/ac292656d03946d0aba35c75783a31f2.png"/></spoiler>
</li>
<li><spoiler title="какие наиболее длинные рейсы;">
<source lang="SQL">
SELECT OriginCityName, DestCityName, count(*) AS flights, avg(AirTime) AS duration
FROM ontime
GROUP BY OriginCityName, DestCityName
ORDER BY duration DESC
LIMIT 20
</source><img src="https://habrastorage.org/files/7b3/c2e/685/7b3c2e685832439b8c373bf2015131d2.png"/></spoiler>
</li>
<li><spoiler title="распределение времени задержки прилёта, по авиакомпаниям;">
<source lang="SQL">
SELECT Carrier, count() AS c, round(quantileTDigest(0.99)(DepDelay), 2) AS q
FROM ontime GROUP BY Carrier ORDER BY q DESC
</source><img src="https://habrastorage.org/files/49c/332/e3d/49c332e3d93146ba8f46beef6b2b02b0.png"/></spoiler>
</li>
<li><spoiler title="какие авиакомпании прекратили перелёты;">
<source lang="SQL">
SELECT Carrier, min(Year), max(Year), count()
FROM ontime GROUP BY Carrier HAVING max(Year) &lt; 2015 ORDER BY count() DESC
</source><img src="https://habrastorage.org/files/249/56f/1a2/24956f1a2efc48d78212586958aa036c.png"/></spoiler>
</li>
<li><spoiler title="в какие города стали больше летать в 2015 году;">
<source lang="SQL">
SELECT
DestCityName,
sum(Year = 2014) AS c2014,
sum(Year = 2015) AS c2015,
c2015 / c2014 AS diff
FROM ontime
WHERE Year IN (2014, 2015)
GROUP BY DestCityName
HAVING c2014 &gt; 10000 AND c2015 &gt; 1000 AND diff &gt; 1
ORDER BY diff DESC
</source><img src="https://habrastorage.org/files/f31/32f/4d1/f3132f4d1c0d42eab26d9111afe7771a.png"/></spoiler>
</li>
<li><spoiler title="перелёты в какие города больше зависят от сезонности.">
<source lang="SQL">
SELECT
DestCityName,
any(total),
avg(abs(monthly * 12 - total) / total) AS avg_month_diff
FROM
(
SELECT DestCityName, count() AS total
FROM ontime GROUP BY DestCityName HAVING total &gt; 100000
)
ALL INNER JOIN
(
SELECT DestCityName, Month, count() AS monthly
FROM ontime GROUP BY DestCityName, Month HAVING monthly &gt; 10000
)
USING DestCityName
GROUP BY DestCityName
ORDER BY avg_month_diff DESC
LIMIT 20
</source><img src="https://habrastorage.org/files/26b/2c7/aae/26b2c7aae21a4c76800cb1c7a33a374d.png"/></spoiler>
</li>
</ul>
<h3>Как установить ClickHouse на кластер из нескольких серверов</h3>
С точки зрения установленного ПО кластер ClickHouse является однородным, без выделенных узлов. Вам надо установить ClickHouse на все серверы кластера, затем прописать конфигурацию кластера в конфигурационном файле, создать на каждом сервере локальную таблицу и затем создать <a href="https://clickhouse.yandex/reference_ru.html#Distributed">Distributed-таблицу</a>.
<a href="https://clickhouse.yandex/reference_ru.html#Distributed">Distributed-таблица</a> представляет собой «вид» на локальные таблицы на кластере ClickHouse. При SELECT-е из распределённой таблицы запрос будет обработан распределённо, с использованием ресурсов всех шардов кластера. Вы можете объявить конфигурации нескольких разных кластеров и создать несколько Distributed-таблиц, которые смотрят на разные кластеры.
<spoiler title="Конфигурация кластера из трёх шардов, в каждом из которых данные расположены только на одной реплике">
<source lang="XML">
&lt;remote_servers&gt;
&lt;perftest_3shards_1replicas&gt;
&lt;shard&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest01j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;/shard&gt;
&lt;shard&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest02j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;/shard&gt;
&lt;shard&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest03j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;/shard&gt;
&lt;/perftest_3shards_1replicas&gt;
&lt;/remote_servers&gt;
</source>
</spoiler>
Создание локальной таблицы:
<source lang="SQL">CREATE TABLE ontime_local (...) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);</source>
Создание распределённой таблицы, которая смотрит на локальные таблицы на кластере:
<source lang="SQL">CREATE TABLE ontime_all AS ontime_local ENGINE = Distributed(perftest_3shards_1replicas, default, ontime_local, rand());</source>
Вы можете создать Distributed-таблицу на всех серверах кластера — тогда для выполнения распределённых запросов, можно будет обратиться на любой сервер кластера. Кроме Distributed-таблицы вы также можете воспользоваться <a href="https://clickhouse.yandex/reference_ru.html#remote">табличной функцией remote</a>.
Для того, чтобы распределить таблицу по нескольким серверам, сделаем <a href="https://clickhouse.yandex/reference_ru.html#INSERT">INSERT SELECT</a> в Distributed-таблицу.
<source lang="SQL">INSERT INTO ontime_all SELECT * FROM ontime;</source>
Отметим, что для перешардирования больших таблиц, такой способ не подходит, вместо этого следует воспользоваться встроенной <a href="https://clickhouse.yandex/reference_ru.html#Перешардирование">функциональностью перешардирования</a>.
Как и ожидалось, более-менее долгие запросы работают в несколько раз быстрее, если их выполнять на трёх серверах, а не на одном. <spoiler title="Пример">
<img src="https://habrastorage.org/files/ece/020/129/ece020129fdf4a18a6e75daf2e699cb9.png"/>
Можно заметить, что результат расчёта квантилей слегка отличается. Это происходит потому, что реализация алгоритма <a href="https://github.com/tdunning/t-digest/raw/master/docs/t-digest-paper/histo.pdf">t-digest</a> является недетерминированной — зависит от порядка обработки данных.</spoiler>
В данном примере мы использовали кластер из трёх шардов, каждый шард которого состоит из одной реплики. Для реальных задач в целях отказоустойчивости каждый шард должен состоять из двух или трёх реплик, расположенных в разных дата-центрах. (Поддерживается произвольное количество реплик.)
<spoiler title="Конфигурация кластера из одного шарда, на котором данные расположены в трёх репликах">
<source lang="XML">
&lt;remote_servers&gt;
...
&lt;perftest_1shards_3replicas&gt;
&lt;shard&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest01j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest02j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest03j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;/shard&gt;
&lt;/perftest_1shards_3replicas&gt;
&lt;/remote_servers&gt;
</source>
</spoiler>
Для работы репликации (хранение метаданных и координация действий) требуется <a href="http://zookeeper.apache.org/">ZooKeeper</a>. ClickHouse будет самостоятельно обеспечивать консистентность данных на репликах и производить восстановление после сбоев. Рекомендуется расположить кластер ZooKeeper на отдельных серверах.
На самом деле использование ZooKeeper не обязательно: в самых простых случаях вы можете дублировать данные, записывая их на все реплики вручную, и не использовать встроенный механизм репликации. Но такой способ не рекомендуется — ведь в этом случае ClickHouse не сможет обеспечивать консистентность данных на репликах.
<spoiler title="Пропишите адреса ZooKeeper в конфигурационном файле">
<source lang="XML">
&lt;zookeeper-servers&gt;
&lt;node&gt;
&lt;host&gt;zoo01.yandex.ru&lt;/host&gt;
&lt;port&gt;2181&lt;/port&gt;
&lt;/node&gt;
&lt;node&gt;
&lt;host&gt;zoo02.yandex.ru&lt;/host&gt;
&lt;port&gt;2181&lt;/port&gt;
&lt;/node&gt;
&lt;node&gt;
&lt;host&gt;zoo03.yandex.ru&lt;/host&gt;
&lt;port&gt;2181&lt;/port&gt;
&lt;/node&gt;
&lt;/zookeeper-servers&gt;
</source>
</spoiler>
Также пропишем подстановки, идентифицирующие шард и реплику — они будут использоваться при создании таблицы.
<source lang="XML">
&lt;macros&gt;
&lt;shard&gt;01&lt;/shard&gt;
&lt;replica&gt;01&lt;/replica&gt;
&lt;/macros&gt;
</source>
Если при создании реплицированной таблицы других реплик ещё нет, то создаётся первая реплика, а если есть — создаётся новая реплика, которая клонирует данные существующих реплик. Вы можете либо сразу создать все таблицы-реплики и затем загрузить в них данные, либо сначала создать часть реплик, а затем добавить другие — уже после загрузки или во время загрузки данных.
<source lang="SQL">
CREATE TABLE ontime_replica (...)
ENGINE = ReplicatedMergeTree(
'/clickhouse_perftest/tables/{shard}/ontime',
'{replica}',
FlightDate,
(Year, FlightDate),
8192);
</source>
Здесь видно, что мы используем тип таблицы <a href="https://clickhouse.yandex/reference_ru.html#ReplicatedMergeTree">ReplicatedMergeTree</a>, указывая в качестве параметров путь в ZooKeeper, содержащий идентификатор шарда, а также идентификатор реплики.
<source lang="SQL">INSERT INTO ontime_replica SELECT * FROM ontime;</source>
Репликация работает в режиме multi-master. Вы можете вставлять данные на любую реплику, и данные автоматически разъезжаются по всем репликам. При этом репликация асинхронная, и в заданный момент времени, реплики могут содержать не все недавно записанные данные. Для записи данных, достаточно доступности хотя бы одной реплики. Остальные реплики будут скачивать новые данные и восстанавливать консистентность как только станут активными. Такая схема допускает возможность потери только что вставленных данных.
<h1>Как повлиять на развитие ClickHouse</h1>
Если у вас возникли вопросы, можно задать их в комментариях к этой статье либо на <a href="http://stackoverflow.com/">StackOverflow</a>. Также вы можете создать тему для обсуждения в <a href="https://groups.google.com/group/clickhouse">группе</a> или написать своё предложение на рассылку clickhouse-feedback@yandex-team.ru. А если вам хочется попробовать поработать над ClickHouse изнутри, приглашаем присоединиться к нашей команде в Яндексе. У нас открыты <a href="https://yandex.ru/jobs/vacancies/dev/?tags=c%2B%2B">вакансии</a> и <a href="https://yandex.ru/jobs/vacancies/interns/summer">стажировки</a>.

View File

@ -1 +0,0 @@
Яндекс открывает ClickHouse

@ -1 +0,0 @@
Subproject commit 7ddc53e4d70bc5f6e63d044b99ec81f3b0d0bd5a

View File

@ -1 +0,0 @@
../../website/logo.svg

3
docs/README.md Normal file
View File

@ -0,0 +1,3 @@
This is the source code for ClickHouse documentation which is published on official website:
* In English: https://clickhouse.yandex/docs/en/
* In Russian: https://clickhouse.yandex/docs/ru/

View File

@ -29,6 +29,10 @@ a.reference {
border-bottom: none;
}
span.strike {
text-decoration: line-through;
}
input[type="submit"] {
border: none!important;
background: #fc0;

View File

@ -4,19 +4,19 @@ $(function() {
var pathname = window.location.pathname;
var url;
if (pathname.indexOf('html') >= 0) {
url = pathname.replace('/docs/', 'https://github.com/yandex/ClickHouse/edit/master/doc/reference/').replace('html', 'rst');
url = pathname.replace('/docs/', 'https://github.com/yandex/ClickHouse/edit/master/docs/').replace('html', 'rst');
} else {
if (pathname.indexOf('/single/') >= 0) {
if (pathname.indexOf('ru') >= 0) {
url = 'https://github.com/yandex/ClickHouse/tree/master/doc/reference/ru';
url = 'https://github.com/yandex/ClickHouse/tree/master/docs/ru';
} else {
url = 'https://github.com/yandex/ClickHouse/tree/master/doc/reference/en';
url = 'https://github.com/yandex/ClickHouse/tree/master/docs/en';
}
} else {
if (pathname.indexOf('ru') >= 0) {
url = 'https://github.com/yandex/ClickHouse/edit/master/doc/reference/ru/index.rst';
url = 'https://github.com/yandex/ClickHouse/edit/master/docs/ru/index.rst';
} else {
url = 'https://github.com/yandex/ClickHouse/edit/master/doc/reference/en/index.rst';
url = 'https://github.com/yandex/ClickHouse/edit/master/docs/en/index.rst';
}
}
}

View File

Before

Width:  |  Height:  |  Size: 620 B

After

Width:  |  Height:  |  Size: 620 B

12
docs/_static/logo.svg vendored Normal file
View File

@ -0,0 +1,12 @@
<svg xmlns="http://www.w3.org/2000/svg" width="54" height="48" viewBox="0 0 9 8">
<style>
.o{fill:#fc0}
.r{fill:#f00}
</style>
<path class="r" d="M0,7 h1 v1 h-1 z"/>
<path class="o" d="M0,0 h1 v7 h-1 z"/>
<path class="o" d="M2,0 h1 v8 h-1 z"/>
<path class="o" d="M4,0 h1 v8 h-1 z"/>
<path class="o" d="M6,0 h1 v8 h-1 z"/>
<path class="o" d="M8,3.25 h1 v1.5 h-1 z"/>
</svg>

After

Width:  |  Height:  |  Size: 421 B

View File

Before

Width:  |  Height:  |  Size: 511 B

After

Width:  |  Height:  |  Size: 511 B

View File

@ -0,0 +1,227 @@
Overview of ClickHouse architecture
===================================
ClickHouse is a true column-oriented DBMS. Data is stored by columns, and during query execution data is processed by arrays (vectors or chunks of columns). Whenever possible, operations are dispatched on arrays, rather than on individual values. This is called "vectorized query execution," and it helps lower dispatch cost relative to the cost of actual data processing.
This idea is nothing new. It dates back to the ``APL`` programming language and its descendants: ``A+``, ``J``, ``K``, and ``Q``. Array programming is widely used in scientific data processing. Neither is this idea something new in relational databases: for example, it is used in the ``Vectorwise`` system.
There are two different approaches for speeding up query processing: vectorized query execution and runtime code generation. In the latter, the code is generated for every kind of query on the fly, removing all indirection and dynamic dispatch. Neither of these approaches is strictly better than the other. Runtime code generation can be better when it fuses many operations together, thus fully utilizing CPU execution units and the pipeline. Vectorized query execution can be less practical, because it involves temporary vectors that must be written to cache and read back. If the temporary data does not fit in the L2 cache, this becomes an issue. But vectorized query execution more easily utilizes the SIMD capabilities of the CPU. A `research paper <http://15721.courses.cs.cmu.edu/spring2016/papers/p5-sompolski.pdf>`_ written by our friends shows that it is better to combine both approaches. ClickHouse mainly uses vectorized query execution and has limited initial support for runtime code generation (only the inner loop of the first stage of GROUP BY can be compiled).
Columns
-------
To represent columns in memory (actually, chunks of columns), the ``IColumn`` interface is used. This interface provides helper methods for implementation of various relational operators. Almost all operations are immutable: they do not modify the original column, but create a new modified one. For example, the ``IColumn::filter`` method accepts a filter byte mask and creates a new filtered column. It is used for the ``WHERE`` and ``HAVING`` relational operators. Additional examples: the ``IColumn::permute`` method to support ``ORDER BY``, the ``IColumn::cut`` method to support ``LIMIT``, and so on.
Various ``IColumn`` implementations (``ColumnUInt8``, ``ColumnString`` and so on) are responsible for the memory layout of columns. Memory layout is usually a contiguous array. For the integer type of columns it is just one contiguous array, like ``std::vector``. For ``String`` and ``Array`` columns, it is two vectors: one for all array elements, placed contiguously, and a second one for offsets to the beginning of each array. There is also ``ColumnConst`` that stores just one value in memory, but looks like a column.
Field
-----
Nevertheless, it is possible to work with individual values as well. To represent an individual value, ``Field`` is used. ``Field`` is just a discriminated union of ``UInt64``, ``Int64``, ``Float64``, ``String`` and ``Array``. ``IColumn`` has the ``operator[]`` method to get the n-th value as a ``Field``, and the ``insert`` method to append a ``Field`` to the end of a column. These methods are not very efficient, because they require dealing with temporary ``Field`` objects representing an individual value. There are more efficient methods, such as ``insertFrom``, ``insertRangeFrom``, and so on.
``Field`` doesn't have enough information about a specific data type for a table. For example, ``UInt8``, ``UInt16``, ``UInt32``, and ``UInt64`` are all represented as ``UInt64`` in a ``Field``.
Leaky abstractions
------------------
``IColumn`` has methods for common relational transformations of data, but they don't meet all needs. For example, ``ColumnUInt64`` doesn't have a method to calculate the sum of two columns, and ``ColumnString`` doesn't have a method to run a substring search. These countless routines are implemented outside of ``IColumn``.
Various functions on columns can be implemented in a generic, non-efficient way using ``IColumn`` methods to extract ``Field`` values, or in a specialized way using knowledge of inner memory layout of data in a specific ``IColumn`` implementation. To do this, functions are cast to a specific ``IColumn`` type and deal with internal representation directly. For example, ``ColumnUInt64`` has the ``getData`` method that returns a reference to an internal array, then a separate routine reads or fills that array directly. In fact, we have "leaky abstractions" to allow efficient specializations of various routines.
Data types
----------
``IDataType`` is responsible for serialization and deserialization: for reading and writing chunks of columns or individual values in binary or text form.
``IDataType`` directly corresponds to data types in tables. For example, there are ``DataTypeUInt32``, ``DataTypeDateTime``, ``DataTypeString`` and so on.
``IDataType`` and ``IColumn`` are only loosely related to each other. Different data types can be represented in memory by the same ``IColumn`` implementations. For example, ``DataTypeUInt32`` and ``DataTypeDateTime`` are both represented by ``ColumnUInt32`` or ``ColumnConstUInt32``. In addition, the same data type can be represented by different ``IColumn`` implementations. For example, ``DataTypeUInt8`` can be represented by ``ColumnUInt8`` or ``ColumnConstUInt8``.
``IDataType`` only stores metadata. For instance, ``DataTypeUInt8`` doesn't store anything at all (except vptr) and ``DataTypeFixedString`` stores just ``N`` (the size of fixed-size strings).
``IDataType`` has helper methods for various data formats. Examples are methods to serialize a value with possible quoting, to serialize a value for JSON, and to serialize a value as part of XML format. There is no direct correspondence to data formats. For example, the different data formats ``Pretty`` and ``TabSeparated`` can use the same ``serializeTextEscaped`` helper method from the ``IDataType`` interface.
Block
-----
A ``Block`` is a container that represents a subset (chunk) of a table in memory. It is just a set of triples: ``(IColumn, IDataType, column name)``. During query execution, data is processed by ``Block``s. If we have a ``Block``, we have data (in the ``IColumn`` object), we have information about its type (in ``IDataType``) that tells us how to deal with that column, and we have the column name (either the original column name from the table, or some artificial name assigned for getting temporary results of calculations).
When we calculate some function over columns in a block, we add another column with its result to the block, and we don't touch columns for arguments of the function because operations are immutable. Later, unneeded columns can be removed from the block, but not modified. This is convenient for elimination of common subexpressions.
Blocks are created for every processed chunk of data. Note that for the same type of calculation, the column names and types remain the same for different blocks, and only column data changes. It is better to split block data from the block header, because small block sizes will have a high overhead of temporary strings for copying shared_ptrs and column names.
Block Streams
-------------
Block streams are for processing data. We use streams of blocks to read data from somewhere, perform data transformations, or write data to somewhere. ``IBlockInputStream`` has the ``read`` method to fetch the next block while available. ``IBlockOutputStream`` has the ``write`` method to push the block somewhere.
Streams are responsible for:
#. Reading or writing to a table. The table just returns a stream for reading or writing blocks.
#. Implementing data formats. For example, if you want to output data to a terminal in ``Pretty`` format, you create a block output stream where you push blocks, and it formats them.
#. Performing data transformations. Let's say you have ``IBlockInputStream`` and want to create a filtered stream. You create ``FilterBlockInputStream`` and initialize it with your stream. Then when you pull a block from ``FilterBlockInputStream``, it pulls a block from your stream, filters it, and returns the filtered block to you. Query execution pipelines are represented this way.
There are more sophisticated transformations. For example, when you pull from ``AggregatingBlockInputStream``, it reads all data from its source, aggregates it, and then returns a stream of aggregated data for you. Another example: ``UnionBlockInputStream`` accepts many input sources in the constructor and also a number of threads. It launches multiple threads and reads from multiple sources in parallel.
Block streams use the "pull" approach to control flow: when you pull a block from the first stream, it consequently pulls the required blocks from nested streams, and the entire execution pipeline will work. Neither "pull" nor "push" is the best solution, because control flow is implicit, and that limits implementation of various features like simultaneous execution of multiple queries (merging many pipelines together). This limitation could be overcome with coroutines or just running extra threads that wait for each other. We may have more possibilities if we make control flow explicit: if we locate the logic for passing data from one calculation unit to another outside of those calculation units. Read this `nice article <http://journal.stuffwithstuff.com/2013/01/13/iteration-inside-and-out/>`_ for more thoughts.
We should note that the query execution pipeline creates temporary data at each step. We try to keep block size small enough so that temporary data fits in the CPU cache. With that assumption, writing and reading temporary data is almost free in comparison with other calculations. We could consider an alternative, which is to fuse many operations in the pipeline together, to make the pipeline as short as possible and remove much of the temporary data. This could be an advantage, but it also has drawbacks. For example, a split pipeline makes it easy to implement caching intermediate data, stealing intermediate data from similar queries running at the same time, and merging pipelines for similar queries.
Formats
-------
Data formats are implemented with block streams. There are "presentational" formats only suitable for output of data to the client, such as ``Pretty`` format, which provides only ``IBlockOutputStream``. And there are input/output formats, such as ``TabSeparated`` or ``JSONEachRow``.
There are also row streams: ``IRowInputStream`` and ``IRowOutputStream``. They allow you to pull/push data by individual rows, not by blocks. And they are only needed to simplify implementation of row-oriented formats. The wrappers ``BlockInputStreamFromRowInputStream`` and ``BlockOutputStreamFromRowOutputStream`` allow you to convert row-oriented streams to regular block-oriented streams.
I/O
---
For byte-oriented input/output, there are ``ReadBuffer`` and ``WriteBuffer`` abstract classes. They are used instead of C++ ``iostream``'s. Don't worry: every mature C++ project is using something other than ``iostream``'s for good reasons.
``ReadBuffer`` and ``WriteBuffer`` are just a contiguous buffer and a cursor pointing to the position in that buffer. Implementations may own or not own the memory for the buffer. There is a virtual method to fill the buffer with the following data (for ``ReadBuffer``) or to flush the buffer somewhere (for ``WriteBuffer``). The virtual methods are rarely called.
Implementations of ``ReadBuffer``/``WriteBuffer`` are used for working with files and file descriptors and network sockets, for implementing compression (``CompressedWriteBuffer`` is initialized with another WriteBuffer and performs compression before writing data to it), and for other purposes the names ``ConcatReadBuffer``, ``LimitReadBuffer``, and ``HashingWriteBuffer`` speak for themselves.
Read/WriteBuffers only deal with bytes. To help with formatted input/output (for instance, to write a number in decimal format), there are functions from ``ReadHelpers`` and ``WriteHelpers`` header files.
Let's look at what happens when you want to write a result set in ``JSON`` format to stdout. You have a result set ready to be fetched from ``IBlockInputStream``. You create ``WriteBufferFromFileDescriptor(STDOUT_FILENO)`` to write bytes to stdout. You create ``JSONRowOutputStream``, initialized with that ``WriteBuffer``, to write rows in ``JSON`` to stdout. You create ``BlockOutputStreamFromRowOutputStream`` on top of it, to represent it as ``IBlockOutputStream``. Then you call ``copyData`` to transfer data from ``IBlockInputStream`` to ``IBlockOutputStream``, and everything works. Internally, ``JSONRowOutputStream`` will write various JSON delimiters and call the ``IDataType::serializeTextJSON`` method with a reference to ``IColumn`` and the row number as arguments. Consequently, ``IDataType::serializeTextJSON`` will call a method from ``WriteHelpers.h``: for example, ``writeText`` for numeric types and ``writeJSONString`` for ``DataTypeString``.
Tables
------
Tables are represented by the ``IStorage`` interface. Different implementations of that interface are different table engines. Examples are ``StorageMergeTree``, ``StorageMemory``, and so on. Instances of these classes are just tables.
The most important ``IStorage`` methods are ``read`` and ``write``. There are also ``alter``, ``rename``, ``drop``, and so on. The ``read`` method accepts the following arguments: the set of columns to read from a table, the ``AST`` query to consider, and the desired number of streams to return. It returns one or multiple ``IBlockInputStream`` objects and information about the stage of data processing that was completed inside a table engine during query execution.
In most cases, the read method is only responsible for reading the specified columns from a table, not for any further data processing. All further data processing is done by the query interpreter and is outside the responsibility of ``IStorage``.
But there are notable exceptions:
- The AST query is passed to the ``read`` method and the table engine can use it to derive index usage and to read less data from a table.
- Sometimes the table engine can process data itself to a specific stage. For example, ``StorageDistributed`` can send a query to remote servers, ask them to process data to a stage where data from different remote servers can be merged, and return that preprocessed data.
The query interpreter then finishes processing the data.
The table's ``read`` method can return multiple ``IBlockInputStream`` objects to allow parallel data processing. These multiple block input streams can read from a table in parallel. Then you can wrap these streams with various transformations (such as expression evaluation or filtering) that can be calculated independently and create a ``UnionBlockInputStream`` on top of them, to read from multiple streams in parallel.
There are also ``TableFunction``s. These are functions that return a temporary ``IStorage`` object to use in the ``FROM`` clause of a query.
To get a quick idea of how to implement your own table engine, look at something simple, like ``StorageMemory`` or ``StorageTinyLog``.
As the result of the ``read`` method, ``IStorage`` returns ``QueryProcessingStage`` information about what parts of the query were already calculated inside storage. Currently we have only very coarse granularity for that information. There is no way for the storage to say "I have already processed this part of the expression in WHERE, for this range of data". We need to work on that.
Parsers
-------
A query is parsed by a hand-written recursive descent parser. For example, ``ParserSelectQuery`` just recursively calls the underlying parsers for various parts of the query. Parsers create an ``AST``. The ``AST`` is represented by nodes, which are instances of ``IAST``.
Parser generators are not used for historical reasons.
Interpreters
------------
Interpreters are responsible for creating the query execution pipeline from an ``AST``. There are simple interpreters, such as ``InterpreterExistsQuery``and ``InterpreterDropQuery``, or the more sophisticated ``InterpreterSelectQuery``. The query execution pipeline is a combination of block input or output streams. For example, the result of interpreting the ``SELECT`` query is the ``IBlockInputStream`` to read the result set from; the result of the INSERT query is the ``IBlockOutputStream`` to write data for insertion to; and the result of interpreting the ``INSERT SELECT`` query is the ``IBlockInputStream`` that returns an empty result set on the first read, but that copies data from ``SELECT`` to ``INSERT`` at the same time.
``InterpreterSelectQuery`` uses ``ExpressionAnalyzer`` and ``ExpressionActions`` machinery for query analysis and transformations. This is where most rule-based query optimizations are done. ``ExpressionAnalyzer`` is quite messy and should be rewritten: various query transformations and optimizations should be extracted to separate classes to allow modular transformations or query.
Functions
---------
There are ordinary functions and aggregate functions. For aggregate functions, see the next section.
Ordinary functions don't change the number of rows they work as if they are processing each row independently. In fact, functions are not called for individual rows, but for ``Block``'s of data to implement vectorized query execution.
There are some miscellaneous functions, like ``blockSize``, ``rowNumberInBlock``, and ``runningAccumulate``, that exploit block processing and violate the independence of rows.
ClickHouse has strong typing, so implicit type conversion doesn't occur. If a function doesn't support a specific combination of types, an exception will be thrown. But functions can work (be overloaded) for many different combinations of types. For example, the ``plus`` function (to implement the ``+`` operator) works for any combination of numeric types: ``UInt8`` + ``Float32``, ``UInt16`` + ``Int8``, and so on. Also, some variadic functions can accept any number of arguments, such as the ``concat`` function.
Implementing a function may be slightly inconvenient because a function explicitly dispatches supported data types and supported ``IColumns``. For example, the ``plus`` function has code generated by instantiation of a C++ template for each combination of numeric types, and for constant or non-constant left and right arguments.
This is a nice place to implement runtime code generation to avoid template code bloat. Also, it will make it possible to add fused functions like fused multiply-add, or to make multiple comparisons in one loop iteration.
Due to vectorized query execution, functions are not short-circuit. For example, if you write ``WHERE f(x) AND g(y)``, both sides will be calculated, even for rows, when ``f(x)`` is zero (except when ``f(x)`` is a zero constant expression). But if selectivity of the ``f(x)`` condition is high, and calculation of ``f(x)`` is much cheaper than ``g(y)``, it's better to implement multi-pass calculation: first calculate ``f(x)``, then filter columns by the result, and then calculate ``g(y)`` only for smaller, filtered chunks of data.
Aggregate Functions
-------------------
Aggregate functions are stateful functions. They accumulate passed values into some state, and allow you to get results from that state. They are managed with the ``IAggregateFunction`` interface. States can be rather simple (the state for ``AggregateFunctionCount`` is just a single ``UInt64`` value) or quite complex (the state of ``AggregateFunctionUniqCombined`` is a combination of a linear array, a hash table and a ``HyperLogLog`` probabilistic data structure).
To deal with multiple states while executing a high-cardinality ``GROUP BY`` query, states are allocated in ``Arena`` (a memory pool), or they could be allocated in any suitable piece of memory. States can have a non-trivial constructor and destructor: for example, complex aggregation states can allocate additional memory themselves. This requires some attention to creating and destroying states and properly passing their ownership, to keep track of who and when will destroy states.
Aggregation states can be serialized and deserialized to pass over the network during distributed query execution or to write them on disk where there is not enough RAM. They can even be stored in a table with the ``DataTypeAggregateFunction`` to allow incremental aggregation of data.
The serialized data format for aggregate function states is not versioned right now. This is ok if aggregate states are only stored temporarily. But we have the ``AggregatingMergeTree`` table engine for incremental aggregation, and people are already using it in production. This is why we should add support for backward compatibility when changing the serialized format for any aggregate function in the future.
Server
------
The server implements several different interfaces:
- An HTTP interface for any foreign clients.
- A TCP interface for the native ClickHouse client and for cross-server communication during distributed query execution.
- An interface for transferring data for replication.
Internally, it is just a basic multithreaded server without coroutines, fibers, etc. Since the server is not designed to process a high rate of simple queries but is intended to process a relatively low rate of complex queries, each of them can process a vast amount of data for analytics.
The server initializes the ``Context`` class with the necessary environment for query execution: the list of available databases, users and access rights, settings, clusters, the process list, the query log, and so on. This environment is used by interpreters.
We maintain full backward and forward compatibility for the server TCP protocol: old clients can talk to new servers and new clients can talk to old servers. But we don't want to maintain it eternally, and we are removing support for old versions after about one year.
For all external applications, we recommend using the HTTP interface because it is simple and easy to use. The TCP protocol is more tightly linked to internal data structures: it uses an internal format for passing blocks of data and it uses custom framing for compressed data. We haven't released a C library for that protocol because it requires linking most of the ClickHouse codebase, which is not practical.
Distributed query execution
---------------------------
Servers in a cluster setup are mostly independent. You can create a ``Distributed`` table on one or all servers in a cluster. The ``Distributed`` table does not store data itself it only provides a "view" to all local tables on multiple nodes of a cluster. When you SELECT from a ``Distributed`` table, it rewrites that query, chooses remote nodes according to load balancing settings, and sends the query to them. The ``Distributed`` table requests remote servers to process a query just up to a stage where intermediate results from different servers can be merged. Then it receives the intermediate results and merges them. The distributed table tries to distribute as much work as possible to remote servers, and does not send much intermediate data over the network.
Things become more complicated when you have subqueries in IN or JOIN clauses and each of them uses a ``Distributed`` table. We have different strategies for execution of these queries.
There is no global query plan for distributed query execution. Each node has its own local query plan for its part of the job. We only have simple one-pass distributed query execution: we send queries for remote nodes and then merge the results. But this is not feasible for difficult queries with high cardinality GROUP BYs or with a large amount of temporary data for JOIN: in such cases, we need to "reshuffle" data between servers, which requires additional coordination. ClickHouse does not support that kind of query execution, and we need to work on it.
Merge Tree
----------
``MergeTree`` is a family of storage engines that supports indexing by primary key. The primary key can be an arbitary tuple of columns or expressions. Data in a ``MergeTree`` table is stored in "parts". Each part stores data in the primary key order (data is ordered lexicographically by the primary key tuple). All the table columns are stored in separate ``column.bin`` files in these parts. The files consist of compressed blocks. Each block is usually from 64 KB to 1 MB of uncompressed data, depending on the average value size. The blocks consist of column values placed contiguously one after the other. Column values are in the same order for each column (the order is defined by the primary key), so when you iterate by many columns, you get values for the corresponding rows.
The primary key itself is "sparse". It doesn't address each single row, but only some ranges of data. A separate ``primary.idx`` file has the value of the primary key for each N-th row, where N is called ``index_granularity`` (usually, N = 8192). Also, for each column, we have ``column.mrk`` files with "marks," which are offsets to each N-th row in the data file. Each mark is a pair: the offset in the file to the beginning of the compressed block, and the offset in the decompressed block to the beginning of data. Usually compressed blocks are aligned by marks, and the offset in the decompressed block is zero. Data for ``primary.idx`` always resides in memory and data for ``column.mrk`` files is cached.
When we are going to read something from a part in ``MergeTree``, we look at ``primary.idx`` data and locate ranges that could possibly contain requested data, then look at ``column.mrk`` data and calculate offsets for where to start reading those ranges. Because of sparseness, excess data may be read. ClickHouse is not suitable for a high load of simple point queries, because the entire range with ``index_granularity`` rows must be read for each key, and the entire compressed block must be decompressed for each column. We made the index sparse because we must be able to maintain trillions of rows per single server without noticeable memory consumption for the index. Also, because the primary key is sparse, it is not unique: it cannot check the existence of the key in the table at INSERT time. You could have many rows with the same key in a table.
When you ``INSERT`` a bunch of data into ``MergeTree``, that bunch is sorted by primary key order and forms a new part. To keep the number of parts relatively low, there are background threads that periodically select some parts and merge them to a single sorted part. That's why it is called ``MergeTree``. Of course, merging leads to "write amplification". All parts are immutable: they are only created and deleted, but not modified. When SELECT is run, it holds a snapshot of the table (a set of parts). After merging, we also keep old parts for some time to make recovery after failure easier, so if we see that some merged part is probably broken, we can replace it with its source parts.
``MergeTree`` is not an LSM tree because it doesn't contain "memtable" and "log": inserted data is written directly to the filesystem. This makes it suitable only to INSERT data in batches, not by individual row and not very frequently about once per second is ok, but a thousand times a second is not. We did it this way for simplicity's sake, and because we are already inserting data in batches in our applications.
MergeTree tables can only have one (primary) index: there aren't any secondary indices. It would be nice to allow multiple physical representations under one logical table, for example, to store data in more than one physical order or even to allow representations with pre-aggregated data along with original data.
There are MergeTree engines that are doing additional work during background merges. Examples are ``CollapsingMergeTree`` and ``AggregatingMergeTree``. This could be treated as special support for updates. Keep in mind that these are not real updates because users usually have no control over the time when background merges will be executed, and data in a ``MergeTree`` table is almost always stored in more than one part, not in completely merged form.
Replication
-----------
Replication in ClickHouse is implemented on a per-table basis. You could have some replicated and some non-replicated tables on the same server. You could also have tables replicated in different ways, such as one table with two-factor replication and another with three-factor.
Replication is implemented in the ``ReplicatedMergeTree`` storage engine. The path in ``ZooKeeper`` is specified as a parameter for the storage engine. All tables with the same path in ``ZooKeeper`` become replicas of each other: they synchronise their data and maintain consistency. Replicas can be added and removed dynamically simply by creating or dropping a table.
Replication uses an asynchronous multi-master scheme. You can insert data into any replica that has a session with ``ZooKeeper``, and data is replicated to all other replicas asynchronously. Because ClickHouse doesn't support UPDATEs, replication is conflict-free. As there is no quorum acknowledgment of inserts, just-inserted data might be lost if one node fails.
Metadata for replication is stored in ZooKeeper. There is a replication log that lists what actions to do. Actions are: get part; merge parts; drop partition, etc. Each replica copies the replication log to its queue and then executes the actions from the queue. For example, on insertion, the "get part" action is created in the log, and every replica downloads that part. Merges are coordinated between replicas to get byte-identical results. All parts are merged in the same way on all replicas. To achieve this, one replica is elected as the leader, and that replica initiates merges and writes "merge parts" actions to the log.
Replication is physical: only compressed parts are transferred between nodes, not queries. To lower the network cost (to avoid network amplification), merges are processed on each replica independently in most cases. Large merged parts are sent over the network only in cases of significant replication lag.
In addition, each replica stores its state in ZooKeeper as the set of parts and its checksums. When the state on the local filesystem diverges from the reference state in ZooKeeper, the replica restores its consistency by downloading missing and broken parts from other replicas. When there is some unexpected or broken data in the local filesystem, ClickHouse does not remove it, but moves it to a separate directory and forgets it.
The ClickHouse cluster consists of independent shards, and each shard consists of replicas. The cluster is not elastic, so after adding a new shard, data is not rebalanced between shards automatically. Instead, the cluster load will be uneven. This implementation gives you more control, and it is fine for relatively small clusters such as tens of nodes. But for clusters with hundreds of nodes that we are using in production, this approach becomes a significant drawback. We should implement a table engine that will span its data across the cluster with dynamically replicated regions that could be split and balanced between clusters automatically.

View File

@ -0,0 +1,195 @@
How to build ClickHouse on Linux
================================
Build should work on Linux Ubuntu 12.04, 14.04 or newer.
With appropriate changes, build should work on any other Linux distribution.
Build is not intended to work on Mac OS X.
Only x86_64 with SSE 4.2 is supported. Support for AArch64 is experimental.
To test for SSE 4.2, do
.. code-block:: bash
grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
Install Git and CMake
---------------------
.. code-block:: bash
sudo apt-get install git cmake
Detect number of threads
------------------------
.. code-block:: bash
export THREADS=$(grep -c ^processor /proc/cpuinfo)
Install GCC 6
-------------
There are several ways to do it.
Install from PPA package
~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
sudo apt-get install software-properties-common
sudo apt-add-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-6 g++-6
Install from sources
~~~~~~~~~~~~~~~~~~~~
Example:
.. code-block:: bash
# Download gcc from https://gcc.gnu.org/mirrors.html
wget ftp://ftp.fu-berlin.de/unix/languages/gcc/releases/gcc-6.2.0/gcc-6.2.0.tar.bz2
tar xf gcc-6.2.0.tar.bz2
cd gcc-6.2.0
./contrib/download_prerequisites
cd ..
mkdir gcc-build
cd gcc-build
../gcc-6.2.0/configure --enable-languages=c,c++
make -j $THREADS
sudo make install
hash gcc g++
gcc --version
sudo ln -s /usr/local/bin/gcc /usr/local/bin/gcc-6
sudo ln -s /usr/local/bin/g++ /usr/local/bin/g++-6
sudo ln -s /usr/local/bin/gcc /usr/local/bin/cc
sudo ln -s /usr/local/bin/g++ /usr/local/bin/c++
# /usr/local/bin/ should be in $PATH
Use GCC 6 for builds
--------------------
.. code-block:: bash
export CC=gcc-6
export CXX=g++-6
Install required libraries from packages
----------------------------------------
.. code-block:: bash
sudo apt-get install libicu-dev libreadline-dev libmysqlclient-dev libssl-dev unixodbc-dev
Checkout ClickHouse sources
---------------------------
To get latest stable version:
.. code-block:: bash
git clone -b stable git@github.com:yandex/ClickHouse.git
# or: git clone -b stable https://github.com/yandex/ClickHouse.git
cd ClickHouse
For development, switch to the ``master`` branch.
For latest release candidate, switch to the ``testing`` branch.
Build ClickHouse
----------------
There are two variants of build.
Build release package
~~~~~~~~~~~~~~~~~~~~~
Install prerequisites to build debian packages.
.. code-block:: bash
sudo apt-get install devscripts dupload fakeroot debhelper
Install recent version of clang.
Clang is embedded into ClickHouse package and used at runtime. Minimum version is 3.8.0. It is optional.
You can build clang from sources:
.. code-block:: bash
cd ..
sudo apt-get install subversion
mkdir llvm
cd llvm
svn co http://llvm.org/svn/llvm-project/llvm/tags/RELEASE_400/final llvm
cd llvm/tools
svn co http://llvm.org/svn/llvm-project/cfe/tags/RELEASE_400/final clang
cd ..
cd projects/
svn co http://llvm.org/svn/llvm-project/compiler-rt/tags/RELEASE_400/final compiler-rt
cd ../..
mkdir build
cd build/
cmake -D CMAKE_BUILD_TYPE:STRING=Release ../llvm
make -j $THREADS
sudo make install
hash clang
Or install it from packages. On Ubuntu 16.04 or newer:
.. code-block:: bash
sudo apt-get install clang
You may also build ClickHouse with clang for development purposes.
For production releases, GCC is used.
Run release script:
.. code-block:: bash
rm -f ../clickhouse*.deb
./release
You will find built packages in parent directory:
.. code-block:: bash
ls -l ../clickhouse*.deb
Note that usage of debian packages is not required.
ClickHouse has no runtime dependencies except libc, so it could work on almost any Linux.
Installing just built packages on development server:
.. code-block:: bash
sudo dpkg -i ../clickhouse*.deb
sudo service clickhouse-server start
Build to work with code
~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
mkdir build
cd build
cmake ..
make -j $THREADS
cd ..
To create an executable, run ``make clickhouse``.
This will create the ``dbms/src/Server/clickhouse`` executable, which can be used with --client or --server arguments.

View File

@ -3,7 +3,7 @@
# How to build ClickHouse under debian-based systems (ubuntu)
# apt install -y curl sudo
# curl https://raw.githubusercontent.com/yandex/ClickHouse/master/doc/build_debian.sh | sh
# curl https://raw.githubusercontent.com/yandex/ClickHouse/master/docs/en/development/build_debian.sh | sh
# install compiler and libs
sudo apt install -y git bash cmake gcc-6 g++-6 libicu-dev libreadline-dev libmysqlclient-dev unixodbc-dev libltdl-dev libssl-dev

View File

@ -16,7 +16,7 @@
# Variant 3: Manual build:
# pkg install -y curl sudo
# curl https://raw.githubusercontent.com/yandex/ClickHouse/master/doc/build_freebsd.sh | sh
# curl https://raw.githubusercontent.com/yandex/ClickHouse/master/docs/en/development/build_freebsd.sh | sh
# install compiler and libs
sudo pkg install devel/git devel/cmake shells/bash devel/icu devel/libltdl databases/unixODBC devel/google-perftools devel/libzookeeper devel/libdouble-conversion archivers/zstd archivers/liblz4 devel/sparsehash devel/re2

View File

@ -0,0 +1,52 @@
How to build ClickHouse on Mac OS X
===================================
Build should work on Mac OS X 10.12. If you're using earlier version, you can try to build ClickHouse using Gentoo Prefix and clang sl in this instruction.
With appropriate changes, build should work on any other OS X distribution.
Install Homebrew
----------------
.. code-block:: bash
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Install required compilers, tools, libraries
--------------------------------------------
.. code-block:: bash
brew install cmake gcc icu4c mysql openssl unixodbc libtool gettext homebrew/dupes/zlib readline boost --cc=gcc-6
Checkout ClickHouse sources
---------------------------
To get the latest stable version:
.. code-block:: bash
git clone -b stable --recursive --depth=10 git@github.com:yandex/ClickHouse.git
# or: git clone -b stable --recursive --depth=10 https://github.com/yandex/ClickHouse.git
cd ClickHouse
For development, switch to the ``master`` branch.
For the latest release candidate, switch to the ``testing`` branch.
Build ClickHouse
----------------
.. code-block:: bash
mkdir build
cd build
cmake .. -DCMAKE_CXX_COMPILER=`which g++-6` -DCMAKE_C_COMPILER=`which gcc-6`
make -j `sysctl -n hw.ncpu`
cd ..
Caveats
-------
If you intend to run clickhouse-server, make sure to increase system's maxfiles variable. See `MacOS.md <https://github.com/yandex/ClickHouse/blob/master/MacOS.md>`_ for more details.

View File

@ -0,0 +1,7 @@
ClickHouse Development
======================
.. toctree::
:glob:
*

View File

@ -0,0 +1,734 @@
.. role:: strike
:class: strike
How to write C++ code
=====================
General
-------
#. This text should be considered as recommendations.
#. If you edit some code, it makes sense to keep style in changes consistent with the rest of it.
#. Style is needed to keep code consistent. Consistency is required to make it easier (more convenient) to read the code. And also for code navigation.
#. Many rules do not have any logical explanation and just come from existing practice.
Formatting
----------
#. Most of formatting is done automatically by ``clang-format``.
#. Indent is 4 spaces wide. Configure your IDE to insert 4 spaces on pressing Tab.
#. Curly braces on separate lines.
.. code-block:: cpp
inline void readBoolText(bool & x, ReadBuffer & buf)
{
char tmp = '0';
readChar(tmp, buf);
x = tmp != '0';
}
#. But if function body is short enough (one statement) you can put the whole thing on one line. In this case put spaces near curly braces, except for the last one in the end.
.. code-block:: cpp
inline size_t mask() const { return buf_size() - 1; }
inline size_t place(HashValue x) const { return x & mask(); }
#. For functions there are no spaces near brackets.
.. code-block:: cpp
void reinsert(const Value & x)
.. code-block:: cpp
memcpy(&buf[place_value], &x, sizeof(x));
#. When using expressions if, for, while, ... (in contrast to function calls) there should be space before opening bracket.
.. code-block:: cpp
for (size_t i = 0; i < rows; i += storage.index_granularity)
#. There should be spaces around binary operators (+, -, \*, /, %, ...) and ternary operator ?:.
.. code-block:: cpp
UInt16 year = (s[0] - '0') * 1000 + (s[1] - '0') * 100 + (s[2] - '0') * 10 + (s[3] - '0');
UInt8 month = (s[5] - '0') * 10 + (s[6] - '0');
UInt8 day = (s[8] - '0') * 10 + (s[9] - '0');
#. If there's a line break, operator is written on new line and it has additional indent.
.. code-block:: cpp
if (elapsed_ns)
message << " ("
<< rows_read_on_server * 1000000000 / elapsed_ns << " rows/s., "
<< bytes_read_on_server * 1000.0 / elapsed_ns << " MB/s.) ";
#. It is ok to insert additional spaces to align the code.
.. code-block:: cpp
dst.ClickLogID = click.LogID;
dst.ClickEventID = click.EventID;
dst.ClickGoodEvent = click.GoodEvent;
#. No spaces around ``.``, ``->`` operators.
If necessary these operators can be moved to next line with additional indent.
#. Unary operators (``--, ++, *, &``, ...) are not delimited from argument.
#. Space is put after comma or semicolon, not before.
#. Operator ``[]`` is not delimited with spaces.
#. In ``template <...>``, put space between ``template`` and ``<``; after ``<`` and before ``>`` - do not.
.. code-block:: cpp
template <typename TKey, typename TValue>
struct AggregatedStatElement
#. In classes and structs keywords public, private, protected are written on same indention level as class/struct, while other contents - deeper.
.. code-block:: cpp
template <typename T, typename Ptr = std::shared_ptr<T>>
class MultiVersion
{
public:
/// Конкретная версия объекта для использования. shared_ptr определяет время жизни версии.
using Version = Ptr;
#. If there's only one namespace in a file and there's nothing else significant - no need to indent the namespace.
#. If ``if, for, while...`` block consists of only one statement, it's not required to wrap it in curly braces. Instead you can put the statement on separate line. This statements can also be a ``if, for, while...`` block. But if internal statement contains curly braces or else, this option should not be used.
.. code-block:: cpp
/// Если файлы не открыты, то открываем их.
if (streams.empty())
for (const auto & name : column_names)
streams.emplace(name, std::make_unique<Stream>(
storage.files[name].data_file.path(),
storage.files[name].marks[mark_number].offset));
#. No spaces before end of line.
#. Source code should be in UTF-8 encoding.
#. It's ok to have non-ASCII characters in string literals.
.. code-block:: cpp
<< ", " << (timer.elapsed() / chunks_stats.hits) << " μsec/hit.";
#. Don't put multiple statements on single line.
#. Inside functions do not delimit logical blocks by more than one empty line.
#. Functions, classes and similar constructs are delimited by one or two empty lines.
#. const (related to value) is written before type name.
.. code-block:: cpp
const char * pos
.. code-block:: cpp
const std::string & s
:strike:`char const * pos`
#. When declaring pointer or reference symbols \* and & should be surrounded by spaces.
.. code-block:: cpp
const char * pos
:strike:`const char\* pos`
:strike:`const char \*pos`
#. Alias template types with ``using`` keyword (except the most simple cases). It can be declared even locally, for example inside functions.
.. code-block:: cpp
using FileStreams = std::map<std::string, std::shared_ptr<Stream>>;
FileStreams streams;
:strike:`std::map<std::string, std::shared_ptr<Stream>> streams;`
#. Do not declare several variables of different types in one statements.
:strike:`int x, *y;`
#. C-style casts should be avoided.
:strike:`std::cerr << (int)c << std::endl;`
.. code-block:: cpp
std::cerr << static_cast<int>(c) << std::endl;
#. In classes and structs group members and functions separately inside each visibility scope.
#. For small classes and structs, it is not necessary to split method declaration and implementation.
The same for small methods.
For templated classes and structs it is better not to split declaration and implementations (because anyway they should be defined in the same translation unit).
#. Lines should be wrapped at 140 symbols, not 80.
#. Always use prefix increment/decrement if postfix is not required.
.. code-block:: cpp
for (Names::const_iterator it = column_names.begin(); it != column_names.end(); ++it)
Comments
--------
#. You shoud write comments in all not trivial places.
It is very important. While writing comment you could even understand that code does the wrong thing or is completely unnecessary.
.. code-block:: cpp
/** Part of piece of memory, that can be used.
* For example, if internal_buffer is 1MB, and there was only 10 bytes loaded to buffer from file for reading,
* then working_buffer will have size of only 10 bytes
* (working_buffer.end() will point to position right after those 10 bytes available for read).
*/
#. Comments can be as detailed as necessary.
#. Comments are written before the relevant code. In rare cases - after on the same line.
.. code-block:: text
/** Parses and executes the query.
*/
void executeQuery(
ReadBuffer & istr, /// Where to read the query from (and data for INSERT, if applicable)
WriteBuffer & ostr, /// Where to write the result
Context & context, /// DB, tables, data types, engines, functions, aggregate functions...
BlockInputStreamPtr & query_plan, /// Here could be written the description on how query was executed
QueryProcessingStage::Enum stage = QueryProcessingStage::Complete); /// Up to which stage process the SELECT query
#. Comments should be written only in english
#. When writing a library, put it's detailed description in it's main header file.
#. You shouldn't write comments not providing additional information. For instance, you *CAN'T* write empty comments like this one:
.. code-block:: cpp
/*
* Procedure Name:
* Original procedure name:
* Author:
* Date of creation:
* Dates of modification:
* Modification authors:
* Original file name:
* Purpose:
* Intent:
* Designation:
* Classes used:
* Constants:
* Local variables:
* Parameters:
* Date of creation:
* Purpose:
*/
(example is borrowed from here: http://home.tamk.fi/~jaalto/course/coding-style/doc/unmaintainable-code/)
#. You shouldn't write garbage comments (author, creation date...) in the beginning of each file.
#. One line comments should start with three slashes: ``///``, multiline - with ``/**``. This comments are considered "documenting".
Note: such comments could be used to generate docs using Doxygen. But in reality Doxygen is not used because it is way more convenient to use IDE for code navigation.
#. In beginning and end of multiline comments there should be no empty lines (except the one where the comment ends).
#. For commented out code use simple, not "documenting" comments. Delete commented out code before commits.
#. Do not use profanity in comments or code.
#. Do not use too many question signs, exclamation points or capital letters.
:strike:`/// WHAT THE FAIL???`
#. Do not make delimeters from comments.
:strike:`/*******************************************************/`
#. Do not create discussions in comments.
:strike:`/// Why you did this?`
#. Do not comment end of block describing what kind of block it was.
:strike:`} /// for`
Names
-----
#. Names of variables and class members — in lowercase with underscores.
.. code-block:: cpp
size_t max_block_size;
#. Names of functions (methids) - in camelCase starting with lowercase letter.
.. code-block:: cpp
std::string getName() const override { return "Memory"; }
#. Names of classes (structs) - CamelCase starting with uppercase letter. Prefixes are not used, except I for interfaces.
.. code-block:: cpp
class StorageMemory : public IStorage
#. Names of ``using``'s - same as classes and can have _t suffix.
#. Names of template type arguments: in simple cases - T; T, U; T1, T2.
In complex cases - like class names or can have T prefix.
.. code-block:: cpp
template <typename TKey, typename TValue>
struct AggregatedStatElement
#. Names of template constant arguments: same as variable names or N in simple cases.
.. code-block:: cpp
template <bool without_www>
struct ExtractDomain
#. For abstract classes (interfaces) you can add I to the start of name.
.. code-block:: cpp
class IBlockInputStream
#. If variable is used pretty locally, you can use short name.
In other cases - use descriptive name.
.. code-block:: cpp
bool info_successfully_loaded = false;
#. ``define``'s should be in ALL_CAPS with underlines. Global constants - too.
.. code-block:: cpp
#define MAX_SRC_TABLE_NAMES_TO_STORE 1000
#. Names of files should match it's contents.
If file contains one class - name it like class in CamelCase.
If file contains one function - name it like function in camelCase.
#. If name contains an abbreviation:
* for variables names it should be all lowercase;
``mysql_connection``
:strike:`mySQL_connection`
* for class and function names it should be all uppercase;
``MySQLConnection``
:strike:`MySqlConnection`
#. Constructor arguments used just to initialize the class members, should have the matching name, but with underscore suffix.
.. code-block:: cpp
FileQueueProcessor(
const std::string & path_,
const std::string & prefix_,
std::shared_ptr<FileHandler> handler_)
: path(path_),
prefix(prefix_),
handler(handler_),
log(&Logger::get("FileQueueProcessor"))
{
}
The underscore suffix can be omitted if argument is not used in constructor body.
#. Naming of local variables and class members do not have any differences (no prefixes required).
``timer``
:strike:`m_timer`
#. Constants in enums - CamelCase starting with uppercase letter. ALL_CAPS is also ok. If enum is not local, use enum class.
.. code-block:: cpp
enum class CompressionMethod
{
QuickLZ = 0,
LZ4 = 1,
};
#. All names - in English. Transliteration from Russian is not allowed.
:strike:`Stroka`
#. Abbreviations are fine only if they are well known (when you can find what it means in wikipedia or with web search query).
``AST`` ``SQL``
:strike:`NVDH (some random letters)`
Using incomplete words is ok if it is commonly used. Also you can put the whole word in comments.
#. C++ source code extensions should be .cpp. Header files - only .h.
:strike:`.hpp` :strike:`.cc` :strike:`.C` :strike:`.inl`
``.inl.h`` is ok, but not :strike:`.h.inl:strike:`
How to write code
-----------------
#. Memory management.
Manual memory deallocation (delete) is ok only in destructors in library code.
In application code memory should be freed by some object that owns it.
Examples:
* you can put object on stack or make it a member of another class.
* use containers for many small objects.
* for automatic deallocation of not that many objects residing in heap, use shared_ptr/unique_ptr.
#. Resource management.
Use RAII and see abovee.
#. Error handling.
Use exceptions. In most cases you should only throw exception, but not catch (because of RAII).
In offline data processing applications it's often ok not to catch exceptions.
In server code serving user requests usually you should catch exceptions only on top level of connection handler.
In thread functions you should catch and keep all exceptions to rethrow it in main thread after join.
.. code-block:: cpp
/// If there were no other calculations yet - lets do it synchronously
if (!started)
{
calculate();
started = true;
}
else /// If the calculations are already in progress - lets wait
pool.wait();
if (exception)
exception->rethrow();
Never hide exceptions without handling. Never just blindly put all exceptions to log.
:strike:`catch (...) {}`
If you need to ignore some exceptions, do so only for specific ones and rethrow the rest..
.. code-block:: cpp
catch (const DB::Exception & e)
{
if (e.code() == ErrorCodes::UNKNOWN_AGGREGATE_FUNCTION)
return nullptr;
else
throw;
}
When using functions with error codes - always check it and throw exception in case of error.
.. code-block:: cpp
if (0 != close(fd))
throwFromErrno("Cannot close file " + file_name, ErrorCodes::CANNOT_CLOSE_FILE);
Asserts are not used.
#. Exception types.
No need to use complex exception hierarchy in application code. Exception code should be understandable by operations engineer.
#. Throwing exception from destructors.
Not recommended, but allowed.
Use the following options:
* Create function (done() or finalize()) that will in advance do all the work that might lead to exception. If that function was called, later there should be no exceptions in destructor.
* Too complex work (for example, sending messages via network) can be put in separate method that class user will have to call before destruction.
* If nevertheless there's an exception in destructor it's better to log it that to hide it.
* In simple applications it is ok to rely on std::terminate (in case of noexcept by default in C++11) to handle exception.
#. Anonymous code blocks.
It is ok to declare anonymous code block to make some variables local to it and make them be destroyed earlier than they otherwise would.
.. code-block:: cpp
Block block = data.in->read();
{
std::lock_guard<std::mutex> lock(mutex);
data.ready = true;
data.block = block;
}
ready_any.set();
#. Multithreading.
In case of offline data processing applications:
* Try to make code as fast as possible on single core.
* Make it parallel only if single core performance appeared to be not enough.
In server application:
* use thread pool for request handling;
* for now there were no tasks where userspace context switching was really necessary.
Fork is not used to parallelize code.
#. Synchronizing threads.
Often it is possible to make different threads use different memory cells (better - different cache lines) and do not use any synchronization (except joinAll).
If synchronization is necessary in most cases mutex under lock_guard is enough.
In other cases use system synchronization primitives. Do not use busy wait.
Atomic operations should be used only in the most simple cases.
Do not try to implement lock-free data structures unless it is your primary area of expertise.
#. Pointers vs reference.
Prefer references.
#. const.
Use constant references, pointers to constants, const_iterator, const methods.
Consider const to be default and use non-const only when necessary.
When passing variable by value using const usually do not make sense.
#. unsigned.
unsinged is ok if necessary.
#. Numeric types.
Use types UInt8, UInt16, UInt32, UInt64, Int8, Int16, Int32, Int64, as well as size_t, ssize_t, ptrdiff_t.
Do not use типы signed/unsigned long, long long, short; signed char, unsigned char, аnd char.
#. Passing arguments.
Pass complex values by reference (including std::string).
If functions captures object ownership created in heap, make an argument to be shared_ptr or unique_ptr.
#. Returning values.
In most cases just use return. Do not write :strike:`return std::move(res)`.
If function allocates an object on heap and returns it, use shared_ptr or unique_ptr.
In rare cases you might need to return value via argument, in this cases the argument should be a reference.
.. code-block:: cpp
using AggregateFunctionPtr = std::shared_ptr<IAggregateFunction>;
/** Creates aggregate function by it's name
*/
class AggregateFunctionFactory
{
public:
AggregateFunctionFactory();
AggregateFunctionPtr get(const String & name, const DataTypes & argument_types) const;
#. namespace.
No need to use separate namespace for application code or small libraries.
For medium to large libraries - put everything in namespace.
You can use additional detail namespace in .h file to hide implementation details.
In .cpp you can use static or anonymous namespace to hide symbols.
You can also use namespace for enums to prevent it's names to pollute outer namespace, but it's better to use enum class.
#. Delayed initialization.
If arguments are required for initialization then do not write default constructor.
If later you'll need to delay initialization you can add default constructor creating invalid object.
For small number of object you could use shared_ptr/unique_ptr.
.. code-block:: cpp
Loader(DB::Connection * connection_, const std::string & query, size_t max_block_size_);
/// For delayed initialization
Loader() {}
#. Virtual methods.
Do not mark methods or destructor as virtual if class is not intended for polymorph usage.
#. Encoding.
Always use UTF-8. Use ``std::string``, ``char *``. Do not use ``std::wstring``, ``wchar_t``.
#. Logging.
See examples in code.
Remove debug logging before commit.
Logging in cycles should be avoided, even on Trace level.
Logs should be readable even with most detailed settings.
Log mostly in application code.
Log messages should be in English and understandable by operations engineers.
#. I/O.
In internal cycle (critical for application performance) you can't use iostreams (especially stringstream).
Instead use DB/IO library.
#. Date and time.
See DateLUT library.
#. include.
Always use ``#pragma once`` instead of include guards.
#. using.
Don't use ``using namespace``. ``using`` something specific is fine, try to put it as locally as possible.
#. Do not use trailing return unless necessary.
:strike:`auto f() -> void;`
#. Do not delcare and init variables like this:
:strike:`auto s = std::string{"Hello"};`
Do it like this instead:
``std::string s = "Hello";``
``std::string s{"Hello"};``
#. For virtual functions write virtual in base class and don't forget to write override in descendant classes.
Unused C++ features
-------------------
#. Virtual inheritance.
#. Exception specifiers from C++03.
#. Function try block, except main function in tests.
Platform
--------
#. We write code for specific platform. But other things equal cross platform and portable code is preferred.
#. Language is C++17. GNU extensions can be used if necessary.
#. Compiler is gcc. As of April 2017 version 6.3 is used. It is also compilable with clang 4.
Standard library from gcc is used.
#. OS - Linux Ubuntu, not older than Precise.
#. Code is written for x86_64 CPU architecture.
CPU instruction set is SSE4.2 is currently required.
#. ``-Wall -Werror`` compilation flags are used.
#. Static linking is used by default. See ldd output for list of exceptions from this rule.
#. Code is developed and debugged with release settings.
Tools
-----
#. Good IDE - KDevelop.
#. For debugging gdb, valgrind (memcheck), strace, -fsanitize=..., tcmalloc_minimal_debug are used.
#. For profiling - Linux Perf, valgrind (callgrind), strace -cf.
#. Source code is in Git.
#. Compilation is managed by CMake.
#. Releases are in deb packages.
#. Commits to master should not break build.
Though only selected revisions are considered workable.
#. Commit as often as possible, even if code is not quite ready yet.
Use branches for this.
If your code is not buildable yet, exclude it from build before pushing to master;
you'll have few days to fix or delete it from master after that.
#. For non-trivial changes use branches and publish them on server.
#. Unused code is removed from repository.
Libraries
---------
#. C++14 standard library is used (experimental extensions are fine), as well as boost and Poco frameworks.
#. If necessary you can use any well known library available in OS with packages.
If there's a good ready solution it is used even if it requires to install one more dependency.
(Buy be prepared to remove bad libraries from code.)
#. It is ok to use library not from packages if it is not available or too old.
#. If the library is small and does not have it's own complex build system, you should put it's sources in contrib folder.
#. Already used libraries are preferred.
General
-------
#. Write as short code as possible.
#. Try the most simple solution.
#. Do not write code if you do not know how it will work yet.
#. In the most simple cases prefer using over classes and structs.
#. Write copy constructors, assignment operators, destructor (except virtual), mpve-constructor and move assignment operators only if there's no other option. You can use ``default``.
#. It is encouraged to simplify code.
Additional
----------
#. Explicit std:: for types from stddef.h.
Not recommended, but allowed.
#. Explicit std:: for functions from C standard library.
Not recommended. For example, write memcpy instead of std::memcpy.
Sometimes there are non standard functions with similar names, like memmem. It will look weird to have memcpy with std:: prefix near memmem without. Though specifying std:: is not prohibited.
#. Usage of C functions when there are alternatives in C++ standard library.
Allowed if they are more effective. For example, use memcpy instead of std::copy for copying large chunks of memory.
#. Multiline function arguments.
All of the following styles are allowed:
.. code-block:: cpp
function(
T1 x1,
T2 x2)
.. code-block:: cpp
function(
size_t left, size_t right,
const & RangesInDataParts ranges,
size_t limit)
.. code-block:: cpp
function(size_t left, size_t right,
const & RangesInDataParts ranges,
size_t limit)
.. code-block:: cpp
function(size_t left, size_t right,
const & RangesInDataParts ranges,
size_t limit)
.. code-block:: cpp
function(
size_t left,
size_t right,
const & RangesInDataParts ranges,
size_t limit)

View File

@ -0,0 +1,38 @@
How to run ClickHouse tests
===========================
The ``clickhouse-test`` utility that is used for functional testing is written using Python 2.x.
It also requires you to have some third-party packages:
.. code-block:: bash
$ pip install lxml termcolor
In a nutshell:
- Put the ``clickhouse`` program to ``/usr/bin`` (or ``PATH``)
- Create a ``clickhouse-client`` symlink in ``/usr/bin`` pointing to ``clickhouse``
- Start the ``clickhouse`` server
- ``cd dbms/tests/``
- Run ``./clickhouse-test``
Example usage
-------------
Run ``./clickhouse-test --help`` to see available options.
To run tests without having to create a symlink or mess with ``PATH``:
.. code-block:: bash
./clickhouse-test -c "../../build/dbms/src/Server/clickhouse --client"
To run a single test, i.e. ``00395_nullable``:
.. code-block:: bash
./clickhouse-test 00395

View File

@ -5,7 +5,7 @@ It is possible to add your own dictionaries from various data sources. The data
A dictionary can be stored completely in RAM and updated regularly, or it can be partially cached in RAM and dynamically load missing values.
The configuration of external dictionaries is in a separate file or files specified in the ``dictionaries_config`` configuration parameter.
This parameter contains the absolute or relative path to the file with the dictionary configuration. A relative path is relative to the directory with the server config file. The path can contain wildcards * and ?, in which case all matching files are found. Example: dictionaries/*.xml.
This parameter contains the absolute or relative path to the file with the dictionary configuration. A relative path is relative to the directory with the server config file. The path can contain wildcards ``*`` and ``?``, in which case all matching files are found. Example: ``dictionaries/*.xml``.
The dictionary configuration, as well as the set of files with the configuration, can be updated without restarting the server. The server checks updates every 5 seconds. This means that dictionaries can be enabled dynamically.

View File

@ -7,6 +7,8 @@ Outputs data in JSON format. Besides data tables, it also outputs column names a
SELECT SearchPhrase, count() AS c FROM test.hits GROUP BY SearchPhrase WITH TOTALS ORDER BY c DESC LIMIT 5 FORMAT JSON
.. code-block:: json
{
"meta":
[

View File

@ -12,6 +12,8 @@ The Pretty format supports outputting total values (when using WITH TOTALS) and
SELECT EventDate, count() AS c FROM test.hits GROUP BY EventDate WITH TOTALS ORDER BY EventDate FORMAT PrettyCompact
.. code-block:: text
┌──EventDate─┬───────c─┐
│ 2014-03-17 │ 1406958 │
│ 2014-03-18 │ 1383658 │

Some files were not shown because too many files have changed in this diff Show More