ClickHouse

Feedback

Ask any questions on Stackoverflow.

Use Google Group for discussion.

Or send private message to developers: clickhouse-feedback@yandex-team.com.

Software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
either express or implied.

ClickHouse is an open-source column-oriented database management system that allows generating analytical data reports in real time.

ClickHouse manages extremely large volumes of data in a stable and sustainable manner. It currently powers Yandex.Metrica, world’s second largest web analytics platform, with over 13 trillion database records and over 20 billion events a day, generating customized reports on-the-fly, directly from non-aggregated data. This system was successfully implemented at CERN’s LHCb experiment to store and process metadata on 10bn events with over 1000 attributes per event registered in 2011.

ClickHouse. Just makes you think faster.

Run more queries in the same amount of time
Test more hypotheses
Slice and dice your data in many more new ways
Look at your data from new angles
Discover new dimensions

Linearly scalable

ClickHouse allows companies to add servers to their clusters when necessary without investing time or money into additional DBMS modification. The system has been successfully serving Yandex.Metrica, while the servers just in its main cluster, located in six geographically distributed datacenters, have grown from 60 to 394 in two years.

ClickHouse scales well both vertically and horizontally. ClickHouse is easily adaptable to perform both on hundreds of node clusters, and on a single server or even virtual machine. It currently has installations with more than two trillion rows per single node, as well as installations with 100 TB of storage per single node.

Hardware-efficient

ClickHouse processes typical analytical queries two to three orders of magnitude faster than traditional row-oriented systems with the same available IO throughput. The system’s columnar format allows fitting more hot data in the server’s RAM, which leads to a shorter response time.

ClickHouse allows to minimize number of seeks for range queries, which increases efficiency of using rotational drives, as it maintains locality of reference for stored data continually.

ClickHouse is CPU efficient because of its vectorized query execution and runtime code generation.

By minimizing data transfers for most types of queries, ClickHouse enables companies to manage their data and create reports without using a network that supports high-performance computing.

Fast

ClickHouse’s performance exceeds comparable column-oriented DBMS currently available on the market. It processes hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.

ClickHouse uses all available hardware to its full potential to process each query as fast as possible. The peak processing performance for a single query (after decompression, only used columns) stands at more than 2 terabytes per second.

Fault-tolerant

ClickHouse supports multi-master asynchronous replication and can be deployed across multiple datacenters. Downtime of a single node or the whole datacenter won’t affect the system’s availability for reads and writes. Distributed reads are automatically balanced to live replicas without increasing latency. Replicated data are synchronized automatically or semi-automatically after the downtime.

Feature-rich

ClickHouse features a number of built-in user-friendly web analytics capabilities, including probabilistic data structures for fast and memory-efficient calculation of cardinalities and quantiles, or functions for addressing URLs and IPs (both IPv4 and IPv6) as well as identifying dates, times and time zones.

Data management methods available in ClickHouse, such as arrays, array joins and nested data structures, are extremely efficient for managing denormalized data.

Using ClickHouse allows joining both distributed data and co-located data, as the system supports local joins and distributed joins. It also offers an opportunity to use external dictionaries, dimension tables loaded from an external source, for seamless joins.

ClickHouse supports approximate query processing – you can get results as fast as you want, which is indispensable when dealing with terabytes and petabytes of data.

The system’s conditional aggregate functions, calculation of totals and extremes, allow getting results with a single query without having to run a number of them.

Simple and handy

ClickHouse is simple and works out-of-the-box. As well as performing on hundreds of node clusters, this system can be easily installed on a single server or even a virtual machine. No development experience or code-writing skills are required to install ClickHouse.

Highly reliable

ClickHouse has been managing petabytes of data serving a number of high-load mass-audience services of Russia’s leading search provider and one of Europe’s largest IT companies, Yandex. Since 2012, ClickHouse has been providing robust database management for the company’s web analytics service, comparison shopping platform, email service, online advertising platform, business intelligence and infrastructure monitoring.

ClickHouse is purely distributed system located on independent nodes, which has no single point of failure.

Software or hardware failures or misconfigurations do not result in loss of data. Instead of deleting "broken" data, Clickhouse saves it or asks you what to do before a startup. All data are checksummed before every read or write to disk or network. It is virtually impossible to delete data by accident.

ClickHouse offers flexible limits on query complexity and resource usage, which can be fine-tuned using settings. It is possible to simultaneously serve both a number of high priority low-latency requests and some long-running queries with lowered priority.

Opens new possibilities

ClickHouse streamlines all your data processing. It’s easy to use: ingest all your structured data into the system, and it is instantly available for reports. New columns for new properties or dimensions can be easily added to the system at any time without slowing it down.

ClickHouse works 100-1,000x faster than traditional approaches. In contrast to data management methods, where vast amounts of raw data in its native format are available as a ‘data lake’ for any given query, ClickHouse, in most cases, offers instant results: the data is processed faster than it takes to make a query.

Key Features

True column-oriented
Vectorized query execution
Data compression
Parallel and distributed query execution
Real-time data ingestion
On-disk locality of reference
Real-time query processing
Cross-datacenter replication
High availability
SQL support
Local and distributed joins
Pluggable external dimension tables
Arrays and nested data types
Approximate query processing
Probabilistic data structures
Full support of IPv6
Features for web analytics
State-of-the-art algorithms
Detailed documentation
Clean documented code

Applications

Web and App analytics
Advertising networks and RTB
Telecommunications
E-commerce
Information security
Monitoring and telemetry
Business intelligence
Online games
Internet of Things

Download

System requirements: Linux, x86_64 with SSE 4.2.

Install packages for Ubuntu 16.04 (Xenial) or Ubuntu 14.04 (Trusty) or Ubuntu 12.04 (Precise):


sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4    # optional

sudo mkdir -p /etc/apt/sources.list.d
echo "deb http://repo.yandex.ru/clickhouse/trusty stable main" |
    sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update

sudo apt-get install clickhouse-server-common clickhouse-client

sudo service clickhouse-server start
clickhouse-client

Read the documentation.

Or build ClickHouse from sources according to the instruction.