ClickHouse/website/tutorial.html
Ivan Blinkov 515a0f061f Work in progress on CLICKHOUSE-2720 & CLICKHOUSE-3067 (#1046)
* update presentations

* CLICKHOUSE-2936: redirect from clickhouse.yandex.ru and clickhouse.yandex.com

* update submodule

* lost files

* CLICKHOUSE-2981: prefer sphinx docs over original reference

* CLICKHOUSE-2981: docs styles more similar to main website + add flags to switch language links

* update presentations

* Less confusing directory structure (docs -> doc/reference/)

* Minify sphinx docs too

* Website release script: fail fast + pass docker hash on deploy

* Do not underline links in docs

* shorter

* cleanup docker images

* tune nginx config

* CLICKHOUSE-3043: get rid of habrastorage links

* Lost translation

* CLICKHOUSE-2936: temporary client-side redirect

* behaves weird in test

* put redirect back

* CLICKHOUSE-3047: copy docs txts to public too

* move to proper file

* remove old pages to avoid confusion

* Remove reference redirect warning for now

* Refresh README.md

* Yellow buttons in docs

* Use svg flags instead of unicode ones in docs

* fix test website instance

* Put flags to separate files

* wrong flag

* Copy Yandex.Metrica introduction from main page to docs

* Yet another home page structure change, couple new blocks (CLICKHOUSE-3045)

* Update Contacts section

* CLICKHOUSE-2849: more detailed legal information

* CLICKHOUSE-2978 preparation - split by files

* More changes in Contacts block

* Tune texts on index page

* update presentations

* One more benchmark

* Add usage sections to index page, adapted from slides

* Get the roadmap started, based on slides from last ClickHouse Meetup

* CLICKHOUSE-2977: some rendering tuning

* Get rid of excessive section in the end of getting started

* Make headers linkable

* CLICKHOUSE-2981: links to editing reference - https://github.com/yandex/ClickHouse/issues/849

* CLICKHOUSE-2981: fix mobile styles in docs

* Ban crawling of duplicating docs

* Open some external links in new tab

* Ban old docs too

* Lots of trivial fixes in english docs

* Lots of trivial fixes in russian docs

* Remove getting started copies in markdown

* Add Yandex.Webmaster

* Fix some sphinx warnings

* More warnings fixed in english docs

* More sphinx warnings fixed

* Add code-block:: text

* More code-block:: text

* These headers look not that well

* Better switch between documentation languages

* merge use_case.rst into ya_metrika_task.rst

* Edit the agg_functions.rst texts

* Add lost empty lines

* Lost blank lines

* Add new logo sizes

* update presentations

* Next step in migrating to new documentation

* Fix all warnings in en reference

* Fix all warnings in ru reference

* Re-arrange existing reference

* Move operation tips to main reference

* Fix typos noticed by milovidov@

* Get rid of zookeeper.md

* Looks like duplicate of tutorial.html

* Fix some mess with html tags in tutorial

* No idea why nobody noticed this before, but it was completely not clear whet to get the data

* Match code block styling between main and tutorial pages (in favor of the latter)

* Get rid of some copypaste in tutorial

* Normalize header styles

* Move example_datasets to sphinx

* Move presentations submodule to website

* Move and update README.md

* No point in duplicating articles from habrahabr here

* Move development-related docs as is for now

* doc/reference/ -> docs/ (to match the URL on website)

* Adapt links to match the previous commit

* Adapt development docs to rst (still lacks translation and strikethrough support)

* clean on release

* blacklist presentations in gulp

* strikethrough support in sphinx

* just copy development folder for now

* fix weird introduction in style article

* Style guide translation (WIP)

* Finish style guide translation to English

* gulp clean separately

* Update year in LICENSE

* Initial CONTRIBUTING.md

* Fix remaining links to old docs in tutorial

* Some tutorial fixes

* Typo

* Another typo

* Update list of authors from yandex-team accoding to git log

* Fix diff with master

* couple fixes in en what_is_clickhouse.rst

* Try different link to blog in Russian

* Swap words

* Slightly larger line height

* CLICKHOUSE-3089: disable hyphenation in docs

* update presentations

* Fix copying of txt files

* update submodule

* CLICKHOUSE-3108: fix overflow issues in mobile version

* Less weird tutorial header in mobile version

* CLICKHOUSE-3073: skip sourcemaps by default

* CLICKHOUSE-3067: rename item in docs navigation

* fix list markup

* CLICKHOUSE-3067: some documentation style tuning

* CLICKHOUSE-3067: less laggy single page documentation
2017-08-04 17:52:08 +03:00

651 lines
26 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="ru">
<head>
<meta charset="utf-8"/>
<title>ClickHouse Quick Start Guide</title>
<link rel="shortcut icon" href="favicon.ico"/>
<meta name="description" content="Quick start guide to ClickHouse — open-source distributed column-oriented DBMS"/>
<meta name="keywords"
content="tutorial, ClickHouse, DBMS, OLAP, relational, analytics, analytical, big data, open-source, SQL, web-analytics"/>
<link href="index.css" media="all" rel="stylesheet" />
</head>
<body>
<div class="page">
<div>
<div id="tutorial_logo">
<a href="/">
<svg xmlns="http://www.w3.org/2000/svg" width="90" height="80" viewBox="0 0 9 8">
<path class="red" d="M0,7 h1 v1 h-1 z"></path>
<path class="orange" d="M0,0 h1 v7 h-1 z"></path>
<path class="orange" d="M2,0 h1 v8 h-1 z"></path>
<path class="orange" d="M4,0 h1 v8 h-1 z"></path>
<path class="orange" d="M6,0 h1 v8 h-1 z"></path>
<path class="orange" d="M8,3.25 h1 v1.5 h-1 z"></path>
</svg>
</a>
</div>
<h1 id="tutorial_title">ClickHouse</h1>
<h2 id="tutorial_subtitle">Tutorial</h2>
</div>
<p>Let's get started with sample dataset from open sources. We will use USA civil flights data since 1987 till 2015.
It's hard to call this sample a Big Data (contains 166 millions rows, 63 Gb of uncompressed data) but this
allows us to quickly get to work. Dataset is available for download <a href="https://yadi.sk/d/pOZxpa42sDdgm">here</a>.
Also you may download it from the original datasource <a
href="docs/en/getting_started/example_datasets/ontime.html"
rel="external nofollow">as described here</a>.</p>
<p>Firstly we will deploy ClickHouse to a single server. Below that we will also review the process of deployment to
a cluster with support for sharding and replication.</p>
<p>On Ubuntu and Debian Linux ClickHouse can be installed from <a href="/#quick-start">packages</a>.
For other Linux distributions you can <a href="docs/en/development/build.html"
rel="external nofollow">compile
ClickHouse from sources</a> and then install.</p>
<p><b>clickhouse-client</b> package contains <a
href="docs/en/interfaces/cli.html">clickhouse-client</a> application —
interactive ClickHouse client. <b>clickhouse-server-base</b> contains a clickhouse-server binary file. <b>clickhouse-server-common</b>
— contains config files for the clickhouse-server.</p>
<p>Server config files are located in /etc/clickhouse-server/. Before getting to work please notice the <b>path</b>
element in config. <b>Path</b>&nbsp;determines the location for data storage. It's not really handy to directly
edit <b>config.xml</b> file considering package updates. Recommended way is to override the config elements in
<a href="docs/en/configuration_files.html">files of config.d directory</a>.
Also you may want to <a href="docs/en/access_rights.html">set up access
rights</a> at the start.</p>
<p><b>clickhouse-server</b> won't be launched automatically after package installation. It won't be automatically
restarted after updates either. Start the server with:
<pre>sudo service clickhouse-server start</pre>
Default location for server logs is /var/log/clickhouse-server/
Server is ready to handle client connections once "Ready&nbsp;for&nbsp;connections" message was logged.</p>
<p>Use <b>clickhouse-client</b> to connect to the server.</p>
<div class="spoiler"><a class="spoiler_title">Tips for clickhouse-client</a>
<div class="spoiler_body">
Interactive mode:
<pre>
clickhouse-client
clickhouse-client --host=... --port=... --user=... --password=...
</pre>
Enable multiline queries:
<pre>
clickhouse-client -m
clickhouse-client --multiline
</pre>
Run queries in batch-mode:
<pre>
clickhouse-client --query='SELECT 1'
echo 'SELECT 1' | clickhouse-client
</pre>
Insert data from file of a specified format:
<pre>
clickhouse-client --query='INSERT INTO table VALUES' &lt; data.txt
clickhouse-client --query='INSERT INTO table FORMAT TabSeparated' &lt; data.tsv
</pre>
</div>
</div>
<h3>Create table for sample dataset</h3>
<div class="spoiler"><a class="spoiler_title">Create table query</a>
<div class="spoiler_body">
<pre>
$ clickhouse-client --multiline
ClickHouse client version 0.0.53720.
Connecting to localhost:9000.
Connected to ClickHouse server version 0.0.53720.
:) CREATE TABLE ontime
(
Year UInt16,
Quarter UInt8,
Month UInt8,
DayofMonth UInt8,
DayOfWeek UInt8,
FlightDate Date,
UniqueCarrier FixedString(7),
AirlineID Int32,
Carrier FixedString(2),
TailNum String,
FlightNum String,
OriginAirportID Int32,
OriginAirportSeqID Int32,
OriginCityMarketID Int32,
Origin FixedString(5),
OriginCityName String,
OriginState FixedString(2),
OriginStateFips String,
OriginStateName String,
OriginWac Int32,
DestAirportID Int32,
DestAirportSeqID Int32,
DestCityMarketID Int32,
Dest FixedString(5),
DestCityName String,
DestState FixedString(2),
DestStateFips String,
DestStateName String,
DestWac Int32,
CRSDepTime Int32,
DepTime Int32,
DepDelay Int32,
DepDelayMinutes Int32,
DepDel15 Int32,
DepartureDelayGroups String,
DepTimeBlk String,
TaxiOut Int32,
WheelsOff Int32,
WheelsOn Int32,
TaxiIn Int32,
CRSArrTime Int32,
ArrTime Int32,
ArrDelay Int32,
ArrDelayMinutes Int32,
ArrDel15 Int32,
ArrivalDelayGroups Int32,
ArrTimeBlk String,
Cancelled UInt8,
CancellationCode FixedString(1),
Diverted UInt8,
CRSElapsedTime Int32,
ActualElapsedTime Int32,
AirTime Int32,
Flights Int32,
Distance Int32,
DistanceGroup UInt8,
CarrierDelay Int32,
WeatherDelay Int32,
NASDelay Int32,
SecurityDelay Int32,
LateAircraftDelay Int32,
FirstDepTime String,
TotalAddGTime String,
LongestAddGTime String,
DivAirportLandings String,
DivReachedDest String,
DivActualElapsedTime String,
DivArrDelay String,
DivDistance String,
Div1Airport String,
Div1AirportID Int32,
Div1AirportSeqID Int32,
Div1WheelsOn String,
Div1TotalGTime String,
Div1LongestGTime String,
Div1WheelsOff String,
Div1TailNum String,
Div2Airport String,
Div2AirportID Int32,
Div2AirportSeqID Int32,
Div2WheelsOn String,
Div2TotalGTime String,
Div2LongestGTime String,
Div2WheelsOff String,
Div2TailNum String,
Div3Airport String,
Div3AirportID Int32,
Div3AirportSeqID Int32,
Div3WheelsOn String,
Div3TotalGTime String,
Div3LongestGTime String,
Div3WheelsOff String,
Div3TailNum String,
Div4Airport String,
Div4AirportID Int32,
Div4AirportSeqID Int32,
Div4WheelsOn String,
Div4TotalGTime String,
Div4LongestGTime String,
Div4WheelsOff String,
Div4TailNum String,
Div5Airport String,
Div5AirportID Int32,
Div5AirportSeqID Int32,
Div5WheelsOn String,
Div5TotalGTime String,
Div5LongestGTime String,
Div5WheelsOff String,
Div5TailNum String
)
ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);
</pre>
</div>
</div>
<p>Now we have a table of <a href="docs/en/table_engines/mergetree.html">MergeTree type</a>.
MergeTree table type is recommended for usage in production. Table of this kind has a primary key used for
incremental sort of table data. This allows fast execution of queries in ranges of a primary key.</p>
<p><b>Note</b>
We store ad network banners impressions logs in ClickHouse. Each table entry looks like:
[Advertiser ID, Impression ID, attribute1, attribute2, &hellip;].
Let assume that our aim is to provide a set of reports for each advertiser. Common and frequently demanded query
would be to count impressions for a specific Advertiser ID. This means that table primary key should start with
Advertiser ID. In this case ClickHouse needs to read smaller amount of data to perform the query for a
given Advertiser ID.
</p>
<h3>Load data</h3>
<pre>xz -v -c -d &lt; ontime.csv.xz | clickhouse-client --query="INSERT INTO ontime FORMAT CSV"</pre>
<p>ClickHouse INSERT query allows to load data in any <a href="docs/en/formats/index.html">supported
format</a>. Data load requires just O(1) RAM consumption. INSERT query can receive any data volume as input.
It's strongly recommended to insert data with <a
href="docs/en/introduction/performance.html#performance-on-data-insertion">not too small
size blocks</a>. Notice that insert of blocks with size up to max_insert_block_size (= 1&nbsp;048&nbsp;576
rows by default) is an atomic operation: data block will be inserted completely or not inserted at all. In case
of disconnect during insert operation you may not know if the block was inserted successfully. To achieve
exactly-once semantics ClickHouse supports idempotency for <a
href="docs/en/table_engines/replication.html">replicated tables</a>. This means
that you may retry insert of the same data block (possibly on a different replicas) but this block will be
inserted just once. Anyway in this guide we will load data from our localhost so we may not take care about data
blocks generation and exactly-once semantics.</p>
<p>INSERT query into tables of MergeTree type is non-blocking (so does a SELECT query). You can execute SELECT
queries right after of during insert operation.</p>
<p>Our sample dataset is a bit not optimal. There are two reasons.</p>
<p>The first is that String data type is used in cases when <a
href="docs/en/data_types/enum.html">Enum</a> or numeric type would fit best.</p>
<p class="tip"><b></b> When set of possible values is determined and known to be small. (E.g. OS name, browser
vendors etc.) it's&nbsp;recommended to use Enums or numbers to improve performance.
When set of possible values is not limited (search&nbsp;query, URL, etc.) just go ahead with String.</p>
<p>The second is that dataset contains redundant fields like Year, Quarter, Month, DayOfMonth, DayOfWeek. In fact a
single FlightDate would be enough. Most likely they have been added to improve performance for other DBMS'es
which DateTime handling functions may be not efficient.</p>
<p class="tip"><b></b> ClickHouse <a
href="docs/en/functions/date_time_functions.html">functions
for operating with DateTime fields</a> are well-optimized so such redundancy is not required. Anyway much
columns is not a reason to worry — ClickHouse is a <a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS"
rel="external nofollow">column-oriented
DBMS</a>. This allows you to have as much fields as you need. Hundreds of columns in a table is fine for
ClickHouse.</p>
<h3>Querying the sample dataset</h3>
<p>Here are some examples of the queries from our test data.</p>
<ul>
<li>
<div class="spoiler"><a class="spoiler_title">the most popular destinations in 2015;</a>
<div class="spoiler_body">
<pre>
SELECT
OriginCityName,
DestCityName,
count(*) AS flights,
bar(flights, 0, 20000, 40)
FROM ontime WHERE Year = 2015 GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20
</pre>
<img src="images/tutorial/a8518a200d6d405a95ee80ea1c8e1c90.png"/>
<pre>
SELECT
OriginCityName &lt; DestCityName ? OriginCityName : DestCityName AS a,
OriginCityName &lt; DestCityName ? DestCityName : OriginCityName AS b,
count(*) AS flights,
bar(flights, 0, 40000, 40)
FROM ontime WHERE Year = 2015 GROUP BY a, b ORDER BY flights DESC LIMIT 20
</pre>
<img src="images/tutorial/d3578db55e304bd7b5eba818abdb53f5.png"/></div>
</div>
</li>
<li>
<div class="spoiler"><a class="spoiler_title">the most popular cities of departure;</a>
<div class="spoiler_body">
<pre>
SELECT OriginCityName, count(*) AS flights
FROM ontime GROUP BY OriginCityName ORDER BY flights DESC LIMIT 20
</pre>
<img src="images/tutorial/ef4141f348234773a5349c4bd3e8f804.png"/></div>
</div>
</li>
<li>
<div class="spoiler"><a class="spoiler_title">cities of departure which offer maximum variety of
destinations;</a>
<div class="spoiler_body">
<pre>
SELECT OriginCityName, uniq(Dest) AS u
FROM ontime GROUP BY OriginCityName ORDER BY u DESC LIMIT 20
</pre>
<img src="images/tutorial/2409f49d11fb4aa1b8b5ff34cf9ca75d.png"/></div>
</div>
</li>
<li>
<div class="spoiler"><a class="spoiler_title">flight delay dependence on the day of week;</a>
<div class="spoiler_body">
<pre>
SELECT DayOfWeek, count() AS c, avg(DepDelay &gt; 60) AS delays
FROM ontime GROUP BY DayOfWeek ORDER BY DayOfWeek
</pre>
<img src="images/tutorial/885e507930e34b7c8f788d25e7ca2bcf.png"/></div>
</div>
</li>
<li>
<div class="spoiler"><a class="spoiler_title">cities of departure with most frequent delays for 1 hour or
longer;</a>
<div class="spoiler_body">
<pre>
SELECT OriginCityName, count() AS c, avg(DepDelay &gt; 60) AS delays
FROM ontime
GROUP BY OriginCityName
HAVING c &gt; 100000
ORDER BY delays DESC
LIMIT 20
</pre>
<img src="images/tutorial/ac292656d03946d0aba35c75783a31f2.png"/></div>
</div>
</li>
<li>
<div class="spoiler"><a class="spoiler_title">flights of maximum duration;</a>
<div class="spoiler_body">
<pre>
SELECT OriginCityName, DestCityName, count(*) AS flights, avg(AirTime) AS duration
FROM ontime
GROUP BY OriginCityName, DestCityName
ORDER BY duration DESC
LIMIT 20
</pre>
<img src="images/tutorial/7b3c2e685832439b8c373bf2015131d2.png"/></div>
</div>
</li>
<li>
<div class="spoiler"><a class="spoiler_title">distribution of arrival time delays split by aircompanies;</a>
<div class="spoiler_body">
<pre>
SELECT Carrier, count() AS c, round(quantileTDigest(0.99)(DepDelay), 2) AS q
FROM ontime GROUP BY Carrier ORDER BY q DESC
</pre>
<img src="images/tutorial/49c332e3d93146ba8f46beef6b2b02b0.png"/></div>
</div>
</li>
<li>
<div class="spoiler"><a class="spoiler_title">aircompanies who stopped flights operation;</a>
<div class="spoiler_body">
<pre>
SELECT Carrier, min(Year), max(Year), count()
FROM ontime GROUP BY Carrier HAVING max(Year) &lt; 2015 ORDER BY count() DESC
</pre>
<img src="images/tutorial/24956f1a2efc48d78212586958aa036c.png"/></div>
</div>
</li>
<li>
<div class="spoiler"><a class="spoiler_title">most trending destination cities in 2015;</a>
<div class="spoiler_body">
<pre>
SELECT
DestCityName,
sum(Year = 2014) AS c2014,
sum(Year = 2015) AS c2015,
c2015 / c2014 AS diff
FROM ontime
WHERE Year IN (2014, 2015)
GROUP BY DestCityName
HAVING c2014 &gt; 10000 AND c2015 &gt; 1000 AND diff &gt; 1
ORDER BY diff DESC
</pre>
<img src="images/tutorial/f3132f4d1c0d42eab26d9111afe7771a.png"/></div>
</div>
</li>
<li>
<div class="spoiler"><a class="spoiler_title">destination cities with maximum popularity-season
dependency.</a>
<div class="spoiler_body">
<pre>
SELECT
DestCityName,
any(total),
avg(abs(monthly * 12 - total) / total) AS avg_month_diff
FROM
(
SELECT DestCityName, count() AS total
FROM ontime GROUP BY DestCityName HAVING total &gt; 100000
)
ALL INNER JOIN
(
SELECT DestCityName, Month, count() AS monthly
FROM ontime GROUP BY DestCityName, Month HAVING monthly &gt; 10000
)
USING DestCityName
GROUP BY DestCityName
ORDER BY avg_month_diff DESC
LIMIT 20
</pre>
<img src="images/tutorial/26b2c7aae21a4c76800cb1c7a33a374d.png"/></div>
</div>
</li>
</ul>
<h3>ClickHouse deployment to cluster</h3>
<p>ClickHouse cluster is a homogenous cluster. Steps to set up:
<ol>
<li>Install ClickHouse server on all machines of the cluster</li>
<li>Set up cluster configs in configuration file</li>
<li>Create local tables on each instance</li>
<li>Create a <a href="docs/en/table_engines/distributed.html">Distributed table</a></li>
</ol>
</p>
<p><a href="docs/en/table_engines/distributed.html">Distributed-table</a> is actually a kind of
"view" to local tables of ClickHouse cluster. SELECT query from a distributed table will be executed using
resources of all cluster's shards. You may specify configs for multiple clusters and create multiple
Distributed-tables providing views to different clusters.</p>
<div class="spoiler"><a class="spoiler_title">Config for cluster of three shards. Each shard stores data on a single
replica</a>
<div class="spoiler_body">
<pre>
&lt;remote_servers&gt;
&lt;perftest_3shards_1replicas&gt;
&lt;shard&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest01j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;/shard&gt;
&lt;shard&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest02j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;/shard&gt;
&lt;shard&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest03j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;/shard&gt;
&lt;/perftest_3shards_1replicas&gt;
&lt;/remote_servers&gt;
</pre>
</div>
</div>
Creating a local table:
<pre>CREATE TABLE ontime_local (...) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);</pre>
Creating a distributed table providing a view into local tables of the cluster:
<pre>CREATE TABLE ontime_all AS ontime_local
ENGINE = Distributed(perftest_3shards_1replicas, default, ontime_local, rand());</pre>
<p>You can create a Distributed table on all machines in the cluster. This would allow to run distributed queries on
any machine of the cluster. Besides distributed table you can also use <a
href="docs/en/table_functions/remote.html">*remote* table function</a>.</p>
<p>Let's run <a href="docs/en/query_language/queries.html#insert">INSERT SELECT</a> into Distributed table
to spread the table to multiple servers.</p>
<pre>INSERT INTO ontime_all SELECT * FROM ontime;</pre>
<p class="tip"><b></b> Worth to notice that the approach given above wouldn't fit for sharding of large
tables.<br/>Please use <a href="docs/en/table_engines/resharding.html">built-in sharding
feature</a>.</p>
<p>As you could expect heavy queries are executed N times faster being launched on 3 servers instead of one.</p>
<div class="spoiler"><a class="spoiler_title">See here</a>
<div class="spoiler_body">
<img src="images/tutorial/ece020129fdf4a18a6e75daf2e699cb9.png"/>
<p>You may have noticed that quantiles calculation are slightly different. This happens due to <a
href="https://github.com/tdunning/t-digest/raw/master/docs/t-digest-paper/histo.pdf">t-digest</a>
algorithm implementation which is non-deterministic — it depends on the order of data processing.</p>
</div>
</div>
<p>In this case we have used a cluster with 3 shards each contains a single replica.</p>
<p>To provide for resilience in production environment we recommend that each shard should contain 2-3 replicas
distributed between multiple data-centers. Note that ClickHouse supports unlimited number of replicas.</p>
<div class="spoiler"><a class="spoiler_title">Config for cluster of one shard containing three replicas</a>
<div class="spoiler_body">
<pre>
&lt;remote_servers&gt;
...
&lt;perftest_1shards_3replicas&gt;
&lt;shard&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest01j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest02j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;replica&gt;
&lt;host&gt;example-perftest03j.yandex.ru&lt;/host&gt;
&lt;port&gt;9000&lt;/port&gt;
&lt;/replica&gt;
&lt;/shard&gt;
&lt;/perftest_1shards_3replicas&gt;
&lt;/remote_servers&gt;
</pre>
</div>
</div>
<p>To enable replication <a href="http://zookeeper.apache.org/" rel="external nofollow">ZooKeeper</a> is required.
ClickHouse will take care of data consistency on all replicas and run restore procedure after failure
automatically. It's recommended to deploy ZooKeeper cluster to separate servers.</p>
<p>ZooKeeper is not a requirement — in some simple cases you can duplicate the data by writing it into all the
replicas from your application code. This approach is not recommended — in this case ClickHouse is not able to
guarantee data consistency on all replicas. This remains the responsibility of your application.</p>
<div class="spoiler"><a class="spoiler_title">Set ZooKeeper locations in configuration file</a>
<div class="spoiler_body">
<pre>
&lt;zookeeper-servers&gt;
&lt;node&gt;
&lt;host&gt;zoo01.yandex.ru&lt;/host&gt;
&lt;port&gt;2181&lt;/port&gt;
&lt;/node&gt;
&lt;node&gt;
&lt;host&gt;zoo02.yandex.ru&lt;/host&gt;
&lt;port&gt;2181&lt;/port&gt;
&lt;/node&gt;
&lt;node&gt;
&lt;host&gt;zoo03.yandex.ru&lt;/host&gt;
&lt;port&gt;2181&lt;/port&gt;
&lt;/node&gt;
&lt;/zookeeper-servers&gt;
</pre>
</div>
</div>
<p>Also we need to set macros for identifying shard and replica — it will be used on table creation</p>
<pre>
&lt;macros&gt;
&lt;shard&gt;01&lt;/shard&gt;
&lt;replica&gt;01&lt;/replica&gt;
&lt;/macros&gt;
</pre>
<p>If there are no replicas at the moment on replicated table creation — a new first replica will be instantiated.
If there are already live replicas — new replica will clone the data from existing ones. You have an option to
create all replicated tables first and that insert data to it. Another option is to create some replicas and add
the others after or during data insertion.</p>
<pre>
CREATE TABLE ontime_replica (...)
ENGINE = ReplicatedMergeTree(
'/clickhouse_perftest/tables/{shard}/ontime',
'{replica}',
FlightDate,
(Year, FlightDate),
8192);
</pre>
<p>Here we use <a href="docs/en/table_engines/replication.html#replicatedmergetree">ReplicatedMergeTree</a>
table type. In parameters we specify ZooKeeper path containing shard and replica identifiers.</p>
<pre>INSERT INTO ontime_replica SELECT * FROM ontime;</pre>
<p>Replication operates in multi-master mode. Data can be loaded into any replica — it will be synced with other
instances automatically. Replication is asynchronous so at a given moment of time not all replicas may contain
recently inserted data. To allow data insertion at least one replica should be up. Others will sync up data and
repair consistency once they will become active again. Please notice that such scheme allows for the possibility
of just appended data loss.</p>
<p class="warranty"><a href="https://github.com/yandex/ClickHouse/blob/master/LICENSE"
rel="external nofollow" target="_blank">
ClickHouse source code is published under Apache 2.0 License.</a> Software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.</p>
<p id="footer">&copy; 20162017 <a href="https://yandex.com/company/" rel="external nofollow">YANDEX</a> LLC</p>
</div>
<script type="text/javascript" src="https://yastatic.net/jquery/3.1.1/jquery.min.js"></script>
<script type="text/javascript">
$('.spoiler_title').click(function () {
$(this).next('.spoiler_body').toggle(100);
});
var name = document.getElementById('main_title').textContent.trim().toLowerCase();
var feedback_address = name + '-feedback' + '@yandex-team.com';
var feedback_email = document.getElementById('feedback_email');
feedback_email.setAttribute('href', 'mailto:' + feedback_address);
feedback_email.textContent = feedback_address;
</script>
<!-- Yandex.Metrika counter -->
<script type="text/javascript">
(function (d, w, c) {
(w[c] = w[c] || []).push(function() {
try {
w.yaCounter18343495 = new Ya.Metrika2({
id:18343495,
clickmap:true,
trackLinks:true,
accurateTrackBounce:true,
webvisor:true
});
} catch(e) { }
});
var n = d.getElementsByTagName("script")[0],
s = d.createElement("script"),
f = function () { n.parentNode.insertBefore(s, n); };
s.type = "text/javascript";
s.async = true;
s.src = "https://mc.yandex.ru/metrika/tag.js";
if (w.opera == "[object Opera]") {
d.addEventListener("DOMContentLoaded", f, false);
} else { f(); }
})(document, window, "yandex_metrika_callbacks2");
</script>
<noscript>
<div><img src="https://mc.yandex.ru/watch/18343495" style="position:absolute; left:-9999px;" alt=""/></div>
</noscript>
<!-- /Yandex.Metrika counter -->
</body>
</html>