ClickHouse/doc/getting_started/getting_started_en.md
2016-06-19 09:34:43 +03:00

6.3 KiB

Introduction

For a start, we need a computer. For example, let's create virtual instance in Openstack with following characteristics:

RAM             61GB
VCPUs           16 VCPU
Disk            40GB
Ephemeral Disk  100GB

OS:

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04 LTS
Release:	16.04
Codename:	xenial

We gonna have a try with open data from On Time database, which is arranged by United States Department of Transportation). One can find information about it, structure of the table and examples of queries here:

https://github.com/yandex/ClickHouse/blob/master/doc/example_datasets/1_ontime.txt

Building

To build ClickHouse we will use a manual located here:

https://github.com/yandex/ClickHouse/blob/master/doc/build.md

Install required packages. After that let's run the following command from directory with source code of ClickHouse:

~/ClickHouse$ DISABLE_MONGODB=1 ./release --standalone

The build successfully completed:

Installing packages and running ClickHouse:

sudo apt-get install ../clickhouse-server-base_1.1.53960_amd64.deb ../clickhouse-server-common_1.1.53960_amd64.deb
sudo apt-get install ../clickhouse-client_1.1.53960_amd64.deb
sudo service clickhouse-server start

Creation of table

Before loading the date into ClickHouse, let's start console client of ClickHouse in order to create a table with necessary fields:

$ clickhouse-client

The table is being created with following query:

:) create table `ontime` (
  `Year` UInt16,
  `Quarter` UInt8,
  `Month` UInt8,
  `DayofMonth` UInt8,
  `DayOfWeek` UInt8,
  `FlightDate` Date,
  `UniqueCarrier` FixedString(7),
  `AirlineID` Int32,
  `Carrier` FixedString(2),
  `TailNum` String,
  `FlightNum` String,
  `OriginAirportID` Int32,
  `OriginAirportSeqID` Int32,
  `OriginCityMarketID` Int32,
  `Origin` FixedString(5),
  `OriginCityName` String,
  `OriginState` FixedString(2),
  `OriginStateFips` String,
  `OriginStateName` String,
  `OriginWac` Int32,
  `DestAirportID` Int32,
  `DestAirportSeqID` Int32,
  `DestCityMarketID` Int32,
  `Dest` FixedString(5),
  `DestCityName` String,
  `DestState` FixedString(2),
  `DestStateFips` String,
  `DestStateName` String,
  `DestWac` Int32,
  `CRSDepTime` Int32,
  `DepTime` Int32,
  `DepDelay` Int32,
  `DepDelayMinutes` Int32,
  `DepDel15` Int32,
  `DepartureDelayGroups` String,
  `DepTimeBlk` String,
  `TaxiOut` Int32,
  `WheelsOff` Int32,
  `WheelsOn` Int32,
  `TaxiIn` Int32,
  `CRSArrTime` Int32,
  `ArrTime` Int32,
  `ArrDelay` Int32,
  `ArrDelayMinutes` Int32,
  `ArrDel15` Int32,
  `ArrivalDelayGroups` Int32,
  `ArrTimeBlk` String,
  `Cancelled` UInt8,
  `CancellationCode` FixedString(1),
  `Diverted` UInt8,
  `CRSElapsedTime` Int32,
  `ActualElapsedTime` Int32,
  `AirTime` Int32,
  `Flights` Int32,
  `Distance` Int32,
  `DistanceGroup` UInt8,
  `CarrierDelay` Int32,
  `WeatherDelay` Int32,
  `NASDelay` Int32,
  `SecurityDelay` Int32,
  `LateAircraftDelay` Int32,
  `FirstDepTime` String,
  `TotalAddGTime` String,
  `LongestAddGTime` String,
  `DivAirportLandings` String,
  `DivReachedDest` String,
  `DivActualElapsedTime` String,
  `DivArrDelay` String,
  `DivDistance` String,
  `Div1Airport` String,
  `Div1AirportID` Int32,
  `Div1AirportSeqID` Int32,
  `Div1WheelsOn` String,
  `Div1TotalGTime` String,
  `Div1LongestGTime` String,
  `Div1WheelsOff` String,
  `Div1TailNum` String,
  `Div2Airport` String,
  `Div2AirportID` Int32,
  `Div2AirportSeqID` Int32,
  `Div2WheelsOn` String,
  `Div2TotalGTime` String,
  `Div2LongestGTime` String,
  `Div2WheelsOff` String,
  `Div2TailNum` String,
  `Div3Airport` String,
  `Div3AirportID` Int32,
  `Div3AirportSeqID` Int32,
  `Div3WheelsOn` String,
  `Div3TotalGTime` String,
  `Div3LongestGTime` String,
  `Div3WheelsOff` String,
  `Div3TailNum` String,
  `Div4Airport` String,
  `Div4AirportID` Int32,
  `Div4AirportSeqID` Int32,
  `Div4WheelsOn` String,
  `Div4TotalGTime` String,
  `Div4LongestGTime` String,
  `Div4WheelsOff` String,
  `Div4TailNum` String,
  `Div5Airport` String,
  `Div5AirportID` Int32,
  `Div5AirportSeqID` Int32,
  `Div5WheelsOn` String,
  `Div5TotalGTime` String,
  `Div5LongestGTime` String,
  `Div5WheelsOff` String,
  `Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)

One can get information about table using following queries:

:) desc ontime
:) show create ontime

Filling in the data

Let's download the data:

for s in `seq 1987 2015`; do
	for m in `seq 1 12`; do
		wget http://tsdata.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
	done
done

After that, add the data to ClickHouse:

for i in *.zip; do
	echo $i
	unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --host=example-perftest01j --query="insert into ontime format CSVWithNames"
done

Working with data

Let's check if the table contains something:

:) select FlightDate, FlightNum, OriginCityName, DestCityName from ontime limit 10; 

Now we'll craft more complicated query. E.g., output fraction of flights delayed for more than 10 minutes, for each year:

select Year, c1/c2
from
(
	select
		Year,
		count(*)*100 as c1
	from ontime
	where DepDelay > 10
	group by Year
)
any inner join
(
	select
		Year,
		count(*) as c2
	from ontime
	group by Year
) using (Year)
order by Year;

Additionally

Copying a table

Suppose we need to copy 1% of records (luckiest ones) from ontime table to new table named ontime_ltd. To achieve that make the queries:

:) create table ontime_ltd as ontime;
:) insert into ontime_ltd select * from ontime where rand() % 100 = 42;

Working with multiple tables

If one need to collect data from multiple tables, use function merge(database, regexp):

:) select (select count(*) from merge(default, 'ontime.*'))/(select count() from ontime);

List of running queries

For diagnostics one usually need to know what exactly ClickHouse is doing at the moment. Let's execute a query that gonna last long:

:) select sleep(1000);

If we run clickhouse-client after that in another terminal, one can output list of queries and some additional useful information about them:

:) show processlist;