2016-06-19 06:34:43 +00:00
|
|
|
## Introduction
|
|
|
|
|
|
|
|
For a start, we need a computer. For example, let's create virtual instance in Openstack with following characteristics:
|
|
|
|
```
|
|
|
|
RAM 61GB
|
|
|
|
VCPUs 16 VCPU
|
|
|
|
Disk 40GB
|
|
|
|
Ephemeral Disk 100GB
|
|
|
|
```
|
|
|
|
|
|
|
|
OS:
|
|
|
|
```
|
|
|
|
$ lsb_release -a
|
|
|
|
No LSB modules are available.
|
|
|
|
Distributor ID: Ubuntu
|
|
|
|
Description: Ubuntu 16.04 LTS
|
|
|
|
Release: 16.04
|
|
|
|
Codename: xenial
|
|
|
|
```
|
|
|
|
|
|
|
|
We gonna have a try with open data from On Time database, which is arranged by United States Department of
|
|
|
|
Transportation). One can find information about it, structure of the table and examples of queries here:
|
|
|
|
```
|
|
|
|
https://github.com/yandex/ClickHouse/blob/master/doc/example_datasets/1_ontime.txt
|
|
|
|
```
|
|
|
|
|
|
|
|
## Building
|
|
|
|
|
|
|
|
To build ClickHouse we will use a manual located here:
|
|
|
|
```
|
|
|
|
https://github.com/yandex/ClickHouse/blob/master/doc/build.md
|
|
|
|
```
|
|
|
|
|
|
|
|
Install required packages. After that let's run the following command from directory with source code of ClickHouse:
|
|
|
|
```
|
2016-12-06 16:51:34 +00:00
|
|
|
~/ClickHouse$ ENABLE_MONGODB=0 ./release --standalone
|
2016-06-19 06:34:43 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
The build successfully completed:
|
|
|
|
![](images/build_completed.png)
|
|
|
|
|
|
|
|
Installing packages and running ClickHouse:
|
|
|
|
```
|
|
|
|
sudo apt-get install ../clickhouse-server-base_1.1.53960_amd64.deb ../clickhouse-server-common_1.1.53960_amd64.deb
|
|
|
|
sudo apt-get install ../clickhouse-client_1.1.53960_amd64.deb
|
|
|
|
sudo service clickhouse-server start
|
|
|
|
```
|
|
|
|
|
|
|
|
## Creation of table
|
|
|
|
|
|
|
|
Before loading the date into ClickHouse, let's start console client of ClickHouse in order to create a table with necessary fields:
|
|
|
|
```
|
|
|
|
$ clickhouse-client
|
|
|
|
```
|
|
|
|
|
|
|
|
The table is being created with following query:
|
|
|
|
```
|
|
|
|
:) create table `ontime` (
|
|
|
|
`Year` UInt16,
|
|
|
|
`Quarter` UInt8,
|
|
|
|
`Month` UInt8,
|
|
|
|
`DayofMonth` UInt8,
|
|
|
|
`DayOfWeek` UInt8,
|
|
|
|
`FlightDate` Date,
|
|
|
|
`UniqueCarrier` FixedString(7),
|
|
|
|
`AirlineID` Int32,
|
|
|
|
`Carrier` FixedString(2),
|
|
|
|
`TailNum` String,
|
|
|
|
`FlightNum` String,
|
|
|
|
`OriginAirportID` Int32,
|
|
|
|
`OriginAirportSeqID` Int32,
|
|
|
|
`OriginCityMarketID` Int32,
|
|
|
|
`Origin` FixedString(5),
|
|
|
|
`OriginCityName` String,
|
|
|
|
`OriginState` FixedString(2),
|
|
|
|
`OriginStateFips` String,
|
|
|
|
`OriginStateName` String,
|
|
|
|
`OriginWac` Int32,
|
|
|
|
`DestAirportID` Int32,
|
|
|
|
`DestAirportSeqID` Int32,
|
|
|
|
`DestCityMarketID` Int32,
|
|
|
|
`Dest` FixedString(5),
|
|
|
|
`DestCityName` String,
|
|
|
|
`DestState` FixedString(2),
|
|
|
|
`DestStateFips` String,
|
|
|
|
`DestStateName` String,
|
|
|
|
`DestWac` Int32,
|
|
|
|
`CRSDepTime` Int32,
|
|
|
|
`DepTime` Int32,
|
|
|
|
`DepDelay` Int32,
|
|
|
|
`DepDelayMinutes` Int32,
|
|
|
|
`DepDel15` Int32,
|
|
|
|
`DepartureDelayGroups` String,
|
|
|
|
`DepTimeBlk` String,
|
|
|
|
`TaxiOut` Int32,
|
|
|
|
`WheelsOff` Int32,
|
|
|
|
`WheelsOn` Int32,
|
|
|
|
`TaxiIn` Int32,
|
|
|
|
`CRSArrTime` Int32,
|
|
|
|
`ArrTime` Int32,
|
|
|
|
`ArrDelay` Int32,
|
|
|
|
`ArrDelayMinutes` Int32,
|
|
|
|
`ArrDel15` Int32,
|
|
|
|
`ArrivalDelayGroups` Int32,
|
|
|
|
`ArrTimeBlk` String,
|
|
|
|
`Cancelled` UInt8,
|
|
|
|
`CancellationCode` FixedString(1),
|
|
|
|
`Diverted` UInt8,
|
|
|
|
`CRSElapsedTime` Int32,
|
|
|
|
`ActualElapsedTime` Int32,
|
|
|
|
`AirTime` Int32,
|
|
|
|
`Flights` Int32,
|
|
|
|
`Distance` Int32,
|
|
|
|
`DistanceGroup` UInt8,
|
|
|
|
`CarrierDelay` Int32,
|
|
|
|
`WeatherDelay` Int32,
|
|
|
|
`NASDelay` Int32,
|
|
|
|
`SecurityDelay` Int32,
|
|
|
|
`LateAircraftDelay` Int32,
|
|
|
|
`FirstDepTime` String,
|
|
|
|
`TotalAddGTime` String,
|
|
|
|
`LongestAddGTime` String,
|
|
|
|
`DivAirportLandings` String,
|
|
|
|
`DivReachedDest` String,
|
|
|
|
`DivActualElapsedTime` String,
|
|
|
|
`DivArrDelay` String,
|
|
|
|
`DivDistance` String,
|
|
|
|
`Div1Airport` String,
|
|
|
|
`Div1AirportID` Int32,
|
|
|
|
`Div1AirportSeqID` Int32,
|
|
|
|
`Div1WheelsOn` String,
|
|
|
|
`Div1TotalGTime` String,
|
|
|
|
`Div1LongestGTime` String,
|
|
|
|
`Div1WheelsOff` String,
|
|
|
|
`Div1TailNum` String,
|
|
|
|
`Div2Airport` String,
|
|
|
|
`Div2AirportID` Int32,
|
|
|
|
`Div2AirportSeqID` Int32,
|
|
|
|
`Div2WheelsOn` String,
|
|
|
|
`Div2TotalGTime` String,
|
|
|
|
`Div2LongestGTime` String,
|
|
|
|
`Div2WheelsOff` String,
|
|
|
|
`Div2TailNum` String,
|
|
|
|
`Div3Airport` String,
|
|
|
|
`Div3AirportID` Int32,
|
|
|
|
`Div3AirportSeqID` Int32,
|
|
|
|
`Div3WheelsOn` String,
|
|
|
|
`Div3TotalGTime` String,
|
|
|
|
`Div3LongestGTime` String,
|
|
|
|
`Div3WheelsOff` String,
|
|
|
|
`Div3TailNum` String,
|
|
|
|
`Div4Airport` String,
|
|
|
|
`Div4AirportID` Int32,
|
|
|
|
`Div4AirportSeqID` Int32,
|
|
|
|
`Div4WheelsOn` String,
|
|
|
|
`Div4TotalGTime` String,
|
|
|
|
`Div4LongestGTime` String,
|
|
|
|
`Div4WheelsOff` String,
|
|
|
|
`Div4TailNum` String,
|
|
|
|
`Div5Airport` String,
|
|
|
|
`Div5AirportID` Int32,
|
|
|
|
`Div5AirportSeqID` Int32,
|
|
|
|
`Div5WheelsOn` String,
|
|
|
|
`Div5TotalGTime` String,
|
|
|
|
`Div5LongestGTime` String,
|
|
|
|
`Div5WheelsOff` String,
|
|
|
|
`Div5TailNum` String
|
|
|
|
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)
|
|
|
|
```
|
|
|
|
![](images/table_created.png)
|
|
|
|
|
|
|
|
One can get information about table using following queries:
|
|
|
|
```
|
|
|
|
:) desc ontime
|
|
|
|
```
|
|
|
|
|
|
|
|
```
|
|
|
|
:) show create ontime
|
|
|
|
```
|
|
|
|
|
|
|
|
## Filling in the data
|
|
|
|
|
|
|
|
Let's download the data:
|
|
|
|
```
|
|
|
|
for s in `seq 1987 2015`; do
|
|
|
|
for m in `seq 1 12`; do
|
|
|
|
wget http://tsdata.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
|
|
|
|
done
|
|
|
|
done
|
|
|
|
```
|
|
|
|
|
|
|
|
After that, add the data to ClickHouse:
|
|
|
|
```
|
|
|
|
for i in *.zip; do
|
|
|
|
echo $i
|
2016-11-08 13:08:05 +00:00
|
|
|
unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --query="insert into ontime format CSVWithNames"
|
2016-06-19 06:34:43 +00:00
|
|
|
done
|
|
|
|
```
|
|
|
|
|
|
|
|
## Working with data
|
|
|
|
|
|
|
|
Let's check if the table contains something:
|
|
|
|
```
|
|
|
|
:) select FlightDate, FlightNum, OriginCityName, DestCityName from ontime limit 10;
|
|
|
|
```
|
|
|
|
|
|
|
|
![](images/something.png)
|
|
|
|
|
|
|
|
Now we'll craft more complicated query. E.g., output fraction of flights delayed for more than 10 minutes, for each year:
|
|
|
|
```
|
|
|
|
select Year, c1/c2
|
|
|
|
from
|
|
|
|
(
|
|
|
|
select
|
|
|
|
Year,
|
|
|
|
count(*)*100 as c1
|
|
|
|
from ontime
|
|
|
|
where DepDelay > 10
|
|
|
|
group by Year
|
|
|
|
)
|
|
|
|
any inner join
|
|
|
|
(
|
|
|
|
select
|
|
|
|
Year,
|
|
|
|
count(*) as c2
|
|
|
|
from ontime
|
|
|
|
group by Year
|
|
|
|
) using (Year)
|
|
|
|
order by Year;
|
|
|
|
```
|
|
|
|
|
|
|
|
![](images/complicated.png)
|
|
|
|
|
|
|
|
## Additionally
|
|
|
|
|
|
|
|
### Copying a table
|
|
|
|
|
|
|
|
Suppose we need to copy 1% of records (luckiest ones) from ```ontime``` table to new table named ```ontime_ltd```.
|
|
|
|
To achieve that make the queries:
|
|
|
|
```
|
|
|
|
:) create table ontime_ltd as ontime;
|
|
|
|
:) insert into ontime_ltd select * from ontime where rand() % 100 = 42;
|
|
|
|
```
|
|
|
|
|
|
|
|
### Working with multiple tables
|
|
|
|
|
|
|
|
If one need to collect data from multiple tables, use function ```merge(database, regexp)```:
|
|
|
|
```
|
|
|
|
:) select (select count(*) from merge(default, 'ontime.*'))/(select count() from ontime);
|
|
|
|
```
|
|
|
|
|
|
|
|
![](images/1_percent.png)
|
|
|
|
|
|
|
|
### List of running queries
|
|
|
|
|
|
|
|
For diagnostics one usually need to know what exactly ClickHouse is doing at the moment. Let's execute a query that gonna last long:
|
|
|
|
```
|
|
|
|
:) select sleep(1000);
|
|
|
|
```
|
|
|
|
|
|
|
|
If we run ```clickhouse-client``` after that in another terminal, one can output list of queries and some additional useful information about them:
|
|
|
|
```
|
|
|
|
:) show processlist;
|
|
|
|
```
|
|
|
|
|
|
|
|
![](images/long_query.png)
|