6.3 KiB
Introduction
For a start, we need a computer. For example, let's create virtual instance in Openstack with following characteristics:
RAM 61GB
VCPUs 16 VCPU
Disk 40GB
Ephemeral Disk 100GB
OS:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial
We gonna have a try with open data from On Time database, which is arranged by United States Department of Transportation). One can find information about it, structure of the table and examples of queries here:
https://github.com/yandex/ClickHouse/blob/master/doc/example_datasets/1_ontime.txt
Building
To build ClickHouse we will use a manual located here:
https://github.com/yandex/ClickHouse/blob/master/doc/build.md
Install required packages. After that let's run the following command from directory with source code of ClickHouse:
~/ClickHouse$ ./release
The build successfully completed:
Installing packages and running ClickHouse:
sudo apt-get install ../clickhouse-server-base_1.1.53960_amd64.deb ../clickhouse-server-common_1.1.53960_amd64.deb
sudo apt-get install ../clickhouse-client_1.1.53960_amd64.deb
sudo service clickhouse-server start
Creation of table
Before loading the date into ClickHouse, let's start console client of ClickHouse in order to create a table with necessary fields:
$ clickhouse-client
The table is being created with following query:
:) create table `ontime` (
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`UniqueCarrier` FixedString(7),
`AirlineID` Int32,
`Carrier` FixedString(2),
`TailNum` String,
`FlightNum` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)
One can get information about table using following queries:
:) desc ontime
:) show create ontime
Filling in the data
Let's download the data:
for s in `seq 1987 2015`; do
for m in `seq 1 12`; do
wget http://tsdata.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
After that, add the data to ClickHouse:
for i in *.zip; do
echo $i
unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --query="insert into ontime format CSVWithNames"
done
Working with data
Let's check if the table contains something:
:) select FlightDate, FlightNum, OriginCityName, DestCityName from ontime limit 10;
Now we'll craft more complicated query. E.g., output fraction of flights delayed for more than 10 minutes, for each year:
select Year, c1/c2
from
(
select
Year,
count(*)*100 as c1
from ontime
where DepDelay > 10
group by Year
)
any inner join
(
select
Year,
count(*) as c2
from ontime
group by Year
) using (Year)
order by Year;
Additionally
Copying a table
Suppose we need to copy 1% of records (luckiest ones) from ontime
table to new table named ontime_ltd
.
To achieve that make the queries:
:) create table ontime_ltd as ontime;
:) insert into ontime_ltd select * from ontime where rand() % 100 = 42;
Working with multiple tables
If one need to collect data from multiple tables, use function merge(database, regexp)
:
:) select (select count(*) from merge(default, 'ontime.*'))/(select count() from ontime);
List of running queries
For diagnostics one usually need to know what exactly ClickHouse is doing at the moment. Let's execute a query that gonna last long:
:) select sleep(1000);
If we run clickhouse-client
after that in another terminal, one can output list of queries and some additional useful information about them:
:) show processlist;