ClickHouse/doc/getting_started/getting_started_en.md

267 lines
6.3 KiB
Markdown
Raw Normal View History

## Introduction
For a start, we need a computer. For example, let's create virtual instance in Openstack with following characteristics:
```
RAM 61GB
VCPUs 16 VCPU
Disk 40GB
Ephemeral Disk 100GB
```
OS:
```
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial
```
We gonna have a try with open data from On Time database, which is arranged by United States Department of
Transportation). One can find information about it, structure of the table and examples of queries here:
```
https://github.com/yandex/ClickHouse/blob/master/doc/example_datasets/1_ontime.txt
```
## Building
To build ClickHouse we will use a manual located here:
```
https://github.com/yandex/ClickHouse/blob/master/doc/build.md
```
Install required packages. After that let's run the following command from directory with source code of ClickHouse:
```
~/ClickHouse$ DISABLE_MONGODB=1 ./release --standalone
```
The build successfully completed:
![](images/build_completed.png)
Installing packages and running ClickHouse:
```
sudo apt-get install ../clickhouse-server-base_1.1.53960_amd64.deb ../clickhouse-server-common_1.1.53960_amd64.deb
sudo apt-get install ../clickhouse-client_1.1.53960_amd64.deb
sudo service clickhouse-server start
```
## Creation of table
Before loading the date into ClickHouse, let's start console client of ClickHouse in order to create a table with necessary fields:
```
$ clickhouse-client
```
The table is being created with following query:
```
:) create table `ontime` (
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`UniqueCarrier` FixedString(7),
`AirlineID` Int32,
`Carrier` FixedString(2),
`TailNum` String,
`FlightNum` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)
```
![](images/table_created.png)
One can get information about table using following queries:
```
:) desc ontime
```
```
:) show create ontime
```
## Filling in the data
Let's download the data:
```
for s in `seq 1987 2015`; do
for m in `seq 1 12`; do
wget http://tsdata.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
```
After that, add the data to ClickHouse:
```
for i in *.zip; do
echo $i
2016-11-08 13:08:05 +00:00
unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --query="insert into ontime format CSVWithNames"
done
```
## Working with data
Let's check if the table contains something:
```
:) select FlightDate, FlightNum, OriginCityName, DestCityName from ontime limit 10;
```
![](images/something.png)
Now we'll craft more complicated query. E.g., output fraction of flights delayed for more than 10 minutes, for each year:
```
select Year, c1/c2
from
(
select
Year,
count(*)*100 as c1
from ontime
where DepDelay > 10
group by Year
)
any inner join
(
select
Year,
count(*) as c2
from ontime
group by Year
) using (Year)
order by Year;
```
![](images/complicated.png)
## Additionally
### Copying a table
Suppose we need to copy 1% of records (luckiest ones) from ```ontime``` table to new table named ```ontime_ltd```.
To achieve that make the queries:
```
:) create table ontime_ltd as ontime;
:) insert into ontime_ltd select * from ontime where rand() % 100 = 42;
```
### Working with multiple tables
If one need to collect data from multiple tables, use function ```merge(database, regexp)```:
```
:) select (select count(*) from merge(default, 'ontime.*'))/(select count() from ontime);
```
![](images/1_percent.png)
### List of running queries
For diagnostics one usually need to know what exactly ClickHouse is doing at the moment. Let's execute a query that gonna last long:
```
:) select sleep(1000);
```
If we run ```clickhouse-client``` after that in another terminal, one can output list of queries and some additional useful information about them:
```
:) show processlist;
```
![](images/long_query.png)