mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-11-27 18:12:02 +00:00
173 lines
7.6 KiB
Markdown
173 lines
7.6 KiB
Markdown
---
|
|
slug: /en/getting-started/example-datasets/environmental-sensors
|
|
sidebar_label: Environmental Sensors Data
|
|
---
|
|
|
|
# Environmental Sensors Data
|
|
|
|
[Sensor.Community](https://sensor.community/en/) is a contributors-driven global sensor network that creates Open Environmental Data. The data is collected from sensors all over the globe. Anyone can purchase a sensor and place it wherever they like. The APIs to download the data is in [GitHub](https://github.com/opendata-stuttgart/meta/wiki/APIs) and the data is freely available under the [Database Contents License (DbCL)](https://opendatacommons.org/licenses/dbcl/1-0/).
|
|
|
|
:::important
|
|
The dataset has over 20 billion records, so be careful just copying-and-pasting the commands below unless your resources can handle that type of volume. The commands below were executed on a **Production** instance of [ClickHouse Cloud](https://clickhouse.cloud).
|
|
:::
|
|
|
|
1. The data is in S3, so we can use the `s3` table function to create a table from the files. We can also query the data in place. Let's look at a few rows before attempting to insert it into ClickHouse:
|
|
|
|
```sql
|
|
SELECT *
|
|
FROM s3(
|
|
'https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/sensors/monthly/2019-06_bmp180.csv.zst',
|
|
'CSVWithNames'
|
|
)
|
|
LIMIT 10
|
|
SETTINGS format_csv_delimiter = ';';
|
|
```
|
|
|
|
The data is in CSV files but uses a semi-colon for the delimiter. The rows look like:
|
|
|
|
```response
|
|
┌─sensor_id─┬─sensor_type─┬─location─┬────lat─┬────lon─┬─timestamp───────────┬──pressure─┬─altitude─┬─pressure_sealevel─┬─temperature─┐
|
|
│ 9119 │ BMP180 │ 4594 │ 50.994 │ 7.126 │ 2019-06-01T00:00:00 │ 101471 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 19.9 │
|
|
│ 21210 │ BMP180 │ 10762 │ 42.206 │ 25.326 │ 2019-06-01T00:00:00 │ 99525 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 19.3 │
|
|
│ 19660 │ BMP180 │ 9978 │ 52.434 │ 17.056 │ 2019-06-01T00:00:04 │ 101570 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 15.3 │
|
|
│ 12126 │ BMP180 │ 6126 │ 57.908 │ 16.49 │ 2019-06-01T00:00:05 │ 101802.56 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 8.07 │
|
|
│ 15845 │ BMP180 │ 8022 │ 52.498 │ 13.466 │ 2019-06-01T00:00:05 │ 101878 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 23 │
|
|
│ 16415 │ BMP180 │ 8316 │ 49.312 │ 6.744 │ 2019-06-01T00:00:06 │ 100176 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 14.7 │
|
|
│ 7389 │ BMP180 │ 3735 │ 50.136 │ 11.062 │ 2019-06-01T00:00:06 │ 98905 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 12.1 │
|
|
│ 13199 │ BMP180 │ 6664 │ 52.514 │ 13.44 │ 2019-06-01T00:00:07 │ 101855.54 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 19.74 │
|
|
│ 12753 │ BMP180 │ 6440 │ 44.616 │ 2.032 │ 2019-06-01T00:00:07 │ 99475 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 17 │
|
|
│ 16956 │ BMP180 │ 8594 │ 52.052 │ 8.354 │ 2019-06-01T00:00:08 │ 101322 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 17.2 │
|
|
└───────────┴─────────────┴──────────┴────────┴───────┴─────────────────────┴──────────┴──────────┴───────────────────┴─────────────┘
|
|
```
|
|
|
|
2. We will use the following `MergeTree` table to store the data in ClickHouse:
|
|
|
|
```sql
|
|
CREATE TABLE sensors
|
|
(
|
|
sensor_id UInt16,
|
|
sensor_type Enum('BME280', 'BMP180', 'BMP280', 'DHT22', 'DS18B20', 'HPM', 'HTU21D', 'PMS1003', 'PMS3003', 'PMS5003', 'PMS6003', 'PMS7003', 'PPD42NS', 'SDS011'),
|
|
location UInt32,
|
|
lat Float32,
|
|
lon Float32,
|
|
timestamp DateTime,
|
|
P1 Float32,
|
|
P2 Float32,
|
|
P0 Float32,
|
|
durP1 Float32,
|
|
ratioP1 Float32,
|
|
durP2 Float32,
|
|
ratioP2 Float32,
|
|
pressure Float32,
|
|
altitude Float32,
|
|
pressure_sealevel Float32,
|
|
temperature Float32,
|
|
humidity Float32,
|
|
date Date MATERIALIZED toDate(timestamp)
|
|
)
|
|
ENGINE = MergeTree
|
|
ORDER BY (timestamp, sensor_id);
|
|
```
|
|
|
|
3. ClickHouse Cloud services have a cluster named `default`. We will use the `s3Cluster` table function, which reads S3 files in parallel from the nodes in your cluster. (If you do not have a cluster, just use the `s3` function and remove the cluster name.)
|
|
|
|
This query will take a while - it's about 1.67T of data uncompressed:
|
|
|
|
```sql
|
|
INSERT INTO sensors
|
|
SELECT *
|
|
FROM s3Cluster(
|
|
'default',
|
|
'https://clickhouse-public-datasets.s3.amazonaws.com/sensors/monthly/*.csv.zst',
|
|
'CSVWithNames',
|
|
$$ sensor_id UInt16,
|
|
sensor_type String,
|
|
location UInt32,
|
|
lat Float32,
|
|
lon Float32,
|
|
timestamp DateTime,
|
|
P1 Float32,
|
|
P2 Float32,
|
|
P0 Float32,
|
|
durP1 Float32,
|
|
ratioP1 Float32,
|
|
durP2 Float32,
|
|
ratioP2 Float32,
|
|
pressure Float32,
|
|
altitude Float32,
|
|
pressure_sealevel Float32,
|
|
temperature Float32,
|
|
humidity Float32 $$
|
|
)
|
|
SETTINGS
|
|
format_csv_delimiter = ';',
|
|
input_format_allow_errors_ratio = '0.5',
|
|
input_format_allow_errors_num = 10000,
|
|
input_format_parallel_parsing = 0,
|
|
date_time_input_format = 'best_effort',
|
|
max_insert_threads = 32,
|
|
parallel_distributed_insert_select = 1;
|
|
```
|
|
|
|
Here is the response - showing the number of rows and the speed of processing. It is input at a rate of over 6M rows per second!
|
|
|
|
```response
|
|
0 rows in set. Elapsed: 3419.330 sec. Processed 20.69 billion rows, 1.67 TB (6.05 million rows/s., 488.52 MB/s.)
|
|
```
|
|
|
|
4. Let's see how much storage disk is needed for the `sensors` table:
|
|
|
|
```sql
|
|
SELECT
|
|
disk_name,
|
|
formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
|
|
formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
|
|
round(usize / size, 2) AS compr_rate,
|
|
sum(rows) AS rows,
|
|
count() AS part_count
|
|
FROM system.parts
|
|
WHERE (active = 1) AND (table = 'sensors')
|
|
GROUP BY
|
|
disk_name
|
|
ORDER BY size DESC;
|
|
```
|
|
|
|
The 1.67T is compressed down to 310 GiB, and there are 20.69 billion rows:
|
|
|
|
```response
|
|
┌─disk_name─┬─compressed─┬─uncompressed─┬─compr_rate─┬────────rows─┬─part_count─┐
|
|
│ s3disk │ 310.21 GiB │ 1.30 TiB │ 4.29 │ 20693971809 │ 472 │
|
|
└───────────┴────────────┴──────────────┴────────────┴─────────────┴────────────┘
|
|
```
|
|
|
|
5. Let's analyze the data now that it's in ClickHouse. Notice the quantity of data increases over time as more sensors are deployed:
|
|
|
|
```sql
|
|
SELECT
|
|
date,
|
|
count()
|
|
FROM sensors
|
|
GROUP BY date
|
|
ORDER BY date ASC;
|
|
```
|
|
|
|
We can create a chart in the SQL Console to visualize the results:
|
|
|
|
![Number of events per day](./images/sensors_01.png)
|
|
|
|
6. This query counts the number of overly hot and humid days:
|
|
|
|
```sql
|
|
WITH
|
|
toYYYYMMDD(timestamp) AS day
|
|
SELECT day, count() FROM sensors
|
|
WHERE temperature >= 40 AND temperature <= 50 AND humidity >= 90
|
|
GROUP BY day
|
|
ORDER BY day asc;
|
|
```
|
|
|
|
Here's a visualization of the result:
|
|
|
|
![Hot and humid days](./images/sensors_02.png)
|
|
|