Merge pull request #48139 from ClickHouse/sensors-dataset

Add environmental sensors dataset
2024-11-23 16:12:01 +00:00 · 2023-03-28 22:18:58 -06:00 · 2023-03-28 22:18:58 -06:00 · 330e85aea6
commit 330e85aea6
parent 8e8babc469 0ca26b09eb
3 changed files with 172 additions and 0 deletions
--- a/docs/en/getting-started/example-datasets/environmental-sensors.md
+++ b/docs/en/getting-started/example-datasets/environmental-sensors.md
@ -0,0 +1,172 @@
+---
+slug: /en/getting-started/example-datasets/environmental-sensors
+sidebar_label: Environmental Sensors Data
+---
+
+# Environmental Sensors Data
+
+[Sensor.Community](https://sensor.community/en/) is a contributors-driven global sensor network that creates Open Environmental Data. The data is collected from sensors all over the globe. Anyone can purchase a sensor and place it wherever they like. The APIs to download the data is in [GitHub](https://github.com/opendata-stuttgart/meta/wiki/APIs) and the data is freely available under the [Database Contents License (DbCL)](https://opendatacommons.org/licenses/dbcl/1-0/).
+
+:::important
+The dataset has over 20 billion records, so be careful just copying-and-pasting the commands below unless your resources can handle that type of volume. The commands below were executed on a **Production** instance of [ClickHouse Cloud](https://clickhouse.cloud).
+:::
+
+1. The data is in S3, so we can use the `s3` table function to create a table from the files. We can also query the data in place. Let's look at a few rows before attempting to insert it into ClickHouse:
+
+```sql
+SELECT *
+FROM s3(
+    'https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/sensors/monthly/2019-06_bmp180.csv.zst',
+    'CSVWithNames'
+   )
+LIMIT 10
+SETTINGS format_csv_delimiter = ';';
+```
+
+The data is in CSV files but uses a semi-colon for the delimiter. The rows look like:
+
+```response
+┌─sensor_id─┬─sensor_type─┬─location─┬────lat─┬────lon─┬─timestamp───────────┬──pressure─┬─altitude─┬─pressure_sealevel─┬─temperature─┐
+│      9119 │ BMP180      │     4594 │ 50.994 │  7.126 │ 2019-06-01T00:00:00 │    101471 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │        19.9 │
+│     21210 │ BMP180      │    10762 │ 42.206 │ 25.326 │ 2019-06-01T00:00:00 │     99525 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │        19.3 │
+│     19660 │ BMP180      │     9978 │ 52.434 │ 17.056 │ 2019-06-01T00:00:04 │    101570 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │        15.3 │
+│     12126 │ BMP180      │     6126 │ 57.908 │  16.49 │ 2019-06-01T00:00:05 │ 101802.56 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │        8.07 │
+│     15845 │ BMP180      │     8022 │ 52.498 │ 13.466 │ 2019-06-01T00:00:05 │    101878 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │          23 │
+│     16415 │ BMP180      │     8316 │ 49.312 │  6.744 │ 2019-06-01T00:00:06 │    100176 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │        14.7 │
+│      7389 │ BMP180      │     3735 │ 50.136 │ 11.062 │ 2019-06-01T00:00:06 │     98905 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │        12.1 │
+│     13199 │ BMP180      │     6664 │ 52.514 │  13.44 │ 2019-06-01T00:00:07 │ 101855.54 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │       19.74 │
+│     12753 │ BMP180      │     6440 │ 44.616 │  2.032 │ 2019-06-01T00:00:07 │     99475 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │          17 │
+│     16956 │ BMP180      │     8594 │ 52.052 │ 8.354 │ 2019-06-01T00:00:08 │   101322 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ              │        17.2 │
+└───────────┴─────────────┴──────────┴────────┴───────┴─────────────────────┴──────────┴──────────┴───────────────────┴─────────────┘
+```
+
+2. We will use the following `MergeTree` table to store the data in ClickHouse:
+
+```sql
+CREATE TABLE sensors
+(
+    sensor_id UInt16,
+    sensor_type Enum('BME280', 'BMP180', 'BMP280', 'DHT22', 'DS18B20', 'HPM', 'HTU21D', 'PMS1003', 'PMS3003', 'PMS5003', 'PMS6003', 'PMS7003', 'PPD42NS', 'SDS011'),
+    location UInt32,
+    lat Float32,
+    lon Float32,
+    timestamp DateTime,
+    P1 Float32,
+    P2 Float32,
+    P0 Float32,
+    durP1 Float32,
+    ratioP1 Float32,
+    durP2 Float32,
+    ratioP2 Float32,
+    pressure Float32,
+    altitude Float32,
+    pressure_sealevel Float32,
+    temperature Float32,
+    humidity Float32,
+    date Date MATERIALIZED toDate(timestamp)
+)
+ENGINE = MergeTree
+ORDER BY (timestamp, sensor_id);
+```
+
+3. ClickHouse Cloud services have a cluster named `default`. We will use the `s3Cluster` table function, which reads S3 files in parallel from the nodes in your cluster. (If you do not have a cluster, just use the `s3` function and remove the cluster name.)
+
+This query will take a while - it's about 1.67T of data uncompressed:
+
+```sql
+INSERT INTO sensors
+    SELECT *
+    FROM s3Cluster(
+        'default',
+        'https://clickhouse-public-datasets.s3.amazonaws.com/sensors/monthly/*.csv.zst',
+        'CSVWithNames',
+        $$ sensor_id UInt16,
+        sensor_type String,
+        location UInt32,
+        lat Float32,
+        lon Float32,
+        timestamp DateTime,
+        P1 Float32,
+        P2 Float32,
+        P0 Float32,
+        durP1 Float32,
+        ratioP1 Float32,
+        durP2 Float32,
+        ratioP2 Float32,
+        pressure Float32,
+        altitude Float32,
+        pressure_sealevel Float32,
+        temperature Float32,
+        humidity Float32 $$
+    )
+SETTINGS
+    format_csv_delimiter = ';',
+    input_format_allow_errors_ratio = '0.5',
+    input_format_allow_errors_num = 10000,
+    input_format_parallel_parsing = 0,
+    date_time_input_format = 'best_effort',
+    max_insert_threads = 32,
+    parallel_distributed_insert_select = 1;
+```
+
+Here is the response - showing the number of rows and the speed of processing. It is input at a rate of over 6M rows per second!
+
+```response
+0 rows in set. Elapsed: 3419.330 sec. Processed 20.69 billion rows, 1.67 TB (6.05 million rows/s., 488.52 MB/s.)
+```
+
+4. Let's see how much storage disk is needed for the `sensors` table:
+
+```sql
+SELECT
+    disk_name,
+    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
+    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
+    round(usize / size, 2) AS compr_rate,
+    sum(rows) AS rows,
+    count() AS part_count
+FROM system.parts
+WHERE (active = 1) AND (table = 'sensors')
+GROUP BY
+    disk_name
+ORDER BY size DESC;
+```
+
+The 1.67T is compressed down to 1.30T, and there are 20.69 billion rows:
+
+```response
+┌─disk_name─┬─compressed─┬─uncompressed─┬─compr_rate─┬────────rows─┬─part_count─┐
+│ s3disk    │ 310.21 GiB │ 1.30 TiB     │       4.29 │ 20693971809 │        472 │
+└───────────┴────────────┴──────────────┴────────────┴─────────────┴────────────┘
+```
+
+5. Let's analyze the data now that it's in ClickHouse. Notice the quantity of data increases over time as more sensors are deployed:
+
+```sql
+SELECT
+    date,
+    count()
+FROM sensors
+GROUP BY date
+ORDER BY date ASC;
+```
+
+We can create a chart in the SQL Console to visualize the results:
+
+![Number of events per day](./images/sensors_01.png)
+
+6. This query counts the number of overly hot and humid days:
+
+```sql
+WITH
+    toYYYYMMDD(timestamp) AS day
+SELECT day, count() FROM sensors
+WHERE temperature >= 40 AND temperature <= 50 AND humidity >= 90
+GROUP BY day
+ORDER BY day asc;
+```
+
+Here's a visualization of the result:
+
+![Hot and humid days](./images/sensors_02.png)
+
--- a/docs/en/getting-started/example-datasets/images/sensors_01.png
+++ b/docs/en/getting-started/example-datasets/images/sensors_01.png
--- a/docs/en/getting-started/example-datasets/images/sensors_02.png
+++ b/docs/en/getting-started/example-datasets/images/sensors_02.png