weather data

2024-11-10 01:25:21 +00:00 · 2024-01-12 17:28:45 +00:00 · 2024-01-12 17:28:45 +00:00 · 1dacfc53ff
commit 1dacfc53ff
parent 8fbe832b71
1 changed files with 340 additions and 0 deletions
--- a/docs/en/getting-started/example-datasets/noaa.md
+++ b/docs/en/getting-started/example-datasets/noaa.md
@ -0,0 +1,340 @@
+---
+slug: /en/getting-started/example-datasets/noaa
+sidebar_label: NOAA Global Historical Climatology Network 
+sidebar_position: 1
+description: 2.5 billion rows of climate data for the last 120 yrs
+---
+
+# NOAA Global Historical Climatology Network 
+
+This dataset contains weather measurements for the last 120 years. Each row is a measurement for a point in time and station.
+
+More precisely and according to the [origin of this data](https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-ghcn):
+
+> GHCN-Daily is a dataset that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurements only (Menne et al., 2012). GHCN-Daily is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews (Durre et al., 2010). The archive includes the following meteorological elements:
+
+    - Daily maximum temperature
+    - Daily minimum temperature
+    - Temperature at the time of observation
+    - Precipitation (i.e., rain, melted snow)
+    - Snowfall
+    - Snow depth
+    - Other elements where available
+
+## Downloading the data
+
+- A [pre-prepared version](#pre-prepared-data) of the data for ClickHouse, which has been cleansed, re-structured, and enriched. This data covers the years 1900 to 2022.
+- [Download the original data](#original-data) and convert to the format required by ClickHouse. Users wanting to add their own columns may wish to explore this approach.
+
+### Pre-prepared data
+
+More specifically, rows have been removed that did not fail any quality assurance checks by Noaa. The data has also been restructured from a measurement per line to a row per station id and date i.e.
+
+```csv
+"station_id","date","tempAvg","tempMax","tempMin","precipitation","snowfall","snowDepth","percentDailySun","averageWindSpeed","maxWindSpeed","weatherType"
+"AEM00041194","2022-07-30",347,0,308,0,0,0,0,0,0,0
+"AEM00041194","2022-07-31",371,413,329,0,0,0,0,0,0,0
+"AEM00041194","2022-08-01",384,427,357,0,0,0,0,0,0,0
+"AEM00041194","2022-08-02",381,424,352,0,0,0,0,0,0,0
+```
+
+This is simpler to query and ensures the resulting table is less sparse. Finally, the data has also been enriched with latitude and longitude.
+
+This data is available in the following S3 location. Either download the data to your local filesystem (and insert using the ClickHouse client) or insert directly into ClickHouse (see [Inserting from S3](#inserting-from-s3)).
+
+To download:
+
+```bash
+wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/noaa/noaa_enriched.parquet
+```
+
+### Original data
+
+The following details the steps to download and transform the original data in preparation for loading into ClickHouse.
+
+#### Download
+
+To download the original data:
+
+```bash
+for i in {1900..2023}; do wget https://noaa-ghcn-pds.s3.amazonaws.com/csv.gz/${i}.csv.gz; done
+```
+
+#### Sampling the data
+
+```bash
+zcat 2021.csv.gz | head
+AE000041196,20210101,TMAX,278,,,S,
+AE000041196,20210101,PRCP,0,D,,S,
+AE000041196,20210101,TAVG,214,H,,S,
+AEM00041194,20210101,TMAX,266,,,S,
+AEM00041194,20210101,TMIN,178,,,S,
+AEM00041194,20210101,PRCP,0,,,S,
+AEM00041194,20210101,TAVG,217,H,,S,
+AEM00041217,20210101,TMAX,262,,,S,
+AEM00041217,20210101,TMIN,155,,,S,
+AEM00041217,20210101,TAVG,202,H,,S,
+```
+
+Summarizing the [format documentation](https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-ghcn):
+
+
+Summarizing the format documentation and the columns in order:
+
+ - An 11 character station identification code. This itself encodes some useful information
+ - YEAR/MONTH/DAY = 8 character date in YYYYMMDD format (e.g. 19860529 = May 29, 1986)
+ - ELEMENT = 4 character indicator of element type. Effectively the measurement type. While there are many measurements available, we select the following:
+    - PRCP - Precipitation (tenths of mm)
+    - SNOW - Snowfall (mm)
+    - SNWD - Snow depth (mm)
+    - TMAX - Maximum temperature (tenths of degrees C)
+    - TAVG - Average temperature (tenths of a degrees C)
+    - TMIN - Minimum temperature (tenths of degrees C)
+    - PSUN - Daily percent of possible sunshine (percent)
+    - AWND - Average daily wind speed (tenths of meters per second)
+    - WSFG - Peak gust wind speed (tenths of meters per second)
+    - WT** = Weather Type where ** defines the weather type. Full list of weather types here.
+- DATA VALUE = 5 character data value for ELEMENT i.e. the value of the measurement.
+- M-FLAG = 1 character Measurement Flag. This has 10 possible values. Some of these values indicate questionable data accuracy. We accept data where this is set to “P” - identified as missing presumed zero, as this is only relevant to the PRCP, SNOW and SNWD measurements.
+- Q-FLAG is the measurement quality flag with 14 possible values. We are only interested in data with an empty value i.e. it did not fail any quality assurance checks.
+- S-FLAG is the source flag for the observation. Not useful for our analysis and ignored.
+- OBS-TIME = 4-character time of observation in hour-minute format (i.e. 0700 =7:00 am). Typically not present in older data. We ignore this for our purposes.
+
+A measurement per line would result in a sparse table structure in ClickHouse. We should transform to a row per time and station, with measurements as columns. First, we limit the dataset to those rows without issues i.e. where `qFlag` is equal to an empty string.
+
+#### Clean the data
+
+Using [ClickHouse local](https://clickhouse.com/blog/extracting-converting-querying-local-files-with-sql-clickhouse-local) we can filter rows that represent measurements of interest and pass our quality requirements:
+
+```bash
+clickhouse local --query "SELECT count() 
+FROM file('*.csv.gz', CSV, 'station_id String, date String, measurement String, value Int64, mFlag String, qFlag String, sFlag String, obsTime String') WHERE qFlag = '' AND (measurement IN ('PRCP', 'SNOW', 'SNWD', 'TMAX', 'TAVG', 'TMIN', 'PSUN', 'AWND', 'WSFG') OR startsWith(measurement, 'WT'))"
+
+2679264563
+```
+
+With over 2.6 billion rows, this isn’t a fast query since it involves parsing all the files. On our 8 core  machine, this takes around 160 seconds.
+
+
+### Pivot data
+
+While the measurement per line structure can be used with ClickHouse, it will unnecessarily complicate future queries. Ideally, we need a row per station id and date, where each measurement type and associated value are a column i.e.
+
+```csv
+"station_id","date","tempAvg","tempMax","tempMin","precipitation","snowfall","snowDepth","percentDailySun","averageWindSpeed","maxWindSpeed","weatherType"
+"AEM00041194","2022-07-30",347,0,308,0,0,0,0,0,0,0
+"AEM00041194","2022-07-31",371,413,329,0,0,0,0,0,0,0
+"AEM00041194","2022-08-01",384,427,357,0,0,0,0,0,0,0
+"AEM00041194","2022-08-02",381,424,352,0,0,0,0,0,0,0
+```
+
+Using ClickHouse local and a simple `GROUP BY`, we can repivot our data to this structure. To limit memory overhead, we do this one file at a time.
+
+```bash
+for i in {1900..2022}
+do
+clickhouse-local --query "SELECT station_id,
+       toDate32(date) as date,
+       anyIf(value, measurement = 'TAVG') as tempAvg,
+       anyIf(value, measurement = 'TMAX') as tempMax,
+       anyIf(value, measurement = 'TMIN') as tempMin,
+       anyIf(value, measurement = 'PRCP') as precipitation,
+       anyIf(value, measurement = 'SNOW') as snowfall,
+       anyIf(value, measurement = 'SNWD') as snowDepth,
+       anyIf(value, measurement = 'PSUN') as percentDailySun,
+       anyIf(value, measurement = 'AWND') as averageWindSpeed,
+       anyIf(value, measurement = 'WSFG') as maxWindSpeed,
+       toUInt8OrZero(replaceOne(anyIf(measurement, startsWith(measurement, 'WT') AND value = 1), 'WT', '')) as weatherType
+FROM file('$i.csv.gz', CSV, 'station_id String, date String, measurement String, value Int64, mFlag String, qFlag String, sFlag String, obsTime String')
+ WHERE qFlag = '' AND (measurement IN ('PRCP', 'SNOW', 'SNWD', 'TMAX', 'TAVG', 'TMIN', 'PSUN', 'AWND', 'WSFG') OR startsWith(measurement, 'WT'))
+GROUP BY station_id, date
+ORDER BY station_id, date FORMAT CSV" >> "noaa.csv";
+done
+```
+
+This query produces a single 50GB file `noaa.csv`.
+
+### Enriching the data
+
+The data has no indication of location aside from a station id, which includes a prefix country code. Ideally, each station would have a latitude and longitude associated with it. To achieve this, NOAA conveniently provides the details of each station as a separate [ghcnd-stations.txt](https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-ghcn#format-of-ghcnd-stationstxt-file). This file has [several columns](https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-ghcn#format-of-ghcnd-stationstxt-file), of which five are useful to our future analysis: id, latitude, longitude, elevation, and name.
+
+```bash
+wget http://noaa-ghcn-pds.s3.amazonaws.com/ghcnd-stations.txt
+```
+
+```bash
+clickhouse local --query "WITH stations AS (SELECT id, lat, lon, elevation, splitByString(' GSN ',name)[1] as name FROM file('ghcnd-stations.txt', Regexp, 'id String, lat Float64, lon Float64, elevation Float32, name String'))
+SELECT station_id,
+       date,
+       tempAvg,
+       tempMax,
+       tempMin,
+       precipitation,
+       snowfall,
+       snowDepth,
+       percentDailySun,
+       averageWindSpeed,
+       maxWindSpeed,
+       weatherType,
+       tuple(lon, lat) as location,
+       elevation,
+       name
+FROM file('noaa.csv', CSV,
+          'station_id String, date Date32, tempAvg Int32, tempMax Int32, tempMin Int32, precipitation Int32, snowfall Int32, snowDepth Int32, percentDailySun Int8, averageWindSpeed Int32, maxWindSpeed Int32, weatherType UInt8') as noaa LEFT OUTER
+         JOIN stations ON noaa.station_id = stations.id INTO OUTFILE 'noaa_enriched.parquet' FORMAT Parquet SETTINGS format_regexp='^(.{11})\s+(\-?\d{1,2}\.\d{4})\s+(\-?\d{1,3}\.\d{1,4})\s+(\-?\d*\.\d*)\s+(.*)\s+(?:[\d]*)'" 
+```
+This query takes a few minutes to run and produces an 6.4GB file `noaa_enriched.parquet`.
+
+## Create table
+
+Create a MergeTree table in ClickHouse (from the ClickHouse client).
+
+```sql
+CREATE TABLE noaa
+(
+   `station_id` LowCardinality(String),
+   `date` Date32,
+   `tempAvg` Int32 COMMENT 'Average temperature (tenths of a degrees C)',
+   `tempMax` Int32 COMMENT 'Maximum temperature (tenths of degrees C)',
+   `tempMin` Int32 COMMENT 'Minimum temperature (tenths of degrees C)',
+   `precipitation` UInt32 COMMENT 'Precipitation (tenths of mm)',
+   `snowfall` UInt32 COMMENT 'Snowfall (mm)',
+   `snowDepth` UInt32 COMMENT 'Snow depth (mm)',
+   `percentDailySun` UInt8 COMMENT 'Daily percent of possible sunshine (percent)',
+   `averageWindSpeed` UInt32 COMMENT 'Average daily wind speed (tenths of meters per second)',
+   `maxWindSpeed` UInt32 COMMENT 'Peak gust wind speed (tenths of meters per second)',
+   `weatherType` Enum8('Normal' = 0, 'Fog' = 1, 'Heavy Fog' = 2, 'Thunder' = 3, 'Small Hail' = 4, 'Hail' = 5, 'Glaze' = 6, 'Dust/Ash' = 7, 'Smoke/Haze' = 8, 'Blowing/Drifting Snow' = 9, 'Tornado' = 10, 'High Winds' = 11, 'Blowing Spray' = 12, 'Mist' = 13, 'Drizzle' = 14, 'Freezing Drizzle' = 15, 'Rain' = 16, 'Freezing Rain' = 17, 'Snow' = 18, 'Unknown Precipitation' = 19, 'Ground Fog' = 21, 'Freezing Fog' = 22),
+   `location` Point,
+   `elevation` Float32,
+   `name` LowCardinality(String)
+) ENGINE = MergeTree() ORDER BY (station_id, date);
+
+```
+
+## Inserting into ClickHouse
+
+### Inserting from local file
+
+Data can be inserted from local file as follows (from the ClickHouse client):
+
+```sql
+INSERT INTO noaa FROM INFILE '<path>/noaa_enriched.parquet'
+```
+
+where `<path>` represents the full path to the local file on disk. 
+
+See [here](https://clickhouse.com/blog/real-world-data-noaa-climate-data#load-the-data) for how to speed this load up.
+
+### Inserting from S3
+
+```sql
+INSERT INTO noaa SELECT *
+FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/noaa/noaa_enriched.parquet')
+
+```
+For how to speed this up, see our blog post on [tuning large data loads](https://clickhouse.com/blog/supercharge-your-clickhouse-data-loads-part2).
+
+## Sample queries
+
+### Highest temperature ever
+
+```sql
+SELECT
+    tempMax / 10 AS maxTemp,
+    location,
+    name,
+    date
+FROM blogs.noaa
+WHERE tempMax > 500
+ORDER BY
+    tempMax DESC,
+    date ASC
+LIMIT 5
+
+┌─maxTemp─┬─location──────────┬─name───────────────────────────────────────────┬───────date─┐
+│    56.7 │ (-116.8667,36.45) │ CA GREENLAND RCH                               │ 1913-07-10 │
+│    56.7 │ (-115.4667,32.55) │ MEXICALI (SMN)                                 │ 1949-08-20 │
+│    56.7 │ (-115.4667,32.55) │ MEXICALI (SMN)                                 │ 1949-09-18 │
+│    56.7 │ (-115.4667,32.55) │ MEXICALI (SMN)                                 │ 1952-07-17 │
+│    56.7 │ (-115.4667,32.55) │ MEXICALI (SMN)                                 │ 1952-09-04 │
+└─────────┴───────────────────┴────────────────────────────────────────────────┴────────────┘
+
+5 rows in set. Elapsed: 0.514 sec. Processed 1.06 billion rows, 4.27 GB (2.06 billion rows/s., 8.29 GB/s.)
+```
+
+Reassuringly consistent with the [documented record](https://en.wikipedia.org/wiki/List_of_weather_records#Highest_temperatures_ever_recorded) at [Furnace Creek](https://www.google.com/maps/place/36%C2%B027'00.0%22N+116%C2%B052'00.1%22W/@36.1329666,-116.1104099,8.95z/data=!4m5!3m4!1s0x0:0xf2ed901b860f4446!8m2!3d36.45!4d-116.8667) as of 2023.
+
+### Best ski resorts
+
+Using a [list of ski resorts](https://gist.githubusercontent.com/gingerwizard/dd022f754fd128fdaf270e58fa052e35/raw/622e03c37460f17ef72907afe554cb1c07f91f23/ski_resort_stats.csv) in the united states and their respective locations, we join these against the top 1000 weather stations with the most in any month in the last 5 yrs. Sorting this join by [geoDistance](https://clickhouse.com/docs/en/sql-reference/functions/geo/coordinates/#geodistance) and restricting the results to those where the distance is less than 20km, we select the top result per resort and sort this by total snow. Note we also restrict resorts to those above 1800m, as a broad indicator of good skiing conditions.
+
+```sql
+SELECT
+   resort_name,
+   total_snow / 1000 AS total_snow_m,
+   resort_location,
+   month_year
+FROM
+(
+   WITH resorts AS
+       (
+           SELECT
+               resort_name,
+               state,
+               (lon, lat) AS resort_location,
+               'US' AS code
+           FROM url('https://gist.githubusercontent.com/gingerwizard/dd022f754fd128fdaf270e58fa052e35/raw/622e03c37460f17ef72907afe554cb1c07f91f23/ski_resort_stats.csv', CSVWithNames)
+       )
+   SELECT
+       resort_name,
+       highest_snow.station_id,
+       geoDistance(resort_location.1, resort_location.2, station_location.1, station_location.2) / 1000 AS distance_km,
+       highest_snow.total_snow,
+       resort_location,
+       station_location,
+       month_year
+   FROM
+   (
+       SELECT
+           sum(snowfall) AS total_snow,
+           station_id,
+           any(location) AS station_location,
+           month_year,
+           substring(station_id, 1, 2) AS code
+       FROM noaa
+       WHERE (date > '2017-01-01') AND (code = 'US') AND (elevation > 1800)
+       GROUP BY
+           station_id,
+           toYYYYMM(date) AS month_year
+       ORDER BY total_snow DESC
+       LIMIT 1000
+   ) AS highest_snow
+   INNER JOIN resorts ON highest_snow.code = resorts.code
+   WHERE distance_km < 20
+   ORDER BY
+       resort_name ASC,
+       total_snow DESC
+   LIMIT 1 BY
+       resort_name,
+       station_id
+)
+ORDER BY total_snow DESC
+LIMIT 5
+
+┌─resort_name──────────┬─total_snow_m─┬─resort_location─┬─month_year─┐
+│ Sugar Bowl, CA       │        7.799 │ (-120.3,39.27)  │     201902 │
+│ Donner Ski Ranch, CA │        7.799 │ (-120.34,39.31) │     201902 │
+│ Boreal, CA           │        7.799 │ (-120.35,39.33) │     201902 │
+│ Homewood, CA         │        4.926 │ (-120.17,39.08) │     201902 │
+│ Alpine Meadows, CA   │        4.926 │ (-120.22,39.17) │     201902 │
+└──────────────────────┴──────────────┴─────────────────┴────────────┘
+
+5 rows in set. Elapsed: 0.750 sec. Processed 689.10 million rows, 3.20 GB (918.20 million rows/s., 4.26 GB/s.)
+Peak memory usage: 67.66 MiB.
+```
+
+## Credits
+
+We would like to acknowledge the efforts of the Global Historical Climatology Network for preparing, cleansing, and distributing this data. We appreciate your efforts.
+
+Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.25]. NOAA National Centers for Environmental Information. http://doi.org/10.7289/V5D21VHZ [17/08/2020]