mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-12-15 10:52:30 +00:00
266 lines
12 KiB
Markdown
266 lines
12 KiB
Markdown
---
|
|
slug: /en/getting-started/example-datasets/covid19
|
|
sidebar_label: COVID-19 Open-Data
|
|
---
|
|
|
|
# COVID-19 Open-Data
|
|
|
|
COVID-19 Open-Data attempts to assemble the largest Covid-19 epidemiological database, in addition to a powerful set of expansive covariates. It includes open, publicly sourced, licensed data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, weather, and more.
|
|
|
|
The details are in GitHub [here](https://github.com/GoogleCloudPlatform/covid-19-open-data).
|
|
|
|
It's easy to insert this data into ClickHouse...
|
|
|
|
:::note
|
|
The following commands were executed on a **Production** instance of [ClickHouse Cloud](https://clickhouse.cloud). You can easily run them on a local install as well.
|
|
:::
|
|
|
|
1. Let's see what the data looks like:
|
|
|
|
```sql
|
|
DESCRIBE url(
|
|
'https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv',
|
|
'CSVWithNames'
|
|
);
|
|
```
|
|
|
|
The CSV file has 10 columns:
|
|
|
|
```response
|
|
┌─name─────────────────┬─type─────────────┐
|
|
│ date │ Nullable(Date) │
|
|
│ location_key │ Nullable(String) │
|
|
│ new_confirmed │ Nullable(Int64) │
|
|
│ new_deceased │ Nullable(Int64) │
|
|
│ new_recovered │ Nullable(Int64) │
|
|
│ new_tested │ Nullable(Int64) │
|
|
│ cumulative_confirmed │ Nullable(Int64) │
|
|
│ cumulative_deceased │ Nullable(Int64) │
|
|
│ cumulative_recovered │ Nullable(Int64) │
|
|
│ cumulative_tested │ Nullable(Int64) │
|
|
└──────────────────────┴──────────────────┘
|
|
|
|
10 rows in set. Elapsed: 0.745 sec.
|
|
```
|
|
|
|
2. Now let's view some of the rows:
|
|
|
|
```sql
|
|
SELECT *
|
|
FROM url('https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv')
|
|
LIMIT 100;
|
|
```
|
|
|
|
Notice the `url` function easily reads data from a CSV file:
|
|
|
|
```response
|
|
┌─c1─────────┬─c2───────────┬─c3────────────┬─c4───────────┬─c5────────────┬─c6─────────┬─c7───────────────────┬─c8──────────────────┬─c9───────────────────┬─c10───────────────┐
|
|
│ date │ location_key │ new_confirmed │ new_deceased │ new_recovered │ new_tested │ cumulative_confirmed │ cumulative_deceased │ cumulative_recovered │ cumulative_tested │
|
|
│ 2020-04-03 │ AD │ 24 │ 1 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 466 │ 17 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
|
|
│ 2020-04-04 │ AD │ 57 │ 0 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 523 │ 17 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
|
|
│ 2020-04-05 │ AD │ 17 │ 4 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 540 │ 21 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
|
|
│ 2020-04-06 │ AD │ 11 │ 1 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 551 │ 22 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
|
|
│ 2020-04-07 │ AD │ 15 │ 2 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 566 │ 24 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
|
|
│ 2020-04-08 │ AD │ 23 │ 2 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 589 │ 26 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
|
|
└────────────┴──────────────┴───────────────┴──────────────┴───────────────┴────────────┴──────────────────────┴─────────────────────┴──────────────────────┴───────────────────┘
|
|
```
|
|
|
|
3. We will create a table now that we know what the data looks like:
|
|
|
|
```sql
|
|
CREATE TABLE covid19 (
|
|
date Date,
|
|
location_key LowCardinality(String),
|
|
new_confirmed Int32,
|
|
new_deceased Int32,
|
|
new_recovered Int32,
|
|
new_tested Int32,
|
|
cumulative_confirmed Int32,
|
|
cumulative_deceased Int32,
|
|
cumulative_recovered Int32,
|
|
cumulative_tested Int32
|
|
)
|
|
ENGINE = MergeTree
|
|
ORDER BY (location_key, date);
|
|
```
|
|
|
|
4. The following command inserts the entire dataset into the `covid19` table:
|
|
|
|
```sql
|
|
INSERT INTO covid19
|
|
SELECT *
|
|
FROM
|
|
url(
|
|
'https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv',
|
|
CSVWithNames,
|
|
'date Date,
|
|
location_key LowCardinality(String),
|
|
new_confirmed Int32,
|
|
new_deceased Int32,
|
|
new_recovered Int32,
|
|
new_tested Int32,
|
|
cumulative_confirmed Int32,
|
|
cumulative_deceased Int32,
|
|
cumulative_recovered Int32,
|
|
cumulative_tested Int32'
|
|
);
|
|
```
|
|
|
|
5. It goes pretty quick - let's see how many rows were inserted:
|
|
|
|
```sql
|
|
SELECT formatReadableQuantity(count())
|
|
FROM covid19;
|
|
```
|
|
|
|
```response
|
|
┌─formatReadableQuantity(count())─┐
|
|
│ 12.53 million │
|
|
└─────────────────────────────────┘
|
|
```
|
|
|
|
6. Let's see how many total cases of Covid-19 were recorded:
|
|
|
|
```sql
|
|
SELECT formatReadableQuantity(sum(new_confirmed))
|
|
FROM covid19;
|
|
```
|
|
|
|
```response
|
|
┌─formatReadableQuantity(sum(new_confirmed))─┐
|
|
│ 1.39 billion │
|
|
└────────────────────────────────────────────┘
|
|
```
|
|
|
|
7. You will notice the data has a lot of 0's for dates - either weekends or days when numbers were not reported each day. We can use a window function to smooth out the daily averages of new cases:
|
|
|
|
```sql
|
|
SELECT
|
|
AVG(new_confirmed) OVER (PARTITION BY location_key ORDER BY date ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS cases_smoothed,
|
|
new_confirmed,
|
|
location_key,
|
|
date
|
|
FROM covid19;
|
|
```
|
|
|
|
8. This query determines the latest values for each location. We can't use `max(date)` because not all countries reported every day, so we grab the last row using `ROW_NUMBER`:
|
|
|
|
```sql
|
|
WITH latest_deaths_data AS
|
|
( SELECT location_key,
|
|
date,
|
|
new_deceased,
|
|
new_confirmed,
|
|
ROW_NUMBER() OVER (PARTITION BY location_key ORDER BY date DESC) as rn
|
|
FROM covid19)
|
|
SELECT location_key,
|
|
date,
|
|
new_deceased,
|
|
new_confirmed,
|
|
rn
|
|
FROM latest_deaths_data
|
|
WHERE rn=1;
|
|
```
|
|
|
|
9. We can use `lagInFrame` to determine the `LAG` of new cases each day. In this query we filter by the `US_DC` location:
|
|
|
|
```sql
|
|
SELECT
|
|
new_confirmed - lagInFrame(new_confirmed,1) OVER (PARTITION BY location_key ORDER BY date) AS confirmed_cases_delta,
|
|
new_confirmed,
|
|
location_key,
|
|
date
|
|
FROM covid19
|
|
WHERE location_key = 'US_DC';
|
|
```
|
|
|
|
The response look like:
|
|
|
|
```response
|
|
┌─confirmed_cases_delta─┬─new_confirmed─┬─location_key─┬───────date─┐
|
|
│ 0 │ 0 │ US_DC │ 2020-03-08 │
|
|
│ 2 │ 2 │ US_DC │ 2020-03-09 │
|
|
│ -2 │ 0 │ US_DC │ 2020-03-10 │
|
|
│ 6 │ 6 │ US_DC │ 2020-03-11 │
|
|
│ -6 │ 0 │ US_DC │ 2020-03-12 │
|
|
│ 0 │ 0 │ US_DC │ 2020-03-13 │
|
|
│ 6 │ 6 │ US_DC │ 2020-03-14 │
|
|
│ -5 │ 1 │ US_DC │ 2020-03-15 │
|
|
│ 4 │ 5 │ US_DC │ 2020-03-16 │
|
|
│ 4 │ 9 │ US_DC │ 2020-03-17 │
|
|
│ -1 │ 8 │ US_DC │ 2020-03-18 │
|
|
│ 24 │ 32 │ US_DC │ 2020-03-19 │
|
|
│ -26 │ 6 │ US_DC │ 2020-03-20 │
|
|
│ 15 │ 21 │ US_DC │ 2020-03-21 │
|
|
│ -3 │ 18 │ US_DC │ 2020-03-22 │
|
|
│ 3 │ 21 │ US_DC │ 2020-03-23 │
|
|
```
|
|
|
|
10. This query calculates the percentage of change in new cases each day, and includes a simple `increase` or `decrease` column in the result set:
|
|
|
|
```sql
|
|
WITH confirmed_lag AS (
|
|
SELECT
|
|
*,
|
|
lagInFrame(new_confirmed) OVER(
|
|
PARTITION BY location_key
|
|
ORDER BY date
|
|
) AS confirmed_previous_day
|
|
FROM covid19
|
|
),
|
|
confirmed_percent_change AS (
|
|
SELECT
|
|
*,
|
|
COALESCE(ROUND((new_confirmed - confirmed_previous_day) / confirmed_previous_day * 100), 0) AS percent_change
|
|
FROM confirmed_lag
|
|
)
|
|
SELECT
|
|
date,
|
|
new_confirmed,
|
|
percent_change,
|
|
CASE
|
|
WHEN percent_change > 0 THEN 'increase'
|
|
WHEN percent_change = 0 THEN 'no change'
|
|
ELSE 'decrease'
|
|
END AS trend
|
|
FROM confirmed_percent_change
|
|
WHERE location_key = 'US_DC';
|
|
```
|
|
|
|
The results look like
|
|
|
|
```response
|
|
┌───────date─┬─new_confirmed─┬─percent_change─┬─trend─────┐
|
|
│ 2020-03-08 │ 0 │ nan │ decrease │
|
|
│ 2020-03-09 │ 2 │ inf │ increase │
|
|
│ 2020-03-10 │ 0 │ -100 │ decrease │
|
|
│ 2020-03-11 │ 6 │ inf │ increase │
|
|
│ 2020-03-12 │ 0 │ -100 │ decrease │
|
|
│ 2020-03-13 │ 0 │ nan │ decrease │
|
|
│ 2020-03-14 │ 6 │ inf │ increase │
|
|
│ 2020-03-15 │ 1 │ -83 │ decrease │
|
|
│ 2020-03-16 │ 5 │ 400 │ increase │
|
|
│ 2020-03-17 │ 9 │ 80 │ increase │
|
|
│ 2020-03-18 │ 8 │ -11 │ decrease │
|
|
│ 2020-03-19 │ 32 │ 300 │ increase │
|
|
│ 2020-03-20 │ 6 │ -81 │ decrease │
|
|
│ 2020-03-21 │ 21 │ 250 │ increase │
|
|
│ 2020-03-22 │ 18 │ -14 │ decrease │
|
|
│ 2020-03-23 │ 21 │ 17 │ increase │
|
|
│ 2020-03-24 │ 46 │ 119 │ increase │
|
|
│ 2020-03-25 │ 48 │ 4 │ increase │
|
|
│ 2020-03-26 │ 36 │ -25 │ decrease │
|
|
│ 2020-03-27 │ 37 │ 3 │ increase │
|
|
│ 2020-03-28 │ 38 │ 3 │ increase │
|
|
│ 2020-03-29 │ 59 │ 55 │ increase │
|
|
│ 2020-03-30 │ 94 │ 59 │ increase │
|
|
│ 2020-03-31 │ 91 │ -3 │ decrease │
|
|
│ 2020-04-01 │ 67 │ -26 │ decrease │
|
|
│ 2020-04-02 │ 104 │ 55 │ increase │
|
|
│ 2020-04-03 │ 145 │ 39 │ increase │
|
|
```
|
|
|
|
:::note
|
|
As mentioned in the [GitHub repo](https://github.com/GoogleCloudPlatform/covid-19-open-data), the dataset is no longer updated as of September 15, 2022.
|
|
:::
|